Open Compute Project: An Interview with Intel's Rebecca Weeklyby Dr. Ian Cutress on August 9, 2021 9:00 AM EST
We all know that we put processors into servers, servers into racks, racks into data centers, and then they 'do stuff'. Whether that's a hyperscaler managing internal infrastructure, offering outside services, businesses processing workflows, high-performance machines working on the latest weather or nuclear simulations, social media companies scaling out their services to billions of users, or smaller startups needing scalable resources for their new monetizable idea, there's always a data center and enterprise backbone.
The need for lots of computing resources comes with a number of fundamental issues, the chief among which is likely going to be standardization. Without a consistent size, depth, or definition to the size of a server, a deployment can easily end up as a hodge-podge of malformed hardware with no discernable high-level design methodology. While the silicon vendors or the OEM partners building the systems could have their own way of doing things, without a collaborative effort to define standards, we would still be in the 1970s or 1980s where systems end up unique for one particular customer. On top of this, there is an important overriding drive in the 21st century to ensure that enterprise deployments are power efficient as well.
When Facebook was scaling its technologies and pivoting to completely public use in the late 2010s, it started an internal project around data efficiency and scalability. The goal was to end up with a solution that provided scalable resources, efficient compute, and enabled cost savings. In 2011, combined with Intel and Rackspace, the Open Compute Project was launched to enable a set of open standards that could benefit all major industry enterprise players. OCP is also a fluid organization, providing its community a structure that is designed to enable close collaboration on these evolving standards, pushing for 'commodity hardware that is more efficient, flexible and scalable, throwing off the shackles of proprietary one-size-fits-all gear'. OCP also has a certified partner program, allowing external customers to be part of the ecosystem that covers data center facilities, hardware, networking, open firmware, power, security, storage, telecommunications, and future technologies.
While the initial founders included Intel and Facebook, other companies involved include ASUS, Arm, IBM, Google, Microsoft, Dell, HPE, NVIDIA, Cisco, Lenovo, and Alibaba. An example of how to think about OCP is that an OCP Rack is 21 inches wide, rather than a standard 19 inches, allowing for more airflow, but the racks are also taller, accommodating more units. Parts of the rack use dedicated high voltage power unit shelves that supply power to the rest of the servers in the rack, rather than relying on each system to have its own power supply. This also allows each server to fit more, such as a 2U six-blade design, or a 30 drive 2U design for storage that allows the drives to sit flat, rather than vertical. The OAM form factor for high-power graphics accelerators comes from the words (OCP Accelerator Module), coming out of the group. Two years ago we reported on Facebook's Zion Unified Training Platform, built to OCP specifications, using Intel's Cooper Lake processors.
Open Compute Project
Dr. Ian Cutress
In this interview today we have Rebecca Weekly, who not only sits as the VP and GM of Intel's Hyperscale and Strategy Execution, but is also an Intel Senior Principal Engineer. However, today we are speaking to her in her role as Chairperson and President of the Board of the Open Compute Project, being promoted on July 1st of 2021. When the press relations team sent around the news that Rebecca was taking the role, I reached out and asked if we could interview to get a deeper insight into OCP.
Ian Cutress: You’ve been elected as chairperson of the Open Compute Project - how long have you been involved with OCP? And what does your position as chairperson entail exactly?
Rebecca Weekly: Great question! I have been on the board of OCP since September 2020. I started (for lack of a better term) shadowing the previous person in the role for Intel back in July, but I took the position in September. But I've been involved in projects with OCP for a long time!
At Intel I work with hyperscale customers, and three of those hyperscale cloud providers are on the board of OCP. I’ve worked with many OCP projects, whether it's Mount Olympus, which was donated [to OCP] in conjunction with Microsoft, or any of Yosemite v1/v2/v3, which were donated from the Facebook partnership. Those projects have been things we've been working on forever. With those systems we have firmware support packages, things like OpenBMC. I mean, there are so many projects from a management (and modular design) point of view that revolve around my day job - to work with customers to make sure that they have the kind of modular compute systems that are vanity-free and are ready to go in the ecosystem.
It has always has been a part of my day job since I came to Intel six years ago. It felt very natural [to be a part of OCP]. But the board is very different! It's a different way to think. When you come in, you look at the open-source ecosystem and your contribution strategy to that open source ecosystem - from a specific company, you think about the base components that have to be able to work together, and how we enable that while keeping our special sauce unique, right? That's our job at OCP. That's our responsibility to our various stakeholders. When you're sitting on a board, you're thinking about the future of that industry and the community you serve - what needs to happen in that industry given the big trends that are happening.
It has been a whirlwind - first of all, just being on a board with Andy Bechtolsheim (who I look up to) is great. Everybody who I have the opportunity to serve with, such as Mark [Roenigk, Facebook], Partha [Ranganathan, Google], Zaid [Kahn, Microsoft], and Jim [Hawkins, Rackspace] - they're all phenomenal humans, who really think about both the future of the industry and the communities they serve, and they wear those hats with a lot of grace. I’m finding that opportunity to see what I am supposed to do for my day job - what am I supposed to do for Intel, but also then what I am supposed to do for this community. It’s about making sure that all can be synergistic, but recognizing the task. I'm here in this capacity - That's the hat I'm wearing and that's what I got to do.
IC: At times they must conflict? If different parts of the board want to do different things than what Intel wants, it’s about the industry going in a given direction?
RW: Something I did very early on was that I listed out all the different working groups at OCP. I was totally explicit with my partners on the board which groups Intel plans to ignore - as in, it's not our job to contribute in specific areas. We think we have a lot of special sauce for our key areas, and we're there to participate, and hopefully we will help ensure that everyone’s key contributions are involved. Anything we can contribute from the sense of experience though, we’ve got experience there, we’ve done that, perhaps don’t go down that path!
But in general, from the Intel side of things, we're observing most areas versus others where we're trying to lead and where we think it's critical to the future [of Intel]. So I went through OCP’s working groups, took note of all the projects, like what the status was, and how it really worked. Because there are so many different parts of Intel that do contributions to OCP, from tools, to software, firmware, BIOS, and everything that's happening on the system side for systems components, whether it's a network add-in card, or something that's happening at the switch silicon space, versus what's happening on a motherboard. So there are lots of different areas where people can contribute, and we’re trying to get everyone on the same page, with truth and transparency. [We always ask] where are we at - and then share that. So that if a topic came up, I have to say ‘I can't really talk about that, it's not something I'm either empowered to talk about, or it's just not something that we’re going to contribute to'. I can speak as myself, as chairperson of the board, but not in my capacity as working for Intel.
IC: OCP is the Open Compute Project, and it is very focused on the enterprise hyperscale industry. It's not going to be for people considering their home system or home networks! But how would you define OCP to people who have never heard about it before?
RW: Sure! OCP, as you said, is the Open Compute Project, and it's really a community at first and foremost. It's a community, and it's ‘by engineers, for engineers’, which is probably why I love it so much! It started in 2011, and fundamentally it was about efficient server, storage, data centers, and hardware designs. It's one of the few communities I know of, or in fact the only community I know of, that doesn't just focus on a single element, such as thermal or electrical protocol layers for interconnect in some capacity. For those we have JEDEC and PCI-SIG. [OCP] is about systems, about implementations.
[For the others], it's great to talk to Hardware Root of Trust in isolation, but if you want everybody who's participating in your supply chain to have an implementation of a Root of Trust that is consistent, you got to go somewhere and force an implementation spec for it as well as a compliance body in some sense to make sure that happens. So OCP is really the only community I know of that does that work.
If you think back to 2011, you still had SGI, you still had all these crazy pseudo companies doing MIPS, which was part of SGI at the time. But they were doing specific individual implementations of very fancy systems - do you remember walking through the data centers with all the LEDs, and they were just so perfect? At the time companies were manufacturing their own screws as if that was important!
IC: They were focusing on bespoke designs per customer?
RW: Exactly. Then this community got together and said ‘we don't care if it's plywood, it does not matter’ - because conceptually - they cared about vanity-free hardware with consistency. [The community asks itself] ‘how do we drive convergence on vanity-free components to increase supply, to decrease cost, to improve PUE*?. The community asked about everything in the domain space that was really important for the data center to take off.
It is, to your point, very hyperscaler led. But actually, if you look at the contributions of the marketplace that people adopt, 58% of adoptions from OCP marketplace are telcos and network operators. The community has changed so much over the last 10 years, and there's a lot of change that will continue to happen. I think, as you know, we're becoming more heterogeneous especially as data is more disaggregated, and everything that we're dealing with just as a community means there are changes afoot.
*PUE = Power Usage Effectiveness, a measure of how much energy input into a data center is used in the servers. The best PUE values are 1.06 or lower, meaning for every 106 W going in, 100 W is being used. An average PUE is 1.4-2.0, bad PUE is 2.5+.
IC: I've noticed that over the timespan that OCP has existed, it shifts based on need. The biggest thing that's currently in the market right now is AI, and the move to make solutions that are more AI-focused for everybody to use. To your point, the shift towards 5G deployments and telcos, that seems to be a really big focus right now?
RW: In that sense, there are some interesting things happening with network anomaly detection, and more of a software perspective for using AI, obviously. But in OCP, we have OAM, or OCP Accelerator Module [which is a unified form factor for high-powered compute cards]. As part of OCP, we think about the form factors we can help create so that people can choose to, for example, take a Cerebras chip, or take a different AI accelerator or whatever, you know the newest, latest, and greatest, and will be and be able to take advantage of the system's footprint and validation footprint that's already in the ecosystem.
IC: I’ve noticed Intel has acquired a few AI companies recently, and they've all gone towards that sort of OAM interface!
IC: So if we look at the companies listed in OCP, we've got ASUS, ARM, IBM, Google, Microsoft, Dell, HPE, NVIDIA, Intel, Cisco, Lenovo, Alibaba. That's a lot of the industry, and you said you work with three out of seven hyperscalers in your role at Intel. Is OCP in the situation of growing its membership, or is it at a healthy level, or are there other people that need to be involved that currently aren't?
RW: The OCP Foundation right now has around 250 corporate members - it started as six. So there is definitely a huge number of growing participants. There are over 5,000 engineers, 16,000 participants, and there are 29 active projects across the world. We move as the domain spaces keep shifting and growing. So obviously we have security and operations specifically for security, there are advanced cooling projects, areas in connectivity solutions, testing, validation, enabling. There is so much, for example, there have been so many great papers recently written from OCP members about the complexity of test and validation, or consistency as we have more heterogeneity in systems.
One of the awesome and amazing projects that has been a focus in the last year is sustainability, and looking at sustainable practices, because there is zero consistency on reporting emissions as it pertains to ICT equipment. There are also zero standards around reporting, no sort of best practices for operations. [At the base level], it’s different to how your laptop works, such as going into suspend mode and whatever, which has standards - it is really different than if you're trying to operate a public cloud infrastructure. For that, you have to have a certain commitment from an SLA perspective [for your customers] for speeding up latency, but you're actually not fully utilized most of the time. It means you operate with an ‘always-on’ mindset, but [the task is to] not burn power if unnecessary. I’m a total hippie so I get really excited at the prospect of us bringing that community together!
But also, there are companies out there making all sorts of claims. There's nothing standard to compare it against - their claim is ‘as measured by’ a hired company with a methodology that has not been validated [or standardized]. Governments haven't necessarily stepped up in this domain space, either. But I think this is a domain space where open source communities really can make a difference - at least they can start, and then others can take notice.
IC: One of the recent groups that I think most of our audience is probably interested in is the Open Domain Specific Architecture, sort of an ‘Open Chiplet Marketplace’. Because Intel is moving into that, and Intel’s competitors are moving in that direction, and even when I spoke about AI chips, some of these designs are essentially chiplets in a big infrastructure. This marketplace was announced around two years ago - do you have any insight into what's going on with the sort of chiplet standards now, with OCP?
RW: So much, and not nearly enough! But you know, I think that you made an excellent point which is that everybody is getting into this domain space. Whether it's 2D or 3D, there are a lot of interesting things happening with 3D technologies and 2D technology. I think it's fair to say we have probably three, or at least definitely two, main things that are occurring.
ODSA itself as a working group has been working to create sort of a test chip mentality, where we can actually give effectively a reference implementation of two different 2D stacking technologies with two different fabs on either side of the wire. That's one OCP project, and it's really about trying to create [combined products]. Whether it's a Bunch Of Wires is used [that’s a technical term, BoW], or an Advanced Interface Bus (AIB) is used, there are lots of things in this domain space.
We're all evaluating these technologies based on throughput, but also based on really simple things, like the right to license and utilize. These are things that just require communities to come together, have the debate, and have the discussion. So that's really where that team is focused and that project group is focused.
When I look at what's happening in the ecosystem, there were some really interesting conversations that happened in November of last year, at the Virtual OCP Summit. They started talking about this concept of an Open Chiplet Marketplace. This is kind of a brainchild out of Google, where companies are coming together and bringing people together to talk not just about the thermal or the electrical, but how and when we actually produce these things, as well as the software layer, creating consistency, and how do we create a security model, [especially] when chiplets are made anywhere? [We ask whether] we have some aspects of composability and manageability, those sorts of things.
So, you probably remember when NVIDIA cards started coming to the public cloud. It was a nightmare. You couldn't chop them up for virtual machines, and if something went wrong in the device, all the visibility went from the virtual machine. So the vendor who's providing it as hardware vendors, we never really thought about the world where customers manage the hardware. But issues like this still happen! I've been focusing on this area for six years at Intel.
But we're coming up on the 15 year anniversary of AWS. Even now, we’re still trying to understand fundamentally what it takes to operate a public cloud set of hardware. It's very different.
With chiplets, there is so much of the domain space of the actual interconnect in the marketplace, and how that could function to have consistent software and composability. I think there's a ton to go figure out in terms of validation, ensuring it works, and creating it at scale. I don't know where it will go - I mean, I'm so excited to see where it goes. This is like my typical day you know - there are some parts of my world where I know the goal is to get to a place where you just have the robots and they're composing the CPU, or XPU, with all these pre-validated chiplets. Suddenly, boom! You run a couple of burn-in tests on it, and you now have a whole new XPU at scale that is exactly what the end-user needed! Will we get there? It's decades away, but I think that's what those kinds of technologies start getting us towards.
IC: I think it's really important that when we start moving the chiplet/tile architectures, if you're buying a third-party chip, you want to verify that it's actually doing what it says, and it has a defined secure supply chain. Because at some point there's going to be, as you say, robots designing these chips. You have got to make sure the robots are making it in a very secure way, such that you can validate start to finish, and the results you get out or right. But that sounds like the right path through!
RW: Just getting these things to do 2D stacking together from two different foundries on two different sides of the wire, to do a different and unique use case is a start. I think that getting there would never happen if people don't come together.
IC: So in OCP, the O stands for Open. But to become part of the OCP, you have to pay to be a member. But all the meetings for everything seem to be online, so anybody can watch them, and they’re all listed on the OCP website. So what exactly does ‘Open’ mean?
RW: It's a good question. To me, ‘Open’ is about open hardware, right? It's about creating specifications that anybody can pick up, whether it's Quanta, Foxconn, Wiwynn - anybody can pick up and produce. [Open hardware is] helping ensure that there's more consistency from the modularity of the computer itself. But even with ‘at scale’ operations, we have to think about management and security solutions in that domain space. So to me, Open is about the specifications that are contributed, in comparison to the participation.
Now the Foundation does have to pay for and administer the working groups, and ensures that those happen. Wearing my OCP hat, I would say it's a pretty nominal fee for what it does in the community! We also have summits and all our different collaborative events. But again, that monetary part isn't specifically about the community it serves which is driving open technology and solutions. I mean anybody can get involved and anybody can learn from it - I think the [cost] opportunity is to become a voting member of the community, and be elected to be part of the various working groups or incubation committees. Those things require membership. That membership really is about contributing financially but also contributing with your time - members actually have to make contributions [as part of the fee]. It’s done in order to ensure that you are, in fact, listened to, but also committed to driving these contributions.
IC: It sounds like you can't really be a passive member in that sense - you definitely have to be involved.
RW: Involved somewhere! There are definitely particular working groups that may not be interesting to you, but the expectation is that it's a very active community. When I joined the board, I still remember Mark Roenigk, the previous chairperson, telling me that we are a working board, and the expectation is you're going to get things done. He told me here's how to set up a support network, to be able to do this for this long - and I took notes, I cannot do it on my own, there's no way.
IC: So with respect to what I think most of our readers might interact with OCP is the standard sort of OCP server rack design. I think you mentioned Olympus earlier, this sort of drive to having more efficient servers, helping with cooling and density. The specifications of an OpenRack OCP server right now, I think version three, are wider than a standard server, and the racks are slightly taller. This is completely different to how most of the time we look at enterprise systems! So why hasn't the industry moved to what OCP suggests, and why has it kind of stayed in its lane? As I say this, as you're somebody who works for Intel, and Intel's partners sell a lot of those regular systems!
RW: So OpenRack started that design form factor, as the 21 inches instead of the 19, for a very specific kind of single socket, but super dense design. It's a phenomenal design, and OpenRack continues in the ecosystem. But it's not the only OCP-certified rack size. So if you look at Mount Olympus, it was a 19-inch form factor and a standard server configuration. You'll see both, and companies like Google make contributions that are 19-inch as well. So the OpenRack form factor is a compliant standard, but it's not necessarily true that every contribution in the marketplace has to adhere to that original OCP form factor. It was originally about Facebook creating a very unique, very low PUE, high dense design. Facebook was one of the OCP Founding members, and I love what they did there.
There are a lot of different people in OCP, and the Open 19 standard started because people were confused why Facebook did this weird 21-inch size thing. There are all sorts of conversations about it. OCP then decided to embrace 19-inch too. So you know I think It's been an interesting journey because so much of the original designs were contributions from hyperscalers, and in partnership for their specific environment. But then as the community has grown into Telcos, into more enterprises, and as the board membership has changed, such as with Rackspace, people are thinking about it differently. So there's more of an expansion in the form factors that are available and that will evolve as the industry evolves.
IC: Does OCP actively search out for technologies it feels that it should incorporate into some of its open standard designs? Or do you rely on the companies with those technologies to become part of OCP?
RW: It's a great question. When I came on to the board, I had a very similar question - ‘so how does this work? I know what we do, I know what we think about, and it's a pretty proactive process from our lens. But how does new membership work? How do new initiatives work?’. So as a board actually, we sat down and spent days and days together (virtually, because it was during the pandemic), to come up with answers to where the industry is headed. We asked ourselves and our members what do we see as the future initiatives of the world - computation is becoming increasingly heterogeneous, data is increasingly disaggregated, and sustainability is incredibly underreported. We kind of tried to analyze both the pain points - our various lenses on the industry, as well as the changes in opportunity. We then reported to each other.
Out of that process, we came up with the OCP 2.0 framework, which we're in the process of rolling out. To meet the market of today, that's traditional OCP. It’s what we've always done, with modularity at scale operations, increasing sustainability, and it's a top pillar of meeting the market today. But I think we're already late, frankly.
Then in the integrated solution space, if you think about how much money and time various vendors spend just to certify different solutions, such as SAP certifications or ensuring that vSphere runs correctly in all different configurations - it’s a huge amount of time for the industry. OCP 2.0 is about how we can do that better, faster, and stronger.
Then the other aspect of it is, again, understanding the future. The future will be more heterogeneous, more disaggregated, and all these different technologies need to be in place. To take one example - optics needs to be developed, not just from a network switch perspective which is well covered, but at a level of silicon and integrated photonics for node-level integration. When does copper (the established solution) lose? There is a point at which we're going to have to make those changes, and it's going to be chiplet integration. We ask how we going to make sure the modules for optics can go into advanced cooling systems - light has different behavior, so how do we make sure that we're building something that actually will work?
You know, in all those different environments, optics was a big area of focus for Intel, and we're seeding future innovation as well as open silicon initiatives. Understanding all the different dynamics that are happening in the industry, whether it's our IDM 2.0, whether it's key acquisitions that have happened to meet the ecosystem, or general consolidation, fewer foundries available across the world, or all the different AI chiplets that you've mentioned earlier - these domains are requiring more and more partnership at all the different layers. At Intel we focus on a lot of layers, but not all! Then we produce a product, someone else produces a product, and we all use standards like PCIe or OAM to work together. I mean the new CXL standard is about having a fundamental understanding of the TLB on a device that is not my own!
So, I mentioned optics with cooling, but cooling anywhere - it’s about how we think about cooling at the edge in these tiny colocation facilities, versus cooling in a more hyperscale situation. If we go back to the beginning - HPC invented this stuff. There were fish tank designs, and cold plates, and all sorts of cool things that came together. OCP is about how we make sure that every single one of those things is not a new research project that takes two and a half years, but can become a commodity way we think about it. I don't know about you, but I grew up loving Fry's Electronics, and I still remember that old, gaming system with an immersion cooling solution that was really kind of a cold plate solution, where it went through the front, and it was so cool, and it lit up, and I just thought it was the coolest thing I've ever seen. We need something like a server equivalent, maybe without the lights (Ian: RGB?), but it needs to be easy for us to take advantage of those technologies, because fundamentally that will make our overall power footprint in this industry more efficient and effective for serving the incredible demand that exists. So that's feeding future innovation, that's what we want to do. We have a lot of aspirations, you know, we had to get it done.
IC: I was going to talk about the Future Technologies Initiative, but I think you've basically covered it. Fantastic!
RW: What's interesting about that one is that that was a community-led initiative - the Future Technology Symposium started with the community, and then the board made it official. Now we've done a good mapping between what was already in the Future Technologies Initiative workstreams, because that started in late 2019, and what the board decided [should be other features under that umbrella]. But Future Technologies is interesting because they are more about service models, like organizing the cloud service model, or an AI hardware model, and a software code design mode. It’s about recognizing the domain space of heterogeneous compute exists, because in the future you will no longer be able to run a generic general-purpose computational solution that isn’t deeply aware of the software running on top. It also has to meet the needs of AI right? We also have Software Defined memory, and I already hinted about CXL and view of what's going to happen there in terms of device models. It’s a completely different device model than anything that we sort of grew up with on assumptions for IO.
The industry is just, you know, it's amazing. I mean that's why I'm at it!
Many thanks to Rebecca and her team for their time.
Also thanks to Gavin for the transcription.