The Power of Power and Cooling – Six Five On The Road at SC24

The Power of Power and Cooling - Six Five On The Road at SC24

Megawatt Racks? It’s not science fiction! Host Keith Townsend is with Dell TechnologiesTim Shedd, Engineering Technologist, Office of the CTO, on this episode of Six Five On The Road at SC24. Tim dropped some of his insights & knowledge about the future of data center power and cooling.

Their discussion covers:

  • The rise of liquid cooling
  • The evolving demands on power and cooling systems in current computing environments
  • Power-dense rack infrastructure
  • Dell Technologies’ innovative approaches to energy efficiency and sustainability
  • Insights into future trends and technologies for power and cooling optimization
  • Bridging the adaptation gap by providing solutions for both air-cooled and liquid-cooled environments

Learn more at Dell Technologies.

Watch the video at Six Five On The Road at SC24, and be sure to subscribe to our YouTube channel, so you never miss an episode.

Or listen to the audio here:

Disclaimer: Six Five On The Road is for information and entertainment purposes only. Over the course of this webcast, we may talk about companies that are publicly traded and we may even reference that fact and their equity share price, but please do not take anything that we say as a recommendation about what you should do with your investment dollars. We are not investment advisors, and we ask that you do not treat us as such.

Transcript:

Keith Townsend: All right. We promised you engineering goodness throughout this coverage of SuperCompute 2024 here in Atlanta, and we are bringing it. We’ve talked power, cooling, GPUs, CPUs, and the conversation continues with our sponsor, Dell. This is Six Five On The Road, I have with me, Tim, you’re an Engineering Technologist with Dell Technologies. Welcome to the show.

Tim Shedd: Thank you very much. Pleased to meet you.

Keith Townsend: All right, so I’ve talked to my co-host, Dave Nicholson, about coming into this show. We pictured 100 kilowatt data center rack as a lot, but this show has redefined that baseline for me. Tom, what do you see?

Tim Shedd: Yeah, so the 120 kilowatt racks that you’re starting to see out on the floor, yeah, that’s today’s technology. Dell is leading there. We’ve shipped some of the first customers out in the marketplace today. They’re out running with blinking lights. But we plan that architecture from the start to hit about a half megawatt, 500 kilowatts in a rack. So if you take a look at our IR7000 racks, they’re designed today to take you to half a megawatt. We can take that infrastructure with some modification, and we’re targeting what we see as possible is getting that up close to a megawatt per rack, not tomorrow, but at some point in the future. Because look, anybody who underestimates what our colleagues in the Silicon world are doing is doing so at peril.

Keith Townsend: So there’s a lot to unpack there. And we’re going to start by the basics. What has been driving the use case for all of this power?

Tim Shedd: Yeah, so there’s a couple things here. So talking about a megawatt data center is not a crazy thing at all. Talking about a megawatt rack, that starts people thinking that’s crazy. The reality is that it comes down to the fact that we want to have all of this information processed in close proximity to each other. So we’ve got all these processors, we need them talking to each other, very high speed. And that forces us to put them as close as possible together in the same rack. And if all of those processors are 1000 watts, we end up with 100 kilowatt rack. If they’re 2000 watts, 200 kilowatt rack, and you can do the math. And so we know where we’re going and we know what that density is demanding, and we know that networking is only going to become more central to this and communication between those processors. And so we’ve just got to think about how do we keep all of that really tight in space? And that means very, very power-dense racks.

Keith Townsend: So with these power-dense racks, the variations start to matter. So as we’re thinking about components, drives, memory, things that we just took for granted, how should engineers and organizations be thinking about these individual components as they make them up? Or this is something we should just leave up to Dell?

Tim Shedd: So Dell’s strength, since our founding, has been to allow the customer to make those decisions. We will obviously come in with recommendations, with validated configurations, but ultimately we’re about allowing the customer to get the compute they need and what they want. So how that relates to power density and cooling is we don’t believe in controlling your choices by constraining everything under a layer of copper. We want to cool what needs to be cooled with liquid, but then we’re going to leave some components out, easy to service, easy to change, easy to configure, like drives, like certain memory components and so on, that they’re going to remain air-cooled for the foreseeable future. And allow what you were just talking about, allow the customer to say, “I really want this particular drive, this, by that time what it’ll be 250 terabytes, 300 terabytes,” who knows. But let the customer choose that. Let them choose the network interface card they need and not be constrained by saying, “Well, Tim didn’t design a cold plate for that, so we can’t put that in there.” We want to keep that choice.

So that means that we are engineering the power and cooling in such a way that we can still allow very, very efficient cooling of these air-cooled components together with all the liquid that’s coming into the rack. And if we think about capturing 90% of that heat to liquid, 10% of a megawatt is still 100 kilowatts. That’s a lot of air cooling still, but we got you covered. We’re taking care of that today as well. So to allow that flexibility, allow you to choose the drives, the memory, whatever you need, and still have a very efficient and power dense rack.

Keith Townsend: So let’s talk about cooling. You’ve mentioned it a couple of times. Air cooling is just not even an option anymore for CPU in processing, to cool off these 1000 watt CPUs and processors in a single dense rack, when you’re getting up to the 100 kilowatt range becomes unreasonable. However, the infrastructure for cooling doesn’t seem to be there, at least not in mass. What are you seeing?

Tim Shedd: Yeah, so we are definitely in a transition period, this has come faster than people expected. We had some visibility into this, but when you’re talking about brick and mortar, it can only move so fast. And we’ve just come out of a supply chain constrained period. So you’re exactly right. We are seeing, however customers being surprisingly agile in figuring out how to get liquid where they need it, and we are listening to them and trying to move with them. So you’ll see in our portfolio, some people have been surprised that our GPU servers have been air-cooled to date. That was a conscious choice because most of our customers are still air-cooled. But we can bring liquid to the rack and we can help our customers figure out how to do that without necessarily having to plumb up the whole rack and have cold plates and everything.

And we can still make that really efficient with a rear door heat exchanger. And we’ve demonstrated that ship thousands of very dense GPU racks that are cooled with a rear door heat exchanger. It’s still liquid to the rack, but we’re showing that, hey, our energy consumption is the same or lower than some of our competition with DLC. So what I’m saying is we’re thinking about this and figuring out how do we bridge that gap, provide the compute today, that can be deployed today in all these, using these billions of dollars of investment in data centers, but bridging to the future? So as we go to, as you said, as we get to beyond 1000 watts, it’s going to be really challenging to blow enough air through there. You’re going to need gale force winds on these, inside of these servers.

But there are also new innovations, some of which you can’t quite talk about right now, but where we’re looking to allow customers to stay air-cooled, even with 1200 watt, 1500 watt, 2000 watt GPUs if they need to. Now, it’s not going to be quite as dense. Still the data center’s going to have to add probably some cooling towers or some other components to get that heat out into the environment so that it’s not for free, but we’re definitely working every aspect to allow the most dense racks possible, 200 kilowatt, 300 kilowatt, but also allow those customers who just for a lot of reasons can’t get there today to allow them to take advantage of the most advanced technology with different cooling technologies.

Keith Townsend: So we were briefed by Jeff Clark’s team a couple of weeks ago. He was saying, “You know what? Dell isn’t a power company, so they’re not going to solve the power problem directly, but you do have to work with customers that do need to solve these problems.” How have they been solving the challenge that you know what? My data center was designed for 15K racks, and now I’m pushing that to 15, 16 and 120. How are customers solving these problems from a practical power generation perspective?

Tim Shedd: A lot of creativity. And a lot of work being done by myself and my colleagues to look at how can we adapt infrastructure that wasn’t necessarily purpose-built data center infrastructure, but exists in a place where power exists? So for some of our customers, that’s been the mode. So there’s a shell, there’s power. So we have experts that are able to say, “Well, here’s what we recommend for a power bus structure that you can implement in this environment.” I can come in and say, “Well, here’s what we recommend to put in cooling that isn’t going to break the bank. We’ll make it real easy for you to just hook up a couple hoses per rack. Here’s the requirements, start hiring the plumbers and get that in.”

So it’s very intimidating to think about from the start. I know a lot of our customers have these, I think some fears built up about liquid and about getting that much power in, but we’ve got the expertise to come in and advise. We’re not the final engineers. We want them to hire the proper engineering firms for that area to sign the papers, but we’ve been working on these designs to help them get there fast. One of the things we’re proud of is helping our customers deploy really fast because they’re spending hundreds of millions of billions of dollars for this compute. They need it to start generating income. And so to get right back to your question is we’re just trying to provide validated designs to help them deploy wherever they get that power, wherever they can find it and get to making money with their compute.

Keith Townsend: Now, we can’t have a conversation about power and cooling without having the conversation about sustainability. A lot of your customers, including Dell, yourself, you have very aggressive sustainability goals. How has this massive explosion in compute impacted the engineering level of the problem?

Tim Shedd: Yeah, exactly. So we’re working this issue on all levels. So it starts with the power itself driving to more energy efficient power supplies. We’re going from 90% to this generation, especially as we look to ORV3, we’re looking at 96 to 97% efficient power supplies. 7% on 100 kilowatts. Seven kilowatts we’re saving per rack, you multiply that out. So that’s a big savings right there. We’re implementing within our firmware a lot of different techniques to optimize energy use. Again, some IP here, but some real interesting things on our GPU servers where we actually are adding some situational awareness to the server to know, hey, there’s actually some additional fans helping me out. I don’t need to drive my fans as hard. I can lower the power that way. So we’re really looking at it from that aspect, from the inside out.

But also looking at it at the rack level, how can we ensure that that rack is using the minimum energy possible by looking at power delivery and cooling? Because cooling traditionally has consumed about 40% of the energy that was used for compute, an additional 40% is needed for cooling. So that’s basically wasted energy that’s not being useful. And so we’re wanting to cut that down to less than 10%. You hear about PUE and so on. So that initial number comes from a PUE of about 1.5, which is really good data center today. We want to get that down to a 1.1, 1.2 or lower, so that our customers can deploy more compute for the power they have and use that power to make money.

So that entails really efficient cooling, which we’re helping our customers deploy. And then also, as I said, optimizing the power, getting that redundancy they need without over designing, which is another way you waste power and waste energy, is trying to over design and be super-conservative. We’re saying, “Well, look, we’re giving you redundancy features that allow you to keep rolling through an outage without necessarily having to double up on everything.” So there’s all sorts of aspects to it. Server power supplies, server firmware, rack design, cooling design and facility power design.

Keith Townsend: Yeah, it really is a mind shift is not just the data center industry but the enterprise. We’re used to overbuilding and having more capacity that we need. In this new world, we really do need to tighten that, this and that up. The more productivity and more efficiency we get out of our systems, the less power and cooling that we need thus helping us achieve our goals. Tim, there’s two acronyms that you use and I really want do our audience a service by making sure we define them. ORV3 and DLC?

Tim Shedd: Yeah, so ORV3 stands for Open Rack Version 3. It’s a rack architecture that came out of the Open Compute Project or another acronym, OCP. There’s a standard form factor for servers and power cells and other equipment that slides into that and we conform to that. But outside of that what we call the payload area where you’re putting your compute and so on, there’s some flexibility in the standard. And what Dell looked at is like, well, our customers are not Microsoft or Meta, they’re not building out 100,000 of the exact same server, so we need to add some flexibility there. So we made the rack a bit deeper. 1200 millimeters versus 1000 millimeters. We made it a bit wider, 750 versus 600 so that we could actually provide that flexibility for the networking and power requirements of our enterprise customers.

That includes what we believe is what we see as the first enterprise ready, DLC based ORV3 rack. So we installed permanent manifolds in the back of the rack with OCP, Open Compute standard connectors on them, and those connectors, that liquid delivery system for direct liquid cooling is already geared for the future. It’s already geared designed to let us get to the 500 kilowatt rack today, so our customers can buy that rack and put in our cooling, put in their server stay and know that the next generation is just going to plug right in. There’s a lot there. So let me talk real briefly about direct liquid cooling. Direct liquid cooling refers to the fact that we’re actually taking liquid into the server, running through some hoses to cold plates or heat exchangers that sit on top of the processors, and pull that heat directly off of the processor. So I don’t need any air movement, I don’t need any fans to get that heat out. So that’s DLC or direct liquid cooling. There’s a lot of details there. We can dive deeper on that if you wish.

Keith Townsend: No, I think all of my gaming friends are now looking at us and saying, “We told you so,” but we knew it was going to come to the data center and here at SuperCompute 2024 in Atlanta, Georgia, we’re seeing it. There is no other platform you’re going to get this high level conversation understanding the business value of DLC and ORV3 other than The Six Five On The Road. Stay tuned for more end-to-end coverage from the business value to the deep technical conversations with experts such as Tim, as we continue our coverage from SuperCompute 2024.

Author Information

Keith Townsend

Keith Townsend is a technology management consultant with more than 20 years of related experience in designing, implementing, and managing data center technologies. His areas of expertise include virtualization, networking, and storage solutions for Fortune 500 organizations. He holds a BA in computing and an MS in information technology from DePaul University. He is the President of the CTO Advisor, part of The Futurum Group.

SHARE:

Latest Insights:

On this episode of The Six Five Webcast, hosts Patrick Moorhead and Daniel Newman discuss Meta, Qualcomm, Nvidia and more.
A Transformative Update Bringing New Hardware Architecture, Enhanced Write Performance, and Innovative Data Management Solutions for Hyperscale and Enterprise Environments
Camberley Bates, Chief Technology Advisor at The Futurum Group, shares insights on VAST Data Version 5.2, highlighting the EBox architecture, enhanced write performance, and data resilience features designed to meet AI and hyperscale storage environments.
A Closer Look At Hitachi Vantara’s Innovative Virtual Storage Platform One, Offering Scalable and Energy-Efficient Storage Solutions for Hybrid and Multi-Cloud Environments
Camberley Bates, Chief Technology Advisor at The Futurum Group, shares insights on Hitachi Vantara’s expanded hybrid cloud storage platform and the integration of all-QLC flash, object storage, and advanced cloud capabilities.
Dipti Vachani, SVP & GM at Arm, joins Olivier Blanchard to discuss how Arm is revolutionizing the automotive industry with AI-enabled vehicles at CES 2025.

Thank you, we received your request, a member of our team will be in contact with you.