The future of dense compute is HERE!
Host David Nicholson and Armando Acosta, Director of HPC Product Management at Dell Technologies, discussed the challenges of skyrocketing power demands, the need for standardization in liquid cooling, and how Dell is leading the charge with innovative solutions like external power shelves for maximum performance.
📻Tune into this episode of Six Five On The Road at #SC24 to hear about some of the mind-blowing advancements in #AI infrastructure including:
- 🤯 480-kilowatt racks (yes, you read that right!)
- 🌊 The rise of direct liquid cooling (and why #Dell is going hybrid)
- 🔮 A glimpse into the next 5 years of #datacenter evolution
#HPC #Innovation #Sustainability
Learn more at Dell Technologies.
Watch the video below at Six Five Media at SC24 and be sure to subscribe to our YouTube channel, so you never miss an episode.
Or listen to the audio here:
Disclaimer: Six Five On The Road at SC24 is for information and entertainment purposes only. Over the course of this webcast, we may talk about companies that are publicly traded and we may even reference that fact and their equity share price, but please do not take anything that we say as a recommendation about what you should do with your investment dollars. We are not investment advisors, and we ask that you do not treat us as such.
Transcript:
David Nicholson: Welcome to SC24 here in Atlanta, the SuperComputing Conference. I’m Dave Nicholson with Six Five On the Road, and I have a very special guest this morning, Armando from Dell. We’re here to talk about the destiny of density. Welcome.
Armando Acosta: Thank you.
David Nicholson: I am here in Dell’s secret lab with Dell scientist, Armando. What do you do at Dell, Armando?
Armando Acosta: I’m an AI Product Planner, so kind of like a scientist, but there’s smarter people behind me.
David Nicholson: Fantastic. There’s always smarter people than us in the room. So we’re going to talk about what I’d like to call the destiny for density.
Armando Acosta: Yes.
David Nicholson: How’s that?
Armando Acosta: I like it.
David Nicholson: Where are we in terms of just how dense compute is and how that relates to AI? What’s sort of the state of the art? Let’s start there.
Armando Acosta: Yeah, I mean, let’s talk first about what we did a couple of years ago with the XE9680. So that was our first 8-way GPU platform. We put that in a 6U form factor.
David Nicholson: Now you say a couple of years ago. How long ago was that?
Armando Acosta: Calendar year, ’22.
David Nicholson: ’22, okay. Okay. So it has been, all right.
Armando Acosta: Yeah. And then if you look fast-forward to what we just announced here at the show in a Dell technology world is we now introduced a new form factor. It’s a 4U direct liquid cool system, but now you can still put 8 GPUs. So now we’ve moved the footprint by 2U. And so by doing direct liquid cooling, we’re able to essentially cool the GPUs, cool the CPUs. And by doing that, we’re actually able to save our customers 2U rack space. Now why is that important?
David Nicholson: Hold on, hold on. What is the chassis? If it was the XE9680 before, is it still the 9680?
Armando Acosta: Yeah, we’re creative here. So now it’s the XC9685L, and the XC9680L. It all stands for the liquid.
David Nicholson: But 4U liquid cooled eight GPU. Because the original 9680, if you will, is air-cooled.
Armando Acosta: Correct.
David Nicholson: Because not everybody is going to do direct liquid today.
Armando Acosta: Correct. And just so you understand is, we know all the hypes around direct liquid cooling. We know everybody wants a density, but Dell is going to still stay true to our enterprise and commercial customers. We’re still going to support 19-inch rack air-cooled systems. But if you look at where we’re going in the future and essentially trying to push the boundaries of density, we will start to look at 21-inch rack architectures as well. So what customers are telling us is, Hey, before I wanted 40 GPUs per rack. Now essentially I want 72 GPU rack. Now I want 96 GPUs per rack. And now we have customers that are asking for even greater density than 96 GPUs per rack. So kind of put your brain around that. You’re like, okay, 96 GPUs per rack. Yeah, you got to bring over a hundred kilowatts per rack in order to make that happen.
David Nicholson: Yeah. And it’s an insane amount of power. So you were going to tell us about why this is important, this additional level of testing.
Armando Acosta: So this is really important because if you look at GPUs, they’re not getting smaller and they’re not consuming less power. It’s just the game we’re in today. And so if you look at what we’re looking at for future roadmaps, can’t talk about our partners future roadmaps because we’re not allowed to do that. But if you look at that, the power’s not going to get less. It’s going to get more. But here’s the deal, is when you want to go run these AI workloads and you want to go do these large language models that have 20 billion, 40 billion, 80 billion parameters, guess what? You need a lot of GPUs to go and train that model. And if you look at where we’re going here, we know you got to train the model and that’s a necessary evil, but guess what? You don’t get any value until that model is in production. This is why they want to train the model faster, to get to insight faster, and then essentially to get to some new data that they need to know, maybe some new decision. And essentially that’s what brings value to our customers.
David Nicholson: So do you continue to support in that smaller form factor a variety of GPU brands, if you will?
Armando Acosta: Yeah, great question. So we’re all about addressing silicon diversity. And so what I mean by silicon diversity, we’re not only talking about silicon diversity on the CPU side, we’re also talking about silicon diversity on the accelerator side and GPU side. And so when you look at what we’re always trying to do, for example, the XC9680 that we just talked about, the 6U, air-cooled, we support NVIDIA GPUs, we support AMD GPUs, and we’ll actually start supporting Intel GPUs as well. Fast-forward to what we’re doing with the direct-liquid-cool systems, same thing. We’ll support NVIDIA, GPUs, AMD GPUs. And then not only that, you have your choice of either I want it with an Intel CPU or I want it with an AMD CPU. And so what’s going on here is with the silicon diversity, we’re giving our customers flexibility and choice so that they say… Because you know this not one size fits all, and there is no silver bullet to say, Hey, I get to buy one system and I get to run one system and it’ll solve all AI workloads. We know that’s not the case. Right?
David Nicholson: Well you just reminded me that actually even asking about diversity in GPUs is almost a knuckle headed question in the context of Dell because they are accelerators. Because you actually offer more than just GPUs.
Armando Acosta: Yes.
David Nicholson: But when you talk about that level of diversity and that level of choice, and Dell’s often looked at it as kind of the Switzerland of AI and let’s have the customers tell us what they need fit for function. You believe there’s room for a variety of suppliers in all of these areas. There’s always a benefit to standardization in a certain way.
Armando Acosta: Oh, yes.
David Nicholson: Right? So what about on the direct-liquid-cooling side, who makes the cooling? If it’s a series of partners, okay, fine. We don’t have to talk about each of them, but on the direct-liquid-cooling side, does that vary from product to product? What does that look like? Or is that a Dell bespoke thing?
Armando Acosta: Yeah, no, let’s talk about direct liquid cooling, because you’re right. It’s the wild wild west right now. And so you’ll go talk to different partners. They design their code plates differently. They design the electronics differently. The direct liquid cool manifolds are different. The quick disconnects are different. And really what we’re trying to drive now is standards into direct liquid cooling. So we announced the Sensei, our integrated rack solutions where we actually fully integrate not only rack, stack and cable, but we include CDUs. We include the manifolds, and essentially what we do is we test and burn that in in our factory and we roll that into your four walls. And so it’s ready to go and it’s not a science experiment. It’s going to work.
David Nicholson: Okay.
Armando Acosta: Now the other big thing that we’re doing is we’re actually working with all our partners now, and we are trying to drive essentially OCP standards into direct liquid cooling now. So there’s a gentleman named Tim Shed, a really bright gentleman on our CTO team. He’s worked hand in hand with each of our partners to say, Hey, here’s essentially the spec we want to work to, and here are the standards we want to work to. And oh, by the way, if you do these standards, we’re more easily able to integrate your CDUs, your manifolds, your quick disconnects into our architecture. So it’s a symbiotic relationship, a win-win for both of us, because they want to sell their gear. We want to sell full integrated rack solutions. And so what we’re really doing right now is the partnership and making sure that we drive standards, but not only that, we drive up quality as well.
David Nicholson: Yeah. Yeah. Wild wild west is a great way to describe it. Over the last three years, just at SC24 at this conference, watching the development of liquid cooling technology, we are all excited by what the latest GPU is, the latest LLM model. But seriously, in the last three years, it looks like we’ve gone from people trying to sell pool pumps to actual technology that looks like it was designed first for data center use. And there’s still a staggering array of standards and ways to do this. If you were advising me on building a data center that I wanted to be future-proof, if I said Armando, “I want my data center to be future-proofed.” Once you stopped laughing because the question is so ridiculous. I mean, seriously, how much power per rack do I need to have in terms of a drop to at least get me through the next two or three years?
Armando Acosta: Well, let me tell you a story. So when we look at what we have out on our show floor, we were the first ones to actually ship a GB 200 in a 21-inch rack. But when you look at essentially that 21-inch rack architecture, this is where we essentially are driving the standards. We are driving working with OCP. The first OCP version we looked at this, was a version two, and that version two essentially could handle up to 200 kilowatts per rack. We started to look at the roadmaps-
David Nicholson: Nobody has that. Nobody has 200 kilowatt racks.
Armando Acosta: There’s some people in there.
David Nicholson: Are there?
Armando Acosta: Yeah.
David Nicholson: But I mean seriously, but if you throw a rock and hit a data center, they’re going to go, we still have 10, 20, 30.
Armando Acosta: Yeah, that’s the paradigm, right? This is where you’ve got to say, when I future-proof…
David Nicholson: Okay.
Armando Acosta: So going back to my story here is when we saw that 200 kilowatt and then we saw the roadmaps, we actually said, Hey, let’s go to OBR rack three, version three, and version three will actually allow you to go to 480 kilowatts in that rack. Now, why did we choose 480 kilowatts? It’s because just what you talked about. In the next two to three years, you are going to see CPU, DPs rise. NVIDIA, when they’re coupling an arm CPU with an accelerator, those boards are getting bigger, more power there. And so the reason that we went to that 480 because is we want to help future-proof for our customers. And so if you tell me, Hey, today I’m bringing 15 to 20 kilowatts for rack, I’m going to tell you, you probably need to double it in the next year. Now, if you’re one of these big tier two CSPs, they’re actually already running 100 kilowatts per rack. We have some big department of energy customers that are doing 200 kilowatts per rack.
So we’re there, man, and it is happening. And so that’s why we went to 480 because the other beautiful thing about this rack infrastructure is, hey, you can start with the sled today, but when there’s the next generation of AMD processor, guess what? You just swap out the sled and the infrastructure stays the same and you just replace the compute tray. And oh, by the way, we’ve already put the headroom for you to do that and it’ll live for three generations. Because that’s the other big thing, data centers have told us, Hey, I’m going to go build this data center. You’re telling me it’s the right thing, but once I implement this rack infrastructure, want to implement this direct liquid cooling. Once I essentially have my water flowing from overhead or bottom, I don’t want to change it again, because this is hard. And so Dell, make this easy for me. And so that’s what we’re trying to do, and put an easy button on trying to build that rack infrastructure and let it live for two to three generations.
David Nicholson: So if you step far enough back from this kind of modern data center architecture, what you see is electricity coming in, heat being generated, heat being dissipated, and then at the other end, there’s an opportunity at some point to recover energy, leverage that energy somewhere else. I know you can neither confirm nor deny the existence of a nuclear energy division of Dell, so we won’t talk about where the power is going to come from. I imagine a lot of folks standing in the parking lots of data centers holding an electric cord, kind of looking around puzzled, like okay, we got 480 kilowatt rack capability, but no power. But are you seeing yet the questions about, Hey, what can we do with this heat that’s coming out of the liquid cooling environment? Or do you think, is that a little further down the road?
Armando Acosta: I think it’s a further down the road, but I think you are going to see that heat dissipation and how do we turn that heat dissipation into some form of energy, right? Hey, maybe you put it in your cold loop, maybe use it to heat other parts of your organization. Maybe you turn it into heat so that you can heat your buildings. There’s different things and different ideas that are going around that. We are working on some ideas. But the other big thing that you think about is with direct liquid cooling, if you just cool the chip and the GPU, you’re solving only about 85% of the power problem. What we’re doing is we’re using a hybrid approach, and essentially we’re going to use fans. We’re going to use efficient fan algorithms to push that other 15% out.
Now, the reason that we do that hybrid is because, hey, if you go directly liquid cooling on your memory, you do it on your voltage regulators, you do it on your CPU GPU, that’s a lot of copper that goes to the system, drives up cost, essentially drives up more resources, and really your serviceability goes out the window. And so what we’re trying to do is do the smart things and combine it with air, and hey, if you do direct liquid cooling with a simple CPU and GPU and use air for other, you’re not going to essentially have to put more components and more heat dissipation. So it’s a much more efficient way to do that.
David Nicholson: So you see a hybrid approach?
Armando Acosta: We’re going to take the hybrid approach. Others have not, but we do believe the hybrid approach is the way, because guess what? The more copper you put in there and more cooling plates you get in there, the more power you’re consuming.
David Nicholson: To me, it makes sense intuitively because although the engineer in your head says, oh, wouldn’t it be cool, we’ll have everything completely surrounded in copper. In fact, at one end of the spectrum, it’s let’s dip these things in vats of oil because they would argue complete perfect coverage, absolutely everything, but there’s always a trade-off. And so it just seems like classic Dell to look at it and go, okay, what if we could… Let’s look at the 80/20 rule. Let’s see what the most efficient combination is. And so in some cases, people will still be only using air cooling because they’re not trying to get the maximum density possible. But today, maximum density possible equals direct liquid cooling. No ifs ands, or buts.
Armando Acosta: No ifs, ands, or buts. Yes, sir. You got it.
David Nicholson: You’re doing it. The folks who have been dragging Cat-5 cable, do we just all of a sudden put hoses in their hands and assume that they’re home gardeners and they can do this stuff? I mean, do you have certification programs for folks on the cooling part? What does that look like?
Armando Acosta: Great question. Yeah, so we’ve announced these L-11 Rack Integration services, right? We talked about it at OCP, we’re talking about it SC24. With these essentially rack integration service, we actually have a site assessment that we go and do with our customers. Just so you know, we’ve deployed some large deployments over the last year, and we took those learnings and developed it into a service so that we can essentially have the best outcome for our customers. When you look at this site assessment, and we go from, Hey, we go to the site, we look at essentially how many AC bus bars do you want to bring per rack? How much power per rack can you bring? Okay, what is essentially your liquid cooling like? What are your flow rates? Essentially, what size CDU do we need to put into this? Okay. How do you want to design your hot and cold aisles?
We go through the whole assessment before we actually go in and implement it. Here’s the deal. If you do not do that, you’re going to have some big gotchas when you get onsite, right? Oh, hey, I put the CDU at the bottom of the rack. Oh, they’re actually bringing in water from the top of the rack. Oh, that doesn’t work. Now I’ve got to go onsite and I’ve got to move my CDUs, got to move all my compute trays, got to remove all my power shelves. It’s not a pretty thing, and you don’t want to see your customer do that sausage making. And so if you do that initial site assessment upfront, you solve all those gotchas. Essentially, we plan it out. We essentially do a schedule, T minus, and then, Hey, when you’re ready to go and this is your production date, we’re going to make sure we hit that for you.
David Nicholson: Do you see issues with true IT data center veterans sort of having the belief that they can kind of rack and stack it on their own, and then coming to grips with the added complexity? Because everyone tells me it’s like, no, no, it’s way more complex than it used to be. So it really is kind of an emerging frontier. So if you had to predict… Now, I’m not talking about product releases or anything or anything else. If we were to sit here five years from now, I know it’s an infinite amount of time in terms of IT, but where do you think we go? Are we headed just doubling, tripling, quadrupling density in the same direction? Or in your mind, are you imagining something different?
Armando Acosta: We are imagining something different. In five years I probably won’t have as much hair as I have right now, so we don’t want to think about five years from now. But no, what we’re imagining is we are looking at different ways to deliver power to these systems. So if you look at the architecture that we’ve just announced with the M707025, and the 21 Ratchet infrastructure, we’ve actually taken the power supplies out of the chassis, and we use external power shelves. Now, here’s the reason why. If you have four power supplies in a chassis and you’re trying to run eight GPUs and one of those power supplies goes down, guess what happens? You got to throttle and you lose performance. And that’s the number one thing customers hate when they’re running large AI jobs, and also when they’re running large HPC and simulation jobs.
David Nicholson: When that GPU was tens of thousands of dollars just for the-
Armando Acosta: Exactly. And so what we’ve done is said, okay, well hey, let’s actually build external power shelves, and what those external power shelves, I actually put six power supplies. You could still lose two power supplies in that power shelf, and I still give you full performance. Not only that, if you want more redundancy, I can put more power shelves and you can use the whole power shelf, and I’m still going to give you full performance. Go look at other architectures out there. They’re not doing the external power shelf. And not only that, that external power shelf allows you to bring 480 directly into the rack so that you don’t have to step it down to 240 and you don’t strand any power.
David Nicholson: You just say 480 like it’s just this casual thing. 480.
Armando Acosta: Yeah, that’s crazy.
David Nicholson: 480 kilowatts. Yeah, that’s like a village’s worth of electricity. So what I’m hearing, Armando, is five years from now, all of the stuff that goes on in data center will fit in my pocket and it won’t make me warm. That’s awesome. No, it is great to hear that that Dell is thinking ahead of it and not just responding to the madness.
Armando Acosta: And I mean, the other big thing is sustainability. You’ll hear that big with us, and we’re looking at how we make AI sustainable. When you look at these big new systems, they do do the work of essentially two or three racks of old gear, right? And so if you think, Hey, I can buy two or three of these eight GPU systems, and they do work of 10 to 20 GPU systems of two to three generations, well, guess what? You are saving power because hey, instead of having two or three racks, now only have one rack. Now that one rack might draw more power, but if you have essentially each rack drawing 50 kilowatts, and now you only have one rack driving 100 kilowatts, well guess what? You did save some power, and that is sustainable.
David Nicholson: Fantastic. Armando, thanks so much for being with us here. At Six Five On the Road I’m Dave Nicholson, and everyone should remember that when Dell tells you you are dense, it’s a compliment. Thanks for joining us from SC24. Again, Armando from Dell. Thanks so much.
Armando Acosta: Thank you.
David Nicholson: Stay tuned.
Author Information
David Nicholson is Chief Research Officer at The Futurum Group, a host and contributor for Six Five Media, and an Instructor and Success Coach at Wharton’s CTO and Digital Transformation academies, out of the University of Pennsylvania’s Wharton School of Business’s Arresty Institute for Executive Education.
David interprets the world of Information Technology from the perspective of a Chief Technology Officer mindset, answering the question, “How is the latest technology best leveraged in service of an organization’s mission?” This is the subject of much of his advisory work with clients, as well as his academic focus.
Prior to joining The Futurum Group, David held technical leadership positions at EMC, Oracle, and Dell. He is also the founder of DNA Consulting, providing actionable insights to a wide variety of clients seeking to better understand the intersection of technology and business.