Search
Close this search box.

Rack-scale Compute & Rack-scale Management – Six Five On The Road at SC24

Rack-scale Compute & Rack-scale Management - Six Five On The Road at SC24

The complexity of rack-scale AI infrastructure is staggering, but Dell’s expertise and build-to-order approach make it manageable. Host David Nicholson is joined by Dell TechnologiesEric Kuzmack, Senior Product Planner, AI Compute Systems, on this episode of the Six Five On The Road at SC24, for a conversation on the latest in rack-scale compute and management innovations.

Here’s the TL;DR:

  • Forget DIY. These racks are packing 130 horsepower and need serious liquid cooling. We’re talking 3,000 lbs of hardware with complex network and cooling interconnects.
  • Data centers need a MAJOR upgrade. Legacy facilities often lack the power, cooling, and even door size (!) to handle these beasts.
  • Dell is leading the charge with customizable rack-scale solutions. They’re helping enterprises design bespoke AI pods tailored to their exact needs and future growth.

Check out the full conversation ➡️

Learn more at Dell Technologies.

Watch the video below at Six Five Media at SC24 and be sure to subscribe to our YouTube channel, so you never miss an episode.

Or listen to the audio here:

Disclaimer: Six Five On The Road is for information and entertainment purposes only. Over the course of this webcast, we may talk about companies that are publicly traded and we may even reference that fact and their equity share price, but please do not take anything that we say as a recommendation about what you should do with your investment dollars. We are not investment advisors, and we ask that you do not treat us as such.

Transcript:

David Nicholson: Welcome to SC24, the Supercomputing Conference here in Atlanta, Georgia. I’m Dave Nicholson with Six Five On the Road, and I am joined here in Dell’s massive presence on the show floor at SC24 by Eric from Dell. Eric, welcome.

Eric Kuzmack: Good morning.

David Nicholson: We’re here to talk about something that is particularly interesting in the space that is high-performance computing, supercomputing, all stuff related to AI, and that is the idea of rack-scale computing and integration. I kind of want to set the stage by saying that I have built my own gaming PC before, so I think that I should be able to rack and stack my own AI gear. What do you say?

Eric Kuzmack: That’s probably a bad idea.

David Nicholson: Okay. Tell me why.

Eric Kuzmack: Well, your gaming computer probably weighed 30 pounds and was air-cooled, and most of these AI systems weigh about 250 pounds, and many of them are liquid-cooled. And the kind of liquid cooling that you get with your home computer if you do a liquid-cooled home computer is about this much worth of capacity when you need this much worth of capacity.

David Nicholson: So this isn’t simple racking and stacking like we may be used to, even in the enterprise data center context?

Eric Kuzmack: Exactly. And the enterprise in the standard data center, you rack the compute, you plug in two power cords, two, three network cables, and you’re done. And each one of those systems only generates a couple of kilowatts worth of heat or consumes that much power.

David Nicholson: So in the context, when we think of Dell and enterprise IT, often people will think of storage subsystems, will think of servers themselves, but you’re saying that this rack-scale integration of network storage, compute, all of that is something that require, let alone water cooling, if necessary. That’s something that you do to integrate parts from a whole variety of partners. Talk about that integration process and validation process.

Eric Kuzmack: Yeah. So within the server, you’ve got integration between CPUs and GPUs from a variety of vendors, and because of the weight and heat requirement, thermal requirements of these systems, a whole lot of engineering has to go into just making sure that from the fact that you don’t want the metal bending to, you don’t want the thing burning up. There’s a lot of thermal engineering that goes into just designing these servers. And then when you take 8, 9, 10, 12 of those, put them in a relatively speaking small space of a rack, you have to have a lot of cool air coming in and you have to be able to take the hot air coming out and do something with it other than just eject it into the data center for an air conditioner to take care of. That works when you’re talking about 10 kilowatts per rack, 15 kilowatts per rack.

David Nicholson: What are we talking today though?

Eric Kuzmack: We’ve got racks that range anywhere from about 50 to 60 kilowatts all the way up to 130 kilowatts. Once you get above that 40 to 50 limit, you have to do liquid cooling. Generally, you’ll also need rear door heat exchangers so that you can get all of that heat rejected to liquid and then send out the data center to be cooled.

David Nicholson: Yeah. Yeah. And for folks, when you hear 50 or 60 kilowatts, it doesn’t mean anything, each kilowatt is roughly equivalent to 1.3 horsepower. So a 100 kilowatt rack, that’s 130 horsepower engine running at red line.

Eric Kuzmack: Right.

David Nicholson: You think about the power. Is a 42U rack still standard height in what you’re seeing?

Eric Kuzmack: No. So what we’re seeing now is 48U, 52U racks because you need both the extra height because of a lot of the networking that goes into these racks. Your typical enterprise server, two or three network cables, your typical AI server can be as high as 12, as low as nine.

David Nicholson: Including the liquid cooling?

Eric Kuzmack: No, that’s just networking.

David Nicholson: Okay. Okay. Just network.

Eric Kuzmack: And then on the liquid side, there are no industry standards for liquid cooling right now. It’s really in its infancy. So for example, our recently announced XE9680L and 9685L, there’s 12 cooling hoses that come out the back that have to be connected correctly to the manifolds. We get beyond that with our integrated Rack 7000 where all of the server nodes slide into the rack and have a blind mate connection with the DLC manifold, so that way you don’t risk human error in connecting the wrong hose to the wrong place or what have you. You just slide it in, click it shut, and you know that the liquid’s going to go where it needs to go.

David Nicholson: I’m thinking of the enterprise data center customer as opposed to the cloud service provider customer. You have data center customers you’ve worked with literally for decades. Some of those data centers have been around-

Eric Kuzmack: For decades.

David Nicholson: … for decades. Is it unusual to enter a space and find that there isn’t enough power dropped in to support the power in a 48U rack with all that stuff? What does that look like typically?

Eric Kuzmack: It’s extremely common. The typical legacy data center, 15 kilowatts is a lot, 20 is top tier. So when you’re dealing with a server that draws roughly 10 kilowatts, you get one, maybe two in a rack, but it goes beyond just the power availability. Can the floor support the weight? One of these fully loaded racks can get close to 3,000 pounds. And then can the air handling system handle the cooling required? And are the doors tall enough to let a 48U rack through? And we know in some areas, especially in Europe, they’re not, so we will have 42U rack options, but obviously a shorter rack if you put less in it. But in the large enterprise and the CSP space, we’re seeing everything being 48 and 52U.

David Nicholson: Do you have SKUs for rack-scale products or are these designed in collaboration with individual customers or both? What does that look like? I imagine that if you’re going into a situation where the power constraints are fixed and they’re telling you you’re not getting more than 30 kilowatts in this rack period, well then, that might preclude some certain solutions. So when you talk about rack-scale specifically and rack-scale management, is it the concept that you make sure that it’s all going to work together or are they prepackaged?

Eric Kuzmack: No.

David Nicholson: What does it look like?

Eric Kuzmack: So we’re very purposefully not doing the single SKU up for a rack because we want to bring Dell’s build to order capabilities into this space. We’ll help you do a data center assessment, and then based on the results of that, we can say, okay, you can fit four of those or two of those, or whatever the case may be. And then within that, okay, this is what you want in the server, all the vast configuration options that we have for our servers, you can pick from all of that. And then based on the data center environment, what PDUs do you need? What height rack do you need? Can you use an in-rack CDU? Do you need an in-row CDU? All of that stuff is tailored to the customer. So you won’t be able to go to dell.com and go click, click, click, buy now and get one of these racks.

David Nicholson: But if I’m hypothetically a data center customer and I’m an enterprise customer and I’m starting with the equivalent of two racks of stuff, probably a best practice for me to then have that be my standard, my personal standard?

Eric Kuzmack: Yes.

David Nicholson: So call it my personal SKU. Probably makes no sense to come in and say, “I’m going to try five different kinds of coolant distribution units, cooling distribution.” So effectively, they’re working with you to create their own pod of infrastructure that can then be copied exact moving forward?

Eric Kuzmack: Exactly right. And even in our sales tools, we follow that flow of, design the compute, how many computes do you want in a rack? Design the rest of the rack. How many racks do you want? And then you say 10 racks and the sales tools even automatically calculate all the right numbers and all of that.

David Nicholson: Yeah. So I come from a hardware background, so I generally think of software people as weak, frankly. But I guess there’s something to be said for this idea that if you’re integrating all of these hardware components together, it would be kind of nice to have less than 10 software interfaces to manage it. Do you have rack-scale management software?

Eric Kuzmack: Yes. So we have software and it’s based on our existing management infrastructure that we have where you bring together the compute, the networking, where it is in the rack, the PDU, CDU, anything with an ethernet port is managed via this console and it gives you a per rack view. And because all of our servers have thermal sensors, we’re able to give a thermal view of the front and the rear of the rack. But that’s all nice fluffy stuff. What’s really important is if you’ve got 10 racks worth of gear and they’re all liquid-cooled and one server springs a leak, you want to make sure that you shut down the correct rack that that server’s in. So if through the course of the years you’ve moved a server from rack one to rack five, and the management system doesn’t know that, if that server springs a leak, the management system will shut down rack one where the server is not, and the leak will continue going on in this other rack. So it really becomes important to truly understand the rack-scale and where everything is and what’s connected to what, so that you don’t end up in situations where you’re turning off the wrong thing or what have you.

David Nicholson: Yeah, liquid cooling. What could possibly go wrong with liquid in electronic components?

Eric Kuzmack: Exactly.

David Nicholson: What’s the kind of minimum entry level point for someone? I can imagine a small to medium enterprise wanting to dabble in something like model training, or I should say fine-tuning. Maybe they want their bespoke data to not be off in the cloud somewhere. They definitely want it on-premises, but maybe they only need an eight node GPU cluster, maybe 16. At what point does it become rack-scale and can you start with half a rack with a plan to scale out to a full rack?

Eric Kuzmack: Absolutely.

David Nicholson: Because these things are expensive. And I want to say something clever here, but we’re talking millions of dollars in cases for a system like this.

Eric Kuzmack: Yes.

David Nicholson: So you can start with less than a full rack with plans to intelligently scale?

Eric Kuzmack: Absolutely.

David Nicholson: Okay.

Eric Kuzmack: Generally speaking, if you only need, say, four nodes, we recommend you just go air-cooled. As you start moving up in node count, the power consumption and heat starts growing, and you’ll start wanting to view things really by the rack. But we’ll have solutions that go from very small all the way up to massive. And some of our systems, for example, the IR7000 platform, which is our OCP, ORV3 rack infrastructure, we sell starting with 18 SLEDS or 36 nodes, and then depending on the height of the rack, you just add more nodes, you add more power shelves as needed, and it’s almost like a blade chassis where you just slide your nodes in and oh, this rack is full, bring in another rack, rinse and repeat.

David Nicholson: Final question for you: If you were to talk to someone who’s been managing a data center, managing IT, really knows hardware, but isn’t aware of what the requirements are moving forward, what are the kinds of things that would surprise them, or what do you think is sort of a surprising thing to understand about this that some people overlook? You talked about things like power density and the massive amount of cooling and the weight. You already touched on some of those, but is there anything else you can think of that people tend to forget about how different the future is than the past?

Eric Kuzmack: The complexity in which these things are all interconnected goes far beyond traditional enterprise networking. You’ve got dedicated GPU fabrics for all the GPUs to communicate with each other. Then you’ve got what we call north/south fabrics for data to get in and out. You’ve got management networks, you’ve got all of your liquid cooling, which is a network of water and coolant. So it’s really a much more complicated thing than just, okay, yeah, I got to run a couple of ethernet cables from here to there.

David Nicholson: Yeah. Very, very interesting. It’s very interesting because one approach could be, we’re coming out with an appliance and it’s going to have a SKU, but what does that do? That eliminates all of the wonderful choice that you’re famous for as Dell democratizes AI, if you will. So this makes a lot of sense. So Eric, I think it’s a fascinating look at what is the future for HPC supercomputing in general, but specifically for folks dealing with AI at all, it’s this idea of integrating at the rack-scale. It’s what Dell does very well. Frankly, it’s what Dell has done very well for a long time. It’s just gotten more and more complicated. So Eric, you’re going to have a job for as long as you want a job. Congratulations.

Eric Kuzmack: It’s a lot of work to do.

David Nicholson: It gets more and more complicated. AI not taking our jobs. Hopefully AI not taking my job. I figure I’ve got a few weeks. For Six Five On the Road here in the Dell booth at SC24, I’m Dave Nicholson. Stay tuned for more Supercomputing action.

Author Information

David Nicholson is Chief Research Officer at The Futurum Group, a host and contributor for Six Five Media, and an Instructor and Success Coach at Wharton’s CTO and Digital Transformation academies, out of the University of Pennsylvania’s Wharton School of Business’s Arresty Institute for Executive Education.

David interprets the world of Information Technology from the perspective of a Chief Technology Officer mindset, answering the question, “How is the latest technology best leveraged in service of an organization’s mission?” This is the subject of much of his advisory work with clients, as well as his academic focus.

Prior to joining The Futurum Group, David held technical leadership positions at EMC, Oracle, and Dell. He is also the founder of DNA Consulting, providing actionable insights to a wide variety of clients seeking to better understand the intersection of technology and business.

SHARE:

Latest Insights:

Nick Coult, Director at Amazon Web Services, joins Keith Townsend to share insights on Amazon ECS's 10th anniversary, discussing its evolution, key innovations, and future vision, highlighting the impact Gen AI has on the industry.
Join hosts Patrick Moorhead and Melody Brue to explore the groundbreaking impact of high-capacity SSDs and QLC technology in driving AI's future, emphasizing Solidigm's role in leading this transformative journey.
Adobe Reports Record FY2024 Revenue Driven by Strong Digital Media and Digital Experience Segments While Leveraging AI to Drive Innovation and Meet Analyst Expectations
Keith Kirkpatrick, Research Director at The Futurum Group, analyzes Adobe’s FY2024 performance. Growth in the Digital Media and Digital Experience segments contributed to record revenue while addressing challenges like the impacts of foreign exchange.
Matt Yanchyshyn, VP at AWS, joins Dion Hinchcliffe to share insights on the evolving cloud marketplace landscape, highlighting AWS Marketplace's new features and the impact of GenAI on business operations.