AI workloads are exploding. Is your network ready? David Nicholson is joined by Dell Technologies‘ James Wynia, Director Product Management Networking, and Broadcom’s Hemal Shah on this episode of Six Five On The Road at SC24 for a discussion on the INSANE speeds and feeds driving the future of AI and supercomputing ⚡
Their discussion covers:
- Building AI fabrics with Dell PowerSwitch and Broadcom Thor 2
- The collaboration between Dell and Broadcom to advance AI infrastructure
- Key features of the Dell PowerSwitch and Broadcom Thor 2 and their impact on AI workloads
- Challenges & solutions in scaling AI operations
- Insights into the future of AI infrastructure and networking technologies
- Best practices for deploying AI fabric in varied enterprise environments
Learn more at Dell Technologies.
Watch the video below at Six Five Media at SC24 and be sure to subscribe to our YouTube channel, so you never miss an episode.
Or listen to the audio here:
Disclaimer: Six Five On The Road at SC24 is for information and entertainment purposes only. Over the course of this webcast, we may talk about companies that are publicly traded and we may even reference that fact and their equity share price, but please do not take anything that we say as a recommendation about what you should do with your investment dollars. We are not investment advisors, and we ask that you do not treat us as such.
Transcript:
David Nicholson: Welcome to SC24, the SuperComputing conference to end all SuperComputing conferences. We are here in Atlanta, Georgia, and the thing about SuperComputing is that it doesn’t tend to happen very often within what you think of as a single computer. In fact, the top 500 SuperComputers were announced this week, this year, and the top system had over 11 million CPU cores, 11 million with an M. All of those had to somehow be connected together. We have interconnects on the motherboard, and then we have networking that attaches all of these things together.
So networking is a critical part of the equation. In fact, you might say that in SuperComputing, the network is the computer. And we have two fabulous gentlemen to talk about networking. Jim from Dell. Hemal from Broadcom. Let’s talk about how we connect these things together, because without what you do, there would be no supercomputer. There would just be like regular computer, right? Paperweights, effectively. Paperweights. So start us off, Jim. Tell us a little about what you do at Dell and how the partnership with Broadcom around networking, what’s going on in the world in networking and SuperComputing?
James Wynia: Absolutely. This is just a crazy exciting time in the networking field as AI has just taken off so prolifically in the last 12, 18 months. It’s been crazy. And honestly, on the networking side, there’s always been a little love, but all of a sudden we’re getting crazy love because the speeds are just going through the roof and the network has to keep up.
David Nicholson: Can I just get a fist bump for us hardware guys?
James Wynia: Absolutely.
David Nicholson: It’s been a little too much of this. Just give me the service. Don’t tell me what you’re doing behind the scenes.
James Wynia: Yeah, okay. Oh, you’re networking. Oh, okay.
David Nicholson: Yeah, exactly. We’re getting our day in the sun. Continue.
James Wynia: Absolutely. So it’s an exciting time to be in the industry and working with strategic partners who are at the front edge, the cutting edge. And so my role, you asked about, I’m the product manager for Dell Networking, the hardware portfolio, and I’ve been working with Hemal and Broadcom for I can remember at least 15 years. He remembers even longer. So yeah, so exciting.
David Nicholson: That’s good. That’s good. Glad someone has a good memory here at the table as they say. I can hide my own Easter eggs at this point. Hemal, let’s get straight to it. You’ve got show and tell me what is the leading edge from Broadcom’s perspective in terms of networking? What do you have here to show us? Talk about what you’re doing.
Hemal Shah: Yeah, so I’ve been the architect of our networking product for, I have been with Broadcom almost two decades, and it’s always exciting. With AI, excitement has gone even higher by having the scale that we never thought about, a million GPUs connecting in the next future.
David Nicholson: Yeah. Crazy.
Hemal Shah: So what we have is, what I’ll say is Ethernet as a fabric, and I’m going to show we have two types of fabric solution. So this is the schedule fabric we call Jericho3AI. What it does is the fabric provides all the congestion control, reliability, and other services, and then endpoints will connect to this. And then all the services provided by fabric is completely transparent to the end user.
David Nicholson: So this would be in a chassis that is a switch.
Hemal Shah: That is a switch.
David Nicholson: Okay. Okay. And just to be clear, because you said 20 years back, so the original thing you worked on was the tin cans with strings between-
Hemal Shah: Yes. Gigabit Ethernet is where it started.
David Nicholson: So we have the switch, so the switch here to those endpoints.
Hemal Shah: So this is the schedule fabric. Then there is another type of fabric solution, which is our industry-leading, Tomahawk 5. This is 51.2 terabits per second in a single chip. So with the two tiers of this, you can build thousands of GPU node cluster or accelerator cluster. And then once you build all this fabric and you get all the load balancing congestion control telemetry services from the fabric, you also need the endpoint, which is the NIC. This is our 400 gig RDMA NIC,
David Nicholson: Right. Network interface card. Yeah.
Hemal Shah: This is where all the transport functions are executed by the NIC. All the endpoint congestion control and the ability to move data between GPU memory from one location to other location is facilitated by this.
David Nicholson: Explain again, the difference between these two networks. You said the scheduling network.
Hemal Shah: This is the schedule network.
David Nicholson: The schedule network. What does that mean exactly?
Hemal Shah: So network itself has built-in capability where it’ll schedule traffic to be congestion free. So the congestion control is built into the fabric.
David Nicholson: Okay.
Hemal Shah: That’s one piece. The QoS comes with that, and it’ll also take Ethernet packets and convert that into cells and the whole cell routing and everything is completely transparent.
David Nicholson: Okay. And you mentioned Ethernet. What’s the state of the art in terms of bandwidth or throughput that’s available? Speeds and feeds.
Hemal Shah: Speeds and feeds. This switch can support a single port of 800 gig. And in a single switch, you can have up to 64 ports of 800 gig or 128 ports of 400 gig.
David Nicholson: Okay. And the NICs?
Hemal Shah: This NIC is 400 gig.
David Nicholson: 400 gig.
Hemal Shah: A single port.
David Nicholson: State of the art. Okay. If you turn that over, are those the little, did I see a little hint of the other side? The other side? Okay. Yes. So each of these little dot has to connect to something else.
Hemal Shah: Correct. These are the pins coming out. And you can think about this means it has, when 128 ports, we are talking about 128 SerDes coming out, or more than that, depending on port speed.
David Nicholson: So Jim, admit it, reverse engineered alien technology.
James Wynia: You know, it starts to make sense now.
David Nicholson: So what’s interesting from a Dell perspective, and I want to ask you this, pretend Hemal’s not here for a second. You support all sorts of networking technologies.
James Wynia: We do.
David Nicholson: I mean, you answer to your customers. What’s different about your relationship with Broadcom in terms of Ethernet? I know there’s this thing called the Ultra Ethernet Consortium.
James Wynia: Yeah. I’d love to hear that.
David Nicholson: Yeah. What’s the story with that? What’s the real link between Broadcom and Dell on this stuff?
James Wynia: So every kind of networking technology partner has its own story. And the thing that we love about pure Ethernet and our partnership with Broadcom is we are true collaborators, it’s not just do this and take this and no, we have a conversation. This is what we’re seeing. Oh, really? And things evolve together, as it were. And so we’re not just told, “This is what you have, thou shalt sell it.” It’s more of a, “How can we do better? What can be improved? What are your customers saying? And let’s go both together, go talk to a customer.” Because we trust them and they’re always there for us. So that’s a big part of it. Certainly.
The technology-wise. Let’s talk about Ethernet for a second. I mean, as a whole, I mean Ethernet as absolutely the dominant networking type. It’s the one that won after Intel won. Back in the day, we used to buy workstations and everything, and it became all these different form factors that eventually now we buy servers. So that’s where we ended up. At the same time, Ethernet won. So are there other options out there? Sure, of course there are. But at the same time, there’s no doubt that if you stack up all the networks in the world, there’s going to be this many Ethernet and there’s going to be this closest competitor is going to be over here. So that are very niche kinds of solutions.
But general purpose is not a curse either. Sometimes, “Oh, we’ve got to be watered down by doing all this stuff.” It has only gotten better and more sufficient over time. So we look at the Tomahawk 5 sitting here, which is an amazing chip. It completely raises the bar for what you need for Ethernet in a reliable, high capacity, AI, ML, HPC solution. Congestion management, link control, transport, all those things. And they just keep getting better. The analogy that I really like is if we look at McLaren is a Dell customer, we all know that. They’re very vocal and they have an amazing car, and you can see it today. Next year they’ll have a whole new car. It’ll be better, it’ll be faster, it’ll be more efficient. It’s not like what they have today is rubbish. No, it’s an amazing car, but there’s always more you can do. And so I look at Ethernet, it’s the same thing. It has come so far. It is the Formula One car today. Yeah, there’s more that we can do. And as we get into Ultra Ethernet, there’s some great stuff coming there and some of those inspirations for Ultra Ethernet, we’re already using them, but the whole industry will be to participate at that point.
David Nicholson: What I love about the collaboration between your companies is that I know that from Broadcom’s perspective, Broadcom sees the success of AI depending upon scalability, performance, power efficiency, all of those things. And openness, and Dell at the same time is this, I like to say Switzerland from an AI perspective, whatever customers need they’ll provide. But more importantly, the fact that Jim, you and I will still be friends next year, even after Lewis Hamilton wins the driver’s Championship in a Ferrari, we will still be friends. We can still wear our papaya shirts and toast to Lewis Hamilton’s win. But on another note, back to the SuperComputing conversation, a lot of the conversations we’ve been having this week have been around this idea that you have electricity coming in the systems that we all build, what do they mainly do? They mainly generate heat?
James Wynia: Love that.
David Nicholson: Right? That has to be dissipated. Then what do you do with that? So what are you doing from a power consumption perspective? Because you’re at the foundation of these systems that are being built out. The network is, and Broadcom in particular, how much do you pay attention to how much power is consumed, therefore, how much heat is generated?
Hemal Shah: Right.
David Nicholson: You doing anything about that? Or is this the first? Is this first you’re hearing about this issue, Hemal?
Hemal Shah: We have heard about this every generation we came out with and we always, let’s start with the NIC. We always design very power-efficient NIC. One of the goal behind doing this is when you have a fixed function device, it tends to give you the lowest power rather than trying to put in general purpose logic inside. Unlike like DPU kind of architectures, that’s one thing. Same thing on the switch side. Always when we take it to higher speed, we look at what is the power per port, and we tend to have a matrix. Even if you go to like 800 gig switch device, you want to have less than a one watt per port. That kind of matrix helps. In addition to that, what connects all this, the cables and optics, you’re working on linear pluggable optics, which brings down the cost and power. They’re also doing co-package optics, which also results in low power. So yes, we do pay a lot of attention to the power, and at the same time, we have to also make sure that those power still can serve the high speed, because these networks are always on. So it’s not like we are saving power by keeping them idle. When they’re at full power, we are still consuming the lowest amount of power.
David Nicholson: Now, very basic question, the connectivity running from, say, when I think of, I’m looking at a rack, looking at the back of the rack, is it copper coming out or is this all optical? Where do we have copper connections versus optical connections? Today in the modern configuration.
Hemal Shah: So today, if you look at the 400 gig, 800 gig kind of connections within the rack, you can still use copper. What we do with our service being on switches and NICs, you can have extended distance. So your adjacent racks can also use copper. Beyond, once you go beyond four or five meters, that’s when you start seeing the optics. So we have seen people like doing, and of course working with partner like Dell, you will see a three rack solution where coppers are connecting the nodes, but then when you go out from there end of row, that’s where you start seeing the optics.
David Nicholson: Okay. And Jim, do you work with the networking software also, or are you primarily on the hardware side?
James Wynia: I get my fingers in the software as well.
David Nicholson: Okay. So what does that look like in terms of, you have, let’s say that you’re building out a massive AI cluster and you’ve decided to go with Ethernet for the benefits that it brings. What about the software? Is the software as open? What does that look like from a Dell perspective?
James Wynia: That is a really important question. And from the Dell perspective, I mean, every vendor is going to have their own operating system, their own story. From the Dell perspective, we have decided to focus exclusively on open source based solutions. We use an operating system called SONiC.
David Nicholson: SONiC. Okay, yeah, yeah.
James Wynia: Yes. And that has roots back to Microsoft, and they use it exclusively in all of their megacenters and everything. And so we worked with them years back and really understood the power of it. And there’s a whole community contributing to that. And you’re starting to see a lot of big companies are getting on the bandwagon, is starting to do press and have some amount of effort. We are all in. For us, as we move forward, it’s all or nothing. And so we see it as very important. We actually partner with Broadcom to help make sure we get the right side development that’s happening and working with them, we make sure that we have the full robust solution that our partners can go to market with confidence.
David Nicholson: So where’s the bottleneck today? Is it networking? Is it accelerator? Is it CPU? We chase around for decade after decade, whack-a-mole, chase the bottleneck. Where is it now? Can we saturate a network like this with what’s coming out of these processors at this point? Or do we have a little headroom, or is that, I know altogether now it depends.
James Wynia: Right? There is some amount of that, but there are some general. Some general. It’s a very interesting question actually, because it used to be that the networking part was so far ahead that we just never got complaints. And honestly, we work with the server team and the storage team. They were, “Oh, you’re still using that. Are you serious?” But that’s not the case anymore because of GPUs and the high capacity NICs that have come out from the likes of Broadcom. Thank you. They’ve made it so that they’re basically kind of step, stepwise about the same almost. Okay. I mean, we are releasing 800 gig ports, and so our fabrics can be based on 800 gig or the server today is 400, but within six months they’re at our heels. And it’s never been that way historically. And so you’re right, we have to be very careful that we make sure we keep enough in front of them so that when they assume it’s going to work, they assume the NIC is going to be there. They assume the connectivity is going to be there. Solid, tested and grounded. And it’s definitely more of a challenge now than it was two years ago even.
David Nicholson: Okay. Interesting. Final question, Hemal. Are there any pesky constraints associated with physics that are looming on the horizon here? Or is the main concern, frankly, the fact that these things that you build that go into systems are part of an environment that’s becoming increasingly dense and hot? Are thermals the main concern or is there something else that’s going to limit this doubling every period? What are we faced up against?
Hemal Shah: For the near term, I don’t see the physics is going to prevent. We will keep doubling. To tell you the truth, frankly, what I see is from the system side, as this is getting to large scale and complex system, software-level libraries and tuning, that’s where we need to have even more effort here because how are they going to take advantage of this all the way from the top layer?
David Nicholson: So gentlemen, I think we can all agree it’s the software guys’ fault. We’re throwing them under the bus. No bottlenecks here in hardware land. Hemal from Broadcom. Thanks for being here. Jim from Dell. Thank you very much.
James Wynia: Thank you very much, Dave. We’ve enjoyed being here.
Hemal Shah: Thank you, Dave.
David Nicholson: Absolutely. And for myself, Dave Nicholson, thank you so much for joining us here on Six Five On The Road at SuperCompute 24. Stay tuned.
Author Information
David Nicholson is Chief Research Officer at The Futurum Group, a host and contributor for Six Five Media, and an Instructor and Success Coach at Wharton’s CTO and Digital Transformation academies, out of the University of Pennsylvania’s Wharton School of Business’s Arresty Institute for Executive Education.
David interprets the world of Information Technology from the perspective of a Chief Technology Officer mindset, answering the question, “How is the latest technology best leveraged in service of an organization’s mission?” This is the subject of much of his advisory work with clients, as well as his academic focus.
Prior to joining The Futurum Group, David held technical leadership positions at EMC, Oracle, and Dell. He is also the founder of DNA Consulting, providing actionable insights to a wide variety of clients seeking to better understand the intersection of technology and business.