Feel like tech is passing you by? Think again! 💾 Broadcom’s Kim Leyenaar, Principal Performance Architect, and host David Nicholson share a conversation on Six Five On The Road at #SC24 about the surprising relevance of “old school” storage tech like RAID controllers in AI infrastructure. They highlight the critical role of infrastructure in AI performance from power requirements to liquid cooling and insights from Broadcom’s AI Labs.
Tune in for more on these key topics:
- AI is a team sport: GPUs, CPUs, networking, and storage all play a crucial role
- Broadcom’s commitment to open, power-efficient, and scalable solutions
- The cutting-edge AI research happening within Broadcom’s labs
- Kimberly Leyenaar’s journey and insights as a Principal Performance Architect
- The collaboration between Broadcom and its technology partners like Dell Technologies
- Future directions for AI research and applications in the tech industry
Learn more at Broadcom.
Watch the video below at Six Five Media at SC24 and be sure to subscribe to our YouTube channel, so you never miss an episode.
Or listen to the audio here:
Disclaimer: Six Five On The Road is for information and entertainment purposes only. Over the course of this webcast, we may talk about companies that are publicly traded and we may even reference that fact and their equity share price, but please do not take anything that we say as a recommendation about what you should do with your investment dollars. We are not investment advisors, and we ask that you do not treat us as such.
Transcript:
David Nicholson: Welcome to SuperCompute ’24 from Atlanta, Georgia. It’s the Supercomputing Conference. I’m here at the Dell Technologies presence with someone from Broadcom, Kim Leyenaar. Kim, tell us something about yourself, your background, what have you focused on over the last few years since you graduated from high school?
Kim Leyenaar: So I’ve been working for Broadcom now for 17 years. I came from the storage side of the house. I’ve been a storage architect, I guess, a little over 20 years, but over the last couple of years, of course, there’s been this shift to AI. So I’ve become an AI architect. So we have lots and lots of servers now available to us in the lab, and I’ve been analyzing all the data flows and data paths for inferencing and for training. So it’s been really exciting.
David Nicholson: So we’re going to talk about that. We’ve titled this segment, Tales From the AI Lab, specifically Broadcom. Now, Broadcom at a high level is all about, in the world of AI, and pretty much everything else, the idea that being open, power efficient and scalable is critical moving forward, whether that’s networking or any other technology, that has been their mantra. And Kim, you’ve had a chance to get in and really dive in with both hands into this world of AI. What are you seeing? Tell us some of the cool stuff that you’ve been working with.
Kim Leyenaar: So as you probably know, and a lot of people know, so Broadcom is really integral into a lot of these servers in the data movement of the data paths. So we make network switches, we make the NICs for the network. We have brand new ones that are coming out. We make storage controllers, we make RAID controllers, and of course we make PCIe switches. So part of what I’ve been doing is, to me, I’m trying to really understand how are we filling these pipes? We’re getting ready to go from PCIe Gen 5 to PCIe Gen 6. Do we need it? Do we need Gen 7 even quicker? And so it’s been really interesting trying to analyze exactly how much of these components within the server, within these AR servers, are being utilized, how much memory.
David Nicholson: So you’re seeing the data move around.
Kim Leyenaar: Oh, yeah.
David Nicholson: Are you seeing bottlenecks now or are you foreseeing where the next bottleneck might be?
Kim Leyenaar: We are. We absolutely are. And we’re trying to make sure that we improve on that. And some of the bottlenecks could just be things like CPU utilization and utilizing the NUMA architecture properly for different AI projects and such. Of course, some of it could be the networking, the collectives overhead. I did some analysis of DLRM training and the sheer amount of collectives and just not the bandwidth even. It’s just the sheer number. And of course, from a storage world, which I come from, we call those IOPS, but they’re really not. So trying to change my thinking about utilization and about latency and such has been really, really eye-opening.
David Nicholson: You’re talking about CPU utilization. Of course, when we think AI, often we’re thinking accelerators that are GPUs, but there are still CPUs working in concert with those GPUs.
Kim Leyenaar: Absolutely. The CPUs are still quite vital. And I do monitor the GPU utilization as well. And luckily for the most part, they’re very utilized. When I’m running most of… And I hate to call them benchmarks, I’m really just kind of exercising some of the different AI models. There’s quite a bit. But we do things like, we’ve been scaling up the number of GPUs that we’re utilizing, mostly for inferencing. So we say, what does one buy you? What does two, what does four, going to eight? And then of course, putting them inside of the cluster. And so we’re getting ready to do even more cluster analysis because that’s where it’s going to get really interesting, especially from a networking standpoint, is trying to understand all the data paths that are going on and trying to make sure that we are promoting Broadcom as the next generation for scalable architectures from networking standpoint. And we are headed that way.
David Nicholson: Yeah, I mean, there is no AI without networking things together. I mean, essentially, very few things are going to run on a single GPU in places. Often they’re going to be clusters. If someone were thinking about building out infrastructure for AI right now, what have you learned that you could pass on that might be cautionary tales or surprises, things that maybe were surprising to you in terms of performance characteristics?
Kim Leyenaar: Well, in our lab, we are actually getting ready to add 12 more GPU servers. So the Dell 9680 servers. And one of the things that we were shocked at was the power requirements, and that’s why power is such a big deal right now. I had somebody earlier in the show tell me that next year’s show… Of course, this year’s show, there’s been a big focus on cooling, especially liquid cooling. If you look around, it’s everywhere. And he said next year they’re going to be talking about nuclear energy. That’s going to be everywhere.
David Nicholson: I think so, yeah, it’s crazy.
Kim Leyenaar: So the power requirements, and of course, we’re putting liquid cooling into our labs so that we can make sure that we have these servers optimized and they’re running in an optimal environment as we analyze them. So that shocked me.
David Nicholson: Do you automatically assume that the people who can string CAT5 cable can also plumb liquid-filled things? I worry about that. I mean, do we need a direct liquid cooling certification for lab people?
Kim Leyenaar: Probably. Our lab engineer, lab manager, Lewis, is fantastic and he’s actually been studying all this, so he’s ready for it. He is so ready for it. And we’re excited. We’re getting ready to have that installed just over the next couple of days.
David Nicholson: But you see, liquid is moving forward. At least now, liquid’s a prerequisite. It’s not a negotiable at some point.
Kim Leyenaar: And that kind of surprised me. I guess my little world that I’ve been in for so many years, you never need a liquid to cool anything. So yeah, but going forward, we’re definitely going to be… So that kind of surprised me. That really was, it was shocking. The other thing that surprised me, and I’m in storage, but I was blown away by how much capacity that I need in order to do these tests, the storage that was required. So I started out. I’m like, “La, la, la, la, la.”
David Nicholson: Why?
Kim Leyenaar: Oh, just the data. The data that you need to do the training is incredible. It really is. So we’ve been sucking all this data in, and initially I’m like, “Oh, just give me two big fat NVMe drives in there.” And then I’m like, “Give me another big fat NVMe.” And then I’m like, “Wait a minute. We’re a storage company. Give me RAID,” because there was no way if I lost this data, that I was going to be happy, because it took me so long, not only to download it, to get it together, to clean that data, to organize that data. I’m like, “I’m not doing this anymore.” So I realized we really need protected data as well. So our RAID controllers have come in quite handy, and I even did an analysis. I’m like, “Well, what was the difference?” One NVMe drive versus I put eight of them behind a RAID controller, and it was great. One of my data preps was taking me five days, five full days to do all the data prep, and I put in a RAID controller that went down to two days. So I’m like, “Yes.”
David Nicholson: Interesting. So some of the stuff we might think about as old school technology actually is relevant.
Kim Leyenaar: Yes.
David Nicholson: So, I know we’ve had conversations about where do certain storage modalities fit in the world of AI? And there have many questions about that. I want to ask you a question that gets me blank stares, like I’m from Mars, from AI people. I will ask them what I think is a very simple question. “When you say you’re going to run a model on my laptop, how much space does that model take?” And they look at me, like, “Don’t you understand? It’s running in memory.” And it’s like, “Yeah, I get that.” So let me ask you, when you say you’re going to run this model on a laptop, is a model, can you think of it as a file at a certain level or no?
Kim Leyenaar: Oh, yeah.
David Nicholson: So how big are these things?
Kim Leyenaar: Well, it really-
David Nicholson: It depends. Come on.
Kim Leyenaar: You know I’m going to say it depends. And it really does depend. If you’re going to be running it on a laptop, not my laptop, my poor laptop’s old, I don’t think I can handle anything right now. But they certainly can be run on laptops. But yeah, we’re talking about upwards of a terabyte for a reasonable size model.
David Nicholson: Okay, but manageable. In today’s day and age, that’s a thumb drive, right?
Kim Leyenaar: Absolutely. It really is. It really is. And you’re moving all this data around.
David Nicholson: But that model is loaded into memory and it’s living in memory when it’s running.
Kim Leyenaar: Yes it is.
David Nicholson: Okay. So then, hold on. So that means that a model on an AI PC needs to be… I have 64 gigs of memory, I think, in my laptop, it’s pretty good. The entire model is running in that memory, in that laptop?
Kim Leyenaar: That’s the size of the unloaded model. So it really-
David Nicholson: Okay. The unloaded model. So not all of the model is necessarily-
Kim Leyenaar: That’s correct, yes.
David Nicholson: Okay. Okay. But when you’re talking about training, that’s a whole different story because that’s not-
Kim Leyenaar: Yeah.
David Nicholson: Okay. Now, is that training what we would call fine-tuning, or is that from the ground up.
Kim Leyenaar: I’ve been doing that.
David Nicholson: You have been.
Kim Leyenaar: Yeah. So I’ve been doing some Llama fine-tuning, but also full training too on some of the more popular models. So it’s been really, really interesting because we wanted to understand what are the data paths? What are the data flows? What are the IO sizes? What does it look like? Coming from a storage background, I am very invested in that. And so we’ve been able to take and ascertain how much data is passed through during some of these training. So how much is going through the PCIe? How much is going through the NVLinks, of course, if we’re using NVIDIA? How much is going through the network? How much is going through the NUMA nodes even, because that can really seriously affect the performance of it. So it’s been very interesting. Of course, looking at the memory, the CPU utilization during that, which tends to be low, but there are definitely cases where even the memory and the CPU are critical to making sure that you are able to do the training as quickly as possible.
David Nicholson: A final question for you. One of the interesting things about this conference is that we have this convergence of, what we think of as traditional IT vendors, government agencies, we have NASA here, Department of Energy and academia. So you see these bright-eyed, bushy-tailed students running around, all excited about getting into their careers. Many of course, are going to be focused on AI moving forward, but there’s a whole bunch of us who have been in this world of IT for decades. It could be intimidating for some folks to think in terms of, “Well, I’m not sure if I’m ready for AI.” You made a pretty big jump over the last couple of years moving from being one of those people who has forgotten more than 99% of people know about storage and associated things, to this new thing. What would your thoughts and advice be to people like us who have been in the IT business for a long time? Should we be afraid of AI? Should we be racing to retire? Or is it a lot of fun and interesting to get into? I mean, what do you think?
Kim Leyenaar: It’s a lot of fun, and it’s very interesting, and it’s very frustrating sometimes too. Whenever you’re on the bleeding edge like this, things change so quickly. What you wrote down yesterday is probably not correct today. So things change very quickly. So you just have to be willing to go with the flow. You have to be willing and able to debug. There’s a lot of debugging that’s been required to get, and you’re like, “What’s this error? I’ve never…” You try and Google it. Nobody’s ever heard of it. And so being willing to stick with it, being willing to go through the code. And in fact, it was a great question because on Monday, I was in one of the classrooms here, and there was a young kid sitting next to me from UCF, which is my alum, and he’s getting ready to graduate in three semesters. And he just had a zillion questions for me, “What’s important? What should I look for in a company? Should I go into AI?” And so we had this conversation, and I love seeing these eager, aspiring young engineers that are coming into this market. I welcome it all.
David Nicholson: Yeah, very exciting. Great commentary, Kim. Well, Kim, it’s been great spending time with you here at SC24. From the rest of us here at the Six Five Media, Six Five On The Road gang, I’m Dave Nicholson. Stay tuned for more content.
Author Information
David Nicholson is Chief Research Officer at The Futurum Group, a host and contributor for Six Five Media, and an Instructor and Success Coach at Wharton’s CTO and Digital Transformation academies, out of the University of Pennsylvania’s Wharton School of Business’s Arresty Institute for Executive Education.
David interprets the world of Information Technology from the perspective of a Chief Technology Officer mindset, answering the question, “How is the latest technology best leveraged in service of an organization’s mission?” This is the subject of much of his advisory work with clients, as well as his academic focus.
Prior to joining The Futurum Group, David held technical leadership positions at EMC, Oracle, and Dell. He is also the founder of DNA Consulting, providing actionable insights to a wide variety of clients seeking to better understand the intersection of technology and business.