On this episode of the Futurum Tech Webcast, host David Nicholson welcomes Delmar Hernandez, Senior Principal Engineer at Dell Technologies and Steen Graham, Founder at Scalers AI for a conversation on the democratization of AI, focusing on the scalability and versatility of heterogeneous AI inferencing.
Their discussion covers:
- The current trends and challenges in AI democratization and how companies are navigating these waters
- How Dell Technologies and Scalers AI are advancing the field of AI inferencing with innovative solutions
- The importance of scalable and heterogeneous AI systems in unlocking new opportunities and applications
- Best practices for implementing AI technologies in various sectors
- Predictions for the future of AI development and its impact on industries
Learn more at Dell Technologies and Scalers AI. Download our related report, Dell POC for Scalable and Heterogeneous Gen-AI Platform, here.
Watch the video below, and be sure to subscribe to our YouTube channel, so you never miss an episode.
Or listen to the audio here:
Or grab the audio on your streaming platform of choice here:
Disclaimer: The Futurum Tech Webcast is for information and entertainment purposes only. Over the course of this webcast, we may talk about companies that are publicly traded, and we may even reference that fact and their equity share price, but please do not take anything that we say as a recommendation about what you should do with your investment dollars. We are not investment advisors, and we ask that you do not treat us as such.
Transcript:
Dave Nicholson: Welcome to the Dell Experience Lounge, in lovely Round Rock, Texas. I’m Dave Nicholson, Chief Research Officer at The Futurum Group. And I am joined by Delmar Hernandez, Dell Technical Marketing Engineer, and Steen Graham, CEO of Scalers AI. Let’s talk about distributed inference. So, Delmar, remind us what the difference is between training and inference to start.
Delmar Hernandez: So, training is when you’re teaching a model, a new trick, right?
Dave Nicholson: Yeah.
Delmar Hernandez: Fine tuning the model. And then inferencing is when you put it to work. So like Steen said before, you’re building and then deploying.
Dave Nicholson: Okay.
Delmar Hernandez: So inferencing would be the deployment of that model.
Dave Nicholson: So, what did we deploy here and how was it distributed?
Steen Graham: Yeah, so we wanted to showcase the leading model in the industry, at the time that we were developing this. With open licensing terms that many businesses can use. So, we used the Llama2 70 billion class of models. And yeah, we’re deploying that across a diverse set of infrastructure. We wanted to showcase to enterprises how they can leverage a multi-node cluster powered by ethernet and serve models to meet their requirements. So once they’ve built a model or trained a model, then they can deploy it on affordable, off-the-shelf infrastructure using a clustering methodology that some of the leading companies in the world would use.
Dave Nicholson: So Steen mentioned a diverse set of infrastructure. When we say diverse, diverse how?
Delmar Hernandez: Different generations of power edge servers. So 16G, 15G. Do you want me to get into specific model numbers?
Dave Nicholson: Sure, yeah.
Delmar Hernandez: Well, let me see if I can remember them all. So XE9680, that’s our current gen AI server. That’s the big guy. 8 GPUs. XE8545 is like the little brother of that server, last gen.
Dave Nicholson: GPUs also in that?
Delmar Hernandez: GPUs, has four GPUs.
Dave Nicholson: Okay.
Delmar Hernandez: 9680 has 8. So I’m going to go down the stack, right? Then we have the R760XA, which supports four GPUs, but those are PCIE.
Dave Nicholson: Okay.
Delmar Hernandez: So they’re a little less performant than the big GPUs and the XE9680. And then for fun, we added a PowerEdge, R7625, which is an AMD based server, so AMD CPUs, AMD GPUs.
Dave Nicholson: Okay.
Delmar Hernandez: So you’ve got a mixture of AMD, Intel and NVIDIA compute devices.
Dave Nicholson: Cats and dogs living together.
Delmar Hernandez: Yes.
Dave Nicholson: That’s awesome. So when you distribute inferencing and when you’re scaling inferencing, are we talking about scaling the number of concurrent users, the number of people who are asking things of the model at the same time? Or is it the size of the queries? What’s being scaled when you distribute something like this?
Steen Graham: Yeah, so absolutely with large language models at which we kind of tested on, deployed, you’d be supporting concurrent users. If you’re looking at another enterprise scenario where they’ve integrated like the APIs into their workflow, you could support many simultaneous calls to the APIs as well. It doesn’t necessarily need to be a human on the other end of it. It could be an auto-generated report within the company that defines their compliance with the policy.
So there’s scalability beyond just the human level interaction, but the concurrent users are kind of the north star in what we’re supporting on those distributed inferencing clusters. Ultimately I want to show that you don’t always need the latest, highest end GPU based implementation. You can take advantage of your existing infrastructure or what infrastructure is actually available to you in the market today available to buy and get highly performant inferencing capability to meet your requirement needs.
Dave Nicholson: Is this just for show or is this something that people can actually use as a reference?
Delmar Hernandez: So the infrastructure that we use is infrastructure that a lot of our customers already have. So Broadcom, Ethernet, PowerEdge switches… So this is not new science experiment type stuff, right? This is stuff that a lot of data centers have today. So we just wanted to showcase how you can bring all that together with the magic of Scalars AIs software expertise. So they kind of took the complexity out of making all of these GPUs and CPUs work together through software.
Steen Graham: And you can find the reference implementation documentation and solution code available in the Dell GitHub repo. So it’s there. So for those companies that have that existing infrastructure on PowerEdge, this generation or the last, they can get up and running.
Dave Nicholson: So is this what the democratization of AI looks like?
Delmar Hernandez: Exactly. And going forward, we’re going to be doing the same thing.
Dave Nicholson: So if you had to summarize the value proposition of what you were able to highlight here with this reference implementation, what’s the main take away?
Steen Graham: The main takeaway is that you don’t need to compromise or wait for the leading edge hardware. You can take advantage of your existing infrastructure and deploy production grade models across a diverse portfolio of semiconductor offerings.
Dave Nicholson: Delmar, any surprises from any actual testing results? Were you surprised by the performance of various systems?
Delmar Hernandez: I was surprised at how few hurdles that we had to jump over to make this work. I was expecting us to go in and spend a year on it. I think we spent maybe two months, three months from start to finish and it just worked. Most of the things that we put together just worked, right? The libraries worked out of box, the networking infrastructure worked. I think the folks out there might be wondering, well, was networking a bottleneck or using ethernet? Not InfiniVAN, surprise. It wasn’t a bottleneck. It worked. It was sufficient. It was more than sufficient. So those were some of the things that we were wondering going in. Is networking going to be a bottleneck? Can we combine all of these heterogeneous compute devices and make them play nice together? And then what does it look like when it’s deployed? Is it performant? Is it getting the, what is it, a hundred… How many tokens per second, 130?
Steen Graham: 10.
Delmar Hernandez: 10 tokens per second, right?
Steen Graham: Yeah.
Delmar Hernandez: Even CPUs are getting that, if I’m not mistaken.
Steen Graham: Yeah.
Dave Nicholson: And that’s based on the size of the cluster that you created. So this is something that could scale out and deliver more performance, right?
Steen Graham: Yeah, and I think there’s, for enterprises and infrastructure companies, there’s opportunity for new forms of monetization. So if you think about it, you might have some really, really high speed top NGPUs in a cluster. Your quality of service offering can be higher there, and then you can cache it down into a freemium-tiered model as well. So this cluster plays really well to that, where if you want to emulate what the leaders in this industry do is offer a free less performant offering all the way up to a paid offering. You can do a similar thing, whether that’s a large language model that you’re offering internally to your organization or API calls of your production based services to your clients. So that kind of caching topography is another kind of way of thinking about exposing this as well across the distributed inferencing cluster.
Dave Nicholson: So you mentioned the amount of time it took to actually put this together. Now that you have this reference implementation and the documentation, how much quicker would it be if someone were to emulate or replicate what you did? Or am I looking at it the wrong way? Is it something that you would be able to do the second time more quickly?
Delmar Hernandez: So I’m not an AI software developer, but I will tell you that I took the GitHub repo that Scalers AI put together and I cloned it on another infrastructure that we set up, and it took me maybe two days.
Steen Graham: I think the other thing to frame this is from our engineering team, first of all, it’s just letting people know that this wasn’t a failed experiment. That it was possible. Because I think most people are just defaulting to the latest, greatest infrastructure as they should because it’s highly performant. But there’s trade offs in that on affordability and availability. And so realizing that we invested in thousands of hours of engineering resources to make this solution code available in public so that Delmar could do it in a few days, in a few hours. So that’s the chasm that we’ve crossed here for democratizing AI for off-the-shelf affordable AI with leading off-the-shelf infrastructure. And of course, ethernet based technology as well.
Dave Nicholson: Fantastic. Well, we’ve covered kind of an overview of what you put together from a distributed training and distributed inference perspective. With that, I’d like to thank you again for joining us here at The Dell Experience Lounge in Round Rock, Texas. I’m Dave Nicholson with The Futurum Group.
Author Information
David Nicholson is Chief Research Officer at The Futurum Group, a host and contributor for Six Five Media, and an Instructor and Success Coach at Wharton’s CTO and Digital Transformation academies, out of the University of Pennsylvania’s Wharton School of Business’s Arresty Institute for Executive Education.
David interprets the world of Information Technology from the perspective of a Chief Technology Officer mindset, answering the question, “How is the latest technology best leveraged in service of an organization’s mission?” This is the subject of much of his advisory work with clients, as well as his academic focus.
Prior to joining The Futurum Group, David held technical leadership positions at EMC, Oracle, and Dell. He is also the founder of DNA Consulting, providing actionable insights to a wide variety of clients seeking to better understand the intersection of technology and business.