AI Democratization: Scale Out Generative AI Platforms – Futurum Tech Webcast

AI Democratization: Scale Out Generative AI Platforms - Futurum Tech Webcast

On this episode of the Futurum Tech Webcast, host David Nicholson welcomes Delmar Hernandez, Senior Principal Engineer at Dell Technologies and Steen Graham, Founder at Scalers AI for a conversation on the democratization and scaling out of generative AI platforms.

Their discussion covers:

  • The current state of generative AI technology and its accessibility
  • Strategies for scaling out generative AI platforms to support widespread adoption
  • Challenges and solutions in making AI tools more available to non-experts
  • Insights into future directions for AI democratization and its impact on various industries
  • Collaboration between Dell Technologies and Scalers AI to promote AI accessibility

Learn more at Dell Technologies and Scalers AI. Download our related report, Dell and Broadcom Deliver Scale-Out AI Platform for Industry, here.

Watch the video below, and be sure to subscribe to our YouTube channel, so you never miss an episode.

Or listen to the audio here:

Or grab the audio on your streaming platform of choice here:

Disclaimer: The Futurum Tech Webcast is for information and entertainment purposes only. Over the course of this webcast, we may talk about companies that are publicly traded, and we may even reference that fact and their equity share price, but please do not take anything that we say as a recommendation about what you should do with your investment dollars. We are not investment advisors, and we ask that you do not treat us as such.


Dave Nicholson: Welcome to the Dell Experience Lounge here in lovely Round Rock, Texas. I’m Dave Nicholson, Chief Research Officer at Futurum, and I’m joined by the esteemed Steen Graham, CEO of Scalers AI and Delmar…

Delmar Hernandez: Delmar Hernandez.

Dave Nicholson: And you’re in technical marketing here?

Delmar Hernandez: Yep. Technical Marketing

Dave Nicholson: And a technical marketing engineer. We’re going to talk about something that I’m fascinated by. When we talk about AI, you can divide that up into the ideas of training and inference. And we’re going to be talking about some distributed training, reference implementation that you’ve put together. But I want to start out. Delmar, what’s the difference between training and inference? What are those things?

Delmar Hernandez: So training would be the process of, well, we do fine-tuning, right? So we’re teaching a model new tricks. So in our case, we took Llama two-seventy-B, and we taught it how to speak. We trained it on PubMed, which is a medical data set.

Dave Nicholson: When we say training, what are we doing here? What is the casual observer need to understand about that?

Delmar Hernandez: This is where Skeen needs to correct me if I’m wrong, but training is like you have foundational models, right? Llama-two would be a foundational model that’s developed. It takes many hours and days of training to release that. What we do is we take that foundational model and then we teach it a new trick, which means we don’t need to spend as much time training it. We fine-tune it, right?

Dave Nicholson: But training takes a lot of work. Training is really hard in terms of horsepower…

Delmar Hernandez: It’s time consuming.

Dave Nicholson: And hardware power. Time-consuming. And then what about, what is inference? Inference is a fancy word for what.

Delmar Hernandez: When you put the model to work, right? You’re basically throwing… so for LLM specifically, you throw a prompt at it and then it responds. The process of it responding is making inferences on what you want it to respond with.

Dave Nicholson: So you’re asking it to do something. So people who have used Chat Gpt know what inference is.

Delmar Hernandez: Exactly.

Dave Nicholson: When they say, tell me about this.

Dave Nicholson: What about distributed training? What’s being distributed here?

Steen Graham: Yeah, so what you want to do when you train a model, and particularly the bigger model, is you need a distributed cluster. And so we’ve all seen that the leading large language models today have all been trained on tens of thousands of GPUs. And so there’s a lot of insight and know-how in creating a cluster at that scale. And so what we wanted do through this initiative is we wanted to offer developers that the ability to be able to set up multi-node training clusters and know how to do that and how to do that effectively with the right optimal software stack.

So we can get enterprises fired up about deploying training workloads in their own infrastructure so they can innovate with their own proprietary data. The key thing about distributed training is ultimately distributing the model, sharding the model, and then communicating the weights back and forth in that continuous exercise of driving epochs through that data set. And in the case that Delmar alluded to here as we’re for this case, we use the PubMed data set, which is a massive text-based medical data set, publicly available. And we wanted to take an off-the-shelf pre-trained model like Llama-seventy-B, and actually fine-tune that model to be a medical expert based on that PubMed data set.

And when you want to drive a big model like Llama-two-seventy-B model, you want to drive it in a distributed way so you can get the training workload to happen faster, and then communicate via Ethernet, in this case, the weights and the learnings of that model to create a new model that would be essentially an industry specific implementation of that. So that’s the flow of how we do the distributed training workloads.

Dave Nicholson: But this distributed training that’s happening, we’re talking about something that is happening on-premises, correct. In the reference implementation that we’re talking about. So what did that look like specifically? What were the nodes that you distributed across?

Delmar Hernandez: So we actually have a lab right down the road round Rock five. So we deployed an XE9680 server with Nvidia H100’s.

Dave Nicholson: Okay.

Delmar Hernandez: That’s our current generation.

Dave Nicholson: So GPU in this case?

Delmar Hernandez: GPUs, yeah. And then we have an poweredge XE8545, which is our last-gen server. That had A100, so Nvidia’s last-Gen GPU. And then for good measure, we added in a poweredge R760 XA, which is a box that allows you to put PCIE GPUs in it, right? The two I mentioned previously are SXM’s, higher power, more performance than the PCIE cards. So we clustered those together with a power switch switch, a Dell power switch, over a hundred gig ethernet.

And there was a little bit of myth busting going on here, right? There’s this perception that you need InfiniBand to make AI clusters. So that was one of the reasons we reached out to Steen. Anytime we needed to do myth busting, I call Steen. Like, “Hey, I’m being told everybody needs InfiniBand. We’d like to understand of Broadcom a hundred gig ethernet can accomplish the same task without being the bottleneck to these AI servers.”

Dave Nicholson: Did it work?

Steen Graham: Of course it worked. There’s features and Broadcom ethernet notably where you can bypass to that distributed inferencing framework and leverage that. The GPU direct features, if you will, to address latency challenges. So we’re really showing them how to take a leading open source model, pre-train or fine tune that model based on their custom dataset. And then we’re showing that you don’t need all the latest infrastructure.

So we were a little bit of a MacGyver of across the Dell PowerEdge portfolio, whether they had the legacy, anchor tenant, high-end system or the latest high-end system or the 76 EXA that just does it all. They can just pair all that infrastructure. I think what’s notable as well is we also supported this infrastructure across AMD, instinct GPUs and Nvidia GPUs. So that framework scales across the leading GPS in the market today.

Dave Nicholson: So you’re talking about a heterogeneous environment in terms of these server nodes, how realistic is that? Are people really going to cobble this stuff together or are you trying to make the point that things are moving so quickly that people are going to have infrastructure that’s perfectly good, that they’re not going to want to kick out the door? Look, what we’re hearing about Blackwell coming next year. I mean, how realistic is this? Was this sort of “science experimenty” or is this something legit, that you think people are going to be doing this heterogeneous environment?

Delmar Hernandez: You took the words out of my mouth. If you have existing infrastructure and something new comes out, I’m not just going to toss that out, right? I’m going to figure out how to leverage what I have to build my capabilities. And that’s the reason why we tackle this because we know that our customers are not simply trashing their existing infrastructure as they move forward, right? They’re increasing their capacity.

Steen Graham: Just as there’s massive innovation and a massive bottleneck today in the workloads that we want to run, and that hardware can solve also software innovation’s happening. And so for this particular engagement, we also use new techniques in fine-tuning models using traditional fine-tuning techniques, but also using LoRa-based techniques that allow us to actually more affordably fine-tune models as well.

So I think there’s going to be innovation on the software that enables you to use some of the existing infrastructure you already have. And of course, I think the industry, a semiconductor industry is continuous innovation and performance. And so it’s great to be able to future-proof your workloads where you can take your existing infrastructure and your new infrastructure and get them working together.

Dave Nicholson: I’ve got one final question for you on the subject of training. When we talk about proprietary data, if a model is being trained on everything, in theory, why do we need proprietary data to make it more valuable?

Steen Graham: But I think that public domain obviously is different than the private domain. So enterprises might have their own proprietary workflows, trainings, IP, and usually that wouldn’t be in the public domain. And so being able to take a model that’s pre trained on the entirety of human knowledge on the internet and pair that with your proprietary insights is really what could be ultimately transformative to your business.

Dave Nicholson: So Dell sees this, sees AI as not being just the purview of hyper scale clouds, but also customer data centers and all the way out to the edge.

Delmar Hernandez: Yeah, exactly. I mean, these are tools that we’re using inside of Dell too. We have a lot of proprietary data, our own IP, and we’re leveraging the power of LLMs to gain insights on all of that information. So instead of “alt f’ing” in a document or scrolling through thousands of PDS, you go to a prompt, ask a question, and you get an answer. Right.

Dave Nicholson: Fantastic. We will have links to the specifics about all of the reference implementations that we talk about. Thanks again for joining us here at The Dell Customer Experience Lounge here in Round Rock, Texas.

Author Information

David Nicholson is Chief Research Officer at The Futurum Group, a host and contributor for Six Five Media, and an Instructor and Success Coach at Wharton’s CTO and Digital Transformation academies, out of the University of Pennsylvania’s Wharton School of Business’s Arresty Institute for Executive Education.

David interprets the world of Information Technology from the perspective of a Chief Technology Officer mindset, answering the question, “How is the latest technology best leveraged in service of an organization’s mission?” This is the subject of much of his advisory work with clients, as well as his academic focus.

Prior to joining The Futurum Group, David held technical leadership positions at EMC, Oracle, and Dell. He is also the founder of DNA Consulting, providing actionable insights to a wide variety of clients seeking to better understand the intersection of technology and business.


Latest Insights:

GPT-4 vs Claude and the Implications for AI Applications
Paul Nashawaty discusses Anthropic's launch of the Claude Android app, bringing its AI capabilities to Android users and also, a comparative analysis of long context recall between GPT-4 and Claude.
Dynamic Chatbot Is Designed to Support Seamless Collaboration Between Digital and Human Workforces
Keith Kirkpatrick, Research Director with The Futurum Group, covers Salesforce’s Einstein Service Agent, which is designed to help improve self-service and agent-driven support experiences by leveraging AI and automation.
New Release Brings AI and Automation Across Business Cloud, Business AI, and Business Technology Offerings
Keith Kirkpatrick, Research Director with The Futurum Group, covers the release of OpenText Cloud Edition 24.3, which incorporates AI to drive enhancements across its Business Clouds, Business AI, and Business Technology offerings.
Experts from Kyndryl, Intel, and Dell Technologies share their insights on enabling practical and scalable Enterprise AI solutions that drive impactful outcomes. Discover the potential of AI factories, the critical role of tailored infrastructure, and the path towards AI readiness in enterprises.