AI Chip Trends, RAG vs. Fine-Tuning, AI2 – The AI Moment, Episode 5

AI Chip Trends, RAG vs. Fine-Tuning, AI2 | The AI Moment, Episode 5

On this episode of The AI Moment, we explore what’s going on with AI chips, the pros and cons to RAG and fine-tuning to make LLMs better and how the Allen Institute for AI has dropped a tool which makes them an Adult in the AI Rumpus Room

The discussion covers:

  • Key Trends in Generative AI:
    • New chips, NVIDIA’s GH200 Grace Hopper, Microsoft’s Azure Maia AI Accelerator and Intel’s Gaudi 2 are showing the potential for significant gains in performance and efficiency, which means lower AI compute costs.
    • Practices for making AI models better – RAG is the new rage, but fine-tuning gets a boost from SLoRA. Vectara’s hallucination measuring tool, Hugging Face’s LLMs as a Judge
  • Adults in the AI Rumpus Room: The Allen Institute for AI (AI2) recently published research that lays out the challenges of having bad data sets feed AI models. They have developed a framework to address it called What’s In My Big Data? (WIMBD)

Watch the video below, and be sure to subscribe to our YouTube channel, so you never miss an episode.

Listen to the audio here:

Or grab the audio on your favorite podcast platform below:

Disclosure: The Futurum Group is a research and advisory firm that engages or has engaged in research, analysis, and advisory services with many technology companies, including those mentioned in this article. The author does not hold any equity positions with any company mentioned in this webcast.

Analysis and opinions expressed herein are specific to the analyst individually and data and other information that might have been provided for validation, not those of The Futurum Group as a whole.


Mark Beccue: Hello, I am Mark Beccue, research director for AI with The Futurum Group. Welcome to The AI Moment, our weekly podcast that explores the latest development in enterprise AI. The pace of change and innovation in AI is dizzying and unprecedented. We’ve been covering AI since 2016, or I’ve been doing that since 2016, never seen anything like what we’ve experienced since ChatGPT launched about this time last year and kickstarted the generative AI times. With the AI Moment podcast, we try to distill the mountain of information, separate the real from the hype, and provide you with shorthanded analysis about where the AI market will go. In each episode, we will deep dive into the latest trends, technologies that are shaping the AI landscape, and a whole lot of other things. Discussions on the latest advancements in AI technology. We’ll parse the mutating vendor landscape, including the big announcements. Talk about AI regulations, ethics, risk management, and more. There’s a lot that we cover in here and time to time, not every week, but we get across those things a lot. Each 30-minute segment is typically going to be made up of three or four different segments. The guest spotlight, key trends in generative AI, adults in the generative AI rumpus room, and companies we like doing AI. So, today, we’re covering two segments. We’re doing key trends in generative AI. and in that, we’re going to talk about chips and about how there is this trend for making models better. Second segment, we’re going to do, adults in the AI rumpus room. We have one candidate this week and that is the Allen Institute for AI.

So, given all that, let’s go ahead and get started. Key trends in generative AI. I’d like to just remind you that we talk about this as trends, not fads. Things are moving so quickly that we try and parse this a little bit and talk about things in terms of more of a trend than a fad. So, it’s not just happening this moment, we see a trend with it. So, in this case, the first one I want to talk about, chips and how chips are progressing. So, one of the big challenges with AI right now is that generative AI takes a lot of compute power. The way that things are architected today, it’s just a lot of compute power. It’s the biggest compute loads that the world has ever known. So, that’s something that’s probably not sustainable and there’s been a lot of pressure, and it’s not just for AI model training, it’s AI inference, but AI model training gets a lot of publicity because it’s so heavy. That’s been a challenge. And there have been players… Chip manufacturing takes a while. And there were really no chips that were purpose-made for AI until recently. There’s a few that are coming out now. But GPUs have been the workhorse for AI workloads, and they were actually not invented or made specifically for running AI, but they tend to do it pretty well. So, there’s a little bit in the news about this, it’s kind of interesting. In the past couple of weeks, we’ve had major announcements. Actually, this week we’ve had major announcements from NVIDIA. They’ve introduced their latest GPU called the GH200 Grace Hopper. The efficiencies on that compute chip they claim to be a significantly better than their foundational piece right now, the H100. So, there are much better compute efficiencies there, so, that’s kind of interesting and those are hitting the market now for data centers.

Today, we just had a big announcement from Microsoft and Microsoft is now going to be building… They are building their own chips. This one’s called the Azure Maia AI Accelerator. And the news there would be interesting in that here’s one of the leaders in AI, it’s made a purpose-built chip, never made chips before. And now, they are kind of completing the whole stack of what they do. And kind of interesting because you could say that Microsoft brings a lot of AI thinking to bear when building a chip and thinking about what it needs to do. So, it’ll be interesting to see how that… We’ll see better performance and efficiency, probably because of Microsoft’s experience in running AI. And they actually talked bit about it today a little bit in that they have tested those chips and work closely with their partner OpenAI on how they run LLMs, so that’s kind of interesting. Finally, Intel has been releasing more performance statistics around their AI chips, particularly the Gaudi 2. And continue to show a lot of progress in how efficient those chips are in reducing AI workloads. So, I guess what I’m thinking here I want to talk about and just say in general is you’re hearing a lot of different things going on. There are other players building purpose-built chips, some of them big, some of them small. Everybody from AMD to Groq. And what we’re seeing is this better performance. I think we’re going to see better performance and efficiency, and that means that these workloads are going to reduce per run, they have to. It’s almost like when you think about how you would run AI and what’s the cost model for this is that compute would be a huge chunk of running any kind of AI use case. And now, we’re seeing progress, quick, rapid progress, that we might be looking at these better efficiencies and better performance, which means cheaper compute for AI, which means there’ll be more useful use cases, more use cases that can be supported by business. So, that’s one trend that I think that’s interesting.

The second one, I want to talk a little bit about an ongoing trend around using tools and/or ideas to make AI models better and/or more efficient. So, you might’ve heard recently there was a couple of different things that go on. One is called fine-tuning a model. We’re going to talk about that in a second. The other one that’s gaining a lot of momentum is called RAG, or retrieval augmented generation. We’re not going to get too much in the weeds in here, but I’ll talk a little bit about these. So, when you think about fine-tuning, fine-tuning helps adapt these general large language models to perform well on specific tasks, making them more task versus specific. So, RAG, on the other hand, focuses on connecting an LLM to an external knowledge source through these what’s called retrieval mechanisms. And so, what RAG does is it allows an LLM to tap into these additional data resources without the time and expense of a constant model fine-tuning. And it uses these external data sets during inference to deliver more timely insights, better quality answers, and they claim that it practically eliminates hallucination.
All right. So, these are some things that are out there that are going on. So, there’s a lot of momentum around companies using RAG instead of fine-tuning to make these models run better or be more accurate, be more specific to what those companies want them to do.

And I’m going to introduce another couple of things that are helping fine-tuning work better, And that’s called LoRA and SLoRA, right? So, these came out today. LoRA is a acronym, I don’t even know what it stands for. I will look it up for you. But it dramatically reduces the cost of fine-tuning by identifying small subsets of the parameters to fine-tune. So, it doesn’t fine tune the whole model, it just takes certain parameters of it and fine-tunes that. That’s all well and good, but there were some issues when that happens around GPU memory management, and there’s a process, they call it the batching process, around it is slow. So, there were some challenges to this idea around LoRa. And today, there was a paper published by researchers from Stanford and UC Cal Berkeley around a methodology they call SLoRA. And what SLoRA was designed to do is to serve multiple LoRa models, right? So, it takes more than one and it has a dynamic memory, which means that the idea… It basically takes this LoRa model and helps it out in multiple ways by using multiples at one time, has this dynamic memory that addresses the issue of memory. And what they’re seeing from the early tests is that it’s significant gains in performance, dramatically reduces the cost of fine-tuning. And the really interesting thing is that it means, in theory, that an enterprise could run hundreds or even thousands of models on a single GPU. If this ends up being true, we’re really looking at things that would really unlock value and reduce costs for AI use cases going forward. So, I think this is interesting. It’s a big deal. Making models better, addressing these issues around cost, a lot of trends there.

I’ll give you a couple more that were interesting. Hugging Face has been at the front of helping forward AI development. And one of their lead folks worked on a thing they call LLMs-as-a-judge. And this is kind of a framework for saying how you evaluate LLMs for helpfulness, relevance, and harmfulness. So, there’s a tool out there that allows you to do that on Hugging Face. So, this is challenges we have with these models, they have to be addressed. There was a second one that came out this week that caught a little bit of controversy. It’s called Vitara. It’s also on Hugging Face. And they put out a measurement tool. It’s called a hallucination measurement tool. The early comeback on that is that it’s only measuring summaries and it’s not measuring the full output of a model and it doesn’t really necessarily measure how the model could hallucinate across what it puts out. It just does summaries. I’m just saying that as these are things… It’s another example of how there’s just going to be this cottage industry of tools, open source tools, things that you buy, all sorts of methodologies to make these models work better. That’s the trend. All right. So, that’s a little bit about key trends.

This week we’re going to move on to the rumpus room. And adults in the AI rumpus room, just as a reminder, I didn’t talk about it upfront, is we like to highlight companies and organizations that have looked at generative AI, that it’s just been this wild, crazy thing going on. And a lot of people have been a little less careful and thoughtful about how they do generative AI and there’s lots of pros and cons and challenges to that. So, we like to applaud and highlight when companies or organizations do AI the right way, the responsible way. And this week it’s the Allen Institute for AI. They came up with a thing, a methodology they’re calling What’s in My Big Data? or WIMBD. It’s a paper again that published this week. Interesting that they are addressing an issue that’s fundamental to generative AI, which is bad data sets. And they did some research around these much… The largest language models are typically trained on large public data sets. And we’ve talked about this since generative AI has been out there. A lot of the challenges to that is that there’s a lot of data there, some of it’s good, some of it’s bad, some of it’s biased, some of it’s just wrong. Those kinds of challenges. So, AI2 went out and did some research around this. The paper found that in some of these data sets, it was kind of a representative sample of some data sets, they found that in somewhere between 50% and 60% of the documents in these data sets were duplicates. And duplicates mean that your model accuracy goes down, it takes down the math and how the algorithm thinks. The more duplicates there are, the less accurate the model is. So, that was a big deal. They said it was over 60% in one of the data sets called The Pile. In RedPajama and laion2B, there were about 50% duplicate documents.

So, that was an interesting finding. Another interesting finding they had was that they ran some, they call it an automatic toxicity detection on these data sets. And they found that anywhere between 1% and 16% of the documents in the corpora contained toxic language. And then, when they used an automatic classifier… That was using an automatic classifier. When they did it the other way, just very similar with these ideas, automatic toxicity. An automatic classifier found it between 1% and 16%. So, what What’s in My Big Data? does is basically it’s a platform and it’s a set of these analyses that allows you to compare the contents of these large text corpora. And it works on a couple of different ideas. One is how to count and search at scale, which allows you to analyze these large amounts of data on standard compute mode, which is a great thing. So, there we go. The Allen Institute for AI coming out with something that would help us get these models to act better is an adult in the AI rumpus room. All right. So, that’s what we have this week. That’s how we go this week. There’s a couple of short ideas going on. I want to thank you for joining me here at The AI Moment. Be sure to subscribe and rate and review the podcast on your preferred platform and we’ll see you next week. Thanks.

Other Insights from The Futurum Group:

A Key Trend, Enterprise-grade Generative AI SaaS Applications, and Adobe’s Blueprint for AI Success – The AI Moment, Episode 3

Two Trends in AI Regulations and a Look at Microsoft Copilot – The AI Moment, Episode 2

Key Trends in Generative AI – The AI Moment, Episode 1

Author Information

Mark comes to The Futurum Group from Omdia’s Artificial Intelligence practice, where his focus was on natural language and AI use cases.

Previously, Mark worked as a consultant and analyst providing custom and syndicated qualitative market analysis with an emphasis on mobile technology and identifying trends and opportunities for companies like Syniverse and ABI Research. He has been cited by international media outlets including CNBC, The Wall Street Journal, Bloomberg Businessweek, and CNET. Based in Tampa, Florida, Mark is a veteran market research analyst with 25 years of experience interpreting technology business and holds a Bachelor of Science from the University of Florida.


Latest Insights:

Camberly Bates, Chief Technology Advisor at The Futurum Group, highlights Solidigm's groundbreaking work in AI, sustainability, and edge innovations presented at the Six Five Summit. Solidigm's advancements are set to redefine the future of data storage, emphasizing efficiency and environmental stewardship.
Oracle Exadata Exascale Debuts Aiming to Unite the Best of Exadata Database Intelligent Architecture and Cloud Elasticity to Boost Performance for Key Workloads
The Futurum Group’s Ron Westfall examines why the Exadata Exascale debut can be viewed as optimally uniting Exadata with the cloud to provide customers a highly performant, economical infrastructure for their Oracle databases with hyper-elastic resources expanding Oracle’s market by making Exadata attractive to small organizations with low entry configuration and small workload affordability.
Brad Tompkins, Executive Director at VMware User Group (VMUG), joins Keith Townsend & Dave Nicholson to share insights on how the VMware community is navigating the company's acquisition by Broadcom, focusing on continuity and innovation.
On this episode of The Six Five Webcast, hosts Patrick Moorhead and Daniel Newman discuss AWS Summit New York 2024, Samsung Galaxy Unpacked July 2024, Apple & Microsoft leave OpenAI board, AMD acquires Silo, Sequoia/A16Z/Goldman rain on the AI parade, and Oracle & Palantir Foundry & AI Platform.