The News: The CTO Advisor team worked with Dell Technologies to showcase low-cost large language model (LLM) updates using retrieval augmented generation (RAG) with a single GPU. Ryan Shrout from Signal65 Labs joined a discussion of the impact and importance of RAG in the business use of LLMs.
Dell Delivers Low-Cost LLM Updates by Retrieval Augmented Generation
Analyst Take: RAG is a way to update an LLM AI without retraining it. RAG takes a trained LLM and supplements it with task-specific content, such as up-to-date financial information. This supplemental information can be readily updated without changing the LLM, potentially updated in near-real-time, so that your application can deliver up-to-date AI insights.
Foundational LLMs are created by training using vast amounts of data, typically a complete crawl of the public Internet, such as the Common Corpus. Training with this vast amount of data requires a lot of resources and time, leading to multimillion-dollar costs. These models have intimate knowledge of the data they were trained on but total ignorance of anything outside it. A foundation LLM can provide answers on many topics such as a massive encyclopedia. Like a printed encyclopedia, a foundation LLM does not add new knowledge over time or go extremely deep on every topic. A customized LLM can be built from a foundation model by training on newer or private data to produce an LLM with specialized knowledge to give specialized insights. Fine-tuning an LLM is less intensive than building the foundation model but requires extensive and expensive computing resources, making fine-tuning a necessarily infrequent activity.
RAG does not change the LLM but supplements the model with another data source. New or specialist data is vectorized, translated into a format that can be used with the LLM and stored in a vector database. Vectorization is a simple and fast process, requiring far fewer resources and costs than fine-tuning with the same input data. This vectorized data is then used to supplement the LLM’s knowledge when a question is posed to the LLM. Vectorization can be a frequent and routine process that updates the vector database and makes updated information available as low-cost LLM updates.
The CTO Advisor team worked with Dell to demonstrate the practicality of using RAG without requiring a vast upgrade to the hardware platform for the application. What is a more modest hardware specification than a farm of servers with huge GPUs? How about a single laptop with a GPU? The demonstration used the Llama3 foundation LLM, augmented with Dell technical documentation. The LLM and the whole augmentation process were run on a single laptop, including vectorizing the Dell documentation and querying the LLM for information from its built-in knowledge and augmented information. Asking an unmodified foundation model about recent software such as vSphere 8.0 yielded an answer that when the model was trained there was no such product version. Asking the same questions of a foundation model with Dell server documentation as augmentation yielded a helpful answer, in this case, the specific minimum version of Dell BIOS needed for a successful upgrade to vSphere 8.0U1. The few minutes of vectorizing Dell documents using a single GPU upgraded the LLM’s knowledge of Dell servers by 2 years.
RAG becomes even more helpful when you need to control that augmentation data; it may be proprietary or regulated. You have more control of the augmentation data and can prevent data leakage when it is not trained into the LLM itself. RAG also allows the LLM to identify the specific augmentation source that provided the answer rather than the usual black-box nature of LLMs where answers are not easily attributed to sources.
The significant value of RAG is its ability to update a foundation model with new information, providing low-cost LLM updates. The Dell servers you already have in your data center may be sufficient for your AI application needs with RAG. A small number of GPUs can be effective at providing up-to-date inference using a foundation model and RAG. Naturally, RAG is not the solution to every LLM problem. There are plenty of use cases for fine-tuning either alone or with RAG used between fine-tuning updates.
Disclosure: The Futurum Group is a research and advisory firm that engages or has engaged in research, analysis, and advisory services with many technology companies, including those mentioned in this article. The author does not hold any equity positions with any company mentioned in this article.
Analysis and opinions expressed herein are specific to the analyst individually and data and other information that might have been provided for validation, not those of The Futurum Group as a whole.
Other Insights from The Futurum Group:
Dell Rolls Out Its 2024 Partner Program
The Evolving Role of Developers in the AI Revolution
Author Information
Alastair has made a twenty-year career out of helping people understand complex IT infrastructure and how to build solutions that fulfil business needs. Much of his career has included teaching official training courses for vendors, including HPE, VMware, and AWS. Alastair has written hundreds of analyst articles and papers exploring products and topics around on-premises infrastructure and virtualization and getting the most out of public cloud and hybrid infrastructure. Alastair has also been involved in community-driven, practitioner-led education through the vBrownBag podcast and the vBrownBag TechTalks.