Unstructured data is the next frontier for AI: think video, audio, and more. David Nicholson is joined by Dell Technologies‘ Vice President of Product Management for Artificial Intelligence and Data Management Chad Dunn for a conversation on the strategic importance of high-quality data and the dynamic capabilities of the Dell Data Lakehouse in facilitating effective AI workloads.
Highlights include ⤵️
- Data quality is paramount: “Garbage in, expensive garbage out” applies more than ever in the age of generative AI
- Dell’s Data Lakehouse: This intelligent platform helps organizations extract, prepare, and analyze data for AI workloads, including both structured and unstructured data with tools like Apache Spark and Trino
- Customer experiences: The evolving landscape of data challenges in large enterprises
- Pushing the boundaries: Dell’s approach to managing unstructured data and integrating AI Factory visions into Lakehouse functionalities
Learn more at Dell Technologies.
Watch the video below at Six Five Media at SC24 and be sure to subscribe to our YouTube channel, so you never miss an episode.
Or listen to the audio here:
Disclaimer: Six Five On The Road at SC24 is for information and entertainment purposes only. Over the course of this webcast, we may talk about companies that are publicly traded and we may even reference that fact and their equity share price, but please do not take anything that we say as a recommendation about what you should do with your investment dollars. We are not investment advisors, and we ask that you do not treat us as such.
Transcript:
David Nicholson: Welcome to Six Five On The Road, continuing coverage of SuperComputing 2024. I’m Dave Nicholson, and I’m joined by a very distinguished guest from Dell Technologies, Chad Dunn. Welcome, Chad.
Chad Dunn: Thanks very much. It’s great to be here.
David Nicholson: So we’re going to talk about something, actually, the subject took me off guard a little bit because I thought that data didn’t really matter anymore. Chad, I mean, can’t we just use AI? Doesn’t that solve all of our data sins? How’s that for a softball question?
Chad Dunn: Boy, what a terrible question. Data always matters. Data is the most important thing, and we’re lucky enough to store most of the world’s most critical data on our systems, and they trust us with it, which is an amazing thing. But in terms of AI, data matters a lot because you really need to think about the quality of the data that you’re introducing to your generative AI to get to business outcomes. And if you think of the old, saying, garbage in, garbage out, if you feed garbage into generative AI, you’re not going to just get garbage out, you’re going to get very expensive garbage out. So you really got to pay attention to what are your data sources? What’s the quality? What’s the lineage? How’s it governed? There’s so many aspects to having a good data strategy before you have a generative AI strategy.
David Nicholson: Well, we’ve been hearing a lot at the conference about Dell’s AI factory. So under the heading of the AI factory, you open the doors to the factory and you walk in, what are we going to see there in terms of what you’re doing from a data perspective? We’ve heard the term data lakehouse, want to hear what that’s all about, the Dell Data Lakehouse. But what are some of the underlying technologies that you’re leveraging that you’re bringing to bear, specifically for AI?
Chad Dunn: Well, if you go into the covers and you sort of look at the AI factory, you’re going to see a lot of really sexy things. You’re going to see a lot of servers with lots and lots of GPUs. You’re going to see really complex, really high performing networks, high performing storage, object storage, and then you’re going to see where the data comes from, and you’re going to see something called the Dell Data Lakehouse. And this is the product that we use to be able to extract the data from those enterprise sources and prepare it to be used by generative AI. Now, we’ve largely had a structured data focus for the data lakehouse before. We partnered with a company called Starburst, a very key partner for us who produces a software product called Trino. And this is basically enabling us to do things like federated search and data virtualization. So being able to get your data no matter where it is through a single sequel query. But we’re starting to augment that, we’re starting to move that into the world of unstructured data starting with embedding something called a Spark engine in the data Lakehouse.
David Nicholson: Got it. Got it. So when people talk about Spark and Trino, that duo has you covered for structured and unstructured data, is that the idea?
Chad Dunn: It’s two steps along the path. So Trino is very good at structured data. So you can point it at your transactional systems, your databases, and do structured query to there, or you can copy that data into a centralized data store and use it there. When you bring in something like a Spark engine, you’re able to ingest streaming data. Now, that data could be from transactional systems, it could be telemetry from hardware, it could be from software, it could be from IoT devices, and that’s largely what I would call semi-structured. You’re able to ingest that data at the same time and process it in real time. And as we move forward, you’ll see us add more and more truly unstructured capabilities to the data lakehouse. So not simply just semi-structured data, but things like audio and video files and other things.
David Nicholson: So I’m going to put my naive hat on for a minute here. Under the heading of AI, there are a lot of different things that people do. You’re talking about preparing data for AI. Are you preparing data for the fine training of models, or is this primarily so that it’s ready to be retrieved by RAG with retrieval augmented generation, or both? Is this sort of a foundational thing that you do from a data lakehouse perspective upon which everything that we think of as AI rests? Is that what this looks like?
Chad Dunn: Yeah, it really is that foundational step. I find whenever I have a generative AI conversation with a customer, it turns out a lot of the meaning gets consumed by talking about their data strategy, whether it’s discovery of data, where it’s stored, what format it’s in, how it needs to be transformed, where it needs to be stored. I’m going to give you a great example of what we do internally at Dell. We have an AI initiative inside of Dell called Next Best Action, and that’s all about analyzing what’s happening with our hardware in the field and what our customers are telling us when there’s a service event where they need to call us or contact us. And so we’re monitoring that telemetry, actually using this product, and just to get into the Dell Data Lake. And then we’re making AI decisions on what the next best action is to get to a faster resolution of the problem. And we’re arming our customer support representatives with that data. So that is in terms of a RAG search. And so we’re using that data lake with that high quality data, which is the telemetry from our products to arm them to have that conversation.
David Nicholson: So there’s a lot of conversation among the CIOs and CTOs that I work with. This question of, shall we build it and hope that they will come? Or do we absolutely have to figure out exactly what we’re going to do in AI before we make a move at all? Could you work with a company that knows that, for example, they know they’re going to be fine-tuning a model with their own bespoke data, and let’s say, in their mind, they’re thinking, look, and by the way, we’re going to do this on premises because this data, we don’t want living off-prem and we’re not exactly sure what this is going to look like moving forward? Could you tell them that even if you don’t know what the ultimate outcomes are that they’re looking for, that at least they can get started on data hygiene and leveraging this stuff right away? Is that a reasonable step to begin instead of waiting to think you’ve got everything figured out to make any move at all?
Chad Dunn: Yeah. Generative AI will expose a bad data management strategy very, very quickly. And so it is important to get ahead of this, and you really need to consider your data management strategy before you even head down the path of generative AI. Now, I’ll say that you mentioned build it and they will come, what we’re really finding with most customers is there’s a line already forming where something needs to be built. We had a customer support organization who wanted to provide a faster time to resolution, who wanted to provide a better support experience. We had programmers who wanted to write better quality code. We have marketing people who want to generate content much, much more quickly, and they’re sort of waiting for these tools and waiting for us to say, here is the Dell-blessed generative AI solution to be able to do so. And I suspect, in fact, I know that many of our customers are in that same position where they sort of know now what the use cases are, and they want to know how to get there.
David Nicholson: What does the actual underlying storage look like in these environments? Dell provides a whole variety of backend storage solutions, internal RAID systems, external block S3, I mean, pretty much everything. What does this look like? Is it a mix of all of the above?
Chad Dunn: Well, there’s a mix, but I’ll tell you the real workhorse of the AI storage portfolio for the AI factory is PowerScale because that’s giving you the unstructured file storage. That’s also giving you an S3 interface. And so that’s the one that you really can’t do without. And that’s where we see a lot of value in using that with the data lakehouse for storage.
David Nicholson: And in that case, are those PowerScale nodes, separate nodes that are dedicated to storage, or is that software that’s running on top of servers that are doing compute work along with storage work? What does that look like?
Chad Dunn: It’s more the latter. And the reason being there is a characteristic difference between something that is optimized to be a compute workload versus a storage workload. I mean, we’ve certainly seen, and maybe in the past been victims or believers in putting compute workloads on storage, and that has never ended very well for really anyone that we’ve seen. And so what we tend to do is put some of these advanced processing capabilities where it belongs, and that’s on Compute notes. And that’s why the data Lakehouse is what it is. It’s a separate appliance. It sits alongside storage and does that data management function, does that federated search, does vector search, does semantic search outside of the primary storage. And we believe that’s a better way to go because you’re going to want to be able to flex the capacity of that data management software in a much different way than you would the capacity or the compute power in your storage arrays.
David Nicholson: Okay. So even though it might look cool to see a picture of a Ferrari towing an Airstream or a boat, it’s probably better to have a good old-fashioned Ford F-150, or at least a device that’s designed for the job.
Chad Dunn: Well, if we’re going to vacation together and we’re taking your Ferrari, we’re going to tow an Airstream, okay.
David Nicholson: Just won’t go up any hills. I think we’ll be okay. So what else do we need to understand about this concept of… Because look, there was a time when we in the industry touted the value of information being the next oil, the next gold, its information. Data mining was all the rage. Data lakes were all the rage. If people have heard the term data lake, now you’re saying data lakehouse, be explicit. What’s the difference between those two concepts?
Chad Dunn: So there’s tremendous value locked up in that unstructured data. So today, if we look at the metadata associated with say, audio files, great. We know when it was created. We know how big it is. We know who modified it last. Great. What if we knew more things about what was inside it? What if we use a language model and GPUs to inspect the contents? Let’s say it’s audio recordings of a customer service call. What if I could tell what’s a good call versus a bad call? What if I could base my training on good calls instead of bad calls? What to do, what not to do? What if I could invert things out of my video files? What if I could search through all my picture files and find all the pictures of cats or find all the pictures of 20-year-old Chad with frosted tips and say delete? That would be a big value to me and probably to society. So being able to use GPUs and language models to inspect the unstructured data increases the value immensely. And so I think that as we start to add more and more unstructured features to the data lakehouse, to the data platform, we’re going to be able to realize those features.
David Nicholson: It’s interesting because we’ve been talking about unlocking the value from data for a long time. It really seems like we’re there now. We’ve got the horsepower and the intelligence to do it. We’ve certainly got the capacity to do it. It’s mind-blowing. What do you see moving forward in terms of advances from a data perspective, or if you think about what’s going on now from a Dell Data Lakehouse perspective, what you’re doing with Spark and Trino at this point, are there bottlenecks, things that you’re looking to overcome when you look into the next year, advancements that you’re looking forward to? What does the future hold from a data perspective?
Chad Dunn: I think for us, what we’re looking at is what are the limits of the data functionality we can place in the platform to start to inspect those unstructured files and to do interesting things with them. And the technology is moving, again, very, very fast ’cause we’re on that sort of generative AI innovation cycle, which is faster than anything I think that I’ve ever seen, at least in my career so far. So I think that really is the next frontier, is getting into those unstructured files and gaining insights from them. If you look at most of the market projections, by 2030, the vast majority of AI inferencing will be on video files, not the text files, not the things that we’re used to today. So I think it’s going to make us incredibly more productive. It’s going to give us insights that we didn’t know we could get out of this content.
David Nicholson: And key point, tell me if I have this wrong, but it feels like a key point is this isn’t our grandparents’ storage. This isn’t simply we have a place where we can securely and with enough performance, store your data. This idea of a data platform involves intelligence that can provide insights into, going back to what you were talking about, structured data. Fine, it’s in a database and we had that power, but specifically on the unstructured side, the platform you’re talking about is intelligent and it’s pushing something up other than simply saying, here’s the file, here’s where it exists, or here’s the block address, here’s where it exists. So that’s the key to what you’re doing. There’s value add there. It’s an intelligent platform. Is that a fair way to characterize it?
Chad Dunn: No, you’re absolutely right about that. I mean, the symbiosis between the data platform and the underlying storage technology is absolutely critical. So you just look at some of the things that we’ve done just recently in PowerScale. So the ability to export the entire metadata catalog so we don’t have to go traverse directory trees to find things. A robust S3 interface that notifies us of changes so we can constantly hydrate the data lake and look for changes and be proactive as the data changes. Things like open data formats for tables like Iceberg or storing things in a Parquet format. These are the things that make storage arrays smarter and more able to cope with the needs that we’re going to see in generative AI going forward. And PowerScale is right there with us the whole time.
David Nicholson: Okay. So Dell’s Data Lakehouse is not simply a storage array with gray Martha’s Vineyard shingles hammered to the outside of it. It is actually something far more sophisticated. This is good to know. Final thoughts or final new things that have come out around SuperComputing 2024 that we need to make sure we know about?
Chad Dunn: Well, look, you got the big news. The next big thing is the embedding of the Spark engine, which we think is a game changer for us. And I think as you watch our roadmap evolve, you’re going to see more and more of those unstructured features that I talked about start to be realized. And it’s an amazing time to be in the data management space.
David Nicholson: Fantastic. Fantastic. And to all of our viewers, go out and make sure the difference between structured and unstructured data, because it’s one of those things that people get confused about all the time. For Six Five On The Road, continuing coverage of SuperComputing 2024, I’m Dave Nicholson. Thanks for spending time with us.
Author Information
David Nicholson is Chief Research Officer at The Futurum Group, a host and contributor for Six Five Media, and an Instructor and Success Coach at Wharton’s CTO and Digital Transformation academies, out of the University of Pennsylvania’s Wharton School of Business’s Arresty Institute for Executive Education.
David interprets the world of Information Technology from the perspective of a Chief Technology Officer mindset, answering the question, “How is the latest technology best leveraged in service of an organization’s mission?” This is the subject of much of his advisory work with clients, as well as his academic focus.
Prior to joining The Futurum Group, David held technical leadership positions at EMC, Oracle, and Dell. He is also the founder of DNA Consulting, providing actionable insights to a wide variety of clients seeking to better understand the intersection of technology and business.