On this episode of DevOps Dialogues: Insights & Innovations, I am joined by Kalyan Veeramachaneni, Co-founder and CEO of DataCebo to discuss access to synthetic data for generative AI models.
Our conversation also covered:
- Beyond LLMs: Generative AI models for data locked up in the enterprise warehouses
- Synthetic data use cases that can bring productivity gains
- Scalable approach to adopting synthetic data within an enterprise
- Synthetic data use cases that can alleviate data availability problems
These topics reflect ongoing discussions, challenges, and innovations within the DevOps community.
Watch the video below, and be sure to subscribe to our YouTube channel, so you never miss an episode.
Listen to the audio here:
Or grab the audio on your favorite audio platform below:
Disclosure: The Futurum Group is a research and advisory firm that engages or has engaged in research, analysis, and advisory services with many technology companies, including those mentioned in this webcast. The author does not hold any equity positions with any company mentioned in this webcast.
Analysis and opinions expressed herein are specific to the analyst individually and data and other information that might have been provided for validation, not those of The Futurum Group as a whole.
Transcript:
Paul Nashawaty: Hello and welcome to a special edition of DevOps Dialogue. My name is Paul Nashawaty and I am joined today by Kalyan, CEO of DataCebo. Kalyan, welcome.
Kalyan Veeramachaneni: Thank you. Thank you for having me.
Paul Nashawaty: So, exciting times at DataCebo. What’s happening? Why don’t you tell us a little bit?
Kalyan Veeramachaneni: Great. We were founded in 2020. We just closed our seed round in last fall, launched our SDV Enterprise product, which is a commercial offering of a popular open source project called Synthetic Data Vault. It spun out of my lab at MIT. So it allows enterprises to build generative AI models on their specific data, their data internally in the enterprises. We focus on something that we call the third kind of data. So you have the images, you have the language, now this is the third kind, which is structured and semi-structured tabular data sets. Pretty much every enterprise has them and that’s what their business depends on, and a lot of times the business secrets and the business intelligence is in that data, so we focus on that data.
So SDV allows people to build generative models on top of that data, and using that model to sample a lot of data, that data is called synthetic data, could be used for a variety of purposes. So it’s very exciting to see that this idea was several years ago born at MIT from our own pain, and getting access to data and then building these models and all that, and transition that from a research project into open source, get a lot of traction, and then launch SDV Enterprise product last December.
Paul Nashawaty: It’s really exciting, and it’s really exciting to see how it’s coming together, because with the explosion of data, data is the lifeline of business, right?
Kalyan Veeramachaneni: Exactly.
Paul Nashawaty: And if organizations don’t have that access to information, their businesses can potentially fail. And then with the data, it has to be a holistic view of the entire organization. So having access to all different datasets and unlocking that dataset is critically important. We see this in our research. We see that access to this information is growing, and the importance of accessing multiple data sets is also key. When you look at gen AI and you look at LLMs, large language models, and you look at how some of that data might be trapped in some of these enterprises, what does that mean to DataCebo.
Kalyan Veeramachaneni: So for LLMs, I think the interesting thing is that LLMs were able to build models because language data is so publicly and widely available. So for this kind of data, it’s not like all insurance companies are going to get together and say, “Release all the data.” It’s not going to happen. So as a result, to build a generative AI model for this kind of data, it has to be built foundationally within an enterprise premises. So that’s just a very different style of functioning. What that also means is that you have to bring algorithms to the data, and you also have to make algorithms work in a way that they don’t need a lot of compute.
So we designed several modeling techniques and algorithms at MIT that are very well known, and from a long time they’re called probabilistic graphical models. We use them to build these models and allow enterprises to build within their prem, within their walls, models specific to their data. And that’s really, really important for them, because as I said, language, you have tons of data available on internet, though again, once you have a model, you fine tune it with your enterprise data. But this kind of data, you have to allow generative AI models to be built internally within the enterprise.
Paul Nashawaty: Yeah, it makes sense to use those models, but when you build out that methodology, there is research that goes behind that methodology. So I’d like to hear a little bit about what you learned out of MIT labs and how that was brought into the commercialization of DataSebo and how that works together. What does that look like?
Kalyan Veeramachaneni: Yeah, that’s really good. I think at MIT also, this project went through certain phases all the way going back to 2014. I think our first challenge was we built it for ourselves, because we had the data from an educational platform. MIT launched edX, an educational platform, and they were collecting a lot of data, which was sensitive data, and we wanted to have access controls on it. So in my lab, we used to have one computer at that time, and we used to shuffle students around that computer. So it was a lot of productivity loss. And then we decided, “Well, why don’t we create a generative model of the data?
Generative models are known to allow you to sample data from it.” So we built something called a synthetic student, which will allow us to create synthetic data that’s realistic and use that. Once that project was successful, we thought, “How generalizable is this? Can this generalize to all the databases and perhaps solve all our access issues when we try to work with our clients at MIT?” So then we built synthetic data walls. I think transitioning it to open source was a very good idea because it actually allowed us to give it to a lot of people so they can try it on their enterprise datasets.
So that went very well. We had one million plus downloads and people used the software. And then we started looking at working with enterprises very, very closely and started looking at how even more complex their datasets are than what we have ever seen in the public domain. So as a result, I think that helped us put some features in the SDV Enterprise and that actually started taking off.
Paul Nashawaty: It sounds like the synthetic data approach really helped with basically productivity gains. So what other use cases, not just from the lab’s perspective, but from enterprises? Can you expand a little bit more on that?
Kalyan Veeramachaneni: Yeah, yeah. One of the use cases that we see is very powerful is testing software applications. As you can imagine, in the last decade, almost all software applications have become data-driven, which means a lot of logic is dependent upon the data they see. So they look at the data and say, “Okay, if this happens, take this path, or if this happens, take this path.” And there’s even sub-paths in there too. So to test these applications when you try to make new releases, developers end up spending some time creating the test data; sometimes for performance, sometimes for functionality testing, sometimes for API testing when two APIs are talking to each other.
And that usually takes a few days before every release worth of time to create that data. Either they write manually or they request for access for production data. So there’s a number of ways they try to get that. And we realized, in talking to our beta customers, that that’s where a lot of time is … not wasted, but spent, manual time is spent, and developers actually don’t like it. One of the developers said, “It’s the most boring part of my work. I would like to write more features, ship more features, and build more functionality.” So what they do is they use SDV Enterprise to build the generative model and then sample from it to get the data to test their applications.
Paul Nashawaty: Yeah, I haven’t met a developer yet that said, “I really want to be in maintenance mode.” They really want to innovate. They want to drive that innovation and be excited about what they’re doing. The interesting thing about data though, and with modernizing applications, historically you have heritage applications that may be monolithic and they have single data sources. But really as we start growing into these modernized approaches, it’s more distributed and scalable. I’m interested to hear how DataCebo connects these modernized applications into a scalable solution.
Kalyan Veeramachaneni: So I think a lot of them, their application teams, they have data sources. The centralized data source, I don’t know if that’s what you’re referring to, there’s a centralized data lake and data sources where the data resides. The decentralized approach is more scalable, in my opinion. So where a team that is maintaining an application looks at their data that this application consumes, and they build a generative model or an SDV model specific for that application. And then it becomes part of their pipeline, their regular testing and CI-CD pipeline.
And once one team does that, then adjacent teams also want to do that, so it’s much more bottom-up in the sense of an application-centric view of saying, “Let’s do it for one application and then another and another and another,” rather than going from the data warehouse point of view, which is very gigantic and messy and there’s not much of a governance or it takes a lot of time. Even if you create a model, you have to govern it and all that. So a lot of the time when we’re seeing adoption, we are seeing it in a much more decentralized way, and growing in that fashion.
Paul Nashawaty: Okay. Does that cause challenges with the organizations because it is decentralized from the data sources?
Kalyan Veeramachaneni: That’s a good question. It doesn’t, actually, and the reason being that when you try to build a very localized model, because our modeling is so fast and sampling is so fast, they actually don’t store the samples. They don’t need to store the samples. So it becomes a part of their workflow and their pipeline. So it’s not competing with the data warehouse or data lakes or anything. It seems like a completely different workflow. It’s part of the software design and development workflow.
Paul Nashawaty: Yeah, it makes sense. And having the workflow at the location, you hear a lot, especially with the large workloads and use cases that are out there, you hear a lot about processing data in place, because moving those large datasets around, it’s too large, you can’t do. It’s too much data. So you need to process it, get the results, and then shoot the results back to a centralized location. That’s what we’re seeing in our research as well. We see application portability is key, but that it’s really about processing the information in place.
Kalyan Veeramachaneni: Actually data portability is another big case for this kind of modeling where you don’t want to move the data around if you’re moving application and testing application in different environments, if you’re migrating to cloud or if you’re testing different environments. So what you would do is you would build a model. The model, it’s fascinating. That’s what most is fascinating about this whole generative modeling for this kind of data. The model is just one file. It’s just one file. In a way, it’s a very compressed form of the database. So you can move the model around and you can sample from it.
And by definition, the model only captures the aggregate statistics and aggregate properties. It doesn’t have any real data in it. So the file you can move around, and then you can put it next to your application and you can sample. So now you’re suddenly not moving the data around in between different environments. We see a lot of customers passionate about that aspect, because A, you don’t want to move data around a lot, B, it’s expensive. Once you’ve migrated the data to the cloud, moving things around, I take it, is actually expensive. So they would like to prefer to build a model, use that model in different environment, sample from their data on demand.
Paul Nashawaty: Yeah, that makes sense. I like the approach that you’re talking about where you move the sample set of information, take the information you need, and then produce the results. It’s really what I’m hearing from clients, from customers that are asking about these business challenges. The other thing I hear a lot about, we talk about data, it’s always a big question about availability, when you have availability in your data. And when you build these models that you’re talking about, how do you ensure that that data is actually being put in place and being used appropriately?
Kalyan Veeramachaneni: So we have a lot of guardrails. So the data availability problem, I’ll say briefly a few things about it. They come from a couple of issues. One is that some events or some things that you’re reporting are so rare in terms of happening, they don’t have that much data about those events or rare occurrences. So that’s one availability problem. The second availability problem is that you had a product, let’s say in UK, and you’re just launched in US. So now you’re slowly getting the customers for this or you don’t have as much data as you used to have in UK. So as a result, you have a data availability problem. So the generative model and then generating synthetic data can help you with that data availability problem in both cases, but one thing we do is we provide guardrails, which is when we do synthesize the data, we can tell the quality of this data and how confident are we that this is the data, because some of this data is in the neighborhood of data points where you have very sparse availability.
So how real are these datasets if you can’t even know what the reality looks like in that space? So we have a lot of quality metrics to assess and say, “Okay, this is the quality of this data. This is how much you should trust.” And one of the interesting things that came about from one of our customers was because it is going to happen, a lot of data supply is going to be synthetic data. So one of the question is, “Kalyan, if that’s going to happen everywhere, how do I know that I’m using synthetic or real? Because you’re claiming that synthetic looks as good as real.” So that brought back an idea of trying to put some markers.
So maybe in the keys, we try to put SDV, or maybe in the primary keys, somewhere we put markers, encoded markers so it actually clearly can be seen that you’re working with synthetic data, you’re not working with real data. So that’s the first thing that we want to put as a guardrail, so people know, they’re aware they’re working with synthetic data. The second thing is obviously all these quality metrics that we can provide. Also, we can tell them how much trustworthy or confidence you should have in this synthetic data based on the quality that is generated.
Paul Nashawaty: Very good. Well, we’re coming to the end of our session here, and we’ve talked a lot about data and the application modernization process and the impacts of the CI-CD pipeline and what it means to move ahead and such. What would you recommend to the audience on how they could get started with DataCebo?
Kalyan Veeramachaneni: I would say start with the Synthetic Data Vault. It’s a publicly available source. Go there. That’s the first stop in our product adoption. Build an SDV model, synthesize data, test out ROI. I think we find teams who try to assess the ROI upfront are more likely to succeed in a broader adoption than when they try to test the ROI and assess the ROI with the publicly available source. And then after that, once they hit complexity of either multiple teams or multiple applications, they reach out to us. We have SDV Enterprise that can handle complex datasets, complex data structures and data formats and the whole holy grail of enterprise data. And then that’s where they can start from application and application and they can go in that direction.
Paul Nashawaty: Very good, very good. Kalyan, I’d like to thank you for your time, your perspective and insights today. And I’d like to thank the audience for attending and watching our session today. If you want to learn more about what we’re talking about and our research and our information, please follow us at futurum.com. Thank you.
Other insights from The Futurum Group:
Application Development and Modernization
The Evolving Role of Developers in the AI Revolution
Docker Build Cloud Aims to Revolutionizing DevOps
Author Information
At The Futurum Group, Paul Nashawaty, Practice Leader and Lead Principal Analyst, specializes in application modernization across build, release and operations. With a wealth of expertise in digital transformation initiatives spanning front-end and back-end systems, he also possesses comprehensive knowledge of the underlying infrastructure ecosystem crucial for supporting modernization endeavors. With over 25 years of experience, Paul has a proven track record in implementing effective go-to-market strategies, including the identification of new market channels, the growth and cultivation of partner ecosystems, and the successful execution of strategic plans resulting in positive business outcomes for his clients.