Enterprising Insight, Episode 15: Text-to-Image Generation for Enterprises

Enterprising Insight, Episode 15: Text-to-Image Generation for Enterprises

In this episode of Enterprising Insights, The Futurum Group’s Enterprise Applications Research Director Keith Kirkpatrick discusses the benefits and challenges of using text-to-image tools powered by generative AI in an enterprise setting. He delves into the various tools that are available, the need for strong guardrails to protect against misuse, toxicity, and bias, and the challenge of making sure that models and guardrails reflect the real world. Kirkpatrick will also highlight a few notable news items from the enterprise application space, and, as always, will close the show with the Rant or Rave segment.

You can grab the video here and subscribe to our YouTube channel if you’ve not yet done so.

Listen to the audio below:

Disclaimer: The Enterprising Insights podcast is for information and entertainment purposes only. Over the course of this podcast, we may talk about companies that are publicly traded and we may even reference that fact and their equity share price, but please do not take anything that we say as a recommendation about what you should do with your investment dollars. We are not investment advisors and we do not ask that you treat us as such.


Keith Kirkpatrick: Hello everyone. I’m Keith Kirkpatrick, Research Director with The Futurum Group and I’d like to welcome you to Enterprising Insights. It’s our weekly podcast that explores the latest developments in the enterprise software market and the technologies that underpin these platforms, applications and tools. This week I’d like to discuss the benefits and challenges of using text-to-image tools that are powered by generative AI and using them in an enterprise setting.

So I’m going to talk about some of the various tools that are currently available, the need for strong guardrails to protect against misuse, toxicity and bias, and the challenge of making sure that the models and guardrails ensure that the outputs reflect the real world. Then I’m going to highlight a few notable news items from the enterprise application space. And as always, I’ll close out our show with our Rant Or Rave segment where I’ll pick one item in the enterprise software market and I will either champion it or criticize it. So let’s get right into it.

So this week’s topics are generative AI based text-to-image tools. What are these? Well, essentially these are tools that allow people to use generative AI to create imagery simply by typing in a text prompt. So for example, a user could type in a description of an image or a scene like I’d like to create a red sports car driving down a highway with a city scene in the background. Once someone types in that prompt, the tool will return an image based on that description or that prompt.

Now, how does this happen? Is this magic? How does this actually occur? Well, really essentially this is using an AI model that has been trained on a massive data set of both text and images. So the basic approach really starts with basically taking hundreds of millions of different images that are each paired with a caption that describes it in words. So for example, if you were to take this example of a car, or a sports car let’s say, you would actually describe each image in words with what it is. Now, the more complex models, take it a step further and they will actually break down large images into smaller chunks or smaller pieces to allow a more granular description of an item and some of the components that make up that item.

So for example, if we’re using that car example, or a sports car let’s say, you could actually go through and describe certain wheels or tires that are generally used on sports cars. So things with wider tread, perhaps they have a certain style or look to them that goes on a sports car versus let’s say a tractor or an 18-wheeler. So the idea here is to basically train the model on very granular images and make sure that each image is described in as much detail as possible with what it is and what the attributes are for that image. And essentially, then the goal is for the model to basically break that down into pixels and then reconstruct them again based on the text prompts.

Now, ultimately the goal here is to be able to create images in almost the same way that we think of things that humans create images, where if you think of again, a car, you understand what are the elements that make a car a car and what are the elements that make a sports car versus a family minivan or an SUV. And the idea is that the model is only able to really provide that feedback or generate an image based on the number of images and the complexity and the granularity of its inputs.

So when we’re talking about this, let’s just quickly talk about some of the key providers, and there are really a ton of companies out there doing it. Some of the major ones that a lot of people may be familiar with. For one, Adobe. They have their Firefly product. This is essentially an enterprise grade image generation technology. So let’s just say a user could type in, again, a prompt saying, “I’d like to see a farm landscape with butterflies.” And because there’s such a massive amount of imagery that has been captured by Adobe over the years within its stock repository of photographs and artwork and so forth, the software is able to generate images based off of those text prompts.

Now, the interesting thing about Adobe of course is that they’ve done a great job of making their image generation offering enterprise grade mainly around the idea of making sure that any image that is being used to generate content in a generative way, it either there is a known content provider who has been compensated for the use of their artwork or the image is in the public domain. And that’s really important, because when we get into issues of commercial usage of this technology, companies are going to want to make sure that they have the rights to use the images that are created. I’ll get into a little more of that later.

Obviously there are other companies out there that are doing this. Midjourney is one, I think their product is called Imaging. Their model is known for very, very high quality photorealistic images. Then there’s the DALL-E 2 and DALL-E 3 model, which comes from OpenAI. That is either available directly through them, or a lot of times you will have other image creation or image editing software that will allow you to access those models through their tool like Canva.

The interesting thing about these is that, again, the goal is to really be able to create a very, very vivid image based on all of the images that have been collected over time in their repository so you can actually create a very realistic-looking image. There are a couple other ones out there that are notable. Stability AI actually has a new model out, which again is designed to improve the quality and granularity of text-to-image generation. I believe their offering is available now.

Microsoft also offers this through Copilot Pro. I think their pricing this at about 20 bucks a month. And I’ll get into pricing in a little bit, but really a lot of these providers, they’re taking sort of almost a hybrid approach in that they are charging basic access fee, and that could be anything from 10 bucks a month, 20, 30, 40, what have you. That will get you a certain number of credits. And really what a credit is, is they’re trying to generate, or I’m sorry, assign value to the amount of compute or processing power that is required to generate these images. So for example, a very, very simple image like a stick figure drawing would not require the same amount of generative credits that a highly photorealistic, very intricate or complex scene might. So the goal here with all of these providers is to figure out a way to make sure that the value is appropriately captured in the pricing.

And really, that really just comes down to assessing how much compute power is required to do a simple task versus a very complex one. So with all of these different providers, there’s obviously two different, I guess, goals here. One is obviously to be able to really create very, very detailed, highly realistic looking images. That’s obviously a big focus for all of these companies because they want to make sure that the images that are created can rival real world or perhaps even exceed real world photographs or pieces of art.

The second part, of course, is looking at the accuracy and the ease of use when it comes to typing in prompts that deliver what the end user wants. So for example, if you are just sort of blindly typing in a prompt, “Show me a red sports car driving down the street,” well, the chances you’re getting something that exactly conforms to what you thought or how you’re perceiving that artwork is pretty low because you really haven’t given any sort of constraints or you haven’t had any parameters put in.

Some of these applications do a good job by providing templates or providing guides to help create a more complex, and I’m sorry, information rich prompt, which obviously leads you to a better result with fewer iterations or shots in terms of trying to create that. So essentially, it would be like if you told a person to draw you an image of a sports car, if you just give that sort of wide open framework, who knows what you’re going to get back? But if you were to explain in detail exactly what you’re looking for, the likelihood that you would get a result back that matches what you’re thinking is much higher. One of the goals of all of these applications is to make it easy to do that. You often with a combination of prompt aids and templates, that sort of thing, to help guide you to quickly get to where you want to be from sort of up here and then eventually to your output.

Now, why is all of this important? Well, at the most basic level, it’s interesting. It’s really, really interesting and really cool that you’re able to create an image by simply typing in a prompt instead of needing to do it by hand or Photoshop an existing image. And really when we’re thinking about it, there are so many really interesting commercial uses here. So I can think of, for example, the obvious ones, the low-hanging fruit, would be marketing campaigns where instead of, again, having to go through and select some artwork from a stock photo repository and then spend a lot of time manipulating that image by using generative AI, you’re able to customize that much more quickly. That’s a massive time saver.

It can also be used to create multiple variations of that image. So for example, let’s use an article of clothing let’s say, like a jacket. Well, if you were wanting to merchandise a jacket, you probably want it to appeal to as many situations or people as possible. And how do you do that? Well, you show that jacket in a number of scenarios. Maybe it’s someone going out to a club and wants to wear it for a night out, but maybe it also looks good in a more formal setting, in an office or perhaps in a casual setting just down at your local coffee shop. By using generative AI, you can actually generate backgrounds very, very quickly with that one image as being the focal point and doing that in such a way that it saves significant amount of time to create those background scenes.

And of course, you could do the same thing with variations of instead of having just the jacket in a shade of gray, you could do any number of variations in different colors. So the idea is you’re reducing the amount of time it takes to create imagery that can be used in a commercial purpose. Now, the other thing that’s also really interesting about this is you can use that in such a way that if you are a shopper on a website and you were looking for a particular outfit that would go good in the club, there again, on the back end, a lot of these platforms are generative AI in such a way that you can type in, “I need a shirt to go out in the club.” And guess what? They will be able to take an item that has been appropriately tagged along with an image, or I’m sorry, a scene of a club background and you would be able to generate that very quickly on demand so that the product would appear in context as being club-worthy or office-worthy or what have you.

The idea is by creating context and personalizing that content, it will go that much further in terms of generating a more relevant experience to the end user when they’re shopping online and trying to figure out what they want to purchase. So you can see there’s a massive opportunity here for utilizing this technology. Now, this sounds all well and good, and it really is pretty amazing. The challenge of course, is that when you have something like an image and you want to describe it, there are quite a few pitfalls out there in terms of how you describe particular images. If we’re talking about, and I’ll use a really obvious example, let’s say we would talk about a particular occupation like a nurse.

Traditionally, if we go back 50 years, 60 years, most people employed in that position were women. And if you were thinking about a nurse from back in the day, it was a woman, probably a Caucasian woman. Now, obviously that is not reflective of the world today. There are people of all backgrounds and genders who are nurses. And it is the responsibility of that model developer to make sure that when you type in nurse, that it responds with a range, a range of particular images that would be appropriate, not just the stereotypical World War II nurse in the whites that you might’ve seen in the movement from the ’30s or ’40s.

So that is very important, to make sure that the model does not absorb conscious or even unconscious human biases during that training stage, particularly when it comes to labeling imagery. Now, the other thing that can happen is there is that fear, or it actually has happened where the model will actually over correct for these issues of bias and toxicity. So if you were to say, “Create a picture of a Caucasian nurse,” depending on the model, it may not return that because the way the model has been or the guardrails have been set up, it wouldn’t want to necessarily reinforce that sort of stereotype.

And the challenge of course, is that aside from the fact that there still are plenty of Caucasian nurses out there, the other problem is that we started to see issues where the model is over corrected to the point where someone would type in and request an image of an actual person or historical person and it would come back and the imagery would be, quite frankly, incorrect because the model was essentially over correcting and basically distorting history with the types of images that were generated. And that is a real problem because it really does two things. Number one, it does not really engender trust that the model has been properly trained and looked at from all perspectives. And two, it also tends to really make users question how are these models actually being evaluated. Who is doing the evaluation? Is there bias there? Is there an agenda?

And I don’t know whether or not that is the case at any organization or not, or if it’s just that they just didn’t tune their models or their guardrails properly, but at any rate, it’s not a good situation and it is a real challenge. I do think the companies are going to get there, but it’s something that requires not just a cursory look at the model. There needs to be a lot of testing. There needs to be a number of stakeholders who look at it from a number of different perspectives to make sure that the models perform the way that they are supposed to, whether we’re talking about image generation or if we’re talking about just sort of a more general purpose assistant or other tool that uses generative AI.

But we’ll come back to this in a minute. But before I let this episode go too long, I wanted to talk a little bit about a couple other really interesting things in the world of enterprise application. This past week we saw some interesting news from Microsoft. They actually just released, for public preview, Copilot for Finance. This is an interesting generative AI tool. And really what this is, is this is a version of co-pilot that’s really been tuned or tailored for finance professionals as some of the tasks that they do. The interesting thing about this is that if you think about finance and information and some of the tasks that they need to do, well, sometimes the information is not held in one place.

Sometimes the information is held in different systems around the organization.What Copilot is going to be able to do is allow these workers in finance to interact with data help within other corporate sources within the applications that they are used to using. So it could be things like Excel or Outlook. And really that’s really interesting because one of the big challenges for folks is sometimes they spend so much time hunting down information that it kind of pulls them away from their core focus, which is doing their job. And in these cases, Microsoft highlighted three sort of typical use cases around variance analysis, reconciliation of accounts and collections.

All very important jobs, very important tasks where accuracy is paramount. These Copilot systems are really designed to make it easier to interact with data within the applications that these folks are familiar with and allow them to focus in on actually looking at the information that they need to be making the right business decisions. And of course, they’re using AI here to help guide various workflow automations, recommendations based on the data. And really these are also tuned to be very role specific, which is a great way to make sure that the people who are interacting with the data, they’re getting what they need in a way that’s relevant to them. So that’s one interesting thing that we saw over the past week.

Another one, and we’ll kind of go back a little bit to generative AI here, Adobe. They just announced project music, generative AI control. This is a very, very early stage generative AI music generation and editing tool. Really cool. It’s very similar to image generation in that you can use a text prompt that feeds into a generative AI model and basically say, “Hey, I want to generate music and I want it to sound like hair metal or acid jazz or ’40s big band music.” And the tool will actually go through and generate music based on a reference melody and really adjust the tempo, the structure, the patterns, all of that kind of information that makes up music to create a repeatable loop based on that style of music. And that’s really interesting because it will help creators, music creators, generate music very quickly within a particular style, and again, using these almost predefined, I don’t want to say templates, but they’re basically style guides.

Now, really interesting technology. It’s in a very, very early stage. There are obviously questions involved around at what point does this further contribute to the issue of music sounding the same, particularly when it comes to things like commercial use of music for jingles or commercials or what have you. That is certainly a big question that is always going to be out there whenever we think about using generative AI to create a creative work as opposed to just handling a repetitive task. Ultimately, I think it may be used as a tool by music creators to speed up their workflow. There are still obviously questions around making sure that the source material has been properly cleared in terms of rights and making sure that anyone who originally contributed music to this repository from which the model has been trained, that they are appropriately compensated. But at first glance, it looks very, very interesting.

And finally, I wanted to talk a little bit about experience management software provider Qualtrics. These guys have just announced a new pricing model, which I believe they’re going to… I believe it’s in effect now, but they’re really doing sort of more of a slow rollout of it. But really the interesting thing about them is they’re shifting from a basic seat license model to more of a consumption-based pricing model. So basically, it takes into the account how much a customer uses, and that sets the rate for what they pay.

This is really interesting because they have three different product areas, XM for customer front lines, which is really sort of that customer experience software that can be used in contact centers, physical locations, and across digital channels. They have that in there. It’s totally based on the number of interactions that are occurring through that platform, which is interesting because it makes sure that essentially if you’re using it more, you’re paying for it more, but you’re not paying for licenses that you don’t use.

Another one, another major platform that they offer is their XM for Strategy and Research. And again, this is really looking at software to do things like surveys, conversational surveys. You can really generate insights from this. And again, this is based on interaction-based pricing here. And then the third one is XM for People Teams. Now, this one’s a little bit different in that you do have pricing based on the number of employees, but I think there is included within that within each sort of license, they have a certain number of interactions that are included for continuous employee listening. So if they’re trying to collect feedback from employees, it’s not like you’re just getting priced on the number of people. You’re actually being able to really, on an ongoing basis, listen to what they have to say in terms of providing feedback as part of employee experience programs.

So with all of these, AI is included within these packages. And again, the really interesting overall comment I’d make about this is this kind of falls in line with some of the trends that I was talking about back in January about this sort of shift to a more consumption-based pricing model that we’re going to see process applications moving forward. Because really, enterprises are looking to tie usage to pricing because it makes more sense to make sure they’re not overspending for seat licenses that they’re not using.

And it will help them also forecast a little bit better for the future based on actual usage of the product as opposed to worrying about, “Well, is my head count going up or down because of AI?” They’re actually going to be able to tie those metrics to the health of their business and the amount of interactions they have with their business, whether it’s through humans or perhaps through automated bots or a combination near up. So really interesting to see this news and certainly going to be continuing to keep an eye on them and see how that moves or how that looks moving forward.

Okay. And finally, we’re going to wrap up today with, of course, our Rant Or Rave section. And today, I actually have a rave and it revolves around the issue that we were talking about earlier with respect to generative AI. So I just published a research note on this focusing on Google’s Gemini AI Service. Now, basically they were a company that basically got hit a lot of negative feedback because their service essentially didn’t do a great job of incorporating the correct guardrails when it came to not over-correcting for toxicity bias, all of that kind of stuff. So essentially, they got hit because people would type in images of the Pope and the Pope would come back being an image that was not historically accurate or all of that kind of stuff.

Certainly, there’s a lot of news made about this in the press. There were obviously accusations that Google was trying to deliberately change their responses in their generative AI to sort of support a certain worldview in terms of their own particular views, what have you. I don’t know whether or not any of that is true or not. My sense is probably that they just got it wrong in terms of the model tuning. This stuff is not easy. It is not easy. And why am I raving though? Because Google’s CEO, Sundar Pichai, actually acknowledged in a letter, somebody must’ve leaked it to the press, and basically he offered up a mea culpa. He said, “I know that some of the responses have offended our users and shown bias. To be clear, that’s completely unacceptable and we got it wrong.”

It’s really refreshing to hear the CEO, particularly the CEO of Google, just come right out and flat and say, “We got it wrong.” He didn’t try to hem and haw saying, “Well, because of this or because of that.” He just said, “We got it wrong.” And I believe that is a worthy of praise in this day and age where a lot of times organizations will do everything they can to avoid taking responsibility. I think ultimately this is going to be great for Google in the longterm because they’re going to be laser-focused on this moving forward. They have to be, there’s so much competition. They’re going to have to do a great job of going back, looking at their model, retuning where necessary and making sure that this doesn’t happen again. And I think when it comes down to Google’s end customers, it’s a win for them as well because they’re going to have the added confidence knowing that this is clearly in the crosshairs or in the focus of Google from the top on down. So again, kudos to Google for that.

So anyway, that is all the time I have today. I’d like to thank everyone for joining me here on Enterprising Insights. And I’ll be back again next week with another episode focused on the happenings within the enterprise application market. So thanks for tuning in and be sure to subscribe, rate and review the podcast on your preferred platform. Thanks, and we’ll see you next time.

Author Information

Keith has over 25 years of experience in research, marketing, and consulting-based fields.

He has authored in-depth reports and market forecast studies covering artificial intelligence, biometrics, data analytics, robotics, high performance computing, and quantum computing, with a specific focus on the use of these technologies within large enterprise organizations and SMBs. He has also established strong working relationships with the international technology vendor community and is a frequent speaker at industry conferences and events.

In his career as a financial and technology journalist he has written for national and trade publications, including BusinessWeek,, Investment Dealers’ Digest, The Red Herring, The Communications of the ACM, and Mobile Computing & Communications, among others.

He is a member of the Association of Independent Information Professionals (AIIP).

Keith holds dual Bachelor of Arts degrees in Magazine Journalism and Sociology from Syracuse University.


Latest Insights:

The Futurum Group’s Guy Currier provides his insights into the advancements in the creation and operation of applications and their foundational data, along with AI, showcasing the rapid progress being made in cloud and application development.
Kubecon and the Vendors Lay Out Strategies for Driving AI
Camberley Bates, Vice President at The Futurum Group, covers the pressing issues of memory constraints and highlights from Memcon 2024.
Empowering Developers with Advanced AI Capabilities and Enhanced Data Analytics Solutions
Paul Nashawaty, Practice Lead at The Futurum Group, provides his insights on the transformative impact of Google's Data Cloud innovations and the implications for developers and enterprises navigating the evolving landscape of AI and data analytics.
Navigating the Future of AI: Analyst Perspectives on Google’s Latest Innovations and their Impact on Developers
Paul Nashawaty, Practice Lead at The Futurum Group, provides his insights into the transformative impact of Google's AI announcements at Google Next and their implications for the future of AI development and adoption.