Watermarking & Other Strategies for Licensing AI Training Data & Combating Malicious AI Generated Content | The AI Moment – Episode 10

Watermarking & Other Strategies for Licensing AI Training Data & Combating Malicious AI Generated Content | The AI Moment – Episode 10

On this episode of The AI Moment, we discuss an emerging Gen AI trend: Watermarking & other strategies for licensing AI training data & combating malicious AI generated content. As we move into year two of Generative AI, some themes have emerged in terms of the downsides to the technology.

Two of the biggest downsides have been:

  • Combating malicious or misleading AI-generated content, and
  • Copyright/IP rights for both non AI generated and AI generated content

Initiatives by Google, Fox-Polygon, the Content Authenticity Initiative and academic researchers focused primarily on digital watermarking are the latest and most prominent attempts to address these issues. What will the impact of this trend be to enabling or stifling gen AI be?

Watch the video below, and be sure to subscribe to our YouTube channel, so you never miss an episode.

Listen to the audio here:

Or grab the audio on your favorite podcast platform below:

Disclosure: The Futurum Group is a research and advisory firm that engages or has engaged in research, analysis, and advisory services with many technology companies, including those mentioned in this article. The author does not hold any equity positions with any company mentioned in this webcast.

Analysis and opinions expressed herein are specific to the analyst individually and data and other information that might have been provided for validation, not those of The Futurum Group as a whole.


Mark Beccue: Hi, I am Mark Beccue, Research Director for AI with The Futurum Group. Welcome to the AI Moment. It’s our weekly podcast that explores the latest developments in enterprise AI, and we are literally having a moment. The pace and the change of innovation in AI is unprecedented. I’ve been covering AI since 2016. I’ve never seen anything like what we’ve experienced since ChatGPT was launched late in 2022 and kick started the generative AI era. With the AI moment podcast, we tried to distill this mountain of information, separate the real from the hype, and provide you with sure handed analysis about where the AI market is headed. We deep dive into the latest trends, technologies are shaping the AI landscape, and we’ll cover things from analyzing the latest developments in technology to parsing the mutating vendor landscape, also AI regulations, ethics, risk management, and a lot more.

Usually it’s about 30 minutes, usually a little less. It’s typically made up of three or four segments, not always all of them, sometimes just one of them. Today, we’re going to do what I call the key trends segment of generative AI. And the title today is Watermarking and Other Strategies for Licensing AI Training Data and Combating Malicious Generated Content, or other words, fake news. Generative AI has… There’s two issues that are really sprung to the forefront as we keep moving through 2023 and into 2024, and there are two. One, there’s this idea around malicious AI generated content. Deep fakes, misinformation, fake news, that kind of thing. Second idea and second issue that we’re dealing with, they’re related, are around copyrighted content and the rights of those copyright holders in that content, and the data that they have, that they generate, being used for AI training. Best known right now is the New York Times lawsuit against OpenAI.

Let’s talk about those issues for a little bit and what I’ve got is a few examples of where we’ve been, where we’re headed, and maybe what’s going to happen next. The issues. There’s a lot more worry right now, particularly around deep fakes and fake news and misinformation because 2024, there are 50 national elections that are going to happen across the globe. Media companies in particular are working to look at this. I will say it’s not necessarily out of the goodness of their heart. There is a lot of liability issues around this as well, so not being forced by their governments to do anything necessarily. They are taking action and I think it’s because of the risk that they’re involved with possibly being sued or other awful things like that.

The second issue is, if you think about where we first started with generative AI and the real big use cases were around image generation, text generation and assistance. Those outputs that they do, that they give are trained on, a lot of times, copyrighted data from content producers. There’s this evolving thinking and there was some original thoughts around how do you protect that copyrighted data, that copyrighted content, whether it’s blocking it out completely or receiving compensation for it. That’s where we’re going. Let’s talk about what’s happened over time real quick. When it first started happening… The issues are related and I’m going to talk about them interspersed, but here’s what happened. When all of this started to turn up middle of last year, the media did a few things.

There were three things that happened. The media started through a negotiation process, and the first ones that started that process were News Corporation around the June timeframe, it came to light that they were negotiating with some of the AI vendors, they didn’t specifically say who, about compensation for using their data to train AI models. Another one was Axel Springer came up around the same time. That’s one way to do this is to just have media companies grant permission to use license that content, at least media content that they control to license it is something they can do. It doesn’t stop the deep faking, but it does help with the training. The second part has to do with lawsuits, and there’s been a few in and out. Sarah Silverman as a content producer has sued some… that aren’t as well known as sued as collectively in a class lawsuit, and then there’s the most famous one right now is the New York Times.

You can sue and see if you can either one, force an issue to get compensation for when things are trained or to keep those companies from AI companies from training on the data. You’ve got negotiations and lawsuits as two things, but really what we’re going to talk about mostly today are technical solutions. There’s been an evolving idea around technical solutions that really started with… A germ of an idea started with what was called the News Province project. It was launched by IBM and New York Times, and so they were thinking about that. That was this early attempt to fight disinformation, misinformation by using a blockchain made digital watermark. That group launched that initiative in July 2019, but since then it’s gone dormant. I haven’t seen any more news about them lately, and I think it might’ve morphed into what I’m going to talk about next, which is the Content Authenticity Initiative.

This is, again, a movement or initiative or group that was led or initiated by Adobe. They use a similar idea. They use cryptographic asset hashtagging or hashing. It’s a digital watermark for images and metadata, and they have a lot of things going on. One thing they’re working on now are actually some standards around that. What’s interesting there is the members include a lot of big media companies, AP, BBC, Axel Springer, Gannett, the New York Times, I keep hearing that name, Reuters, Wall Street Journal, Washington Post. There’s also these image companies, they’re not known as AI vendors, but they sort of are, which is Adobe, which has created their own products like Firefly for creating images. Adobe is obviously leading it, but in there with them are Getty Images and Shutterstock. There’s nothing so far in that initiative around digital watermarking that addresses text. Set that aside.

Those were started last year, early… The News Province Project was old. Last year, you got the CAI, moving forward this year, there’s been really actually in the last few weeks, there’s been a couple of movements we’re going to talk about. These developments are emerging. The first one is Google. Google has some stuff going on that’s interesting. In December, around the 19th, it was the 19th of December, I wrote a research note about this. Susan Jasper is the vice president of their trust and safety solutions at Google. She publishes a blog post, outlines what Google’s got going on to combat, and this is how they phrase it, combat misuse and abuse of Google products and services, particularly those leveraging generative AI. They noted in their piece that this was really urgent to them because the disciples were coming through with the elections.

Here’s what they’ve got going on. This is interesting to me, they said with the recent advances in their LLFs, Google has been experimenting with building what they said faster, more adaptable enforcement systems for… They have longstanding policies and software that they use for content filtering and identifying and removing content. They were actually using AI to do that before they said they were AI classifiers, but now they’re using generative AI to make that a better process. That was the first thing to mention I thought was cool, and that they are prioritizing their testing for safety risks. It’s everything from cyber security to the misinformation and fairness of these generative AI products. They mentioned specifically BARD and the search generative experience, which is their project for next generation web search. They said they’re prioritizing safety tests for that. Then they are… This one’s non AI… This is the biggest news they said that day was they are restricting the type of election related queries that BARD and SGE the Search Generative Experience could return responses on.

In other words, they’re limiting those two generative AI tools from literally not answering questions about the elections, which is interesting. It seems like a heavy-handed thing, but I think that’s interesting given that they really can’t control these things as well as they’d like to, so why not just turn it off? Makes sense. It was a mature thing to say and do. A few other things they’re doing, some are related to AI and some are not. They require election advertisers to disclose when their ads include synthetic content that’s realistic that’s been either digitally altered or generated including by AI tools. That’s Interesting.

All right, so next there’s YouTube is requiring creators to disclose when they’ve created these realistic, altered or synthetic content and they have to display a label that indicates for people that are watching that content they’re watching is synthetic. That was interesting. A few other things… They really did a lot in this announcement. A few other things. They have a feature which is called about this result, which is in SGE, the Search Generative experience. What it does is it provides context about this result and they do it in BARD as well. It enables people to evaluate whether there’s content across the web that substantiates BARD’s English language response. It’s interesting. They have about this image which is similar in search and it helps people understand this credibility of the context of the images found that they see.

The last piece they have, they actually did announce this one back in early… It was late fourth quarter. It’s called Synth ID. This was interesting. It’s in beta and it’s a piece that’s being offered by Google DeepMind that is a digital watermarking tool. It embeds these digital watermarks into AI-generated images. Hold that because we’re going to talk about that on the next part. When you hear all of those pieces that Google’s doing, I thought there was some interesting pieces that I’ll talk about these individually instead of collectively. What that meant to me is here’s Google, we’ve got a lot at risk, the world’s largest search engine, they are an AI leader ,and they really had this… At least they’re in a position to have a very big impact on these election cycles for 2024. Taking BARD and SGE out of this election equation was really a leadership move, which I mentioned a little earlier. I think that was not made easily if you understand how much political advertising means to most companies. I found some estimates that 2024, they believe that digital ads just in the US alone for 2024 political advertising, digital ads in the US will reach about $1.2 billion. Those products that Google has, aren’t going to be that responsible for all of that, but they would be missing a portion of it. Google’s willing to forego that to ensure the safety.

That idea that these products could really spread disinformation or misinformation is a bigger issue for them and outweighed the revenue upside. One thing from that, I hope that other players that are in that space will think about that as well and maybe make some other moves to maybe join at Google’s leadership of just blocking these tools for now until we figure this out. The other part to this was thinking about how they are looking at how to trace content and this range of strategies to identify. You had to add disclosures, the content labels, about this image, and Synth ID. It’s interesting that they are thinking about these things and it’s playing into a trend I think we’re going to see going forward. Whether that costs money or not, we’ll see. That’s Google made some nice interesting moves and a pretty comprehensive take on what they’re doing.

The second one… We’ve got two others to talk about was a recent, very recent announcement actually this week, the 9th that Fox Corporation made with… Actually, the announcement came from a small company called Polygon Labs, and it was on January 9th that Polygon Labs announced that Fox was going to publicly release a beta version of this thing they call Verify. I caution you there’s a lot of people use this word, verify, as the name of a product. There’s actually a product or a mechanism from the CAI called Verify as well. It’s not that, this is Polygon Fox’s Verify. What it is, it’s an open source protocol that’s meant to establish the history and the origin of registered media and it’s built on a protocol built from Polygon. The details of that were interesting. What they’re saying it’ll do is the publishers can register content on Verify in order to prove the origination, individual pieces of the content are cryptographically signed on chain. That’s wording that Polygon is using, but basically it’s blockchain technology like we heard about from both CAI and from Google. Well, Google didn’t say that actually, but blockchain. It allows the consumers to identify content from trusted sources using this tool.

Again, I look at this, I can say okay, I can verify that it’s from a legitimate source and not created by someone not saying who they’re. The beta for this was launched by Fox in August and they did it around the first Fox News GOP debate and they post some statistics which I thought was interesting. Since that day, they’ve had 89,000 pieces of content that span text and images that have been signed to Verify. They came from different sources of Fox News, Fox Business, Fox Sports, and Fox TV affiliates. That was just a neat thing that they shared that a lot of it’s been shared so far. It’s now going to be open-sourced, which is an interesting idea as well, and it was developed in-house by Fox Technology. It’s built by this protocol by Polygon. That’s the basics of that.

It does this basically, it establishes a way for these media companies to work with LLMs and other AI platforms and what it does, they’ve had a thing called a verified access point. That creates this a new thing, which is a new commercial opportunities for content owners to build these smart contracts to set up what they call programmatic conditions to access content. In other words, this is a way to license your digital content to LLMs translation. Interesting about this, my breakdown on that was that this is like we said was the latest move of anyone as far as this goes. You’re seeing these trends are lining up as similar, and I looked at it from two angles and just broke it down in our thing. How does it combat malicious, how good it will be at doing that and what it means in the licensing piece.

In the malicious AI generated content piece, this is addition to content filtering. Filtering has been out there for a bit, but content filtering doesn’t necessarily discern whether content is AI generated or not, just looks for misused inappropriate content. It’s different in that case. When you look at the drivers, like I said for Fox, I mentioned this up at the top, I think that this is important. It could be a liability. We mentioned that having malicious content that’s related or possibly related to these media companies is a liability for them that could get sued. That’s one driver for Fox and these media companies to do it, but the second might be that it’s possible that Fox might build this Verify into ancillary products around it that they could sell. Part of it’s open source and other media companies can use it, but maybe they build some tooling around Verify and make some money selling this kind of technology to other media companies. That’s one thing.

The bigger part is back to this copyright piece that they talked about, this licensing. I think this is a bigger deal. One thing I found that was interesting that TechCrunch interviewed the CTO of Fox, and she said this when she was talking to them about Verify, she said, “Verify is also a technical on-ramp for AI platforms to license publisher content with encoded controls via smart contracts for LLM training or real-time use cases. We’re in discussion with several media companies and expect to be able to share more soon on that front.” They’re going to sell this to those media companies, not for… The bigger idea would be to license content and protect content. I think that’s interesting and what moved forward with that whole piece. Now the last one we’re going to talk about is totally different in a way because it is not a commercial product, but it’s actually a research breakthrough.

On January 3rd, a group of Chinese researchers, most of them are based in China, one is not, he’s of Chinese descent but he was based here in the US researchers from universities. They published a paper called Text Watermarking. It was brought to my attention by the CTO of… His name is Raphael Mansou, he’s the CTO of Ellitson and he’s been one of the people I watch, writes about these things. This is just an interesting, his synopsis, there’s a bit I’d like to tell you about from what he said about this. He says, what is text watermarking and why does it matter? In simple terms, text watermarking is this subtly altering text to embed a digital watermark that contains identifiable information. It works the same as it did for image or video files. It allows the text’s origin or authorship or intended use to be tracked. At the same time, text readability is preserved and obviously it has powerful implications when we’re looking at things like copyright protection, source tracking, which we’ve talked about in securing sensitive data and all those other pieces that we think are drivers for why these things are moving forward.

How it works, I’ll give you a real quick… He gave a synopsis about that. Two approaches, you can do watermarking for existing text and you can do watermarking for AI models. I’ll give you a real quick look at that. For existing text, it involves making these very subtle modifications is how you insert a watermark. It’s a format change. They do a little bit of alternate spacing or Unicode points, a couple other pieces. They say you can do this and actually you don’t have to start from scratch, you could go back to content that’s already there and watermarking. The second one, he talked about watermarking for AI models. He said what this means is you can do… He said data set modification. You can add specifically watermark data samples and you can do generative process alteration, which is biasing model outputs during text generation. The idea here is that a watermarked model will naturally output watermarked text.

Take that for what it is and just say that what happens from that point is we’re saying wherever you want to play that. Let’s say that content owners now have these tools for themselves. They have tools for starting to look via cryptographic blockchain to do image and video digital watermarking. You have this idea and at least a paper that says you can do that with text as well. Where does that lead us? It means that back to our original premise and what may happen is that we’re going to see content owners start to on their own, work through digital watermarking of all their content, and either they’re looking for compensation or the way to block those things from being used and it does a dual purpose thing. They’re doing that, it not only helps protect their IP and copyright and gives us a bunch of interesting things to do when we can trace data that’s being used for training.

That’s an interesting piece as well. It gives them protection over their stuff, it may give them the choice of either being compensated by those AI players that want to use that data, or the option that it doesn’t get used. Since the digital watermarking is there, they get that choice. I think the byproduct of all this is when we’re able to use those digital watermarking pieces, and for those purposes, it helps fight deep fakes and all the fake news and the misinformation and the… It’s a dual purpose, it’s really going after two things. It’s great. That’s where we are. This is going to evolve, keep moving. I’ll keep an eye on it. I think it’s going to be fascinating to see what happens because this does impact a lot of the business models and the structures for using AI going forward. It’s going to be a key piece of it, so we’ll keep watching it.

That’s our show for the week. Thank you for joining me. Be sure to subscribe or rate and review the podcast on your preferred platform. We’re on lots of different places, including YouTube. Thank you again. We’ll see you next time.

Other Insights from The Futurum Group:

2023 AI Product of the Year, AI Company of the Year | The AI Moment, Episode 9

Adults in the Generative AI Rumpus Room: The Best of 2023 | The AI Moment, Episode 8

Top AI Trends for 2024 | The AI Moment, Episode 7

Author Information

Mark comes to The Futurum Group from Omdia’s Artificial Intelligence practice, where his focus was on natural language and AI use cases.

Previously, Mark worked as a consultant and analyst providing custom and syndicated qualitative market analysis with an emphasis on mobile technology and identifying trends and opportunities for companies like Syniverse and ABI Research. He has been cited by international media outlets including CNBC, The Wall Street Journal, Bloomberg Businessweek, and CNET. Based in Tampa, Florida, Mark is a veteran market research analyst with 25 years of experience interpreting technology business and holds a Bachelor of Science from the University of Florida.


Latest Insights:

The Six Five team discusses Sequoia/A16Z/Goldman rain on the AI parade.
The Six Five team discusses Oracle & Palantir Foundry & AI Platform.
The Six Five team discusses AWS Summit New York 2024.