Womble Perspectives

The Essential Role of Human Input in Sustaining AI Models

Womble Bond Dickinson

By now, we all know that Generative AI models are trained on massive amounts of data, much of which is collected from the Internet, as well as other sources.

This AI is also generating tons of new content, much of which ends up online. But what happens when this AI-generated content is used again to train the AI models?

Read the full article.

About the author
Dr. Christian E. Mammen

Welcome to Womble Perspectives, where we explore a wide range of topics, from the latest legal updates to industry trends to the business of law. Our team of lawyers, professionals and occasional outside guests will take you through the most pressing issues facing businesses today and provide practical and actionable advice to help you navigate the ever changing legal landscape.

With a focus on innovation, collaboration and client service. We are committed to delivering exceptional value to our clients and to the communities we serve. And now our latest episode.

By now, we all know that Generative AI models are trained on massive amounts of data, much of which is collected from the Internet, as well as other sources.
This AI is also generating tons of new content, much of which ends up online. But what happens when this AI-generated content is used again to train the AI models? 

It might seem like no big deal, right? Maybe the models get better because the AI outputs are so fluent. After all, a lot of AI-generated content looks pretty similar to human-made content, at least to our eyes. Plus, there's still more human-created content out there than AI-generated content, so the impact might seem minimal.

However, new research led by Oxford professor Ilya Shumailov has shown some concerning findings. Even a small amount of AI-generated data used as training data can cause what's known as model collapse. This means the AI starts to "forget" the characteristics of its initial training data, leading to outputs that are more nonsensical. The exact reason for this loss of coherence isn't fully understood. It might be due to the amplification of statistical anomalies in the data set when it's reused over time. It's worth mentioning that other researchers have debated whether model collapse is inevitable.

Despite this debate, it comes as little surprise that Open AI supports California's proposed law AB3211, which would require watermarking on AI-generated content. In addition to performing an intended consumer-notice function, such watermarks would provide a handy way for online data scraping tools to track and exclude LLM-generated content, at scale, as they collect new training data.

Thank you for listening to Womble Perspectives. If you want to learn more about the topics discussed in this episode, please visit The Show Notes, where you can find links to related resources mentioned today. The Show Notes also have more information about our attorneys who provided today's insights, including ways to reach out to them.

Don't forget to subscribe via your podcast player of choice so that you never miss an episode. Thank you again for listening.