AI has poisoned its own well

Raw Text

Categories

Featured

Technology

The Internet

Post author By Tracy Durnell

Post date June 18, 2023

7 Comments on AI has poisoned its own well

Replied to

The Curse of Recursion: Training on Generated Data Makes Models Forget

arXiv.org

What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs. We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models. We demonstrate that it has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.

I suspect tech companies (particularly Microsoft / OpenAI and Google) have miscalculated, and in their fear of being left behind, have released their generative AI models too early and too wide. By doing so, they’ve essentially established a threshold for the maximum improvement of their products due to the threat of model collapse . I don’t think the quality that generative AI will be able to reach on a poisoned data supply will be good enough to get rid of all us plebs .

They need an astronomical amount of training data to make any model better than what already exists . By releasing their models for public use now, when they’re not very good yet, too many people have pumped the internet full of mediocre generated content with no indication of provenance. Stack Overflow has thrown up their hands and said they can’t moderate generative AI content, meaning the site can no longer be a training source for coding material. Publishers of formerly reputable sites are laying off their staff and experimenting with AI-generated articles. There is no consistent system for marking up generated content online that will allow companies to trust material of unknown origin as training data. Because of this approach, 2022 and 2023 will be essentially “lost years” of internet-sourced content, even if they can establish a tagging system going forward — and get people hostile or ambivalent to them to use it.

In their haste to propagandize the benefits of generative AI and encourage adoption so widespread it can’t be stopped, they’re already encouraging people to lean on LLMs to write for them. Microsoft is plugging AI tools into their flagship Office Suite software. Writing well is a skill that few possess today, and these companies are creating an environment where even fewer people will bother to learn and practice professional writing. As time goes by, there will be less human-created material (especially of quality and complexity) available as new training data.

Obtaining quality training data is going to be very expensive in five years if AI doesn’t win all its lawsuits over training data being fair use. By allowing us a glimpse into their vision and process , they’ve turned nearly every professional artist and writer against them as an existential threat . Even if they do win their fair use lawsuits, it may be a challenge to access the data; every creative person who relies on their work for pay will do everything they can to prevent their creations from becoming future training data.

Even worse, these companies’ misuse of the internet commons — of humanity’s collective creativity — as fuel for their own profit could lead to fragmentation and closing off of online information to prevent its theft . Bloggers don’t want their words stolen, and social media companies are getting wise to the value of “their” data and beginning to charge for API access. The difficulty and cost of gathering sufficient high quality training data for future models will incentivize continued use of whatever is easiest to grab, only hastening model collapse and increasing the likelihood of malicious actors perpetrating poisoning attacks.

Tags AI , commons , creative work , data , LLM , scientific paper , writing

By Tracy Durnell

Writer and designer in the Seattle area. Freelance sustainability consultant. Reach me at tracy.durnell@gmail.com. She/her.

View Archive →

← Updating understandings of history

→ Curating an experience

7 replies on “AI has poisoned its own well”

says:

June 20, 2023 at 4:21 am

Really appreciate this perspective. Thanks!

says:

June 20, 2023 at 6:42 am

We need our regulators and legislators working on this now. Provide protection for content creators so that AI cannot train without conforming to some rule. Creative Commons contemplated this back in 2021.

Tracy Durnell

says:

June 20, 2023 at 6:58 pm

Agreed, our current legislation isn’t up to today’s needs!

AI has poisoned its own well - Smeargle Fans

says:

lemmy.smeargle.fans

June 20, 2023 at 11:05 pm

This Article was mentioned on lemmy.smeargle.fans

says:

June 21, 2023 at 6:51 am

Thank you for this perspective. It is quite thought provoking indeed

says:

July 3, 2023 at 7:54 am

I think this is a fascinating and interesting take, and echos what I have been thinking (so thank you for writing something way more eloquently that I ever could!).

Six Links Worthy of Your Attention #683 - Six Pixels of Separation

says:

July 29, 2023 at 3:01 am

[…] AI has poisoned its own well – Tracy Durnell. “When an AI consumes data it hallucinates, it gets dumb, fast. AI has a malnutrition problem, which is a major challenge for the growth of generative AI. There aren’t good solutions (sidenote: I’m actually working on a proposal for one; more details soon, hopefully.) But it’s an issue Tracy Durnell explains perfectly in this blog post.” (Alistair for Hugh). […]

Leave a Reply

Your email address will not be published. Required fields are marked *

Comment *

Name *

Email *

Website

Save my name, email, and website in this browser for the next time I comment.

Δ document.getElementById( "ak_js_1" ).setAttribute( "value", ( new Date() ).getTime() );

To respond on your own website, enter the URL of your response which should contain a link to this post's permalink URL. Your response will then appear (possibly after moderation) on this page. Want to update or remove your response? Update or delete your post and re-enter your post's URL again. ( Learn More )

Single Line Text

Categories. Featured. Technology. The Internet. Post author By Tracy Durnell. Post date June 18, 2023. 7 Comments on AI has poisoned its own well. Replied to. The Curse of Recursion: Training on Generated Data Makes Models Forget. arXiv.org. What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs. We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models. We demonstrate that it has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet. I suspect tech companies (particularly Microsoft / OpenAI and Google) have miscalculated, and in their fear of being left behind, have released their generative AI models too early and too wide. By doing so, they’ve essentially established a threshold for the maximum improvement of their products due to the threat of model collapse . I don’t think the quality that generative AI will be able to reach on a poisoned data supply will be good enough to get rid of all us plebs . They need an astronomical amount of training data to make any model better than what already exists . By releasing their models for public use now, when they’re not very good yet, too many people have pumped the internet full of mediocre generated content with no indication of provenance. Stack Overflow has thrown up their hands and said they can’t moderate generative AI content, meaning the site can no longer be a training source for coding material. Publishers of formerly reputable sites are laying off their staff and experimenting with AI-generated articles. There is no consistent system for marking up generated content online that will allow companies to trust material of unknown origin as training data. Because of this approach, 2022 and 2023 will be essentially “lost years” of internet-sourced content, even if they can establish a tagging system going forward — and get people hostile or ambivalent to them to use it. In their haste to propagandize the benefits of generative AI and encourage adoption so widespread it can’t be stopped, they’re already encouraging people to lean on LLMs to write for them. Microsoft is plugging AI tools into their flagship Office Suite software. Writing well is a skill that few possess today, and these companies are creating an environment where even fewer people will bother to learn and practice professional writing. As time goes by, there will be less human-created material (especially of quality and complexity) available as new training data. Obtaining quality training data is going to be very expensive in five years if AI doesn’t win all its lawsuits over training data being fair use. By allowing us a glimpse into their vision and process , they’ve turned nearly every professional artist and writer against them as an existential threat . Even if they do win their fair use lawsuits, it may be a challenge to access the data; every creative person who relies on their work for pay will do everything they can to prevent their creations from becoming future training data. Even worse, these companies’ misuse of the internet commons — of humanity’s collective creativity — as fuel for their own profit could lead to fragmentation and closing off of online information to prevent its theft . Bloggers don’t want their words stolen, and social media companies are getting wise to the value of “their” data and beginning to charge for API access. The difficulty and cost of gathering sufficient high quality training data for future models will incentivize continued use of whatever is easiest to grab, only hastening model collapse and increasing the likelihood of malicious actors perpetrating poisoning attacks. Tags AI , commons , creative work , data , LLM , scientific paper , writing. By Tracy Durnell. Writer and designer in the Seattle area. Freelance sustainability consultant. Reach me at tracy.durnell@gmail.com. She/her. View Archive →. ← Updating understandings of history. → Curating an experience. 7 replies on “AI has poisoned its own well” says: June 20, 2023 at 4:21 am. Really appreciate this perspective. Thanks! says: June 20, 2023 at 6:42 am. We need our regulators and legislators working on this now. Provide protection for content creators so that AI cannot train without conforming to some rule. Creative Commons contemplated this back in 2021. Tracy Durnell. says: June 20, 2023 at 6:58 pm. Agreed, our current legislation isn’t up to today’s needs! AI has poisoned its own well - Smeargle Fans. says: lemmy.smeargle.fans. June 20, 2023 at 11:05 pm. This Article was mentioned on lemmy.smeargle.fans. says: June 21, 2023 at 6:51 am. Thank you for this perspective. It is quite thought provoking indeed. says: July 3, 2023 at 7:54 am. I think this is a fascinating and interesting take, and echos what I have been thinking (so thank you for writing something way more eloquently that I ever could!). Six Links Worthy of Your Attention #683 - Six Pixels of Separation. says: July 29, 2023 at 3:01 am. […] AI has poisoned its own well – Tracy Durnell. “When an AI consumes data it hallucinates, it gets dumb, fast. AI has a malnutrition problem, which is a major challenge for the growth of generative AI. There aren’t good solutions (sidenote: I’m actually working on a proposal for one; more details soon, hopefully.) But it’s an issue Tracy Durnell explains perfectly in this blog post.” (Alistair for Hugh). […] Leave a Reply. Your email address will not be published. Required fields are marked * Comment * Name * Email * Website. Save my name, email, and website in this browser for the next time I comment. Δ document.getElementById( "ak_js_1" ).setAttribute( "value", ( new Date() ).getTime() ); To respond on your own website, enter the URL of your response which should contain a link to this post's permalink URL. Your response will then appear (possibly after moderation) on this page. Want to update or remove your response? Update or delete your post and re-enter your post's URL again. ( Learn More )