AI has poisoned its own well
Raw Text
Categories
Featured
Technology
The Internet
Post author By Tracy Durnell
Post date June 18, 2023
7 Comments on AI has poisoned its own well
Replied to
The Curse of Recursion: Training on Generated Data Makes Models Forget
arXiv.org
What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs. We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models. We demonstrate that it has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.
I suspect tech companies (particularly Microsoft / OpenAI and Google) have miscalculated, and in their fear of being left behind, have released their generative AI models too early and too wide. By doing so, theyâve essentially established a threshold for the maximum improvement of their products due to the threat of model collapse . I donât think the quality that generative AI will be able to reach on a poisoned data supply will be good enough to get rid of all us plebs .
They need an astronomical amount of training data to make any model better than what already exists . By releasing their models for public use now, when theyâre not very good yet, too many people have pumped the internet full of mediocre generated content with no indication of provenance. Stack Overflow has thrown up their hands and said they canât moderate generative AI content, meaning the site can no longer be a training source for coding material. Publishers of formerly reputable sites are laying off their staff and experimenting with AI-generated articles. There is no consistent system for marking up generated content online that will allow companies to trust material of unknown origin as training data. Because of this approach, 2022 and 2023 will be essentially âlost yearsâ of internet-sourced content, even if they can establish a tagging system going forward â and get people hostile or ambivalent to them to use it.
In their haste to propagandize the benefits of generative AI and encourage adoption so widespread it canât be stopped, theyâre already encouraging people to lean on LLMs to write for them. Microsoft is plugging AI tools into their flagship Office Suite software. Writing well is a skill that few possess today, and these companies are creating an environment where even fewer people will bother to learn and practice professional writing. As time goes by, there will be less human-created material (especially of quality and complexity) available as new training data.
Obtaining quality training data is going to be very expensive in five years if AI doesnât win all its lawsuits over training data being fair use. By allowing us a glimpse into their vision and process , theyâve turned nearly every professional artist and writer against them as an existential threat . Even if they do win their fair use lawsuits, it may be a challenge to access the data; every creative person who relies on their work for pay will do everything they can to prevent their creations from becoming future training data.
Even worse, these companiesâ misuse of the internet commons â of humanityâs collective creativity â as fuel for their own profit could lead to fragmentation and closing off of online information to prevent its theft . Bloggers donât want their words stolen, and social media companies are getting wise to the value of âtheirâ data and beginning to charge for API access. The difficulty and cost of gathering sufficient high quality training data for future models will incentivize continued use of whatever is easiest to grab, only hastening model collapse and increasing the likelihood of malicious actors perpetrating poisoning attacks.
Tags AI , commons , creative work , data , LLM , scientific paper , writing
By Tracy Durnell
Writer and designer in the Seattle area. Freelance sustainability consultant. Reach me at tracy.durnell@gmail.com. She/her.
View Archive â
â Updating understandings of history
â Curating an experience
7 replies on âAI has poisoned its own wellâ
says:
June 20, 2023 at 4:21 am
Really appreciate this perspective. Thanks!
says:
June 20, 2023 at 6:42 am
We need our regulators and legislators working on this now. Provide protection for content creators so that AI cannot train without conforming to some rule. Creative Commons contemplated this back in 2021.
Tracy Durnell
says:
June 20, 2023 at 6:58 pm
Agreed, our current legislation isnât up to todayâs needs!
AI has poisoned its own well - Smeargle Fans
says:
lemmy.smeargle.fans
June 20, 2023 at 11:05 pm
This Article was mentioned on lemmy.smeargle.fans
says:
June 21, 2023 at 6:51 am
Thank you for this perspective. It is quite thought provoking indeed
says:
July 3, 2023 at 7:54 am
I think this is a fascinating and interesting take, and echos what I have been thinking (so thank you for writing something way more eloquently that I ever could!).
Six Links Worthy of Your Attention #683 - Six Pixels of Separation
says:
July 29, 2023 at 3:01 am
[âŚ] AI has poisoned its own well â Tracy Durnell. âWhen an AI consumes data it hallucinates, it gets dumb, fast. AI has a malnutrition problem, which is a major challenge for the growth of generative AI. There arenât good solutions (sidenote: Iâm actually working on a proposal for one; more details soon, hopefully.) But itâs an issue Tracy Durnell explains perfectly in this blog post.â (Alistair for Hugh). [âŚ]
Leave a Reply
Your email address will not be published. Required fields are marked *
Comment *
Name *
Email *
Website
Save my name, email, and website in this browser for the next time I comment.
Î document.getElementById( "ak_js_1" ).setAttribute( "value", ( new Date() ).getTime() );
To respond on your own website, enter the URL of your response which should contain a link to this post's permalink URL. Your response will then appear (possibly after moderation) on this page. Want to update or remove your response? Update or delete your post and re-enter your post's URL again. ( Learn More )
Single Line Text
Categories. Featured. Technology. The Internet. Post author By Tracy Durnell. Post date June 18, 2023. 7 Comments on AI has poisoned its own well. Replied to. The Curse of Recursion: Training on Generated Data Makes Models Forget. arXiv.org. What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs. We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models. We demonstrate that it has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet. I suspect tech companies (particularly Microsoft / OpenAI and Google) have miscalculated, and in their fear of being left behind, have released their generative AI models too early and too wide. By doing so, theyâve essentially established a threshold for the maximum improvement of their products due to the threat of model collapse . I donât think the quality that generative AI will be able to reach on a poisoned data supply will be good enough to get rid of all us plebs . They need an astronomical amount of training data to make any model better than what already exists . By releasing their models for public use now, when theyâre not very good yet, too many people have pumped the internet full of mediocre generated content with no indication of provenance. Stack Overflow has thrown up their hands and said they canât moderate generative AI content, meaning the site can no longer be a training source for coding material. Publishers of formerly reputable sites are laying off their staff and experimenting with AI-generated articles. There is no consistent system for marking up generated content online that will allow companies to trust material of unknown origin as training data. Because of this approach, 2022 and 2023 will be essentially âlost yearsâ of internet-sourced content, even if they can establish a tagging system going forward â and get people hostile or ambivalent to them to use it. In their haste to propagandize the benefits of generative AI and encourage adoption so widespread it canât be stopped, theyâre already encouraging people to lean on LLMs to write for them. Microsoft is plugging AI tools into their flagship Office Suite software. Writing well is a skill that few possess today, and these companies are creating an environment where even fewer people will bother to learn and practice professional writing. As time goes by, there will be less human-created material (especially of quality and complexity) available as new training data. Obtaining quality training data is going to be very expensive in five years if AI doesnât win all its lawsuits over training data being fair use. By allowing us a glimpse into their vision and process , theyâve turned nearly every professional artist and writer against them as an existential threat . Even if they do win their fair use lawsuits, it may be a challenge to access the data; every creative person who relies on their work for pay will do everything they can to prevent their creations from becoming future training data. Even worse, these companiesâ misuse of the internet commons â of humanityâs collective creativity â as fuel for their own profit could lead to fragmentation and closing off of online information to prevent its theft . Bloggers donât want their words stolen, and social media companies are getting wise to the value of âtheirâ data and beginning to charge for API access. The difficulty and cost of gathering sufficient high quality training data for future models will incentivize continued use of whatever is easiest to grab, only hastening model collapse and increasing the likelihood of malicious actors perpetrating poisoning attacks. Tags AI , commons , creative work , data , LLM , scientific paper , writing. By Tracy Durnell. Writer and designer in the Seattle area. Freelance sustainability consultant. Reach me at tracy.durnell@gmail.com. She/her. View Archive â. â Updating understandings of history. â Curating an experience. 7 replies on âAI has poisoned its own wellâ says: June 20, 2023 at 4:21 am. Really appreciate this perspective. Thanks! says: June 20, 2023 at 6:42 am. We need our regulators and legislators working on this now. Provide protection for content creators so that AI cannot train without conforming to some rule. Creative Commons contemplated this back in 2021. Tracy Durnell. says: June 20, 2023 at 6:58 pm. Agreed, our current legislation isnât up to todayâs needs! AI has poisoned its own well - Smeargle Fans. says: lemmy.smeargle.fans. June 20, 2023 at 11:05 pm. This Article was mentioned on lemmy.smeargle.fans. says: June 21, 2023 at 6:51 am. Thank you for this perspective. It is quite thought provoking indeed. says: July 3, 2023 at 7:54 am. I think this is a fascinating and interesting take, and echos what I have been thinking (so thank you for writing something way more eloquently that I ever could!). Six Links Worthy of Your Attention #683 - Six Pixels of Separation. says: July 29, 2023 at 3:01 am. [âŚ] AI has poisoned its own well â Tracy Durnell. âWhen an AI consumes data it hallucinates, it gets dumb, fast. AI has a malnutrition problem, which is a major challenge for the growth of generative AI. There arenât good solutions (sidenote: Iâm actually working on a proposal for one; more details soon, hopefully.) But itâs an issue Tracy Durnell explains perfectly in this blog post.â (Alistair for Hugh). [âŚ] Leave a Reply. Your email address will not be published. Required fields are marked * Comment * Name * Email * Website. Save my name, email, and website in this browser for the next time I comment. Î document.getElementById( "ak_js_1" ).setAttribute( "value", ( new Date() ).getTime() ); To respond on your own website, enter the URL of your response which should contain a link to this post's permalink URL. Your response will then appear (possibly after moderation) on this page. Want to update or remove your response? Update or delete your post and re-enter your post's URL again. ( Learn More )