If we all start opting out of our posts being used for training models, doesn’t that reduce the influence of our unique voice and perspectives on those models? Increasingly, the models will be everyone’s primary window into the rest of the world. It seems like the people who care the least about these things will be the ones with the most data that ends up training the models’ default behavior.
—Data Influencer
Honestly, it’s frustrating to me that users of the internet are forced to opt out of artificial intelligence training as the default. Wouldn’t it be nice if affirmative consent was the norm for generative AI companies as they scrape the web and any other data repositories they can find to build increasingly larger and larger frontier models?
But, unfortunately, that’s not the case. Companies like OpenAI and Google argue that if fair use access to all this data was taken away from them, then none of this technology would even be possible. For now, users who don’t want to contribute to the generative models are stuck with a morass of opt-out processes across different websites and social media platforms.
Even if the current bubble surrounding generative AI does pop, much like the dotcom bubble did after a few years, the models that power all of these new AI tools won’t go extinct. So, the ghosts of your niche forum posts and social media threads advocating for strongly held convictions will live on inside the software tools. You’re right that opting out means actively attempting not to be included in a potentially long-lasting piece of culture.
To address your question directly and realistically, these opt-out processes are basically futile in their current state. Those who opt out right now are still influencing the model. Let’s say you fill out a form for a social media site to not use or sell your data for AI training. Even if that platform respects that request, there are countless startups in Silicon Valley with plucky 19-year-olds who won’t think twice about scraping the data posted to that platform, even if they aren’t technically supposed to. As a general rule, you can assume that anything you’ve ever posted online has likely made it into multiple generative models.
OK, but let’s say you could realistically block your data from these systems or demand it be removed after the fact, would doing so lessen your voice or impact on the AI tools? I’ve been thinking about this question for a few days, and I’m still torn.
On one hand, your singular information is just an infinitesimally small contribution to the vastness of the dataset, so your voice, as a nonpublic figure or author, likely isn’t nudging the model one way or another.
From this perspective your data is just another brick in the wall of a 1,000-story building. And it’s worth remembering that data collection is just the first step in creating an AI model. Researchers spend months fine-tuning the software to get the results they desire, sometimes relying on low-wage workers to label datasets and gauge the output quality for refinement. These steps may further abstract data and lessen your individual impact.
On the opposite end, what if we compared this to voting in an election? Millions of votes are cast in American presidential elections, yet most citizens and defenders of democracy insist that every vote matters—with a constant refrain of “make your voice heard.” It’s not a perfect metaphor, but what if we saw our data as having a similar impact? A small whisper among the cacophony of noise, but still impactful on the AI model’s output.
I’m not fully convinced of this argument, but I also don’t think this perspective should be dismissed outright. Especially for subject matter experts, your distinct insights and way of approaching information is uniquely valuable to the AI researchers. Meta wouldn’t have gone through the trouble of using all those books in its new AI model if any old data would do the trick.
Looking toward the future, the true impact your data could have on these models will likely be to inspire “synthetic” data. As the companies who make generative AI systems run out of quality information to scrape, they will enter their ouroboros era; they’ll start using generative AI to replicate human data that they will then feed back into the system to train the next AI model to better replicate human responses. As long as generative AI exists, just remember that you, as a human, will always be a small part of the machine—whether you want to be or not.