The next significant challenge in developing generative AI will be the availability of data, followed by access to sufficient human input to enable replicable human responses.
Which may mean that social platforms have the best opportunity to lead the way, with the chatbots from Meta and xAI having more direct access to more human data inputs than any others. Google has Search queries and review inputs, too. But smaller players, without such access, could be left in the cold, as publishers look to lock down their content, to control access and to maximize profit.
The latest push on this front is a petition signed by thousands of well-known artists, calling for a ban on unlicensed use of creative works for generative AI training. Publisher Penguin Random House is yet taking a stand against the use of its authors' work for AI training, while several news publications are now organizing official licensing deals with individual AI developers for their output.
Indeed, if official regulations that rightfully provide the copyright holders with revenues garnered from their licensed works do arise from this trend, that would limit access to the large inputs of data needed to train AI models. Which will leave smaller developers with bad or worse choices: Either scrape whatever data they can from the greater web (and more publishers are changing their robots.txt parameters to outlaw unlicensed use of their data), or worse, use AI generated content to further train their AI models.
That's a road to deprecation of AI's outputs, as the continued usage of AI content in creating LLMs poisons in its own manner and piles the errors in its data-set. That ain't sustainable, so the data coming from human inputs is going to be in high demand, which puts Meta, X, and Reddit at the wheel.
Reddit CEO Steve Huffman recognized that this week. During an interview, he said:
"The source of artificial intelligence is actual intelligence, and that's what you find on Reddit."
Reddit has already signed a data-sharing deal with Google to help power the search giant's Gemini AI experiments, and that could prove to be a key collaboration for the future of Google's tools.
Which social platform, then, has the most valuable data for AI model creation?
Meta has a stack of posts from hundreds of billions of human users, but posting has decreased in recent years and is being replaced by video views in its apps instead. Which is why Threads may prove valuable, and posts which are the questions may look more favorable to the algorithm, as a way of training the AI system.
X, too, has more than 200 million original posts and replies uploaded to its platform every day, but the nature of those posts is relevant, in terms of training a system on how to understand human-like interaction and respond correctly.
Which is why Reddit, as Huffman notes, could be the best platform for AI training.
It's built around Q and A-style engagement, where users post questions, and serve relevant answers, which are up and downvoted in the app. Building an AI tool around that understanding, plus each developer's own AI models, could potentially provide the most accurate responses, and it'll be interesting to see how that ends up fueling Google's AI efforts and what Google ends up paying for the privilege going on.
While that also means others could end up falling away in the race.
For instance, OpenAI does not have an ongoing feed of data, other than from LinkedIn, as part of its partnership with Microsoft. Will that eventually impede development of ChatGPT, as more publishers lock down their content and remove it from AI training?
It's a fair consideration for the future development of AI models, as without fresh data sources, such tools could quickly fall out of favor, and users might switch to other models.
So who emerges victorious in this contest? Meta? xAI? Google?
At least currently, it appears as though one of these three is going to eventually have the better model, and will be at the front of the queue with the next wave of gen AI tools.
Or we're going to see the rise of these megadeals for preeminent data inputs, with more siloed AI models built around different data sets.
That may well prove to be a more useful and sensible path that re-orients the directionality of how generative AI is built from here.