Reddit announced it is updating its Robots Exclusion Protocol or robots.txt file. As described on its web site, this protocol communicates to automated web crawlers whether they are allowed or not to crawl a site. Never mind the long history of robots.txt file-through which one granted permission to search engines to scrape the content and then subsequently direct users to that content.
The updated robots.txt file aside, Reddit would continue to rate-limits and block unknown bots and crawlers from accessing the site. According to the company, bots and crawlers will be either rate-limited or blocked if they do not comply with Reddit's Public Content Policy and do not have an agreement with the platform.
Says Reddit: The change won't affect most users or well-intentioned actors, like researchers and organizations, such as the Internet Archive. Rather, the goal is to deter AI companies from training their large language models on content here. Of course, AI crawlers could just ignore Reddit's robots.txt file.
The news comes just a few days after a Wired investigation discovered that AI-powered search start-up Perplexity has been scraping and stealing the content. According to an audit report from Wired, though Perplexity appears to ignore requests not to scrape its website-actually having blocked the startup in its own robots.txt file-it allegedly sends responses stating that it will follow the rules set in that very same file. That is how the CEO of Perplexity, Aravind Srinivas, reacted to those claims, referring to the robots.txt file as "a legal framework.".
Those changes on Reddit will not touch the companies which have an agreement with Reddit. For example, Reddit has a $60 million deal with Google in which the search giant trains its models on the social platform's contents. With these changes, Reddit is sending a signal to other companies that want to use Reddit's data for AI training that they will have to pay.
Any person accessing the content on Reddit has to be in line with our policies, including those we have implemented to safeguard redditors," the company said on its blog. "We are choosy about with whom we collaborate and entrust such a massive breadth access to our content.".
It really isn't a surprise as Reddit recently released a new policy designed to guide how Reddit's data is accessed and used by commercial entities and other partners.