Reddit's upcoming changes aim to protect the platform from AI crawlers.

The company, Reddit, on Tuesday said it's updating its Robots Exclusion Protocol, or robots.txt file, in simple terms, which informs web bots whether they are allowed to crawl a site or not.
Reddit's upcoming changes aim to protect the platform from AI crawlers.

The company, Reddit, on Tuesday said it's updating its Robots Exclusion Protocol, or robots.txt file, in simple terms, which informs web bots whether they are allowed to crawl a site or not.
Traditionally, the robots.txt file was used by search engines to allow scraping of a site and then direct people to the content. However, with the rise of AI, websites are now scraped and used to train the models without making any acknowledgment of the actual source of the content.

In addition to the new version of the robots.txt file, Reddit will continue limiting rates and block unknown bots and crawlers from accessing its platform. According to reports from TechCrunch, the company said that bots and crawlers will be either rate-limited or blocked if they fail to adhere to the Public Content Policy of Reddit or do not have an agreement with the said platform.

Accordingly to Reddit, it should not affect the vast majority of users or good faith actors: researchers and organizations, such as Internet Archive. On the contrary, the change is in place to dissuade AI companies from training their large language models on the content hosted on Reddit. Of course, AI crawlers could just simply ignore the robots.txt file at Reddit.

This happens days after investigations made by Wired found that AI search startup Perplexity is stealing and scraping content. According to Wired, it found that Perplexity ignores requests not to scrape its website even though it blocked the startup on its robots.txt file. Reacting to the claims, Perplexity CEO Aravind Srinivas said that robots.txt file is "not a legal framework.".

Companies with which it has an agreement will not be touched by the changes that Reddit is introducing. For instance, the social platform has a deal worth $60 million with the search giant Google under which the latter trains its models of artificial intelligence on the content of the former. A message to other companies wanting to use Reddit's data for AI training is being sent out: they will pay for it.

Anyone accessing Reddit content must comply with our policies, which includes those designed to protect redditors, " Reddit wrote on its blog. "We are selective about who we work with and trust with large-scale access to Reddit content.".

This announcement did not come as a surprise since Reddit had recently let loose a new policy that was to outline how the data of Reddit is being accessed and used by commercial entities and other partners.

Blog
|
2024-10-06 18:53:25