OpenAI Accidentally Erases Potential Evidence in NY Times Copyright Lawsuit (Updated)

Lawyers for The New York Times and Daily News, which are suing OpenAI for allegedly scraping their works to train its AI models without permission, say OpenAI engineers accidentally deleted data potentially relevant to the case.
OpenAI Accidentally Erases Potential Evidence in NY Times Copyright Lawsuit (Updated)

Lawyers for The New York Times and Daily News, which are suing OpenAI for allegedly scraping their works to train its AI models without permission, say OpenAI engineers accidentally deleted data potentially relevant to the case.
Earlier this fall, OpenAI agreed to provide two virtual machines so that counsel for The Times and Daily News could perform searches for their copyrighted content in its AI training sets. (Virtual machines are software-based computers that exist within another computer's operating system, often used for testing, data backup, and running apps.) Attorneys for the publishers wrote in a letter that they and experts hired by them logged over 150 hours since November 1 scouring the training data by OpenAI.

But on November 14, OpenAI engineers erased all the publishers' search data stored on one of the virtual machines, the aforementioned letter said, which was filed Wednesday evening in the U.S. District Court for the Southern District of New York.

OpenAI tried to recover the data — and was mostly successful. However, because the folder structure and file names were “irretrievably” lost, the recovered data “cannot be used to determine where the news plaintiffs’ copied articles were used to build [OpenAI’s] models,” per the letter.

"News plaintiffs have been forced to recreate their work from scratch using significant person-hours and computer processing time," counsel for The Times and Daily News wrote. "The news plaintiffs learned only yesterday that the recovered data is unusable and that an entire week's worth of its experts' and lawyers' work must be re-done, which is why this supplemental letter is being filed today."

The plaintiffs' counsel makes clear they have no evidence to believe the deletion was intentional. But they say the incident serves to underscore that OpenAI "is in the best position to search its own datasets" for potentially infringing content using its own tools.

An OpenAI spokesperson would not comment for this story.

However, late Friday, November 22, OpenAI lawyers filed an answer to the letter sent on Wednesday by lawyers to The Times and Daily News. Within their answer, lawyers for OpenAI definitely denied deleting evidence, and instead blamed the plaintiffs for a system configuration which led to a technological malfunction.

“Plaintiffs requested a configuration change to one of several machines that OpenAI has provided to search training datasets,” OpenAI’s counsel wrote. “Implementing plaintiffs’ requested change, however, resulted in removing the folder structure and some file names on one hard drive — a drive that was supposed to be used as a temporary cache … In any event, there is no reason to think that any files were actually lost.”

In this case and others, OpenAI has argued that training models on publicly available data — including articles from The Times and Daily News — constitutes fair use. That is, in creating models like GPT-4o, which "learn" from billions of examples of e-books, essays, and more to generate human-sounding text, OpenAI believes it isn't required to license or otherwise pay for the examples — even if it makes money off those models.

Still, OpenAI has been signing licensing agreements with more new publishers- among them are the Associated Press, Business Insider owner Axel Springer, Financial Times, People parent company Dotdash Meredith and News Corp. OpenAI has refused to disclose terms of deals but at least one content partner, Dotdash, is being paid at least $16 million per year, according to reports.

OpenAI neither acknowledges nor denies that it trained its AI systems on any copyrighted works without permission.

Blog
|
2024-11-23 17:37:56