Lawyers for The New York Times and Daily News, which are suing OpenAI for allegedly scraping their works to train its AI models without permission, say OpenAI engineers accidentally deleted data potentially relevant to the case.
Earlier this fall, OpenAI agreed to provide two virtual machines so that counsel for The Times and Daily News could perform searches for their copyrighted content in its AI training sets. (Virtual machines are software-based computers that exist within another computer’s operating system, often used for the purposes of testing, backing up data, and running apps.) In a letter, attorneys for the publishers say that they and experts they hired have spent over 150 hours since November 1 searching OpenAI’s training data.
But on November 14, OpenAI engineers wiped out all the publishers' search data stored on one of the virtual machines, according to the above letter, which was filed late Wednesday in the U.S. District Court for the Southern District of New York.
OpenAI tried to recover the data — and was mostly successful. However, because the folder structure and file names were “irretrievably” lost, the recovered data “cannot be used to determine where the news plaintiffs’ copied articles were used to build [OpenAI’s] models,” per the letter.
News plaintiffs have had to redo all their work in a massive use of person-hours and computing time," wrote The Times and Daily News' counsel. "The news plaintiffs only learned yesterday that the recovered data is not usable and that an entire week's work of experts and lawyers must be started over, hence this supplemental letter today.
The plaintiffs' counsel said they have no evidence they believe the content was deleted intentionally. They do, however say that the episode drives home that OpenAI "is in the best position to search its own datasets" for possibly infringing content using its own tools.
OpenAI did not provide an official comment.
In this case and others, OpenAI has argued that training models using publicly available data — including articles from The Times and Daily News — constitutes fair use. Which is to say: When creating models like GPT-4o, which "learn" from billions of examples of e-books, essays, and more to generate human-sounding text, OpenAI believes it isn't required to license or otherwise pay for the examples — even if it makes money from those models.
OpenAI has announced the licensing deals that it has inked with a growing number of new publishers, such as Associated Press, Business Insider owner Axel Springer, Financial Times, People parent company Dotdash Meredith, and News Corp. The terms OpenAI has imposed on these deals have not been disclosed. According to reports, Dotdash has got at least $16 million in one year.
OpenAI neither confirms nor denies that it has trained its AI systems on any specific copyrighted works without permission.