Lawyers for The New York Times and Daily News, which are suing OpenAI for allegedly scraping their works to train its AI models without permission, say OpenAI engineers accidentally deleted data potentially relevant to the case.
Earlier this fall, OpenAI agreed to provide two virtual machines so that counsel for The Times and Daily News could perform searches for their copyrighted content in its AI training sets. (Virtual machines are software-based computers that exist within another computer’s operating system, often used for the purposes of testing, backing up data, and running apps.) In a letter, attorneys for the publishers say that they and experts they hired have spent over 150 hours since November 1 searching OpenAI’s training data.
But on November 14, OpenAI engineers erased all of the publishers' search data stored on one of the virtual machines, according to the aforementioned letter, which was filed in the U.S. District Court for the Southern District of New York late Wednesday.
OpenAI tried to recover the data — and was mostly successful. However, because the folder structure and file names were “irretrievably” lost, the recovered data “cannot be used to determine where the news plaintiffs’ copied articles were used to build [OpenAI’s] models,” per the letter.
"News plaintiffs have been required to redo their work entirely from scratch using substantial person-hours and computer processing time," wrote counsel for The Times and Daily News. "The news plaintiffs learned only yesterday that the recovered data is unusable and that an entire week's worth of its experts' and lawyers' work must be redone, which is why this supplemental letter is being filed today."
The plaintiffs' counsel explicitly state that there is nothing to suggest the deletion was willful. They do say, however, that the episode highlights that OpenAI "is in the best position to search its own datasets" for content potentially infringing using its own tools.
An OpenAI spokesperson had no comment.
But late Friday, November 22, lawyers for OpenAI responded to the letter sent on Wednesday by counsel for New York Times and Daily News. In their reply, OpenAI attorneys strongly debunked the insinuation that OpenAI deleted evidence, and instead placed the blame on the plaintiffs' shoulders for a system configuration issue that resulted in a technical glitch.
“Plaintiffs requested a configuration change to one of several machines that OpenAI has provided to search training datasets,” OpenAI’s counsel wrote. “Implementing plaintiffs’ requested change, however, resulted in removing the folder structure and some file names on one hard drive — a drive that was supposed to be used as a temporary cache … In any event, there is no reason to think that any files were actually lost.”
In this and other cases, OpenAI has asserted that using publicly accessible data to train models — articles from The Times and Daily News among them — is fair use. That is, in building models like GPT-4o, which "learn" from billions of examples of e-books, essays, and more to generate human-sounding text, OpenAI believes it needn't license or otherwise pay for examples, even if it profits off the models.
With that said, OpenAI has signed licensing agreements with a rising number of new publishers, including the Associated Press, Business Insider parent Axel Springer, Financial Times, People parent Dotdash Meredith, and News Corp. OpenAI refuses to disclose the terms of those deals, but one content partner, Dotdash, is said to be receiving at least $16 million annually.
In fact, OpenAI neither confirmed nor denied any specifics about training its AI on such copyrighted works without permission.