Republican senator Josh Hawley is furious. In a lively Senate subcommittee hearing, Hawley describes AI companies’ use of copyrighted materials to train their LLMs as “the largest intellectual property theft in American history.”
Although Hawley may be right about that, judges have not fully agreed with his assessment. It is a contentious issue, and one that will likely make its way to the Supreme Court.
As of January 2026, there are over 40 major copyright infringement cases working their way through the courts, and new ones are popping up all the time. On December 22, 2025, investigative reporter and Bad Blood author John Carreyou, along with five other writers, filed a high-profile copyright infringement suit against Anthropic, Google, OpenAI, Meta, xAI, and Perplexity.
Similarly, on December 5, 2025, The New York Times sued Perplexity, accusing the AI company of copyright infringement, as well as brand damage. According to the lawsuit, The New York Times contends that Perplexity has been “hallucinating” and attributing false citations to the newspaper; hence, the rationale for the brand damage claim. Incidentally, The Chicago Tribune and The Wall Street Journal parent company Dow Jones has filed similar suits against Perplexity.
Although these copyright lawsuits are still making their way through the courts, there have been a few landmark rulings over the past year—namely, Thompson Reuters v. Ross Intelligence, Bartz v. Anthropic, and Kadrey v. Meta.
In all of these cases, AI companies employ a fair use defense to argue that they can legally train their LLMs with copyrighted data. When judges decide whether fair use is a viable defense, there is a “four-factor fair use test.”
The four-factor fair use test
In a nutshell, the four-factor fair use test is a guide for judges to ascertain whether a use of copyrighted materials qualifies as fair use under Section 107 of the U.S. Copyright Act. When making a ruling, judges weigh all four factors together.
1. The purpose of the use: Are these copyrighted works being used for commercial purposes? Or are they being used for educational or nonprofit purposes? Is the use transformative? Or are these copyrighted works simply being copied?
2. The nature of the work being used: Is it published or unpublished? Is it a creative work? Or a factual work?
3. How much content was used: This includes a quantitative and qualitative assessment by the judge.
4. The effect of the use on the marketplace: Does the use create a new competitor in the marketplace or otherwise hinder the copyright holder’s economic prospects?
Thompson Reuters v. Ross Intelligence
In this landmark case, Thompson Reuters (owner of the legal research platform, Westlaw) sued legal tech start-up Ross Intelligence for using Westlaw’s proprietary data to train its AI model.
On February 11, 2025, the Delaware District Court ruled that Ross’ use of Westlaw’s “headnotes” (summaries of judicial decisions) did not meet the standard of fair use, and the Court found Ross liable for copyright infringement.
In his Thompson Reuters v. Ross Intelligence ruling, Judge Stephanos Bibas said that two factors weighed in favor of each party. However, Judge Bibas ultimately decided that Westlaw’s legal summaries (“headnotes”) were original enough to warrant copyright protection; moreover, Ross’ use of these headnotes was commercial in nature, and Ross’ business model was designed to compete directly with Westlaw.
Subsequently, Thompson Reuters was granted a partial summary judgment, which is currently being appealed.
This ruling is important because it suggests that using copyrighted material to train LLMs may not qualify as fair use, especially if the AI company is using proprietary materials from a competitor.
Bartz v. Anthropic
In this case, a group of authors sued Anthropic for copyright infringement, claiming that the Google-backed company used pirated books from “shadow libraries” (Library Genesis, Pirate Library Mirror) to train its LLM, Claude. Anthropic also purchased millions of print books, scanned those, and then used those digital files to train Claude.
Judge Alsup, a senior judge for the Northern District of California, issued a summary judgment, declaring that Anthropic’s use of legally-purchased books to train Claude was “transformative” and thus covered under fair use; however, the fact that Anthropic copied and stored over 7 million pirated books did not sit well with Judge Alsup.
Nor did it sit well with Illinois senator Dick Durbin, who lamented to his colleagues,
“Anthropic pirated over 7 million books from shadow libraries. As Anthropic’s CEO put it, Anthropic had many places from which it could have purchased, but it preferred to steal them to avoid, quote, ‘legal practice business slog’, whatever that means. While Anthropic later became not so gung-ho about training their LLM on pirated books for legal reasons, it kept the pirated copies it had already downloaded anyways. I don’t get that.”
On June 23, 2025, Anthropic was forced to pay $1.5 billion in a class action settlement.
This ruling is important because it suggests that the use of pirated materials to train LLMs may not be covered under fair use. Such behavior may increase the risk of copyright infringement liability for AI companies. And rightfully so.
Kadrey v. Meta
Similar to Bartz v. Anthropic, this case involves a group of authors suing an AI company for downloading pirated books from shadow libraries and using those files to train its LLM.
While considering the four-factor fair use test, Judge Vince Chhabria argued that Meta’s use of the materials was transformative and wouldn’t impact the authors’ financial prospects. That said, Chabbria also writes, “This ruling (in favor of Meta) does not stand for the proposition that Meta’s use of copyrighted materials to train its language models is lawful.”
So, although Meta won this particular suit, Chabbria suggests that not all training with copyrighted materials will fall under fair use.
This Meta case really garnered the ire of Hawley, Durbin, and many of the witnesses in the Senate subcommittee hearing on the use of copyrighted works to train AI models.
According to Hawley and Maxwell Pritt, a partner at Boies Schiller Flexner LLP, Mark Zuckerberg was explicitly warned about the illegality of pirating works, but Zuckerberg gave his company the go-ahead to do so anyways. In fact, Pritt and Hawley allege that Meta employees tried to hide the company’s illegal behavior by moving the torrenting (via Library Genesis) off Facebook servers and onto AWS servers.
Citing conversations between Meta employees, (Eleanora Presani, Nisha Deo, Xavier Martinet, David Esiobu, and Nikolay Bashlykov), Pritt says that Meta acquired over 200 terabytes of pirated data to train Llama. In his Senate testimony, Pritt didn’t mince words, saying, “Yes, Meta uses torrents to acquire pirated data for its Llama model.”
Looking at the big picture: Big Tech’s rationale for using copyrighted materials to train their LLMs
The fair use arguments aside, Big Tech companies contend that they can’t compete with other countries if they aren’t allowed to train their models with copyrighted materials. These tech companies are also quick to claim that blocking such LLM training would stifle innovation.
As Michael Smith, professor of IT and marketing at Carnegie Mellon University, points out, these arguments from Big Tech companies and their lobbyists are far from new. In his subcommittee testimony, Smith says,
“In the context of gen AI training, we are now hearing many of the same arguments that we heard in the early days of the Internet: Allowing generative AI companies to use pirated content to train their models is fair use because it won’t harm legal sales, is unlikely to harm creativity, and any legislative efforts to curtail the use of pirated materials for training will not only be ineffective, but will also stifle innovation. It’s important to recognize that while the time has changed, the underlying economic principles are the same today as they were in 2000.”
Calls for Congress to change existing copyright law
In this same hearing, Richmond, Virginia-based author David Baldacci—who is part of a federal class-action copyright lawsuit against OpenAI—suggests that there is a clear power imbalance and double standard when it comes to copyright. Baldacci says,
“I urge Congress to consider what the AI companies’ position means for copyright as a whole and the future of the creative professions in this country. Source code and elements of algorithms are also protected by copyright. I would hazard to bet that if I stole any of the AI communities’ source codes or algorithms and then tried to profit off them, they would unleash a tsunami of lawsuits against me.
However, if as AI companies contend, fair use is actually my entire body of work, there is no more copyright protection for anyone. I’m sure the AI community believes that their IP should be fully protected against interlopers, and I agree with them. Thus, I am deeply disappointed that they don’t feel that people such as myself should enjoy the same rights and protections.”
Josh Hawley agrees with Baldacci and takes it a step further by arguing that U.S. copyright law itself needs to be examined. Speaking explicitly about existing copyright law and fair use statutes, Hawley gets extremely animated at one point. He says, “If the biggest corporation in the world (he’s talking about Meta), worth trillions of dollars, can come take an individual author’s work, like Mr. Baldacci, lie about it, hide it, profit off it, and there’s nothing our law does about that, we need to change the law. And if nothing else comes out of this hearing today, I hope that’s it.”
Hawley paints this situation as an intrinsically moral issue. He continues, “Mr. Baldacci, you said you’d rather live on a different planet if there was AI that could write your books. I’m sure that will never happen; they’ll never write your books. I want to live on a different planet if this can go on (referring to an incriminating conversation between Meta employees David Esiobu and Frank Zhang) and it’s perfectly legal. We’ve got to do something about this.”
Josh Hawley’s AI Accountability and Personal Data Protection Act
A week after his lively Senate hearing, Hawley introduced the AI Accountability and Personal Data Protection Act. Co-sponsored by Senator Richard Blumenthal (D-CT), this proposed law would allow individuals to sue AI companies that use individuals’ personal data or copyrighted works to train LLMs. The law would require AI companies to gain informed consent before doing so.
Given the current Congressional makeup, I don’t like this law’s chances of passing. That said, it’s important to bring attention to this issue. And Josh Hawley, Richard Blumenthal, and Dick Durbin are certainly doing that.
Key takeaways
The recent Anthropic and Meta decisions suggest that fair use can apply to AI training under certain circumstances. The rulings highlight the issue of using pirated content in LLM training, which should significantly increase the risk of AI companies’ infringement liability.
Also, these court decisions suggest that moving forward, AI output will factor into the courts’ four-factor fair use assessment and market harm analysis. If other industries, such as music or film, are able to show market harm more clearly, it may become more difficult for AI companies to successfully employ a fair use defense.
In the wake of these court rulings, some politicians and academics, e.g., Josh Hawley (R-MO), Richard Blumenthal (D-CT), Dick Durbin (D-IL), New England Law School professor Bhamati Viswanathan, are framing the AI companies’ LLM training as a moral issue, even going so far as to call for modifications to current U.S. copyright law.
In summary, when it comes to the issue of using copyrighted works to train LLMs, courts in the U.S. have not reached a consensus yet. Although it’s early days, I suspect this issue will eventually reach the Supreme Court.


