Tool Without A Handle: Are You Not Trained?

Tool Without a Handle:  Are You Not Trained?

This post takes up the questions of how copyright law may impact the development and commercialization of Artificial Intelligence ("AI") tools, given their use of other people's data, generally without prior notice or permission.

As has been well documented, training the Large Language Models (“LLMs”) used in Generative AI tools requires lots of data, much of it acquired by scraping data from the Internet.[1]  And, the thinking goes, having more data of higher quality means the better the outputs from such tools.[2]  Some such input data is open for such use, but much else is protected by copyright,[3] or is personal information regulated by privacy laws.[4]  This article summarizes some of the key copyright legal and policy issues this raises; I'll get to privacy issues shortly.

To date, responses of AI builders and sellers to these issues vary from ignoring them, to attempting negotiations with certain parties, to lawsuits, but to my knowledge (and to the knowledge of experts I’ve spoken with), no definitive solutions have been established, and some contend these conversations are just beginning.[5] Challenges include finding solutions to data scraping for training AI models that both fairly address the interests of rights holders and data subjects, while also maintaining optimal momentum for generative AI and its useful and profitable applications.[6]

The response of rights holders and creators to the commercialization of ChatGPT and other tools trained on their works include asking nicely (usually getting a reasonable response)[7] to working out agreements[8] to litigation.[9]  Some of the litigation has been inartfully plead, or based on copyright theories that weren’t going to sustain the weight of their claims.  Some claims that argued AI models regularly and generally infringe on protected works by creating ‘derivatives’ of a protected work, for example, have not been sustained.[10]

Those “output focused” claims failed in part because plaintiffs presented no evidence that the AI model outputs contained protected components of the original work.[11]  Notably, though, the New York Times’s lawsuit against OpenAI and Microsoft does include examples of model outputs that contain substantial passages that reproduce verbatim the content of New York Times articles.[12]

The US Copyright office has sought comment on this (and other) questions,[13] and received thousands of responses.  Stakeholder perspectives lined up predictably, with creative communities arguing such use requires permission (and compensation) while AI model builders and their supporters argued for a broad view of “fair use” that would allow reproduction of protected content without licensing being required or compensation being owed.[14]

Loosely speaking, “fair use” is a doctrine of copyright law establishing that certain uses of copyrighted material, e.g., for purposes of commenting on the original work, do not infringe that work’s copyright.  Whether a use qualifies as a “fair use” is determined by an inquiry that examines the purpose of the use, the amount used, and the impact on the market of the original work[15] – including whether the subsequent work is a “transformation” of the original (and therefore not a market substitute).[16]

Proponents of the fair use basis argue both that what AI models do with protected content is permitted fair use,[17] and that policy considerations weigh in favor of that approach.  For example, Microsoft noted a licensing scheme would impede AI innovation, including from start-ups and entrants who lack the resources to obtain licenses.[18] Proponents have also noted that some analogous digital uses of in-copyright works have qualified as transformative fair uses.[19]

Another perspective, more skeptical of the fair use arguments, is that the AI models relying on data scraped without permissions are living on borrowed time, and that copyright claims will eventually lead to disgorgements.[20] There is some legal support for this view as well.  For one, Microsoft and OpenAI, proponents of the “fair use” basis for use of protected works in training, have offered to indemnify customers facing copyright litigation due to their use of Microsoft/OpenAI tools. While Microsoft has stated it’s extending this offer on the basis few claims are likely to succeed,[21] it also indicates that Microsoft believes there will be some claims that survive to settlement or even to judgment, and for which it would be optimal for Microsoft to control the litigation (rather than its customer).

Creative artists are, naturally, more firm in their views.  Comments from creative interests have told the FTC such use of their work for training AI models is “theft, not fair use,” and the FTC has raised about both copyright and unfair competition arising from the use of original artists’ work.[22]  That said, the issue is sufficiently uncertain that rights holders are also hedging their bets, e.g., advocating changes to the Copyright Act, on the assumption that existing law is insufficiently clear on the point.[23]

Court opinions to date suggest answers may turn on what the AI models intended to do with the ingested works.[24]  In a 2015 case involving the mass reproduction of books to create Google online search services,[25] a court found that “fair use” applied where the reproduction was used solely to provide a public affordance to search for keywords within the text of books, and to see as results certain “snippets” showing context where the word appears.  This was found to be a “transformative” use – “transforming” the books to make available information about the work without making available a full copy that could serve as a substitute or create harm to authors.  Thus, the outcomes of litigation may vary, depending on the nature of the AI model and how well it has been trained to avoid simply plagiarizing its answers and/or publishing more than “snippets” in replies to prompts.[26]

There are other approaches as well - and reason to favor more comprehensive approaches instead of piecemeal solutions, which are likely to be shaped by variations in market power and the ability of a given AI builder/seller to litigate claims, and would result in unpredictable or uneven protections for the interests of rights holders.  Microsoft's policy argument that a licensing scheme would impede AI innovation cuts the other way too - start-ups and entrants who lack the resources to obtain licenses can be addressed not only by a definitive court ruling of "fair use" but also by a licensing / compensation scheme that equitably applies to all concerned. 

That could be accomplished through law and regulation, of course, and there are examples in the law of ways in which such laws balance incentives and interests, e.g., compulsory licensing schemes for cable TV stations retransmitting copyrighted content.  But there's also much to offer in a voluntary, self-regulatory scheme, similar to how protected works are licensed and made broadly available through music licensing processes that allow licenses for public performances of copyrighted songs. 

In this model a builder of an AI model would deal with one organization, who would collect the license revenue and then redistribute compensation proportionately (e.g., allocating a higher share to Wikipedia, the New York Times, or other sources that weight heavily in the model).  An AI model builder could similarly check with one central organization as to whether a given data set has limits on how it can be used, e.g.

To be sure, music licensing has not been without its controversies, but by that token there's no reason to think aspects of the music licensing model that don't work well would necessarily need to be recreated here. Indeed, training AI models does have some important differences from a public performance of a musical work, so it should differ anyway.  Identity and authentication issues - e.g., how to attribute protected works to their proper owners -  would be fundamental to making this approach work.  All of which is to say the details of a licensing scheme for AI are weighty enough to deserve their own post, as well as privacy issues, which I'll also tackle in a future installment.


[1]See, e.g., Lauren Lefer, Scientific American, Your Personal Information Is Probably Being Used to Train Generative AI Models | Scientific American (October 2023); Congressional Research Service, “Generative Artificial Intelligence and Copyright Law” (2023), online at: LSB10922 (congress.gov) (“AI systems are “trained” to create literary, visual, and other artistic works by exposing the program to large amounts of data, which may include text, images, and other works downloaded from the internet”); but see Emily Bender, On NYT Magazine on AI: Resist the Urge to be Impressed | by Emily M. Bender | Medium (pointing out reasons why “training” is an imperfect metaphor that may overly anthropomorphize the software tools).

[2]Hence the “large” in the “large language models” that are currently the focus of much attention.  OpenAI published a paper in 2020, for example, outlining a scaling analysis for AI models, finding that “language modeling performance improves smoothly and predictably as we appropriately scale up model size, data, and compute”; see Kaplan, McCandlish, et. al, “Scaling Laws for Neural Language Models,” online at:  2001.08361.pdf (arxiv.org).  That said, there are views that the “bigger = better” principle may no longer hold see, e.g., The Economist, “The bigger-is-better approach to AI is running out of road (economist.com) (June 21, 2023).  In addition, some commentators have argued smaller models may be better suited for many AI tasks (see, e.g., When It Comes to AI Models, Bigger Isn't Always Better | Scientific American), that the premise “larger is better” was never persuasive in the first place, see, e.g., “[2302.13971] LLaMA: Open and Efficient Foundation Language Models (arxiv.org) (finding a smaller model from Meta outperforms ChatGPT-4); or noting that large models create negative externalities in the form of energy consumption, financial costs and perpetuating harmful social bias.  See, e.g., Bender, Gebru, et. al., ‘On the Dangers of Stochastic Parrots:  Can Language Models Be Too Big?” online at: https://dl.acm.org/doi/10.1145/3442188.3445922 (hereafter “Stochastic Parrots”).

[3]The primary copyright issues I deal with in this post involved the potential infringement by use of a protected work to train an AI model.  Some commentators refer to these as “input” questions, as distinct from “output” questions such as whether an AI model can hold a copyright for works it created. See, e.g., AI created a song mimicking the work of Drake and The Weeknd. What does that mean for copyright law? - Harvard Law School | Harvard Law School.  This is in part because the input questions are, I think, more susceptible to solutions and in part because it’s those questions that survive in the lawsuit filed against Stability AI, et. Al.   Andersen v. Stability AI, U.S. District Court for the Northern District of California, No. 3:23-cv-00201.  Other litigation includes claims against OpenAI and Meta, and claims against Stability AI filed by Getty Images (Navigating the Legal Labyrinth of AI’s Intellectual Property Challenge (msn.com), and the copyright claims filed by the New York Times against OpenAI and Microsoft (NYT_Complaint_Dec2023.pdf (nytimes.com)(“NYT Complaint”). “Input” and “output” issues are inter-related, however.  As the NYT Complaint outlines, their claim of infringement rests in part on the fact that many reported outputs from the OpenAI/Microsoft ChatGPT tools reproduce verbatim text from NYT articles.  The complaint argues that these direct reproductions not only indicate that the OpenAI tools do not “transform” the NYT article as a claim of “fair use” would require, but also work direct competitive harms to the New York Times as a business by providing content to non-subscribers and adding value to OpenAI and Microsoft products without fair compensation to the New York Times for its labor, capital and other investments in creating works of journalism.

[4]See, e.g.,  Italy gives OpenAI initial to-do list for lifting ChatGPT suspension order | TechCrunch (setting out requirements for offering ChatGPT in Italy so as to comply with GDPR, as applied by the Italian Data Protection authority); Data Scraping, Privacy Law, and the Latest Challenge to the Generative AI Business Model | Arent Fox Schiff - JDSupra

[6]It’s possibly the understatement of the year to say there is principled disagreement as to how fast or how slow is “appropriate” or even safe for the existence of humanity.  See, e.g., Sam Altman's firing at OpenAI highlights AI culture clash (axios.com).

[8]See, e.g., Writers Guild of America settlement agreement, Memorandum of Agreement for the 2023 WGA Theatrical and Television Basic Agreement (wgacontract2023.org), Art. 72 (“Generative Artificial Intelligence”, p.68. And while OpenAI was apparently unable to reach agreement with the New York Times on use of the Times’s content, it has struck deals with Associated Press (Associated Press licensing news archive to OpenAI  | The Hill) and German publisher Axel Springer Partnership with Axel Springer to deepen beneficial use of AI in journalism (openai.com).  The full terms of these deals are not public, though in its announcement with Axel Springer, OpenAI notes that ChatGPT’s answers to user queries will include attribution and links to the full articles for transparency and further information. Id.

[10]See, e.g., gov.uscourts.cand.415175.56.0_1.pdf (courtlistener.com); Court Dismisses Lawsuits Against Meta's LLaMA and DoNotPay AI - Pearl Cohen.  Thus, as one commentator put it, “courts are leaning in favor of the AI companies so far, but the battle is just beginning.” “The Plaintiffs Are Wrong”: OpenAI Submits New Authority in Attempt to Knock Out Sarah Silverman’s Claims | Gadgets, Gigabytes & Goodwill Blog (gadgetsgigabytesandgoodwill.com).  These cases raised the claim that use of protected works to train AI models means that any output of the generative AI tool is an infringing “derivative work.”

[11]See, e.g., Order on Motions to Dismiss, Andersen v. Stability AI Ltd., 23-cv-00201-WHO | Casetext Search + Citator;  see also  Federal Judge Dismisses Major Claims in Artists' Copyright Case Against AI Platforms (legal.io) (summary of Judge Orrick’s order).  Certain AI models may raise different issues than others.  For example, generative AI models that use “diffusion” techniques arguably do not recreate a new image from scratch, but instead replicate the original image and then transform it by adding new or different data (aka “noise”).  Courts still are wrestling with whether this involves a “copying” of a protected work, though it quite likely does.  See, e.g., Image-generating AI can copy and paste from training data, raising IP concerns | TechCrunch.  Demonstrating that a given scraped work served as the origin of a derivative output may prove difficult or impossible given the opacity of many generative AI tools, something which various proposals attempt to address, such as the provisions described for the proposed EU AI Act; the political agreement for that legislation contains rules on both transparency (such as summaries of training data) and explainability.  See. e.g., Artificial Intelligence Act: deal on comprehensive rules for trustworthy AI | News | European Parliament (europa.eu); eu_ai_act_cheat_sheet.pdf (iapp.org) (proposed requirements summarized).

[12]See NYT Complaint at 29-39; Exhibit J

[13]See  US Copyright Office, Artificial Intelligence and Copyright, Federal Register Vol. 88, No. 167 / Wednesday, August 30, 2023, 2023-18624.pdf (govinfo.gov);

[14]Compare Comment from Songwriters Guild of America, Society of Composers & Lyricists, and Music Creators North America (online at: https://www.regulations.gov/comment/COLC-2023-0006-10291) with Comment from Microsoft, online at:  https://www.regulations.gov/comment/COLC-2023-0006-8750 (“Microsoft believes that the fair use doctrine in the U.S. is the Intellectual Property framework best suited to supporting AI development”).

[15]See  17 U.S.C. § 107 (Copyright Act provisions describing the “fair use” test); Folsom v. Marsh, 9. F.Cas. 342 (C.C.D. Mass. 1841) (court opinion setting out the four factors since codified in the Copyright Act).

[16]See Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569, 578-585 (1994) (“Under the first of the four § 107 factors, "the purpose and character of the use, including whether such use is of a commercial nature ... ," the enquiry focuses on whether the new work merely supersedes the objects of the original creation, or whether and to what extent it is "transformative," altering the original with new expression, meaning, or message”).

[17]See, e.g., Microsoft Comments, p. 7-8 (describing how AI model use of protected data both “works a transformation of the original work” and that he potential market for, or value of, a copyrighted work is not affected by use of the work to train an AI model (as distinct from the effects of any outputs); Testimony of Sy Damle, Latham & Watkins, online at: https://judiciary.house.gov/sites/evo-subsites/republicans-judiciary.house.gov/files/evo-media-document/damle-testimony.pdf; see also Five key takeaways from the House Judiciary Committee hearing on AI and copyright law – The Passive Voice (summarizing Mr. Damle’s testimony).

[18]See, e.g., id., p.9

[22]See, .e.g., FTC Comment to the Copyright Office, online at: https://downloads.regulations.gov/COLC-2023-0006-8630/attachment_1.docx; FTC Workshop, “Creative Economy and Generative AI”, online at: Creative Economy and Generative AI | Federal Trade Commission (ftc.gov).

[23]See, e.g., Testimony of Ashley Irwin, American Society of Composers and Lyricists, irwin-testimony.pdf (house.gov).

[25]Authors Guild v. Google, Inc., 804 F.3d 202 (2d Cir. 2015), online at: https://bit.ly/47h7i8i

[26]The NYT Complaint alleges several contrasts, for example, noting that the outputs of Bing/ChatGPT search both display significantly more content from original NY Times articles than would ordinarily be in search results, and omit links to the original articles.  NYT Complaint, para. 114.  The complaint also notes Bing Chat search results “display extensive excerpts or paraphrases of Wirecutter [“Wirecutter” is a NYT publication featuring product reviews for consumer electronics and similar goods and services] content when prompted. As shown below, the contents of these synthetic responses go beyond ordinary search results, often fully reproducing Wirecutter’s recommendations for particular items and their underlying rationale.” NYT Complaint, para. 127.

 

Add new comment