Tool Without A Handle: Are You Not Trained?

Tool Without A Handle: Are You Not Trained? (part 2)

The last post in this series[1] addressed how copyright law may impact the development and commercialization of Artificial Intelligence ("AI") tools, given their development relies on use of other people's creative works, often without notice or consent. This is an unavoidable question; as OpenAI noted in a public comment, it is impossible to create generative AI tools without use of copyright-protected content.[2] This post addresses a follow up question: whether, in light of what’s noted thus far, copyright law requires AI model builders to license all such content, and thus whether a mandatory licensing scheme to enable fair exchange of value between creators of protected content and the AI model builders is legally required.

My answer is a caveated “no” – for at least three reasons. First, for the moment, no copyright claim has persuaded a court that licensing is required to avoid infringing the rights of copyright holders in scanned/scraped works used to train AI models.[3] And no US court has yet ruled against the argument that such use of content to train AI models is protected by fair use principles, which permit use of a protected work to create transformative new works.[4] Second, as a practical matter, the economic forces propelling the growth of generative AI appear to be presently much stronger than any resistance to such growth based on copyright concerns.[5] Third, the major platform firms are conscious of the concerns and have staked out responses.

OpenAI, for example, notes it has adopted a policy of training its models only on data that is either public, provided directly by its staff, or licensed.[6] Recognizing it is important to “support and empower” creators, OpenAI also notes it is actively engaged with creative interests to find mutually beneficial arrangements, such as partnership deals with publishers.[7] Other AI developer firms are also responding; a new nonprofit offers a trust mark program for companies that have trained a generative AI model using only work that is properly permissioned.[8]

That said, I said I've got a caveat, and it's an important one: I think a licensing scheme should be created and required, for three reasons independent of copyright litigation or the scope of the fair use defense:

1) The market for generative AI tools already fails to provide creators with a reasonable and sustainable share of revenue. Even if not required to satisfy copyright claims, there's long term value (as well as fundamental fairness) in compensating artists who invented the works (including the concepts underlying the work) and undertook the risks, time, and other expense in bringing them into existence. AI could tip the balance here such that it works a material harm on the pace and profit of creative work, and all in the AI ecosystem will benefit if the pace and profitability of content creation continue to be fairly incentivized;

2) Misinformation concerns already constitute a compelling interest that government regulation can and should address. The fight against misinformation enjoys substantial advantages where content can be properly attributed to its true creator, including creators who build off of licensed content to create their own works (e.g., parody entertainment). Establishing the provenance of content used to train AI models enables content owners to protect against derogatory use of their work, including redress against counterfeiting. Thus, a licensing scheme furthers both public and private interests related to reducing misinformation;

3) It’s easier to contract than to sue, and it’s preferable to contract than to operate against uncertainty about material issues such as the lawfulness of model training. While the market for generative AI models may currently function reasonably well enough, even with the overhang of rights ownership uncertainty, generative AI will not be in its nascent stage indefinitely. Especially if a settlement means no court provides clarity as to how far “fair use” protects the use of content to train AI models, licensing will likely prove attractive path to greater legal certainty for risk averse parties to complex and expensive transactions.

Taking the third of these reasons first, business model innovations are one good reason for parties to seek greater legal certainty. It's reasonable to assume the future holds more aggressive and content-intensive applications of AI (e.g., “CoPilot – write the film score for the next season of my Netflix series”), and thus more aggressive contracting with respect to commissioned uses of AI. In such negotiations, parties will mutually benefit if licensing is available as an alternative to “winging it” re: potential copyright infringement, especially parties who are more risk averse.

So while current cases against OpenAI continue to wrestle with how to effectively plead a complaint,[9] or to sort out facts,[10] and notwithstanding the strength of fair use arguments as a defense to the use of copyright works in training large language models,[11] the ability of generative AI companies to sustain investment levels throughout a long and accelerating innovation cycle will likely depend on - or at least materially benefit from – reducing current levels of legal uncertainty.

Professor Andre Guadamuz of the University of Sussex, quoted in The Economist in support of the “fair use” arguments, nonetheless also believes ultimately a licensing regime will arise, and that AI developers will have to pay creators in return for the use of their work in AI model development.[12] Professor Guadamuz elaborates on the reasons for this in a recent paper,[13] which emphasizes the fact-specific nature of both the data collection, the model training, and the output generation processes (not all generative AI models create outputs in the same way). Because of these variances, some AI model development and/or use may violate copyright, other development and use may not, and there would thus need to be legal process required to sort out which is which. Developers may well prefer a licensing fee over undertaking the time and cost of such a process.

Law firms, as well, are therefore encouraging clients to pursue contractual certainty given that the legal intellectual property issues related to generative AI software are in flux. E.g., from Hogan Lovells:

“[I]f generative AI software has been fed or trained using private data (such as a proprietary provider or user database) the agreement should explicitly define ownership of such data. Negotiating appropriate IP and data ownership terms is critical in generative AI software contracts, because, among other aspects, these terms (or the lack thereof) can impact the user's ability to use and further develop AI-generated assets. However, IP ownership related to AI-generated content under United States law is still in flux. [14]

Hogan Lovells goes on to note that, because issues are in flux, the rights enforceable under a contract may not be enforceable against the world, i.e., non-parties to the contract. This, in turn, creates incentives to establish a broader licensing structure so that parties can minimize legal uncertainty, at a lower absolute cost (and a far lower transactional cost) than contracting.

Licensing regimes can create cost efficiencies in the public context as well, including by reducing the social costs of unattributed “deep fakes” and their abuse in misinformation campaigns. As CISA Director Jen Easterly noted to the New York Times,[15] the most “critical infrastructure” of the United States is our cognitive infrastructure – the framework and tools by which citizens examine and analyze reality. When this is weakened, both commercial marketing and public safety efforts are weakened as well. In this way, misinformation, including in the form of AI-generated “deepfakes” that use the image, likeness, or other characteristics of a famous work (or person), is a key threat to that critical infrastructure, in particular to politics and elections.[16]

Accordingly, some major platforms have adopted policies requiring posts to disclose if they were created with AI, and to require similar disclosures of altered or synthetic content.[17] Adding such disclosures, through metadata or otherwise, to new content may be fairly straightforward, but of course AI models are typically trained on the substantial volume of untagged, already-created content available for training AI models. Tagging that content to identify ownership and indicate usage rights would be challenging and expensive.

We have done challenging and expensive things before, though, particularly where public interests and private commercial incentives align. Tagging AI training content may prove less challenging than, say, scanning all the books in the world’s libraries.[18] For one, such tagging could leverage existing metadata so as to drive scale, and take advantage of existing standardization so as to promote interoperability across platforms, as well as traceability for misinformation researchers and public safety officials.

Establishing the provenance of data used to train AI models will be required to address other issues as well, such as privacy.[19] The EU AI Act will require compliance with EU copyright law (even where the model is trained outside of the EU).[20] The tools being developed to establish provenance of digital content (such as the location, date of creation)[21] afford opportunities to simultaneously tackle privacy, international compliance, tracking and redress of misinformation, as well as associate rights permissions with digital content, identify the rightful creator, and track use of the content for purpose of calculating royalties.

It should also afford a chokepoint for detecting and blocking CSAM before it is input into a model. As David Thiel noted for the Stanford Internet Observatory, CSAM has been detected in image databases used to train AI models.[22] And while the report notes that such CSAM is being quarantined and removed, and that there are methods to minimize CSAM in datasets used to train AI models, “it is challenging to clean or stop the distribution of open datasets with no central authority that hosts the actual data.”

Of course, there’s no need to have a central authority host training datasets and there's a number of downsides (including security) to that approach. The point remains, though, that it is important to clean such datasets and that such work creates opportunities and affordances useful to tracking content for purposes of compensating rights owners. There would be value added, for example, through a common standard by which the data is tagged and tracked to its owners/originators who then license its use. And the process of screening out unattributed / unlicensed data also affords opportunities to screen out CSAM, in combination with tools like PhotoDNA.

Moreover, such a system could then preserve free speech values. One way platforms might deal with various legal risks (and respond to misinformation concerns) is by simply blocking AI-generated “deepfake” content. But some such "deepfake" content will have social value as commentary, parody, or as an example of technical capabilities. Where copyright checks provide a way to verify the origin, and thus the truth, of a given work, more such content could be cleared for distribution. In general, incentives to deal with misinformation also create incentives for licensing regimes that both authenticate and allow for more legal clarity & certainty around the rights to use a work, including for commercial purposes.

I don't want to put undue weight on the fight against misinformation as a reason for a rights licensing regime to govern AI training datasets. I believe the primary motivators will be commercial. Among other things, policy approaches to misinformation vary considerably across jurisdictions. For EU countries with reasonably homogenous domestic populations and a stronger regulatory backdrop, public sector efforts to tackle misinformation are likely to have some heft.[23] In the US, it’s less clear that would be the case. Some actors, including politically influential actors, have incentives to perpetuate distribution of certain misinformation and to block efforts to curtail such misinformation (and disinformation).[24] Even so, a licensing regime that incidentally also helps fight misinformation has one additional thing going for it.

While incentives in the US are likely to be primarily commercial rather than regulatory, those commercial incentives are meaningful, as evidenced by the current use of various systems to uniquely identify a piece of content, establish ownership rights, and allocate revenue. Creative works such as books, films, television, games, and many apps come with unique identifiers, such as ISBNs for books. In cybersecurity, an important way to know if one is interacting with a legitimate website is the use of certificates (and trusted 3^rd parties who issue them) that show independent confirmation that a website’s public key is valid.[25] These systems generate trust, as well as facilitating sales tracking, marketing analytics, and regional copyright protections. They also incentivize investment by providing greater certainty about both the value of creating a work and the legality of using a work.

Finally, as a variety of stakeholders have noted, current content distribution models already fail to allocate a reasonable and sustainable share of revenue for creative works. [26] Absent fair compensation for the entrepreneurial artists who generate the content on which AI models are trained, the proliferation of AI tools could work a material harm on the pace and profit of creative work, with both economic and cultural loss as the consequence. I agree one should weigh the primary purposes of copyright – to “promote the useful arts,” e.g. - in considering whether licensing is a preferable approach, and if treating the unauthorized use of protected material to train AI models as a copyright violation works against that purpose,[27] the absence of any acknowledgment to rights creators works against that purpose at least equally so.

And, as with other examples of corporate sustainability, a mix of commercial responses and regulation is most likely effective here. Regulations should be narrowly tailored, including in order to meet First Amendment scrutiny. That's not impossible to do, though. As noted above, protecting incentives that "promote the useful arts" is a legitimate government interest, as are goals such as protecting opportunity in the market for AI tools by leveling the playing field for rights negotiations. Licensing regulations rules that protect the expressive entrepreneurship of content creators further 1^st Amendment interests and could be crafted to do so in the least restrictive way possible.

It's optimal, though, for the private sector (commercial and non-profit alike) to decide on how these interests are best advanced, and how policy responses should be designed. For while a licensing clearinghouse is an obvious approach, there are options that are less complex and could be set up more quickly. These include voluntary ‘safe harbors’ that provide insulation from litigation if a firm signs certain commitments with rights holders and/or offers certain contractual terms. Similarly, model terms and rate schedules for use in licensing contracts could help foster grass-roots level development of firmer protections for content creators. Most of the benefits I identified above, though, arrive via an approach that, at a minimum, 1) includes omnidirectional identifiers for creative content that allow tracking of use rights and compensation and 2) has enough oomph and scale to be sustainable and effective.

[1]Tool Without A Handle: Are You Not Trained? | Center for Internet and Society (stanford.edu)

[2]Dan Milmo, The Guardian, “Impossible’ to create AI tools like ChatGPT without copyrighted material, OpenAI says.” OpenAI | The Guardian (24 January 2024); online at https://www.theguardian.com/technology/2024/jan/08/ai-tools-chatgpt-copyrighted-material-openai

[3]See, e.g., Copyright: US Court Rejects Some of the Claims against OpenAI (barlaw.co.il)

[4]See, e.g., Stability AI, Response to United States Copyright Office Inquiry into Artificial Intelligence and Copyright (October 2023) online at: Stability AI – submission to USCO (berkeley.edu); see generally 17 U.S.C. § 107 (Copyright Act provisions describing the “fair use” test);

[5]The market for generative AI is expected, by one forecast, to top $407B by 2027, and to grow at a CAGR of 31.4% from 2023 to 2032. See OpenAI Statistics 2024 By Revenue and Growth (enterpriseappstoday.com)

[6]See, e.g., OpenAI, “OpenAI—written evidence (LLM0113) House of Lords Communications and Digital Select Committee inquiry: Large language models,” online at: committees.parliament.uk/writtenevidence/126981/pdf/

[7]Id. As noted in my earlier post, an additional reason for a comprehensive licensing scheme is to ensure reasonable terms are available to all creators, not simply those with larger economic power and litigation departments.

[8]See, e.g., Fairly Trained (www.fairlytrained.org); "New Nonprofit Spotlights AI Trained on Copyrighted Work with Permission," Scientific American

[9]See generally Tremblay v. OpenAI, Inc. | Loeb & Loeb LLP (noting the dismissal, with leave to amend, of plaintiffs’ claims for vicarious copyright infringement, violation of the DMCA, negligence and unjust enrichment);

[10]See, e.g., "New York Times denies OpenAI's 'hacking' claim in copyright fight | Reuters" (describing dispute between the NYT and OpenAI as to whether apparently infringing outputs of the OpenAI models were naturally arrived at through ordinary prompts, were ‘hacked’ (created with artificially rigged conditions ), or neither); New York Times Opposition to OpenAI Motion to Dismiss Microsoft Word - 2024-03-11 NYT Opposition to OpenAI's MTD_FILED (004).docx (thomsonreuters.com). Among other things, the NYT Opposition also notes that copyright issues arise as well when model training strips out or evades rights-protecting technologies, something the Digital Millenium Copyright Act expressly prohibits. See NYT Motion at 14. DMCA claims were also pled, but did not survive, in the Tremblay v. OpenAI litigation (see supra n.7). But even so, the DMCA rules against interference with rights-protecting tools creates additional legal uncertainty that a licensing scheme would remove.

[11]See, e.g., Does generative artificial intelligence infringe copyright? (economist.com) (noting that both OpenAI and an independent academic expert agree the “fair use” arguments should defeat plaintiff’s infringement claims)

[12]See Id.

[13]See Guadamuz, Andres, “A Scanner Darkly: Copyright Liability and Exceptions in Artificial Intelligence Inputs and Outputs,” GRUR International, Jan 2024, online at: https://ssrn.com/abstract=4371204

[14]"Key Considerations in Negotiating Generative AI Agreements" - Hogan Lovells Engage

[15]Jim Rutenberg and Stephen E. Myers, New York Times, 17 March 2024, "How Trump’s Allies Are Winning the War Over Disinformation"

[16]See, e.g., Social media accounts use AI-generated audio to push 2024 election misinformation (dallasnews.com) (noting examples of how AI is being misused or demonstrating potential for misuse).

[17]See, e.g., TikTok (https://www.tiktok.com/community-guidelines/en/integrity-authenticity/#3) ; YouTube ("How we're helping creators disclose altered or synthetic content" - YouTube Blog)

[18]See, e.g., 15 years of Google Books (blog.google)

[19]See, e.g., AI and scraped data: Data protection implications - Insight - MinterEllison

[20]See, e.g., The AI Act and IP – LoupedIn

[21]See, e.g., Coalition for Content Provenance and Authenticity, Overview - C2PA; How we're helping creators disclose altered or synthetic content - YouTube Blog

[22]See Investigation Finds AI Image Generation Models Trained on Child Abuse | FSI (stanford.edu)

[23]See, e.g., About us - The Swedish Psychological Defence Agency (mpf.se) (a government agency which “identifies, analyses, prevents, and counters foreign malign information influence activities and other disinformation directed at Sweden or at Swedish interests.”

[24]See, e.g., New York Times, supra, n.15

[25]Certificate authority - Wikipedia

[26]See, e.g., Comment from Songwriters Guild of America, Society of Composers & Lyricists, and Music Creators North America (online at: https://www.regulations.gov/comment/COLC-2023-0006-10291)

[27]See, e.g., The Soul of Creativity in Copyright: From Inspiration to Information - R Street Institute (“The copyright clause aims to “promote the progress of science and useful arts,” suggesting that copyright law should facilitate—not hinder—innovation and creativity. By labeling the training of AI as copyright infringement, we risk contravening this constitutional directive, stifling the advancement of science and the arts. Copyright was conceived as a means to encourage creators by providing them with a limited privilege over their works, thereby incentivizing the production of new knowledge and artistic expressions. This incentive structure is predicated on the belief that creativity and innovation flourish when authors and artists can derive tangible benefits from their creations.”); see also Adam Thierer, “The Most Important Principle for AI Regulation,” R Street Institute, June 21, 2023, online at: https://www.rstreet.org/commentary/the-most-important-principle-for-ai-regulation/

Tool Without A Handle: Are You Not Trained? - Part 2