Wikipedia:Large language models and copyright

This is an essay.

It contains the advice or opinions of one or more Wikipedia contributors. This page is not an encyclopedia article, nor is it one of Wikipedia's policies or guidelines, as it has not been thoroughly vetted by the community. Some essays represent widespread norms; others only represent minority viewpoints.

An LLM can generate copyright-violating material.^[a] Generated text may include verbatim snippets from non-free content or be a derivative work. In addition, using LLMs to summarize copyrighted content (like news articles) may produce excessively close paraphrases.

The copyright status of LLMs trained on copyrighted material is not yet fully understood. Their output may not be compatible with the CC BY-SA license and the GNU license used for text published on Wikipedia.

Does LLM output inherently violate copyright law?

The copyright status of LLM-generated text is not defined by statute, so it is hard to make confident claims, but precedent exists for computer-generated art and other works created by non-humans. Here is what the US Copyright office has to say:^[1]

The Office will not register works produced by nature, animals, or plants. Likewise, the Office cannot register a work purportedly created by divine or supernatural beings, although the Office may register a work where the application or the deposit copy(ies) state that the work was inspired by a divine spirit.

[...]

Similarly, the Office will not register works produced by a machine or mere mechanical process that operates randomly or automatically without any creative input or intervention from a human author.

Not all jurisdictions have the same view. For example, the UK maintains the view AI-generated works are normally copyrightable, and that the copyright is held by the operator/creator/custodian of the AI:^[2]

The UK is one of only a handful of countries to protect works generated by a computer where there is no human creator. The “author” of a “computer-generated work” (CGW) is defined as “the person by whom the arrangements necessary for the creation of the work are undertaken”. Protection lasts for 50 years from the date the work is made.

Both maintain the view AI systems do not have a legal personhood and they cannot hold the copyright on their own.

Whether artificial neural networks are capable of producing original intellectual output is less of a legal issue and more of a philosophical/anthropological one. Human brains are themselves neural networks; much has been said, in a variety of fields, on the subject of whether humans create original works versus whether they merely juxtapose or recombine motifs and concepts that they're exposed to through participation in society. While interesting (and humbling), these discussions are unrelated to whether neural networks which have been exposed to copyrighted material in the course of their existence are capable of later creating original works under the purview of intellectual property law: they are (and if this were not the case, a large majority of creative work would be illegal. Good luck finding a band where none of the musicians have ever heard a copyrighted song before).

In any case, there does not exist strong legal precedent with respect to the specific circumstance of licensing LLM output under free licenses like Creative Commons or the GPL. On Wikimedia Commons, the template and copyright category for PD-algorithm asserts that all algorithmically-generated works are public domain: "This file is in the public domain because, as the work of a computer algorithm or artificial intelligence, it has no human author in whom copyright is vested".

Companies that offer public access and use of proprietary software do not automatically hold a claim to copyright on all works produced using their products merely because the products themselves are proprietary: Adobe does not own copyright on all works produced using Photoshop, and Apple does not own copyright on all photographs taken with iPhones. It is unlikely that OpenAI, for example, could require attribution or issue takedowns under the DMCA.

However, there are circumstances under which the terms and conditions of an API may cause a company to restrict continued access to the model based on adherence to certain criteria: this could include any number of things like providing attribution for LLM output, not using it to write racial slurs, not using it to write material critical of the company, et cetera. In these cases, it may be prudent to ask yourself whether it is worth getting your account at Honest Bob's LLMs suspended for using them to summarize rap lyrics or write about Honest Bob's SEC investigation.

Is LLM output capable of violating copyright law?

The answer here is obviously "yes". An example, prompted from ChatGPT:^[b]

What are the lyrics to "Moonage Daydream", by David Bowie?

Keep your 'lectric eye on me babe

Put your ray gun to my head

Press your space face close to mine, love

Freak out in a moonage daydream oh yeah

These are, indeed, the verbatim lyrics to an unambiguously copyrighted song. Of note is the fact that it was specifically asked to do this. Compare with this prompt, using the first couple line of the song:

I'm an alligator / I'm a mama papa comin' for you / I'm a space invader / I'll be a rock-'n-rollin' bitch for you

I'm a force of nature, a wild beast / A creature of the swamps, a king of the feast / I'm a living legend, a natural high / 'Cause I'm an alligator, and I'll never die

While this would be a sick Judas Priest song, it isn't one, and in fact, this text seems to never have been published.

It bears mentioning that many human beings, also know the lyrics to "Moonage Daydream", and are capable of typing them into the edit box with no assistance whatsoever; doing this in an article would still be a copyright violation. For LLMs, the critical piece of knowledge is that output needs to be checked, and that it is insufficient to assume that generated text is always novel.

Keep in mind that, while these examples were blindingly obvious, it may be less so in practice. It is a good idea, if you are producing a large amount of text, to use a search engine for snippets, on the off-chance that the model has coincidentally duplicated previously-published material.

Apart from the a possibility that saving an LLM output may cause verbatim non-free content to be carried over to the article, these models can produce derivative works. For example, an LLM can rephrase a copyrighted text using fewer, the same, or more words than the original – editors should mind the distinction between a summary and an abridgement.

Notes

^ This also applies to cases in which the AI model is in a jurisdiction where works generated solely by AI is not copyrightable, although with very low probability.
^ Note: This of course is not a copyright violation in itself, it would be if it were published in such a way as to deprive the copyright owner of income. Moreover it is not even plagiarism, since the lyrics are attributed to Bowie.

References

^ "Compendium of U.S. Copyright Office Practices, § 313.2" (PDF). United States Copyright Office. 22 December 2014. p. 22. Retrieved 18 January 2023.
^ "Artificial Intelligence and Intellectual Property: copyright and patents". GOV.UK. Retrieved 2023-03-19.

[1] This also applies to cases in which the AI model is in a jurisdiction where works generated solely by AI is not copyrightable, although with very low probability.

[4] Note: This of course is not a copyright violation in itself, it would be if it were published in such a way as to deprive the copyright owner of income. Moreover it is not even plagiarism, since the lyrics are attributed to Bowie.

[2] "Compendium of U.S. Copyright Office Practices, § 313.2" (PDF). United States Copyright Office. 22 December 2014. p. 22. Retrieved 18 January 2023.

[3] "Artificial Intelligence and Intellectual Property: copyright and patents". GOV.UK. Retrieved 2023-03-19.

[a]

[1]

[2]

[b]