Copyright Problems Arise from Generative Artificial Intelligence

Since ChatGPT’s launch in November 2022 by OpenAI, followed closely by other generative artificial intelligence (GenAI) platforms from various providers, these platforms have remained consistently in the public eye. In direct response to user-input prompts, these GenAI algorithms can craft novel content, including text, images, videos, audio, and computer code. These AI systems have already showcased their immense potential for generating fresh content, offering positive and negative possibilities that could significantly impact humanity. For instance, ChatGPT can seamlessly formulate responses in a natural, human-like language to any user inquiry or request, while DALL-E can conjure images based on user-provided prompts. Consequently, GenAI has the potential to enhance the capabilities of creative and other industries, streamlining the process of generating new material swiftly and effectively, thereby boosting productivity and efficiency. However, it also raises concerns about its potential to render human creators largely obsolete, casting a shadow of existential threat over their livelihood.

Regardless of one's perspective on GenAI, it is undeniable that numerous legal complexities are intertwined with it, with intellectual property concerns standing out, particularly about copyright. Prominent GenAIs such as ChatGPT and DALL-E, which are large language models, undergo training on extensive repositories of human-written language, textual passages, and graphical information sourced through web scraping. This includes various materials, including books, essays, articles, Wikipedia entries, images, and various web pages. For instance, the training of ChatGPT-3 entailed a staggering 570 GB of data derived from diverse internet content sources. This data ingestion amounted to roughly 300 billion words. This number will increase with subsequent iterations. This training equips these models to "comprehend" diverse subjects and formulate responses to human-input prompts in a conversational manner that emulates human cognition and speech, creating novel content based on the patterns discerned within the dataset used for training.

GenAI introduces copyright concerns at two distinct junctures: during training data integration into the AI model (input) and upon the production of AI-generated content derived from the training dataset (output).

Input

As previously mentioned, the extensive training data or “machine learning” is sourced from the internet (copied and input into the AI software), encompassing various protected materials, including entire books. Utilizing such content without obtaining consent from the copyright owners could potentially constitute a breach of the exclusive rights afforded to authors under 17 U.S.C. 106. Given the sheer volume of data extracted from the internet for training purposes, it becomes a challenging endeavor for AI developers to individually contact every copyright holder to secure permission or engage in licensing negotiations to use their copyrighted material. Instead, AI developers assert that their utilization falls under the umbrella of fair use for such material.

The doctrine of fair use is codified in 17 U.S.C. 107. It permits the usage of copyrighted material belonging to others without copyright violation, subject to specific circumstances and the consideration of factors delineated within that section. One of these factors pertains to “the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes.” If the purpose serves nonprofit educational objectives, it leans toward being deemed fair use. Naturally, this assessment must be weighed against findings derived from the other factors, as no single factor can be determinative.

In Thomson Reuters Enterprise Centre GmbH et al. v. ROSS Intelligence Inc., 20-cv-00613 (D. Del.), both parties submitted cross-motions for summary judgment regarding the defense of fair use. This case revolves around allegations that ROSS utilized proprietary content from Thomson's Westlaw research database without proper authorization, employing it to train its own GenAI legal research tool. ROSS contends that its use falls under the fair use doctrine, asserting that it is primarily for research purposes in developing its generative AI model. Additionally, ROSS argues that its utilization of headnotes and key numbers from the Westlaw database does not negatively impact the marketability of Westlaw's database. The assessment of how such use affects the market for copyrighted material constitutes another pivotal factor in determining fair use.

Thomson presents a counterargument, asserting that ROSS's intent is geared towards creating a competitive commercial legal research tool, thus harming the market value of Westlaw's subscription services. A usage that undermines the market worth of copyrighted material would tend to disfavor a fair use designation. Another significant factor considered in fair use evaluation is the nature of the copyrighted work itself, whether it is of a creative or factual nature. The court's decision is expected to draw significant attention.

Besides the Thomson case, which involves identified parties and specific data, Tremblay v. OpenAI., Inc. case 3:23-cv-03223 (N.D. Cal) includes as-yet unidentified plaintiffs. This lawsuit, initiated on June 28, 2023, centers around two named plaintiffs who are authors. They have initiated a class action against OpenAI, representing themselves and “[A]ll persons or entities domiciled in the United States that own a United States copyright in any work that was used as training data for the OpenAI Language Models ---.” The plaintiffs allege that OpenAI infringed upon their copyright in their duly registered books and those of other as-yet-unidentified authors. This infringement purportedly occurred when OpenAI incorporated these books into the ChatGPT training dataset without obtaining the authors' authorization or providing them with due compensation. Under the Copyright Act, they are seeking statutory and actual damages, along with attorneys' fees. Furthermore, they seek a permanent injunction to restrain OpenAI from engaging in such activities.

Another lawsuit highlighting copyright infringement pertaining to the training of large language models is the legal dispute known as Silverman et al. v. OpenAI, Inc. et al., case 3:23-cv-03416 (N.D. of Cal). This lawsuit was officially initiated on July 7, 2023. This case garnered heightened attention in the media, potentially attributed to the celebrity status of one of the plaintiffs, Sarah Silverman, a distinguished writer and performer. In this instance, the plaintiffs, including Sarah Silverman, assert claims on their behalf and of “all others similarly situated.” Their allegations revolve around extensive copyright infringement committed by OpenAI through using their copyrighted books as part of a training dataset for its models.

Output

Regarding the output produced by GenAI, copyright concerns arise in relation to the input data used for training and extend to the content generated by GenAI itself. The results generated by GenAI in response to prompts could potentially encompass excerpts or substantial segments of the underlying training data. If the training dataset is subject to copyright protection, any instances of the reproduced or integrated portions appearing in the AI's output without the explicit authorization of the copyright holder would amount to copyright infringement. Additionally, even if the reproduced content is somewhat altered, the output containing modified iterations of the training data may be classified as derivative work. Once more, without the author's consent, such actions would contradict the exclusive right held by the author to produce derivative works.

In late 2022, a pair of unidentified programmers initiated a lawsuit against GitHub, Inc., acting on their behalf and “all others similarly situated.” This legal action pertained to Copilot, an AI tool designed for software development tasks. J. Doe1 and J. Doe2 v. GitHub, Inc., case 4:22-cv-06823-JST (N.D. Cal). Copilot generates code suggestions in real time as users engage in coding activities, resembling an auto-completion function.

The plaintiffs asserted that the output produced by Copilot frequently incorporated copyrighted content from its training data, sometimes even reproducing portions verbatim. They alleged Copilot failed to adhere to the stipulations outlined in relevant licenses, including open-source licenses, which governed the inclusion of code lines from GitHub's public repositories. This noncompliance encompassed the omission of copyright notices and attributions to the code's original authors and the neglect to incorporate the terms and conditions of the applicable licenses into the generated output.

In response, the defendant contended that the GitHub license extended the authority to the company to store, employ, and distribute content from public repositories to other users. While the court dismissed the plaintiffs' claims for damages, it allowed them to pursue injunctive relief concerning instances of reproduction lacking proper attribution. Although this case leaned more toward asserting a breach of contract rather than copyright infringement, it serves as a reminder that copyright issues are an ever-present concern in the realm of GenAI.

Another copyright-related facet concerning AI-generated output revolves around whether such resulting products qualify for copyright protection under U.S. copyright law. The Compendium of U.S. Copyright Office Practices outlines that the Office will grant registration to an original work of authorship if it “was created by a human being.” It further clarifies that the Copyright Office will decline to register a claim if it determines that a human being did not contribute to the creation of the work since copyright law is confined to the "original intellectual conceptions of the author."

Operating within this framework, the U.S. Copyright Office (USCO) declined to register an image titled "A Recent Entrance to Paradise," which was entirely generated by an AI named DABUS, an invention of Stephen Thaler. Subsequently, Thaler initiated legal action against Shira Perlmutter, the Register of Copyrights and Director of USCO. Thaler filed a motion for summary judgment, urging the court to intervene and direct the USCO to reverse its decision to withhold copyright registration for the work. In response, the defendant submitted a cross-motion for summary judgment.

With no disputed factual elements, the central question before the court revolves around whether AI-generated works (formed without human intervention) are legally eligible for copyright protection. On August 18, 2023, the U.S. District Court for D.C. issued a summary judgment ruling in favor of the USCO. The court affirmed that copyright protection could not be extended to a work exclusively generated by an autonomous AI program, as it lacked the requisite human authorship essential for copyright protection. However, the court did leave open the inquiry concerning whether a human user of AI could potentially secure copyright for content created by the AI, provided that the user substantially contributed to the work's creation through intricate prompts and additional post-production endeavors.

The situation grew even more intricate with "Zarya of the Dawn," a graphic novel crafted by an artist who incorporated images generated by Midjourney, an AI specializing in image creation, to complement her written content. Initially, the USCO granted copyright registration for the entire work. However, upon realizing that a segment of the work was AI-generated, the USCO rescinded copyright protection for that specific portion.

The USCO's determination was grounded in the notion that the images produced by Midjourney lacked sufficient human involvement in their creation process, rendering them ineligible for copyright protection. The AI's creation process operated autonomously following the input of text prompts, and the user had no control over the ensuing creative process. Anticipating the AI's specific image outputs in response to prompts was also unfeasible.

Nevertheless, the USCO did acknowledge that copyright protection could extend to the artist's original textual content and the specific arrangement of text in conjunction with select images chosen from the AI's creations. In these instances, the requisite human creative control was deemed present. As a result, the USCO concluded that only the artist's original expressive material met the copyright registration criteria.

Undoubtedly, copyright within the GenAI era represents a rapidly evolving realm subject to ongoing changes. To provide clarity, the Copyright Office has issued a policy statement outlining its approach to scrutinizing and registering works containing AI-generated content, including how the prerequisite of human authorship is interpreted within this context. In line with this endeavor, the USCO facilitated a series of listening sessions throughout 2023, inviting participants to voice their expectations, reservations, and inquiries concerning the intersection of GenAI and copyright law.

Furthermore, as part of its commitment to delve into copyright-related aspects raised by AI technology, the USCO has conducted webinars. These webinars have been instrumental in exploring issues encompassing the use of copyrighted materials in AI training and delving into the parameters of copyright protection for works originating from AI tools. This multifaceted initiative underscores the Copyright Office's dedication to navigating the intricate terrain of copyright law as it intertwines with the dynamic landscape of AI advancements.

In the rapidly evolving world of GenAI, where the coverage since ChatGPT's introduction has been nothing short of enthusiastic, viewpoints diverge on the application of existing copyright law and the necessary legal evolution to align with GenAI's rapid progress. Amidst this dynamic landscape, one certainty emerges: the current state of affairs leaves several critical questions unanswered.

Presently, the legality of utilizing data extracted from the internet for training AI models without the authorization of copyright holders remains unresolved. The question surrounding the potential eligibility of GenAI's creations for copyright protection is equally ambiguous.
We are anticipating additional insights from the judiciary, the U.S. Copyright Office, and legislative guidance from Congress.

Should you have inquiries regarding protecting your copyrights or if you are utilizing generative AI content and seek to steer clear of potential copyright infringement allegations, feel free to contact Veterans Advocacy Law Group. Our team of Intellectual Property attorneys will assist you in preserving your established copyrights and facilitating the registration of new ones. Furthermore, if you engage generative AI systems in creating AI art or content, we will guide you in appropriately utilizing AI and strategies for circumventing copyright concerns.