Copyright Content and AI training

Published on

The primary legal difficulty associated with AI training is the acquisition and use of training data without the consent of the owner of said training data.

July 11, 2023  New Jersey Law Journal

By Jonathan Bick | Jonathan Bick is counsel at Brach Eichler in Roseland, and chairman of the firm’s patent, intellectual property, and information technology group. He is also an adjunct professor at Pace and Rutgers Law Schools.

Traditional software differs from artificial intelligence (AI) because programmers write algorithms which tell a program how to proceed and AI writes its own algorithms. When writing algorithms (known as deep learning), AI must copy and process a significant amount of existing content. Since copyright rights automatically arise upon the creation of any sort of content (images, text or sound), AI copying (without consent) are likely to cause legal difficulties. Technical, legal and business options are available to ameliorate AI training legal difficulties.

AI training starts with data and processes that data as follows: First, an AI model is given a set of training data and asked to make decisions based on that information. The data allows the AI to make correct and incorrect output. Each time the AI makes and delivers an output, it is told if the output is correct or not. The AI then repeats the process making adjustments to the data processing steps that help the AI become more accurate by making increasingly better algorithms (ordered data processing steps resulting in correct output).

Once the initial training is completed, the second step of AI training is to validate the algorithm. In this phase, the AI will validate the assumption that the algorithm created by the AI yielding acceptably correct output when using a new set of test data. If the output is accepted, then the AI is finally tested using live data from real world sources. In the event that the output from either the new set of test data or the real world data yields unacceptable output then the training begins again with the first step.

The primary legal difficulty associated with AI training is the acquisition and use of training data without the consent of the owner of said training data. The case of Getty Images v. Stability AI (U.S. District Court, District of Delaware Case 1:23-cv-00135-UNA filed 02/03/23) exemplifies the legal difficulties associated with AI training. 

Getty claims that Stability AI copied more than 12 million photographs from Getty Images’ collection, along with the associated captions and metadata, without permission from Getty Images and used the copied material in part to train its AI. More specifically, Getty Images makes hundreds of millions of visual assets available to customers via internet sites, such as gettyimages.com and istock.com, and Stability AI used the copied images to train its AI. 

Copying images without consent has resulted in several types of legal difficulties. These legal difficulties include unlawful acts pursuant to the Copyright Act of 1976, 17 U.S.C.

Section101 et seq., the Lanham Act, 15 U.S.C. Section 1051 et seq., as well as state trademark and unfair competition laws. 

Contract violations may also result from copying images without consent. For example, the method noted in the Getty complaint by Stability AI to assess the Getty content violated the terms of use agreement for both the gettyimages.com and istock.com internet sites. Allegedly, Stability AI accessed Getty content via Getty Images’ public-facing websites. The Getty Images websites from which Stability AI copied images without permission is subject to express terms and conditions of use which, among other things, expressly prohibited (i) downloading, copying, or re-transmitting any or all of the website or its contents without a license; and (ii) using any data mining, robots or similar data gathering or extraction methods. As a result, a contract breach has allegedly occurred.

The first step in ameliorating AI training legal difficulties is the lawful acquisition of AI training data. From a legal perspective, securing consent to use existing content for AI training is one option. Generally, such consent is specific in nature and takes the form of a contract or license between the content owner and the content users. Some AI training content prepared by third parties does not require specific content to use. For example, data.gov is a set of public government data to train machine learning models that can help discover patterns, identify trends and detect anomalies. 

From a technological perspective, AI training data may be acquired by creating it, such as by taking measurements of real-world physical occurrences using signals and digitizing them so that a computer and software may alter them. For example, take a photo of an object, digitizing that photo and uploading it into the AI software. From a business perspective, AI training data may be acquired by contracting with the third party to secure or develop the required data as a work for hire.

Special training precautions are needed for generative AI. Generative AI is a type of AI system capable of generating text, images, or other media in response to prompts. Generative AI models learn the patterns and structure of their input training data, and then generate new data that has similar characteristics.

More precisely, copyright owners may consider suing a user of the generative AI software for use of the generative AI software that has been trained using the copyright owner’s copyrighted data. This risk of exposure becomes higher when using generative AI models to generate an image as output that is substantially similar to copyrighted works of a particular visual artist or if the output inserts a watermark or other insignia indicating that the model was trained using copyrighted data of the visual artist or image source.

As individuals and entities adopt generative AI solutions, additional attention should be paid to understand the risks associated with the adoption of generative AI. Furthermore, individuals and entities who adopt generative AI solutions should establish policies that will help mitigate such risks.

For example, various generative AI software requires AI software users to indemnify the generative AI software developers and distributors as part of AI software license or appropriate terms of use agreement. In addition to indemnification agreements, errors and omissions insurance coverage should be considered. 

Additionally, if the AI software developer, AI software user or AI software distributor are internet providers, then protection via compliance with the Digital Millennium Copyright Act (DMCA), 112 Stat. 2860, should be contemplated. The DMCA’s principal innovation in the field of copyright is the exemption from direct and indirect liability of internet service providers and other intermediaries resulting from the unconsented use of another’s copyrighted content.