IP Challenges of AI Training on Scraped Data

The OECD report, “Intellectual property issues in artificial intelligence trained on scraped data” published in February 2025, provides a comprehensive analysis of the complex interplay between artificial intelligence (AI) and intellectual property (IP) rights, with a particular focus on data scraping for AI training. The paper aims to inform policymakers on the role of data scraping, existing legal frameworks, diverse stakeholder views, and potential policy avenues, emphasizing the necessity of a balanced approach that supports AI innovation while safeguarding IP protections.

TLDR;

In essence, the OECD report provides a timely and detailed analysis of the complex IP challenges arising from AI’s reliance on scraped data. It underscores the critical need for clear legal frameworks, adaptable licensing solutions, enhanced transparency, and international collaboration to foster responsible AI innovation while safeguarding the rights of intellectual property holders.

The Role of Data Scraping in AI Training

Data scraping, or web scraping, involves the automated extraction of vast quantities of data from online sources. This practice is fundamental to training advanced AI models, especially generative AI, which rely on extensive and diverse datasets—including text, images, audio, and video—to learn patterns, understand context, and generate new content. The scale and variety of data required make automated scraping a practical necessity for AI developers, enabling the rapid development and refinement of AI systems across various domains like natural language processing and computer vision.

However, a core challenge arises because much of this online data is protected by intellectual property rights, primarily copyright. The indiscriminate collection of such data through automated means raises significant legal and ethical questions regarding infringement, fair use, and the rights of content creators.

Legal Frameworks Governing Data Scraping and AI Training

The report meticulously examines the key legal frameworks applicable to data scraping for AI training, acknowledging the jurisdictional variations in their interpretation and application.

1. Copyright Law: Copyright is the most significant IP right in this context, protecting original works of authorship.

Originality and Fixation: For scraped content to be copyrightable, it must be original and fixed in a tangible medium. While complete works like articles or images clearly meet this, the copyright status of snippets or raw data points extracted for training can be ambiguous.
Infringement: Copying, distributing, or adapting copyrighted works without permission constitutes infringement. Data scraping invariably involves making copies. A central legal debate concerns whether copying for AI training, where the output may not directly reproduce the input, still constitutes infringement.
Exceptions and Limitations: Many jurisdictions provide exceptions to copyright infringement to balance creators’ rights with public interest.
- Fair Use (United States): This flexible doctrine considers the purpose and character of the use (e.g., transformative nature), the nature of the copyrighted work, the amount and substantiality of the portion used, and the effect on the potential market. AI training is often argued to be transformative, but the commercial nature of many AI applications can complicate this.
- Fair Dealing (e.g., UK, Canada): Generally more prescriptive than fair use, fair dealing exceptions apply to specific purposes (e.g., research, criticism). Their applicability to AI training is less clear and often debated.
- Text and Data Mining (TDM) Exceptions (e.g., EU, Japan, Singapore): Some jurisdictions have enacted specific TDM exceptions to facilitate AI development. The EU Copyright Directive (DSM Directive) includes exceptions for TDM for scientific research and a broader exception for other TDM, subject to opt-out mechanisms by rights holders. The effectiveness and enforcement of these opt-out provisions remain areas of discussion.
AI Outputs and Originality: A significant challenge is determining if AI-generated outputs, derived from copyrighted training data, constitute copyright infringement. Current legal frameworks struggle to trace specific inputs to specific outputs in complex AI models, especially when the AI generates novel content. The copyrightability of AI-generated content itself is also a developing legal area.

2. Database Rights: The European Union has distinct sui generis database rights, protecting substantial investments in database creation, even if the content itself isn’t copyrightable. This right can prevent unauthorized extraction and re-utilization of significant parts of a database, offering another layer of protection for rights holders in the EU.

3. Contract Law: Terms of Service (ToS) or Terms of Use (ToU) on websites often prohibit automated scraping. The enforceability of these contracts against scrapers, particularly “browse-wrap” agreements (where agreement is implied by usage) versus “click-wrap” agreements (requiring explicit assent), varies significantly in judicial interpretation.

4. Privacy and Data Protection Law: Laws like the GDPR are crucial when scraped data includes personal information. Training AI on such data must comply with principles of consent, purpose limitation, and data minimization, adding a critical layer of regulatory compliance for AI developers.

5. International Frameworks: International treaties such as the Berne Convention and the WIPO Copyright Treaty (WCT) provide foundational IP protections but allow member states flexibility in implementing exceptions, contributing to the global variations in TDM laws. The WCT’s “three-step test” guides the creation of exceptions, ensuring they are limited to specific cases, do not conflict with normal exploitation, and do not unreasonably prejudice rights holders’ legitimate interests.

Stakeholder Perspectives

The report highlights the often-conflicting perspectives of key stakeholders:

Content Creators and Rights Holders: They argue that unauthorized scraping devalues their copyrighted works, undermines their ability to monetize content, and diminishes control over usage. They fear AI-generated content, trained on their material, could directly compete without proper attribution or compensation. They advocate for stronger IP protections, clear licensing, and fair remuneration.
AI Developers and Researchers: They contend that extensive, diverse datasets are essential for developing robust and effective AI models, and overly restrictive IP regimes could stifle innovation. They often argue that training is a transformative use and seek broad TDM exceptions and streamlined data acquisition processes. They also highlight the technical difficulties in tracing specific inputs to AI outputs.

Preliminary Considerations and Potential Policy Approaches

The report concludes by outlining key considerations and potential policy approaches, emphasizing the need for a multi-faceted and internationally coordinated response:

Clarifying Legal Interpretations: There’s an urgent need for legal clarity on how existing IP laws, especially copyright exceptions, apply to AI training and data scraping. This includes defining “transformative use” in the AI context.
Facilitating Licensing Mechanisms: Promoting voluntary licensing frameworks is crucial to enable AI developers to legitimately acquire copyrighted data while ensuring fair compensation for rights holders. Collective licensing or data marketplaces could offer legal certainty.
Promoting Transparency and Traceability: Increased transparency in AI development, such as disclosing training data sources, is vital. Technical measures like watermarking or metadata could aid in tracing data origins, though implementation faces challenges. The OECD AI Principle 1.5, emphasizing accountability and traceability, is a guiding framework.
Revisiting Exceptions and Limitations (TDM): Policymakers should assess whether current TDM exceptions are adequate for generative AI, considering the balance between fostering innovation and protecting rights, and the efficacy of opt-out mechanisms.
International Cooperation: Given the global nature of AI and data flows, international collaboration is essential to prevent regulatory fragmentation and develop consistent approaches to IP in AI. This involves dialogue and sharing best practices among nations and international bodies.
Balancing Innovation and Protection: The overarching goal is to strike a delicate balance. Overly strict IP regimes could impede AI progress, while insufficient protection could undermine the creative industries vital for AI content. Policies must enable both technological advancement and a thriving creative ecosystem.