IP Challenges of AI Training on Scraped Data

2025-05-22 | 7 min | 1224 words | Jonas

The OECD report, “Intellectual property issues in artificial intelligence trained on scraped data” published in February 2025, provides a comprehensive analysis of the complex interplay between artificial intelligence (AI) and intellectual property (IP) rights, with a particular focus on data scraping for AI training. The paper aims to inform policymakers on the role of data scraping, existing legal frameworks, diverse stakeholder views, and potential policy avenues, emphasizing the necessity of a balanced approach that supports AI innovation while safeguarding IP protections.

TLDR;

In essence, the OECD report provides a timely and detailed analysis of the complex IP challenges arising from AI’s reliance on scraped data. It underscores the critical need for clear legal frameworks, adaptable licensing solutions, enhanced transparency, and international collaboration to foster responsible AI innovation while safeguarding the rights of intellectual property holders.

The Role of Data Scraping in AI Training

Data scraping, or web scraping, involves the automated extraction of vast quantities of data from online sources. This practice is fundamental to training advanced AI models, especially generative AI, which rely on extensive and diverse datasets—including text, images, audio, and video—to learn patterns, understand context, and generate new content. The scale and variety of data required make automated scraping a practical necessity for AI developers, enabling the rapid development and refinement of AI systems across various domains like natural language processing and computer vision.

However, a core challenge arises because much of this online data is protected by intellectual property rights, primarily copyright. The indiscriminate collection of such data through automated means raises significant legal and ethical questions regarding infringement, fair use, and the rights of content creators.

Legal Frameworks Governing Data Scraping and AI Training

The report meticulously examines the key legal frameworks applicable to data scraping for AI training, acknowledging the jurisdictional variations in their interpretation and application.

1. Copyright Law: Copyright is the most significant IP right in this context, protecting original works of authorship.

2. Database Rights: The European Union has distinct sui generis database rights, protecting substantial investments in database creation, even if the content itself isn’t copyrightable. This right can prevent unauthorized extraction and re-utilization of significant parts of a database, offering another layer of protection for rights holders in the EU.

3. Contract Law: Terms of Service (ToS) or Terms of Use (ToU) on websites often prohibit automated scraping. The enforceability of these contracts against scrapers, particularly “browse-wrap” agreements (where agreement is implied by usage) versus “click-wrap” agreements (requiring explicit assent), varies significantly in judicial interpretation.

4. Privacy and Data Protection Law: Laws like the GDPR are crucial when scraped data includes personal information. Training AI on such data must comply with principles of consent, purpose limitation, and data minimization, adding a critical layer of regulatory compliance for AI developers.

5. International Frameworks: International treaties such as the Berne Convention and the WIPO Copyright Treaty (WCT) provide foundational IP protections but allow member states flexibility in implementing exceptions, contributing to the global variations in TDM laws. The WCT’s “three-step test” guides the creation of exceptions, ensuring they are limited to specific cases, do not conflict with normal exploitation, and do not unreasonably prejudice rights holders’ legitimate interests.

Stakeholder Perspectives

The report highlights the often-conflicting perspectives of key stakeholders:

Preliminary Considerations and Potential Policy Approaches

The report concludes by outlining key considerations and potential policy approaches, emphasizing the need for a multi-faceted and internationally coordinated response:

  1. Clarifying Legal Interpretations: There’s an urgent need for legal clarity on how existing IP laws, especially copyright exceptions, apply to AI training and data scraping. This includes defining “transformative use” in the AI context.
  2. Facilitating Licensing Mechanisms: Promoting voluntary licensing frameworks is crucial to enable AI developers to legitimately acquire copyrighted data while ensuring fair compensation for rights holders. Collective licensing or data marketplaces could offer legal certainty.
  3. Promoting Transparency and Traceability: Increased transparency in AI development, such as disclosing training data sources, is vital. Technical measures like watermarking or metadata could aid in tracing data origins, though implementation faces challenges. The OECD AI Principle 1.5, emphasizing accountability and traceability, is a guiding framework.
  4. Revisiting Exceptions and Limitations (TDM): Policymakers should assess whether current TDM exceptions are adequate for generative AI, considering the balance between fostering innovation and protecting rights, and the efficacy of opt-out mechanisms.
  5. International Cooperation: Given the global nature of AI and data flows, international collaboration is essential to prevent regulatory fragmentation and develop consistent approaches to IP in AI. This involves dialogue and sharing best practices among nations and international bodies.
  6. Balancing Innovation and Protection: The overarching goal is to strike a delicate balance. Overly strict IP regimes could impede AI progress, while insufficient protection could undermine the creative industries vital for AI content. Policies must enable both technological advancement and a thriving creative ecosystem.