
Companies specializing in AI, such as Google, Anthropic, OpenAI, and Meta, may have identified a method to obtain training data from paywalled publishers like the New York Times, Wired, or the Washington Post. A report by Alex Reisner in The Atlantic indicates that these firms have collaborated with the Common Crawl Foundation, a nonprofit organization that scrapes the internet to assemble a comprehensive public archive for research purposes. The database maintained by Common Crawl, which encompasses several petabytes of data, has allegedly enabled AI companies to train their models using paywalled information. Nevertheless, Common Crawl has refuted these allegations in a recent blog entry.
The foundation’s site claims that its data is sourced from openly accessible webpages. Richard Skrenta, Common Crawl’s executive director, mentioned to The Atlantic that AI models ought to have access to all content found online, asserting, “The robots are people too.”
AI chatbots such as ChatGPT and Google Gemini have sparked a crisis in the journalism sector by delivering information straight to readers, leading to reduced traffic for publishers. This phenomenon has been labeled as the “traffic apocalypse” and “AI armageddon.” (Disclosure: Ziff Davis, the parent company of Mashable, has initiated legal action against OpenAI for copyright violations.)
Some news publishers have obstructed Common Crawl’s scraper by altering their website’s code, but this only safeguards future content. Numerous publishers have sought to have their material omitted from Common Crawl’s archives, but the procedure has been sluggish, with no takedown requests apparently fulfilled since 2016.
Skrenta clarified that the file format utilized for storing archives is “designed to be immutable,” thus rendering content deletion unfeasible. Nonetheless, Reisner reports that Common Crawl’s public search tool yields misleading results for specific domains.
Mashable reached out to Common Crawl, and a representative referenced a public blog post by Skrenta. In that post, Skrenta repudiated the claim of misleading publishers and stated that Common Crawl does not circumvent paywalls. He stressed the foundation’s financial autonomy and denied performing “AI’s dirty work.”
The blog post asserts, “The Atlantic makes numerous false and misleading assertions regarding the Common Crawl Foundation, including the claim that our organization has ‘deceived publishers’ about our practices.” It continues, “Our web crawler, known as CCBot, gathers data from publicly accessible web pages. We do not go ‘behind paywalls,’ do not log in to any sites, and do not utilize any strategies aimed at circumventing access limitations.”
Despite this, Reisner notes that Common Crawl has accepted contributions from AI-oriented companies such as OpenAI and Anthropic and names NVIDIA as a “collaborator” on its site. The foundation additionally aids in the assembly and distribution of AI training datasets.
The discussion surrounding AI’s utilization of copyrighted content persists. OpenAI is currently facing several lawsuits from major publishers, including the New York Times and Ziff Davis, the parent company of Mashable.