Unlock Editor’s Digest for free
FT editor Roula Khalaf selects her favourite stories in this weekly newsletter.
The AI start-up Anthropic is accused of aggressively collecting data from websites to train its systems and, according to those affected, may have violated the publishers’ terms of use.
AI developers rely on ingesting massive amounts of data from a variety of sources to build large language models. This technology is behind chatbots like OpenAI’s ChatGPT and its competitor Claude from Anthropic.
Anthropic was founded by a group of former OpenAI researchers with the promise of developing “responsible” AI systems.
However, Matt Barrie, CEO of Freelancer.com, accused the San Francisco-based company of being “by far the most aggressive scraper” of his freelance portal, which receives millions of visits every day.
Other web publishers share Barrie’s concerns that Anthropic is flooding their sites and ignoring their instructions to stop collecting content to train its models.
Freelancer.com received 3.5 million visits in four hours from a web “crawler” linked to Anthropic, according to data obtained by the Financial Times. That gives Anthropic “probably about five times the volume of the second-largest” AI crawler, Barrie said.
Visits from his bot continued to increase even after Freelancer.com tried to deny its access requests by using standard web protocols to control crawlers, he added. After that, Barrie decided to block traffic from Anthropic’s Internet addresses entirely.
“We had to block them because they don’t follow the rules of the internet,” said Barrie. “This is outrageous scraping [which] slows down the site for everyone who uses it and ultimately impacts our revenue.”
Anthropic said it was investigating the case and respected the publishers’ wishes, saying it did not want to be “intrusive or disruptive.”
Scraping publicly available data from across the web is generally legal, but the practice is controversial, can violate websites’ terms of service, and can be costly for site hosts.
Kyle Wiens, CEO of iFixit.com, said his electronic repair website received a million hits from anthropic bots in 24 hours. “We have a lot of alerts [for high traffic]people are being woken up at 3 a.m. That sets off all our alarms,” he said.
iFixit’s terms of service prohibit the use of its data for machine learning, Wiens said. “My first message to Anthropic is: If you use this to train your model, that’s illegal. My second is: That’s not polite behavior on the internet. Crawling is a matter of etiquette.”
Websites use a protocol called robots.txt to keep crawlers and other web robots away from parts of their websites, but voluntary compliance with the protocol is required.
“We respect robots.txt and our crawler respected that signal when iFixit implemented it,” Anthropic said. The company also said its crawlers respected “anti-evasion technologies” such as CAPTCHAs and that “our crawling should not be intrusive or disruptive. We strive for minimal disruption by carefully considering how quickly we crawl the same domains.”
Data scraping is not a new practice, but it has increased dramatically in the last two years due to the AI arms race, creating new costs for websites.
“AI crawlers have cost us a lot of money in bandwidth fees and cost us a lot of time dealing with abuse,” Eric Holscher, co-founder of document hosting site Read the Docs, wrote in a blog post on Thursday. “AI crawlers are not respectful to the sites they crawl, and that will lead to a backlash against AI crawlers in general,” he added.
Anthropic has developed some of the world’s most advanced chatbots – rivaling OpenAI’s ChatGPT – that can respond to a range of natural language prompts, while positioning itself as a more ethical actor than some competitors. Anthropic’s stated goal is “the responsible development and maintenance of advanced AI for the long-term benefit of humanity.”
As leading AI companies compete to develop ever more powerful and adept models, they are penetrating deeper into untapped corners of the internet, partnering with publishers or creating synthetic training data.
OpenAI has signed a number of deals with publishers and content providers in recent months, including Reddit, The Atlantic and The Financial Times. Anthropic has not publicly announced similar partnerships.
“Search engines have always done a lot of scraping,” Barrie said, “but with the training of generative AI, that has gone to a whole new level.”
iFixit’s mission “is to share information,” Wiens said, to encourage people to do their own repairs. “We don’t mind them using our content to train models, we just want to be part of the conversation.”
He added: “I’m not an advocate on this issue, I’m just trying to keep a website online.”