Jan 11, 2025 23:50:00

OpenAI's crawler bot takes down 3D scan data sales site with thorough scraping that's almost like a DDoS attack

It turns out that a website called '

Triplegargs ,' which sells 3D scans of human bodies, faces, hands, etc., was taken down by an OpenAI crawler bot. The bot was sending download requests for each of the hundreds of thousands of pieces of data that were for sale, and the CEO explained that it was 'essentially like a DDoS attack.'

How OpenAI's bot crushed this seven-person company's website 'like a DDoS attack' | TechCrunch
https://techcrunch.com/2025/01/10/how-openais-bot-crushed-this-seven-person-companys-web-site-like-a-ddos-attack/

Triplegangers is a website that has been selling 3D scan data for over 10 years. The data they handle is divided into categories such as 'face', 'full body', 'full body with pose', 'full body couple', 'hand', 'full body clothed', 'hand statue', and 'face statue'.

For example, at the time of writing, there is data for 1,509 people in the 'face' category, with around 20 different facial expressions registered for each person. In other categories, at least three images are registered for each product, so the total data is said to be hundreds of thousands of points.

High-Resolution Face Models, 21 FACS expressions each. | Triplegangers
https://triplegangers.com/browse/scans/faces

Triplegangers CEO Oleksandr Tomchuk told the news site TechCrunch that he had received a warning that the site was down, so he investigated and discovered that an OpenAI bot had attempted to scrape the entire Triplegangers site, downloading all of its hundreds of thousands of pieces of material.

They quickly determined that the site was down due to an OpenAI bot, but they said the bot was using 600 different IP addresses and they still don't know when it started scraping.

'OpenAI's crawler took down our site, it was essentially a DDoS attack,' Tomczuk said.

In addition, Triplegangers' terms of use page states that 'you may not collect data through indexing, scraping, data mining, or other methods using bots or search apps without our prior consent.' However, TechCrunch points out that the robots.txt file, which tells crawler bots whether they can index or crawl, did not contain appropriate information about OpenAI's bots.

OpenAI announced 'GPTBot,' a web crawler for collecting data to improve large-scale language models, in August 2023 and also published how to block it.

OpenAI announces 'GPTBot', a web crawler for improving future AI models, and also publishes a blocking method to prevent unauthorized learning by AI - GIGAZINE

On the other hand, TechCrunch also points out that 'OpenAI's bots include ChatGPT-User and OAI-SearchBot,' and that 'robots.txt is not absolute.'

For example, the search engine Perplexity has been accused of '

crawling while ignoring robots.txt .' Perplexity CEO Aravind Srinivas explained that 'it's not that our crawlers are ignoring it, but that we also use third-party crawlers.'

In response to the criticism that 'Perplexity's AI ignores robots.txt that blocks crawlers,' CEO claims that 'we don't ignore it, but we rely on third-party crawlers' - GIGAZINE

In addition, there have been reports of an AI company called Anthropic receiving 1 million hits in 24 hours because its anti-crawler protection measures were out of date.

Perplexity, a generative AI search engine, ignores crawler-preventing 'robots.txt' to extract information from websites - GIGAZINE

By the way, TechCrunch reached out to OpenAI about this matter but did not receive a response.

Related Posts:

Jan 11, 2025 23:50:00 in Web Service, Posted by logc_nt