An attempt to block the web crawler 'GPTBot' used by OpenAI to collect content on the Internet is underway



In August 2023, OpenAI, which develops ChatGPT for interactive AI, released details about the web crawler `` GPTBot '' for collecting the data sets necessary for learning large-scale language models from the Internet. The online documentation about GPTBot also describes how to prevent GPTBot from collecting content, and it has been reported that some websites are immediately beginning to block GPTBot.

Now you can block OpenAI's web crawler - The Verge
https://www.theverge.com/2023/8/7/23823046/openai-data-scrape-block-ai



OpenAI launches web crawling GPTBot, sparking blocking effort by website owners and creators | VentureBeat
https://venturebeat.com/ai/openai-launches-web-crawling-gptbot-sparking-blocking-effort-by-website-owners-and-creators/

Sites scramble to block ChatGPT web crawler after instructions emerge | Ars Technica
https://arstechnica.com/information-technology/2023/08/openai-details-how-to-keep-chatgpt-from-gobbling-up-website-data/

Large language models such as GPT-4, which is installed in AI that generates text and images, require a large dataset to train the model. The dataset also includes various contents collected from the Internet, and the open source dataset provided by Common Crawl, a non-profit organization used for learning by GPT-3.5, has been available since 2008 on the Internet. It is said that it consists of a total of 45 TB of text collected from.

It is a problem that these datasets include copyrighted content, paid articles that cannot be subscribed without paying a fee, and personal information of the general public. In June 2023, California-based Clarkson Law Firm filed a class action lawsuit against OpenAI, alleging that the dataset used to train ChatGPT violated people's copyrights and privacy. was .

In response to these issues, AI development companies are under pressure to take measures such as 'using datasets with clean copyrights for AI training.' In August, OpenAI announced in an online document details of the web crawler `` GPTBot '' used to improve large-scale language models such as GPT-4 and GPT-5 to be released in the future.

OpenAI announces a web crawler ``GPTBot'' for improving future AI models, and at the same time, also publishes a blocking method to prevent unauthorized learning by AI - GIGAZINE



OpenAI said, ``Paid content, content containing personal information, and content containing text that violates our policy will be excluded from access by GPTBot and filtered, and will be used to improve new language models in the future. 'By allowing GPTBot to crawl web pages, we can contribute to improving AI accuracy, improving privacy, and expanding possibilities.'

On top of that, the online document about GPTBot also describes 'Methods for blocking crawling by GPTBot'. To block GPTBot's access, just add two lines of code to ' robots.txt ' in the directory, and some websites have responded immediately after this method was released. Masu.

For example, technology media The Verge has already added code to block GPTBot to 'robots.txt', and online science fiction magazine Clarkesworld reports that it blocked GPTBot in a post on X (formerly Twitter). Did.



Please note that GPTBot's block only prevents future data scraping and does not affect content that has already been collected. It is also unrelated to datasets collected by data scrapers other than OpenAI, so content from websites that blocked GPTBot could be used to train AIs not affiliated with OpenAI.

in Software,   Web Service, Posted by log1h_ik