Jun 24, 2024 14:54:00

In response to the criticism that 'Perplexity's AI ignores robots.txt that blocks crawlers,' the CEO claims that 'we don't ignore it, but we rely on third-party crawlers'

This article, originally posted in Japanese on 14:54 Jun 24, 2024, may contains some machine-translated parts.
If you would like to suggest a corrected translation, please click here.

It has been pointed out that

Perplexity , a search engine that uses generative AI, ignores the instructions in robots.txt , a text file that controls bots ( crawlers ) for search engines and AI training, and accesses websites that administrators have prohibited Perplexity from crawling. In response, Perplexity CEO Aravind Srinivas explained that 'we are not ignoring the instructions in robots.txt' and that 'we rely not only on our own crawlers but also on third-party crawlers.'

Perplexity AI CEO Aravind Srinivas on plagiarism accusations - Fast Company
https://www.fastcompany.com/91144894/perplexity-ai-ceo-aravind-srinivas-on-plagiarism-accusations

Perplexity's AI Chatbot: Why Wired Magazine Calls It a 'BS Machine'

https://www.cryptoglobe.com/%E6%9C%80%E6%96%B0/2024/06/perplexitys-ai-chatbot-why-wired-magazine-calls-it-a-bs-machine/

AI companies are reportedly still scraping websites despite protocols meant to block them
https://www.engadget.com/ai-companies-are-reportedly-still-scraping-websites-despite-protocols-meant-to-block-them-132308524.html

Exclusive: Multiple AI companies bypassing web standard to scrape publisher sites, licensing firm says | Reuters
https://www.reuters.com/technology/artificial-intelligence/multiple-ai-companies-bypassing-web-standard-scrape-publisher-sites-licensing-2024-06-21/

Basically, search engines such as Google and Bing and AI generators use programs called crawlers to obtain vast amounts of information from the Internet and use it to train search results and AI. On the other hand, a text file called robots.txt is used by websites to control the crawler's crawling, and administrators can block crawlers by including certain elements in robots.txt.

However, previous research has shown that Perplexity was able to extract information from blog posts that were prohibited by robots.txt, generating summaries with a variety of details , and that its crawlers were also found to use headless browsers to scrape content, ignoring robots.txt.

In response to this behavior of Perplexity, users have expressed opinions such as 'Crawling by generative AI search engines like Perplexity reduces the number of users who access websites directly and creates various disadvantages.'

Perplexity, a generative AI search engine, ignores crawler-preventing 'robots.txt' to extract information from websites - GIGAZINE

On the other hand, Perplexity CEO Srinivas reported, 'We don't ignore protocols like robots.txt. However, we rely on third-party crawlers as well as our own crawlers. ' According to Srinivas, although he could not disclose the specific provider's name due to a non-disclosure agreement, he said that the crawler used was owned by a third-party provider of web crawling and indexing services.

TollBit , a startup that pairs with AI companies to enter into licensing agreements, pointed out that 'more than 50 websites choose to ignore the robots.txt protocol and retrieve content from the site.' Although TollBit did not reveal the specific names of the companies, an investigation by overseas media Business Insider revealed that OpenAI, the developer of ChatGPT, and Anthropic, the developer of Claude, also ignore the robots.txt protocol.

In addition, Anthropic states that 'Anthropic's crawler respects the industry standard directives of robots.txt and respects the 'don't crawl' signal from users,' and explains how to block the crawler.

Srinivas acknowledged that 'We also use crawlers owned by companies like TollBit,' but argued that 'protocols that control crawlers like robots.txt are not legally mandated. Publishers and technology companies have to build new relationships.'

In response to a question about whether Google could have immediately contacted third-party crawlers and told them to stop crawling content protected by robots.txt, Srinivas said, 'This is a complex issue.'

Related Posts:

Jun 24, 2024 14:54:00 in Software, Posted by log1r_ut

Archives

Categories: Note; Headline; Review; Coverage; Interview; Tasting; Mobile; Software; Web Service; Web Application; Hardware; Vehicle; Science; Creature; Video; Movie; Manga; Anime; Game; Design; Art; Food; Security; Notice; Pick Up; Column

Search

<	4, 2025					>
Sun	Mon	Tue	Wed	Thu	Fri	Sat
30	31	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	1	2	3

<	4, 2025					>
Sun	Mon	Tue	Wed	Thu	Fri	Sat
30	31	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	1	2	3

<	4, 2025					>
Sun	Mon	Tue	Wed	Thu	Fri	Sat
30	31	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	1	2	3

<	4, 2025					>
Sun	Mon	Tue	Wed	Thu	Fri	Sat
30	31	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	1	2	3