In response to the criticism that 'Perplexity's AI ignores robots.txt that blocks crawlers,' the CEO claims that 'we don't ignore it, but we rely on third-party crawlers'
It has been pointed out that
Perplexity AI CEO Aravind Srinivas on plagiarism accusations - Fast Company
https://www.fastcompany.com/91144894/perplexity-ai-ceo-aravind-srinivas-on-plagiarism-accusations
Perplexity's AI Chatbot: Why Wired Magazine Calls It a 'BS Machine'
AI companies are reportedly still scraping websites despite protocols meant to block them
https://www.engadget.com/ai-companies-are-reportedly-still-scraping-websites-despite-protocols-meant-to-block-them-132308524.html
Exclusive: Multiple AI companies bypassing web standards to scrape publisher sites, licensing firm says | Reuters
https://www.reuters.com/technology/artificial-intelligence/multiple-ai-companies-bypassing-web-standard-scrape-publisher-sites-licensing-2024-06-21/
Basically, search engines such as Google and Bing and AI generators use programs called crawlers to obtain vast amounts of information from the Internet and use it to train search results and AI. On the other hand, a text file called robots.txt is used by websites to control the crawler's crawling, and administrators can block crawlers by including certain elements in robots.txt.
However, previous research has shown that Perplexity was able to extract information from blog posts that were prohibited by robots.txt, generating summaries with a variety of details , and that its crawlers were also found to use headless browsers to scrape content, ignoring robots.txt.
In response to this behavior of Perplexity, users have expressed opinions such as 'Crawling by generative AI search engines like Perplexity reduces the number of users who access websites directly and creates various disadvantages.'
Perplexity, a generative AI search engine, ignores crawler-preventing 'robots.txt' to extract information from websites - GIGAZINE
On the other hand, Perplexity CEO Srinivas reported, 'We don't ignore protocols like robots.txt. However, we rely on third-party crawlers as well as our own crawlers. ' According to Srinivas, although he could not disclose the specific provider's name due to a non-disclosure agreement, he said that the crawler used was owned by a third-party provider of web crawling and indexing services.
TollBit, a startup that pairs with AI companies to enter into licensing agreements, pointed out that 'more than 50 websites choose to ignore the robots.txt protocol and retrieve content from the site.' Although TollBit did not reveal the specific names of the companies, an investigation by overseas media Business Insider revealed that OpenAI, the developer of ChatGPT, and Anthropic, the developer of Claude, also ignore the robots.txt protocol.
In addition, Anthoropic states that 'Anthropic's crawler respects the industry standard directives of robots.txt and respects the 'don't crawl' signal from users,' and explains how to block the crawler.
Srinivas acknowledged that 'we also use crawlers owned by companies like TollBit,' but argued that 'protocols that control crawlers like robots.txt are not legally mandated. Publishers and technology companies have to build new relationships.'
In response to a question about whether Google could have immediately contacted third-party crawlers and told them to stop crawling content protected by robots.txt, Srinivas said, 'This is a complex issue.'
Related Posts:
in Software, Posted by log1r_ut