Aug 30, 2024 12:03:00

It has been revealed that several major news sites are blocking the crawlers used to train Apple's personal AI 'Apple Intelligence'

The content used to train generative AI is often scraped from the web using bots, but this has come under scrutiny from time to time , with Apple also scraping content to train its AI, but it has come to light that several news sites are blocking these crawlers.

Many of the biggest websites opted out of Apple Intelligence training
https://9to5mac.com/2024/08/29/apple-intelligence-training-opt-outs/

Why top publishers are opting out of Apple Intelligence AI data scraping | iThinkDifferent
https://www.ithinkdiff.com/apple-intelligence-data-scraping-publishers/

Websites Increasingly Tell Apple and AI Companies to Stop Scraping - MacStories
https://www.macstories.net/linked/websites-increasingly-tell-apple-and-ai-companies-to-stop-scraping/

Top Media Outlets Block Apple's AI Data Collection • iPhone in Canada Blog
https://www.iphoneincanada.ca/2024/08/29/news-outlets-block-apple-ai-data-collection/

Apple wants to scrape content for Apple Intelligence training — but few publishers have agreed terms to let it happen | iMore
https://www.imore.com/apple/apple-wants-to-scrape-content-for-apple-intelligence-training-but-few-publishers-have-agreed-terms-to-let-it-happen

Apple blocked from training Apple Intelligence on several publishing websites — here's what we know | Tom's Guide
https://www.tomsguide.com/ai/apple-blocked-from-training-apple-intelligence-on-several-publishing-websites-heres-what-we-know

Websites opt out of Apple AI scraping, signaling 'conflict zone' | Cult of Mac
https://www.cultofmac.com/news/websites-opt-out-of-apple-ai-scraping

New York Times and more block Apple Intelligence training
https://appleinsider.com/articles/24/08/29/big-name-publishers-are-refusing-to-let-apple-intelligence-train-on-data

Apple's AI training faces backlash as major publishers opt out - PhoneArena
https://www.phonearena.com/news/apple-ai-training-publishers-opt-out_id162000

Generative AI scrapes content from the web for training purposes, a practice that has often come under scrutiny because it often uses copyrighted content to train the AI.

Apple's personal AI, Apple Intelligence , also scrapes content from the web for training purposes, but content publishers can explicitly opt out of having their content scraped by including instructions in a robots.txt file.

Apple announced this opt-out feature, 'Applebot-Extended,' in May 2024, and the information is also available on the following page, which summarizes information about the crawler 'Applebot' that Apple uses to scrape content on the web. Applebot was originally a crawler used to train the voice assistant Siri and the search function Spotlight, and was introduced in 2015. In recent years, Apple has reused Applebot to train Apple Intelligence.

About Applebot - Apple Support
https://support.apple.com/en-us/119829

It has been revealed that Applebot's opt-out function is being used by major social networking sites operated by Meta, such as Facebook and Instagram, as well as major news sites such as The New York Times and The Atlantic.

Anyone can check whether they have opted out of Applebot by checking the publicly available robots.txt file. According to a survey by WIRED , Facebook, Instagram, Craigslist, Tumblr, The New York Times, The Financial Times, The Atlantic, Vox Media, USA Today, Condé Nast, and others have blocked Applebot. WIRED reports that 'about 6-7% of high-traffic websites block Applebot.'

In addition, data journalist Ben Welsh's own investigation revealed that 294 out of 1,167 English-language media outlets based in the United States (about a quarter) block Applebot, compared with about 53% of companies blocking OpenAI crawlers and about 43% of companies blocking Google crawlers.

In addition, Apple has reportedly signed contracts with multiple media outlets to train its AI, so it is speculated that the companies and media outlets that do not have this contract are the ones blocking Applebot.

It turns out that Apple has discussed multi-year deals worth over $50 million with various media outlets to train generative AI on news articles - GIGAZINE

'Many of the world's largest publishers are clearly taking a strategic approach,' said John Gillham, founder of Originality AI, a company that develops tools to check for plagiarism using AI. 'I think in some cases it has to do with business strategy, such as withholding data until a partnership agreement is signed.' He suggested that companies may be refusing Applebot in order to get money from Apple.

It has been pointed out that Apple uses YouTube video subtitles to train its AI, but Apple has responded by saying that it does not use them to train its production AI, including Apple Intelligence.

Apple refutes reports that YouTube subtitles were used for AI training, saying they are not used in production AI including 'Apple Intelligence' - GIGAZINE

Related Posts:

Aug 30, 2024 12:03:00 in AI, Web Service, Posted by logu_ii