Nov 28, 2024 12:18:00

An example has emerged in which 'Bluesky operators do not use user posts for AI learning, but third parties can learn AI,' and a data set of 1 million posts is made public on Hugging Face via Bluesky's API

X (formerly Twitter), which updated its terms of use in November 2024, clarified that posts will be used for AI training. In response to this, many users have switched to Bluesky, a rival social networking site that has stated that it will not use posts for AI training . However, a data set of 1 million posts obtained via Bluesky's API was made public on Hugging Face.

Someone Made a Dataset of One Million Bluesky Posts for 'Machine Learning Research'
https://www.404media.co/someone-made-a-dataset-of-one-million-bluesky-posts-for-machine-learning-research/

Bluesky may not train AI on your posts, but others can, and users are furious - Neowin
https://www.neowin.net/news/bluesky-may-not-train-ai-on-your-posts-but-others-can-and-users-are-furious/

Bluesky, AI, and the battle for consent on the open web
https://werd.io/2024/bluesky-ai-and-the-battle-for-consent-on-the-open

Bluesky updated its official account on November 15, 2024, stating that it would not use user content to train generative AI. However, because Bluesky has a system in place that keeps all posts open, there were concerns that it would be impossible to prevent AI learning by third parties.

Unlike X (formerly Twitter), Bluesky has stated that it will not use posts to train AI - GIGAZINE

Meanwhile, engineer Daniel van Strien announced on November 26, 2024 that 'a dataset of 1 million posts from Bluesky has been made available on Hugging Face.' Van Strien said about this dataset, 'It can be used for training and testing language models on social media content, analyzing social media posting patterns, studying conversation structure and reply networks, studying social media content moderation, and natural language processing tasks using social media data.'

First dataset for the new @huggingface.bsky.social @bsky.app community organization: one-million-bluesky-posts ????

???? 1M public posts from Bluesky's firehose API
???? Includes text, metadata, and language predictions
???? Perfect to experiment with using ML for Bluesky ????

huggingface.co/datasets/blu...

[image or embed]
— Daniel van Strien ( @danielvanstrien.bsky.social ) November 26, 2024 22:50

In the post, Van Strien explains, 'We created the dataset using Bluesky's API, Firehose .' Firehose is an API that streams all posts in real time and allows third parties to freely use the posted data.

However, the dataset has drawn criticism from some users, with one user harshly criticizing Van Strien, saying, 'I moved to Bluesky to get away from crappy scraping with X, and now you're trying to use Bluesky data to train your AI - that's disgusting.'

You are the absolute worst for even suggesting ML training off this data while BSky themselves said they'd never train on it and MANY came here to leave Muskrat's stupid ML/AI scraping.
— Dix ( @dixonij.bsky.social ) November 28, 2024 6:23

In response to these criticisms, Van Strien removed the dataset from the Hugging Face repository on November 27, 2024. 'While I wanted to support the development of the platform's tools, I realized that this approach violated the principles of transparency and consent in data collection. I apologize for this mistake,' Van Strien said.

I've removed the Bluesky data from the repo. While I wanted to support tool development for the platform, I recognize this approach violated principles of transparency and consent in data collection. I apologize for this mistake.

[image or embed]
— Daniel van Strien ( @danielvanstrien.bsky.social ) November 27, 2024 11:19

After the dataset was made public, Bluesky updated its official account, revealing that it is developing a mechanism to 'explicitly indicate whether or not users consent to their data being used for AI training.'

Brief update on our ongoing efforts to allow users to specify consent (or not) for AI training: 🧵
— Bluesky (@bsky.app) 2024-11-27T01:52:05.788Z

The mechanism for indicating whether or not AI can learn is being considered in the form of a ' robots.txt ' for websites. However, Bluesky says that 'it is up to the external developer to decide whether or not to respect user consent.'

For example, this might look like a setting that allows Bluesky users to specify whether they consent to outside developers using their content in AI training datasets

Bluesky won't be able to enforce this consent outside of our systems. It will be up to outside developers to respect these settings
— Bluesky ( @bsky.app ) November 27, 2024 11:11

Bluesky also said, 'We are continuing discussions with our engineers and lawyers and will provide an update soon.'

We're having ongoing conversations with engineers & lawyers and we hope to have more updates to share on this shortly!
— Bluesky ( @bsky.app ) November 27, 2024 11:18

Related Posts:

Nov 28, 2024 12:18:00 in Software, Web Service, Posted by darkhorse_log