One Million Public Bluesky Posts Gathered for AI Training


Bluesky, the social media platform marketed as an alternative to X (formerly known as Twitter), is already facing its initial major controversy regarding AI data scraping. This issue arises even after the platforms’ owners assured users that their data would *not* be utilized for the training of generative AI models.

On November 26, **404Media** revealed that one million public posts from Bluesky, containing identifiable user information, were scraped and uploaded to the AI platform Hugging Face. This dataset was created by machine learning librarian Daniel van Strien to aid in the development of language models, tools for natural language processing, and the analysis of social media trends, content moderation, and posting behaviors. The dataset not only included decentralized identifiers (DIDs) of users but also incorporated a search function to find posts from particular users.

The description of the dataset indicates that it “contains 1 million public posts gathered from Bluesky Social’s firehose API (Application Programming Interface), aimed at machine learning research and experimentation with social media data. Each post encompasses text content, metadata, and details regarding media attachments and reply relationships.”

### SEE ALSO:
[Thinking of leaving X for a friendlier platform? Here’s what you should know about Bluesky’s owners and policies.](https://mashable.com/article/blueky-content-moderation-team-child-abuse-materials)

Bluesky users were not asked for permission to have their content utilized in this manner, although the platform does not outright prohibit such actions. The firehose API, which supplies a real-time, aggregated stream of all public data from the network—including posts, likes, follows, and handle adjustments—remains inherently accessible. Coupled with Bluesky’s decentralized Authenticated Transfer (AT) Protocol, this accessibility allows third-party developers to reach the platform’s content, a demographic Bluesky has actively aimed to engage, according to **404Media**.

This situation has sparked concerns among many of Bluesky’s swiftly expanding user community, which includes users who departed from X after the platform adopted a contentious AI training policy. In reply to **404Media’s** inquiry, a representative from Bluesky commented: “Bluesky is an open and public social network, similar to many websites on the Internet. Just as robots.txt files do not always stop outside companies from crawling those sites, the same is true in this case. We are looking for ways for Bluesky users to communicate their consent to outside organizations/developers, and for those organizations to honor user consent, and we are actively discussing how to make this happen.”

After the report was published, the dataset was swiftly removed from Hugging Face. Van Strien offered an apology in a follow-up [Bluesky post](https://bsky.app/profile/danielvanstrien.bsky.social/post/3lbvih4luvk23?ref=404media.co), stating: “I have deleted the Bluesky data from the repository. While my intention was to support development tools for the platform, I acknowledge that this approach infringed on principles of transparency and consent in data collection. I sincerely apologize for this error.”

This event underscores the difficulties Bluesky encounters as it attempts to maintain a balance between transparency and user privacy, particularly as it strives to set itself apart from its rivals.