an incomplete list of fediverse instances scraped by meta to train AI

cm0002@piefed.world · 17 days ago

an incomplete list of fediverse instances scraped by meta to train AI

FaceDeer@fedia.io · 16 days ago

I don’t see why everyone’s surprised about this. The Fediverse is running on ActivityPub, an open protocol whose purpose is to broadcast the content we post here to anyone who wants it. Of course it’s being used to train AI, why wouldn’t it?

OpenStars@piefed.social · 16 days ago

Except iirc, they aren’t scraping “properly” (read: efficiently at least, setting aside morality for the sake of discussing this component in isolation), and are causing traffic troubles. If only they took the time to install an actual instance themselves then nobody would care in the slightest (again, ignoring the morality part, for now).

TLDR: they are being dicks about it, bc offering everything we have for free is not enough for them.

MrKaplan@lemmy.world · 16 days ago

of all the scrapers we see, the requests identified as originating from Meta seem to be well behaved overall. they appear to (mostly) be respecting robots.txt where present and their request volume to Lemmy.World is only averaging slightly above 5 requests per minute over the last 2 weeks. they also don’t spoof their user agents to pretend to be web browsers, or at least I have not seen credible accusations of this happening.

scytale@piefed.zip · 16 days ago

But if they do it the “proper” way, they won’t be able to grab the data if instances defederate from them, right? And that’s what the majority of instances will do.

FaceDeer@fedia.io · 16 days ago

Assuming you know which instances are the ones they’re collecting data from. It could be any instance.

Excrubulent@slrpnk.net · 17 days ago

We need to implement tarpits: https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/

shnizmuffin@lemmy.inbutts.lol · 17 days ago

Those tasteless frauds!

kbal@fedia.io · 17 days ago

I see that shitposter.club is on the list. Good to know they’re using only the highest-quality training material.

ms.lane@lemmy.world · 14 days ago

They’re training on Hexbear

That’s… amusing.

Cris@lemmy.world · 16 days ago

This is only a loosely related thought, but are there any new foss licenses or anything that prohibit ai usage? I know it’ll be ignored but it feels like explicitly disallowing things could be important in opening the door to successful legal challenges to ai scraping and theft…

FaceDeer@fedia.io · 16 days ago

Case law is still pretty young in this area, but it’s looking like there’s nothing actually against copyright about the training of AI on copyrighted content. It’s not something that a license can restrict because the trainers can simply reject the license and carry on training under the basics of what the law allows them to do anyway.

Open source licenses only have power because they grant permissions that people normally wouldn’t have and put conditions on those permissions. If you don’t need those permissions then you don’t have to be bound by those conditions.

Cris@lemmy.world · 16 days ago

Ahhh, that sucks ass :(

Thank you for expanding my understanding of the problem!

an incomplete list of fediverse instances scraped by meta to train AI

an incomplete list of fediverse instances scraped by meta to train AI

:pona_plush: #FediPact :pona_plush: (@FediPact@cyberpunk.lol)