AO3 data scraping
Apr. 25th, 2025 06:37 pmI heard about this on tumblr, and went digging a little to see what I could verify. Here's what I've got.
Earlier this year, someone scraped AO3 for a dataset, seemingly intended for AI training. Data scrapes aren't new, but this one was large and the person who did it has been sort of flippant about takedown requests. AO3 wasn't alone; several other sites were also scraped.
This lays out the progression of events pretty comprehensively, but the tl;dr version is that the dataset was uploaded to HuggingFace, then a few other websites. At some point, the scraper created their own website to host as well. After receiving DMCAs, most sites have taken down or deleted the data. However, the scraper filed a counter-notice.
I reached out to the OTW (my full message is below the cut at the end of the post) asking if they can let me know if they've pursued legal action against data scraping before & whether they intend to this time. I'll post an update when I get a response.
The takeaways:
Right now, it feels like many of us rely on word of mouth and the occasional OTW press release for information. I'm not criticizing OTW's press releases, but like any organization they tend to be slow and careful in putting things out, only wanting to write after things are resolved and they can provide solid answers. But sometimes we need to know there are questions to be asked.
Maybe somebody's already doing this and I'm just unaware of it (lbr, if it exists it's probably here on dreamwidth), so if you know of something like this please drop it in the comments.
Here's what I sent OTW:
I'll let you know what kind of response I get.
Earlier this year, someone scraped AO3 for a dataset, seemingly intended for AI training. Data scrapes aren't new, but this one was large and the person who did it has been sort of flippant about takedown requests. AO3 wasn't alone; several other sites were also scraped.
This lays out the progression of events pretty comprehensively, but the tl;dr version is that the dataset was uploaded to HuggingFace, then a few other websites. At some point, the scraper created their own website to host as well. After receiving DMCAs, most sites have taken down or deleted the data. However, the scraper filed a counter-notice.
I reached out to the OTW (my full message is below the cut at the end of the post) asking if they can let me know if they've pursued legal action against data scraping before & whether they intend to this time. I'll post an update when I get a response.
The takeaways:
- The dataset is currently down in most locations where we can realistically expect it to be taken down. PaperDemon advises against visiting the scraper's personal website and I'm inclined to agree; this person has already proven themselves to not adhere to the same ethical values we hold. It also was apparently still up on a site called datafish as of 6-7 hours ago; I'm not sure if that's the personal website in question or something else, although frankly I haven't spent much time trying to look into that piece. You probably don't need to worry about filing DMCAs at this point.
- That said, the dataset might go back up. At 10:29 GMT on April 11, the scraper commented that they had filed a counter-notice. From the date of receipt of the counter-notice, the copyright holder usually has 10-14 business days to file a lawsuit. If no legal action is taken, the data can go back up. Today is the 11th business day since the counter-notice.
- HuggingFace's post says the DMCA was from AO3 (actually, "from the representatives of Transformative Works"), so presumably the counter-notice was also sent to AO3. It's possible that they're referring to AO3 when they actually mean an individual, but if we take their statements at face value that means the ball's in the OTW's court.
- Things are a little heated on HuggingFace. AO3 users are using strong language, and in response some of the people on HuggingFace have started talking about things like torrents, private websites, etc.
- So far, the OTW has been silent. That isn't hugely unusual; this is far from the first data-scraping incident, and a lot of this stuff gets handled behind the scenes. But this incident has attracted significant attention on at least tumblr, so I was hoping they would have some kind of statement.
- Not all AO3 data scraping is unethical. There are plenty of people who scrape AO3 for legitimate fannish purposes, from gathering fandom stats to Auto AO3 (that website people use to look up gift exchange requests). If in some future world we decide to Do Something about data scraping, I think it's important to bear in mind the distinction between ethical vs unethical data scraping. The scraper in this incident has said that they aren't doing this for profit, but to me this absolutely feels like the same thing as a web novel website copying things from AO3 to try to sell.
Right now, it feels like many of us rely on word of mouth and the occasional OTW press release for information. I'm not criticizing OTW's press releases, but like any organization they tend to be slow and careful in putting things out, only wanting to write after things are resolved and they can provide solid answers. But sometimes we need to know there are questions to be asked.
Maybe somebody's already doing this and I'm just unaware of it (lbr, if it exists it's probably here on dreamwidth), so if you know of something like this please drop it in the comments.
Here's what I sent OTW:
( Message to OTW )
I'll let you know what kind of response I get.