thisaintbc: (Default)
[personal profile] thisaintbc
I heard about this on tumblr, and went digging a little to see what I could verify. Here's what I've got.

Earlier this year, someone scraped AO3 for a dataset, seemingly intended for AI training. Data scrapes aren't new, but this one was large and the person who did it has been sort of flippant about takedown requests. AO3 wasn't alone; several other sites were also scraped.

This lays out the progression of events pretty comprehensively, but the tl;dr version is that the dataset was uploaded to HuggingFace, then a few other websites. At some point, the scraper created their own website to host as well. After receiving DMCAs, most sites have taken down or deleted the data. However, the scraper filed a counter-notice.

I reached out to the OTW (my full message is below the cut at the end of the post) asking if they can let me know if they've pursued legal action against data scraping before & whether they intend to this time. I'll post an update when I get a response.

The takeaways: 
  • The dataset is currently down in most locations where we can realistically expect it to be taken down. PaperDemon advises against visiting the scraper's personal website and I'm inclined to agree; this person has already proven themselves to not adhere to the same ethical values we hold. It also was apparently still up on a site called datafish as of 6-7 hours ago; I'm not sure if that's the personal website in question or something else, although frankly I haven't spent much time trying to look into that piece. You probably don't need to worry about filing DMCAs at this point.
  • That said, the dataset might go back up. At 10:29 GMT on April 11, the scraper commented that they had filed a counter-notice. From the date of receipt of the counter-notice, the copyright holder usually has 10-14 business days to file a lawsuit. If no legal action is taken, the data can go back up. Today is the 11th business day since the counter-notice.
  • HuggingFace's post says the DMCA was from AO3 (actually, "from the representatives of Transformative Works"), so presumably the counter-notice was also sent to AO3. It's possible that they're referring to AO3 when they actually mean an individual, but if we take their statements at face value that means the ball's in the OTW's court.
  • Things are a little heated on HuggingFace. AO3 users are using strong language, and in response some of the people on HuggingFace have started talking about things like torrents, private websites, etc. 
  • So far, the OTW has been silent. That isn't hugely unusual; this is far from the first data-scraping incident, and a lot of this stuff gets handled behind the scenes. But this incident has attracted significant attention on at least tumblr, so I was hoping they would have some kind of statement.
  • Not all AO3 data scraping is unethical. There are plenty of people who scrape AO3 for legitimate fannish purposes, from gathering fandom stats to Auto AO3 (that website people use to look up gift exchange requests). If in some future world we decide to Do Something about data scraping, I think it's important to bear in mind the distinction between ethical vs unethical data scraping. The scraper in this incident has said that they aren't doing this for profit, but to me this absolutely feels like the same thing as a web novel website copying things from AO3 to try to sell.
This incident also highlights for me something I've been saying in private for a while: We need someone doing some kind of AO3 news reporting. The Daily Dot was sort of doing this for a while, but their fandom reporting seems to have fallen off - and they really only focused on major issues and interesting features. What we need is something more comparable to local journalism, that helps us understand the AO3 equivalents of water rates and pot hole repairs.

Right now, it feels like many of us rely on word of mouth and the occasional OTW press release for information. I'm not criticizing OTW's press releases, but like any organization they tend to be slow and careful in putting things out, only wanting to write after things are resolved and they can provide solid answers. But sometimes we need to know there are questions to be asked.

Maybe somebody's already doing this and I'm just unaware of it (lbr, if it exists it's probably here on dreamwidth), so if you know of something like this please drop it in the comments.

Here's what I sent OTW:

Hi - this is probably a question for legal advocacy, but I'm reaching out via communications as a sort of "media inquiry" in the sense that I'd like to share the answers I get.

As a mod of the due South/C6D Big Bang and someone involved in the running of a few other challenges, I'm concerned about the rising frequency of what seem to be for-profit data scrapes and considering the possibility of beginning to advise participants of the risks around this issue. In order to better understand what the OTW is doing in response, I have two main questions:

First, and forgive my ignorance here, but has any kind of legal action been taken by the OTW in regards to for-profit/non-fannish data scrapes at this point?

Second, and probably more pressing, in regards to the recent Hugging Face data scrape by nyuuzyou I understand that OTW received a counternotice, and the dataset will be public again soon unless OTW takes legal action. Do you intend to do so?

Thanks for all that you do!


I'll let you know what kind of response I get.

If you don't have an account you can create one now.
HTML doesn't work in the subject.
More info about formatting

If you are unable to use this captcha for any reason, please contact us by email at support@dreamwidth.org

Profile

thisaintbc: (Default)
Mission

April 2025

S M T W T F S
  12345
6789101112
13141516171819
2021222324 2526
27282930   

Style Credit

Expand Cut Tags

No cut tags
Page generated Jan. 7th, 2026 02:44 am
Powered by Dreamwidth Studios