Dr. Casey Fiesler
@cfiesler.bsky.socialHi, so I've spent the past almost-decade studying research uses of public social media data, like e.g. ML researchers using content from Twitter, Reddit, and Mastodon.
Anyway, buckle up this is about to be a VERY long thread with lots of thoughts and links to papers. 🧵
First dataset for the new @huggingface.bsky.social @bsky.app community organisation: one-million-bluesky-posts 🦋
đź“Š 1M public posts from Bluesky's firehose API
🔍 Includes text, metadata, and language predictions
🔬 Perfect to experiment with using ML for Bluesky 🤗
huggingface.co/datasets/blu...
Daniel van Strien
@danielvanstrien.bsky.social
Researchers have been using social media content without your consent for a LONG time. Not just AI/ML research, of course, all kinds. There is a non-zero chance that one of your tweets or reddit comments is quoted in a research paper somewhere. www.howwegettonext.com
Scientists Like Me Are Studying Your Tweets–Are You OK With That? - How We Get To Next
As researchers, we have a responsibility to acknowledge that factors like the type of data, the creator of that data, and our intended use for the data are important when it comes to using public info...
www.howwegettonext.com
And Twitter for a long time was absolutely the biggest source of social media data for research. @zey.bsky.social once called Twitter the “model organism” of social media research: researchers used Twitter because like the fruit fly, the platform and its users were just so easy to study.
In 2016 my collaborator @profprof.bsky.social and I surveyed Twitter users about how they felt about researchers using their tweets. And one of the findings was that most of them had no idea this was happening. But when they found out... they cared. journals.sagepub.com
Sage Journals: Discover world-class research
Subscription and open access journals from Sage, the world's leading independent academic publisher.
journals.sagepub.com
In part because of how U.S. IRBs work, a common ethical heuristic for "is it ok to use data for research without consent" is "is it public?" Often the ONLY ethical heuristic.
For example, speaking of datasets, remember when someone scraped all of OK Cupid? www.wired.com
OkCupid Study Reveals the Perils of Big-Data Science
The data of 70,000 OKCupid users is now searchable in a database. Ethicist Michael T Zimmer explains why it doesn't matter that it was "already public."
www.wired.com
But it turns out that actual humans aren't just blanket totally okay with their social media posts being part of a research paper somewhere. Context matters! Like um... who are you? What are you going to do with it? What's the research about? What was the tweet about? Could it be traced back to me?
For example, Brianna Dym and I talked to folks in fandom about how they felt about both researchers and journalists using their content without consent. And a significant concern was amplification of their work beyond its intended audience. journal.transformativeworks.org
Ethical and privacy considerations for research using online fandom data
| Transformative Works and Cultures
journal.transformativeworks.org
And Shamika Klassen interviewed people who were part of Black Twitter about how they felt about researchers using their tweets. We found that things like positionality really matter here, especially in the context of historical harms. journals.sagepub.com
Sage Journals: Discover world-class research
Subscription and open access journals from Sage, the world's leading independent academic publisher.
journals.sagepub.com
And the idea that "is it public" is all that matters and not what you DO with people's content, is absurd.
Like you cannot possibly suggest with a straight face that this example of using transgender YouTubers' videos to train facial recognition is 100% fine.
www.theverge.com
Transgender YouTubers had their videos grabbed to train facial recognition software
In the race to train AI, researchers are taking data first and asking questions later
www.theverge.com
And to be clear nope I am not against use of public data in research. I've done it myself. (For example, this paper about research ethics that uses comments on news articles as a data source: cmci.colorado.edu)
BUT. There should be a lot of ethical consideration beyond "publicness."
For example, my collaborators @michaelzimmer.bsky.social @profprof.bsky.social @sarahagilbert.bsky.social & Naiyan Jones published a paper earlier this year with suggestions for best practices for using Reddit data in research. cfiesler.medium.com
How to remember the human: Recommendations for ethical Reddit research
The first rule of etiquette on the social platform Reddit is “Remember the human.” This rule should apply to researchers, too.
cfiesler.medium.com
Last week I gave a talk titled "When Data Is People: Ethics, Privacy and Ownership in Research & AI Uses of Public Data" that tied together this research ethics work with generative AI training data and also copyright. Here's one of the final slides.
Notice I haven't said anything about what's "allowed" yet. I think ethical issues for this kind of thing are often more profound than law.
We'll be fighting for about copyright & AI for a while, but I think for eg artists, "you used my work to build tech to replace me" is probably more important.
There are three major possibilities for reasons collecting public data for research or AI training might not be allowed (all of which I saw references to in the replies to the post about the bluesky dataset):
(1) IRB
(2) Terms of Service
(3) Copyright
(Spoiler: None of these unambiguously apply.)
IRBs govern research ethics for U.S. universities (though there are some similar bodies in other countries) and not e.g. companies.
But also, they only govern human subjects research which means (a) interacting with a human or (b) collecting identifiable *private* information.
So IRBs typically consider collection of public social media data to not be human subjects research & therefore not under their purview. @jesspater.bsky.social @michaelzimmer.bsky.social & I actually published some speculative fiction about the limits of IRBs. dl.acm.org
No Humans Here: Ethical Speculation on Public Data, Unintended Consequences, and the Limits of Institutional Review: Proceedings of the ACM on Human-Computer Interaction: Vol 6, No GROUP
Many research communities routinely conduct activities that fall outside the bounds of traditional human subjects research, yet still frequently rely on the determinations of institutional review boards (IRBs) or similar regulatory bodies to scope ...
dl.acm.org
So what about TOS? A few years ago @brianckeegan.com & Nathan Beard and I analyzed data scraping provisions on social media sites. Honestly, they're pretty useless for thinking about ethics, and I don't even think it's inherently unethical to violate them. cfiesler.medium.com
Spiders and crawlers and scrapers, oh my! Law and ethics of researchers violating terms of service
Last week, a federal court ruled that researchers violating a website’s terms of service (TOS) in order to conduct research aimed at…
cfiesler.medium.com
That said, there is nothing in Bluesky's TOS that prohibits scraping. But also... they have an API. This is the same reason Twitter became the fruit fly of social media research. There's no reason to think that Bluesky WANTS to prohibit use of the data here by third parties like researchers.
I think that people misunderstood Bluesky's statement that (unlike Musk/X) THEY will not be training AI based on your content here to mean that they would be actively protecting your content from others.
That said, apparently Bluesky is exploring methods for consent: bsky.app
Brief update on our ongoing efforts to allow users to specify consent (or not) for AI training: 🧵
Bluesky
@bsky.app
My collaborators (@brianckeegan.com @mattnicholson.bsky.social @blakeley.bsky.social @jojuliz.bsky.social) & I as part of a broader project replicated the Twitter ethics survey on Mastodon and found that over there this kind of opt-in consent for research was desirable (e.g. at the instance level).
So yes, I think that if Bluesky is able to create a mechanism for consent for AI training (though tbh I think if they're going to do it, it should be broader than that) that would be great, but I'm a little doubtful and I'm also not sure how enforceable it could be.
Okay, copyright.
Yes, you own your Bluesky content. Their copyright license looks basically the same as every other social media site (and I should know! caseyfiesler.com).
And are your posts copyrightable? Eh, probably. Some of them anyway.
(Fun fact before I go into this next part: My dissertation was literally about fair use! caseyfiesler.com I raise this because for reasons that elude me, fair use is the #1 topic I get mansplained to about on social media. Even more than AI.)
Dissertation
You can download a copy of my dissertation (defended June 15, 2015) here: The Role of Copyright in Online Creative Communities: Law, Norms, and Policy. This work is licensed under a CC-BY-NC licens…
caseyfiesler.com
I feel pretty confident that academic research uses of public social media data are almost always fair use. Ticks the right boxes.
Noncommercial/academic AI models... less clear, but still pretty good chance.
Commercial AI like those in the slew of pending copyright cases? Maybe. We'll see.
However, I still think that the ethical/normative issues around content re-use are more interesting than copyright. For example, if you're curious why many fanfiction writers (despite relying on fair use) are so anti-AI this paper I wrote has all the answers: cmci.colorado.edu
Here's another slide from the talk I gave last week that tied together research ethics and copyright stuff re: use of public data.
So in sum of this last bit: At this moment, there is definitely nothing unambiguously *not allowed* about collecting data from Bluesky and using it however you want, including for AI training or creating a secondary dataset.
It's just another way this platform is similar to what Twitter was.
Though finally, a note on datasets.
Releasing a dataset of social media posts presents a different set of ethical issues. The most obvious one being that the user loses control over their content. I can delete a post. Once it's in a dataset, I can't. (Unless they provide a mechanism to request.)
Also you probably don't know once you're in a dataset in order to request removal from it even if you wanted to. @morganklauss.bsky.social led this paper a few years ago about this traceability problem (and what we might do about it). medium.com
From Human to Data to Dataset: Mapping the Traceability of Human Subjects in Computer Vision…
This blog post summarizes the paper “From Human to Data to Dataset: Mapping the Traceability of Human Subjects in Computer Vision Datasets”…
medium.com
And in case you didn't already see this conclusion, the Bluesky dataset has already been removed, which I think was the right call in the context of the backlash. I also think that this whole thing was important to happen, to start these conversations. bsky.app
I've removed the Bluesky data from the repo. While I wanted to support tool development for the platform, I recognize this approach violated principles of transparency and consent in data collection. I apologize for this mistake.
Daniel van Strien
@danielvanstrien.bsky.social
And I almost hate to bring this to folks' attention, but... three weeks ago PLOS ONE published a dataset of "the complete post history of over 4M users (81% of all registered accounts), totalling 235M posts." journals.plos.org
This is not surprising or new.
“I’m in the Bluesky Tonight”: Insights from a year worth of social data
Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly...
journals.plos.org
Here are some recommendations from me:
(1) Anyone who collects social media content for any reason should be doing an ethical analysis based on the kinds of things raised in this thread. If they're writing a paper, it's in there. If they're releasing a dataset, it's in the documentation.
(2) You just need to know that public social media content is likely to be used in all these ways. Whether that's training an AI system or being quoted in a scientific article about flu trends or mental health or social media use. This has been the case for many years and is unlikely to change.
(3) I would love to see granular privacy settings on Bluesky. e.g., mutuals-only, followers-only (this would be my pick), or logged-in-accounts-only. This could in theory serve as a post-by-post form of consent, because non-public posts would be... not public.
Wow, I was right that this was very long. I could have just filmed a YouTube video about this in 2 hours I spent on this thread. :) Please share so I feel like this was worth it!
I'd love to talk to journalists about social media data and research/AI! I would also love to talk to @bsky.app. :)
P.S. Someone asked if my slides from the keynote I gave last week on this topic were available, so here they are! The talk is titled "When Data Is People: Ethics, Privacy and Ownership in Research and AI Uses of Public Data." drive.google.com
DataIsPeople2024_Fiesler.pdf
drive.google.com