Pretty much everyone wants better information about how platforms propagate and govern online speech. We are probably about to get it. A major pending EU law will require detailed public transparency reporting, and give some researchers access to large platforms’ internally-held information. Bills including the Platform Accountability and Transparency Act (PATA) make similar proposals in the U.S. As I discussed in a prior post, I think this is all great news. But I also worry that we will waste this window of legislative opportunity if we don’t put more thought into the details of proposed transparency measures.
This post is about what I consider one of the hardest questions, particularly under laws that create special data-access regimes for researchers. What data are platforms supposed to share, and what personal information will it disclose about Internet users? This question pits privacy goals against data-access and research goals. A strongly pro-privacy answer will curtail research into questions of great public importance. A strongly pro-research answer will limit users’ privacy rights. In between lie a lot of difficult calls and complex trade-offs.
Few of the proposed laws really address this, though they do contain other privacy provisions. The “what data are we talking about” question is largely left for resolution under future codes of conduct, or by future regulators. That makes some sense, since regulators will have more technical expertise and ability to assess tradeoffs on a case-by-case basis. But it also leaves the laws themselves as something of a Rorschach test. Researchers can read the bills and envision their projects going forward, while privacy advocates envision the same work being prohibited.
A lot of the current uncertainty comes from language that means different things to different people. User “data,” for example, might include the actual content of people’s posts, like the words or images they shared in a Facebook group – or it might not. Calling data “anonymized” may suggest perfect privacy protection to some, while others will recall the many times when supposedly anonymized data has turned out to identify individual people. “Aggregate” data has the same problem. In principle aggregate data isn’t supposed to provide details about any individual. In practice it very often does. Until we know which kind of data we are talking about, and the privacy risks it entails, we can’t make informed decisions about the balance between privacy and research goals.
This isn’t just a question that matters for privacy advocates. It matters for people whose priority is better access to information, too. One of my big worries is that lawmakers who find privacy questions too difficult will give up on public access to data, and focus only on limited-access regimes for academics. That would be a huge loss, forfeiting the diverse viewpoints and research agendas that come from outside the academy. Some of the best information we have today about platforms and online content comes from non-academic sources. In the disinformation space, that includes journalists and civil society groups. In the content moderation space, it includes empirical research built on publicly archived material in Harvard’s Lumen database. Passing new laws without working through diffictult privacy questions and trying to replicate those successes would be a real missed opportunity.
What This Post Isn’t About: Who Gets Data and How They Use It
Many experts would say that questions about what data gets disclosed depends on who gets to see the data and how the data will be used. These “who” and “how” questions have taken up the bulk of the discussion I personally have seen, and they are addressed in draft laws like PATA. These are big, meaty questions. They just aren’t the same as the “what data” topic that I will discuss here. To frame the issue, though, here is a rough rundown of those “who” and “how” concerns:
- Should the law establish separate tiers of data access, with some data being disclosed to the public and other data only to authorized researchers?
- Which researchers get access – is this only for academics, or also for other groups including journalists or civil society groups? Who counts as a journalist? Who counts as an academic, for that matter? Would the university researchers behind the Cambridge Analytica scandal get access?
- Does it matter what platforms are doing with the data currently? In other words, if we think outside researchers would do a more credible job of analysis that platforms themselves already do, or if we think that users don’t have meaningful privacy with their data in platform hands in the first place, does that affect our thinking about data disclosures?
- Will law enforcement, national security agencies, and other government entities get access to people’s data – explicitly or as an unintended consequence of platform transparency laws? If politically motivated government actors like Texas Attorney General Ken Paxton are getting access anyway, how does that shape our thinking about letting researchers see data, too?
- Who will serve as the gatekeeper, deciding which researchers, or which projects, can go forward? How do we keep the gatekeeper from using that power to advance its own policy preferences?
- What research goals are important enough to justify special access to data? Is it just for things like public health or protection of democratic systems, or can researchers also study things like consumer preferences? (That couldn’t be commercial research, of course. Just the not-at-all-rare kind of academic research that happens to be useful for commerce.)
- How does this relate to important debates about encryption – should platforms reduce privacy protections for user communications in order to facilitate research?
- How can we ensure that researchers don’t inadvertently create new privacy risks – like by storing data insecurely, or inadvertently publishing data that can be used to identify people? Is after-the-fact liability enough?
These are important questions, and rightly the subject of ongoing debate. But it is hard to answer them without knowing more about what data is at stake. The remaining sections of this post will review four major categories of data that raise distinct privacy issues: content that discloses private information, privately shared information, aggregate data, and data that tracks user behavior over time.
Content that Discloses Private Information
Many Internet research questions can only be answered by looking at the words, pictures, or other content that users posted. We can’t know how many people posted about vaccine hesitancy, for example, without seeing the posts. Platforms might report their own aggregate data on questions like this, but researchers assessing errors or patterns of bias in platforms’ enforcement need to see the content for themselves.
Giving researchers access to users’ posts can raise real privacy issues, though – regardless of whether the researcher knows who those users are. A Facebook post that interests researchers because of its factual claims about vaccines, for example, might also include deeply personal information or photos related to real children’s health problems. Does personal content of this sort become fair game as long as the parent who posted it failed to use Facebook’s “private” setting? What if the parent’s post was private, but someone else re-shared images of her or her children? Or the parent’s post was initially public, but she later deleted it or changed its privacy settings? Some researchers report a real interest in seeing deleted posts, in order to track bad actors who deliberately post and then delete disinformation. Others note that previous research can’t be checked or replicated without access to the same data set – including content users subsequently deleted. Those are serious concerns. On the other hand, undoing ordinary platform users’ ability to retract posts would be a major step away from practical privacy protections that exist today.
Beliefs about the right balance between privacy and freedom of information in situations like this vary dramatically. American laws and journalistic norms, for example, often treat anything that has been published once as public forever. European laws and norms more often prioritize privacy. In my experience from years of working on Right to Be Forgotten laws, most people’s preferences fall somewhere in the middle. Many thoughtful observers would prefer to have different rules for politicians than for private figures, for example; or more stringent rules for sensitive information like an individual’s HIV status or sexual history. But applying such nuanced or context-specific rules to large data sets used in research may be functionally impossible.
Privately Shared Information
Content that has serious public consequences is often shared over private channels. Falsehoods spread via private chat apps have contributed to violence against Muslims in India, for example. Earlier technologies also spread problematic content, like the 1990s-era glut of email messages conveying unsubstantiated and often false rumors about celebrities, politicians, and companies.
Should the contents of private communications like SMS or email messages ever be disclosed to researchers – even in aggregate? What if the private messages are simply distributing publicly available material – like if thousands of people share the same meme, or send links to the same public webpage? Should there be different rules for researchers who want to see the content of a message, and as opposed to non-content metadata, like a sender’s location or the time when a message was sent? Should researcher access vary based on the nature of the service and the user’s expectations of privacy – such that a one-to-one email, for example, might be treated differently from content shared in a 5,000-person Facebook group? Do users have any expectation of privacy in communications that are technically public, but were intended for small audiences – like Twitter posts from a student with a few dozen followers?
If these questions look familiar, it may be because their analogs have been debated and litigated for years in the context of surveillance and government access to data. The content of private messages, for example, is generally protected from police review without a search warrant. Police also need warrants for some arguably public information, like location-tracking data showing where a person traveled on public streets. And while government agents can and do look at public social media posts, not everyone believes they should. Civil liberties groups recently condemned the FBI’s acquisition of software for monitoring public social media accounts, for example. In past years, they criticized the Department of Homeland Security’s practice of inspecting individuals’ social media posts at the border.
It's not clear how debates or legal precedent about users’ legitimate expectation of privacy from the police should affect our thinking about privacy vis a vis researchers. Some might say that a user’s privacy is equally compromised no matter who reads her private mail, or profiles her public online activity. Others might say that government surveillance is more harmful, and should be more tightly restricted. In that case, it matters what researchers do with the information they find. If a researcher spots evidence of a crime in private messages, for example, can she report it to the police? Does that affect our thinking about letting government agencies decide which research projects to fund or authorize in the first place? Whatever the answers, it seems clear at a minimum that surveillance experts should have a seat at the table as societies craft rules for researcher access to data.
Identifying Individual Users in Aggregate Data Sets.
One of the most common methods for preserving user privacy while disclosing other information about platform usage is by publishing aggregate data sets. Platforms’ public transparency reports, for example, often list the total number of items of content removed under particular policies, such as rules against pornography or bullying. Since this data does not identify any individual person, or even any individual piece of content, disclosing it should raise no privacy concerns. Researchers investigating things like the spread of disinformation online, however, often have a strong interest in knowing more about the users represented by such aggregate information. They may want to know how many people in a particular age group, geographic region, or education or income level shared a particular news story, for example.
Data like this – even in aggregate form – can effectively disclose the real identity of individual users. For example, one study by Dr. Latanya Sweeney found that 87% of U.S. residents can be personally identified by a researcher who knows only their ZIP code, birth date, and gender. The more categories of information are included in an “anonymized” aggregate data set, the greater the risk of disclosing the personally identifying information that underlies it. But if very few categories of information are disclosed, important research may become impossible.
Measures exist to reduce these privacy risks, but none avoid the need to make tradeoffs. One approach is to collect data only from subjects who provide informed consent. This greatly improves privacy protection, but leaves researchers with data that may not provide a representative sample of the larger population. Other approaches involve limiting the data disclosed. Most simply, data sets could exclude users from smaller, more readily identified groups – for example, by not including data about users in a particular zip code unless there are at least ten of them. Or users in those smaller groups could be aggregated into larger ones, like by combining residents of several adjacent zip codes into a single group. The resulting data is more privacy protective, but leaves researchers with little visibility into unique attributes of those smaller groups (or of people in sparsely populated regions, in the zip code example). There is also a risk that compilers of aggregate data may inadvertently disclose information about smaller groups to anyone willing to do a little arithmetic.
Some more sophisticated methods for preserving privacy in aggregate data sets exist. One relatively novel one works by spreading data sets across two servers, which do not communicate with each other. Another, known as differential privacy, works by adding “noise” in the form of false or “synthetic” information to data sets. The idea is to keep the data accurate in aggregate, while making inferences about any individuals unreliable. Differential privacy encompasses a range of practices – some very privacy protective, some less so. For researchers, it can also create some downsides. For one thing, some have suggested that it might create barriers by researchers who do not have the statistical expertise to work with this kind of modified information. For another, it can meaningfully impair research that focuses on smaller user groups. If there are just five Hungarian people in a large data set that identifies users by national origin, for example, adding false data could meaningfully distort researchers’ conclusions about Hungarians. But failing to add false data could create privacy risks for them.
Tracking User Behavior Over Time
Researchers often have very legitimate questions about individual users’ patterns of behavior over time (or “longitudinal” data). For example, one important study assessed the impact of different responses to Twitter users who post racist remarks – a topic that required tracking their behavior over time. Other studies might use longitudinal data to assess how Facebook users’ exposure to misinformation in private groups affects those users’ subsequent posts or interests.
The most obvious way to enable such research while preserving some user privacy is to disclose the sequence of a user’s posts or behavior over time, but replace potentially identifying information (like a platform user ID or an email address) with “anonymized” information, like a randomly assigned number. Unfortunately, this approach to privacy protection has failed dramatically in a number of well-documented cases. AOL famously published 20 million user search queries “anonymized” in this manner in 2006, for example. Reporters quickly tracked down an individual user from the data set, by comparing her queries to public records about things like real estate purchases. Similarly, when Netflix released data showing how users had rated movies over a six-year period, researchers demonstrated that anyone with modest additional information about people’s movie-watching history – gleaned from publicly available sources – could identify those individuals 99% of the time.
Longitudinal data may, like aggregate data, be amenable to improved privacy protection using techniques like differential privacy. Making this kind of data truly anonymous sounds like a bigger challenge to me, but this privacy challenge is not as widely discussed in the non-technical literature and I can’t claim deep expertise. In any case, it seems safe to assume that technical fixes won’t solve everything. Ultimately, we will have to choose between legitimate but competing goals of privacy and information access.
Research, information access, and public understanding of the forces that shape public discourse are important. Sometimes they are more important than privacy, in my opinion. Sometimes they aren’t. Sometimes we can find technical measures to enable research while protecting privacy, other times we can’t. Certain tradeoffs between privacy and research goals will be impossible to avoid.
My goal in this post is not to pick a side in those debates. I don’t have enough information to do that, and I don’t think much of anyone else really does, either. Rather, I am hoping to support informed analysis. This is one of many transparency-related topics that would benefit from a broadened discussion, with researchers and privacy experts comparing notes, developing potential principles, and supporting informed policymaking.