Some Practical Postulates About Platform Data

Beliefs and expectations about what data platforms have at their fingertips vary wildly. That is about to matter a great deal, once new rules in the EU allowing researchers to access data held by platforms come into effect. Relationships between researchers, platforms, and regulators are likely to be very bumpy — and important research is likely to be delayed — until expectations become more aligned.

This blog post is an attempt to explain the experience of many platform employees trying to get data from within their own companies. It draws on my own experience and conversations with others. Its goal is to help reduce that inevitable bumpy adjustment period as researchers, platforms, and regulators figure out how to work with one another.

Public discussion about transparency too often starts from the assumption that platforms are well-functioning surveillance machines. That’s sort of true. The platforms most likely to be affected by transparency laws generally collect and store a lot of data. But the data is kind of a mess. It is not pre-sorted into all the categories that researchers are interested in, or even the categories that employees wish they could see. Often, it could be. But that requires both work and decision-making — often very consequential decision-making — about exactly what data is being pulled, and how it is defined. Researcher access regimes can succeed by anticipating and planning around this critical stage in the process.

Some Postulates

Here are some things I think I have learned from my own experience and from talking to people at well over a dozen platforms. This is definitely unscientific. But I don’t know of any public source discussing these issues, so here is my experience.

1. There is rarely just one way to count most things we care about. This could be because of hard-to-define categories — what counts as “political” content, for example, in an attempt to count posts on political topics? But it can also be for much more fine-grained reasons, like what counts as “Italian” — something posted from Italian territory, seen in Italy, hosted on a .it domain, written in Italian language? What counts as having happened within a specified time frame? In the context of content moderation, any given cutoff date for e.g. annual data sets might fall midway through the process of content moderation. A notice alleging that a post violates the platform’s rules might arrive December 30, be resolved by the platform January 2, and go through an appeal process after that. Another classic problem is about counting “notices” — does that mean individual communications, or the number of items listed in those communications? There’s often no one right answer to these questions. It’s just that data sets, expectations, and interpretations get skewed when not everyone is working with the same answer. (More examples here.)

2. If you want to pull data no one has pulled before, it will ALWAYS take a couple of hours to a couple of weeks of engineering time to get it. The “couple of weeks” is presumably not someone working the whole time, but rather tasks that require coordination and handoff between people, which takes time. The lower-hanging fruit is data that someone else with clout already wanted to have tracked for their own purposes. Here is a list, which is partially speculative, of data that might already be sorted and available at many platforms:

a. Data already tracked for ads purposes (including for sharing information with advertisers and for internal ad targeting mechanisms)

b. Data routinely requested by law enforcement (at companies that have handled enough requests to standardize data collection processes)

c. Information that is formally tracked in Trust and Safety tools for purposes of later retrieval by members of that team. (This might be as simple as using assigned ticket numbers in email headers to make them searchable, using unique and recurring language in messages to users about particular topics, etc.)

d. Data already collected as part of the public product offering, like the number of times a particular post on Twitter was retweeted.

e. High level data collected for internal understanding of usage trends or revenue expectations (like monthly active users, number of daily posts, etc.). Some companies are reputed to already track and make internally available much more data of this sort (cough cough, Meta) than others.

3. When specifying in advance what data to pull or what tracking capacities to build into your tools, you’ll likely ask for the wrong thing at first. Maybe you just didn’t think of something. (Like, you asked for posts that mention “women” without also asking for synonyms.) Maybe your theory of the problem or how the product works is wrong. (Like, you asked for an algorithmic ranking signal’s “weight,” assuming it to be a single static number — then had to revise your project to account for signals’ varying “weight” in any given newsfeed or on any given day.) Maybe you phrased the data request ambiguously and someone invested time gathering the wrong data as a result. Of all the postulates I’m listing here, this is the one that gets the most winces of recognition when I mention it in the company of people who have done this work.

4. The data you pull at first will turn out to have flaws. There are a million versions of this. Meta’s helpful log of transparency report corrections provides great day-to-day examples; my very unscientific Twitter survey suggests that problems like these are very hard to avoid. Probably some involve malfeasance, but what I’ve seen is honest mistakes by people who are legitimately trying to work together. If you wanted every example of X, you may eventually hear “Sorry, we thought every item on this list was X, but the list turns out to also include Y because of an imperfectly formed query or a logging error;” “Uh oh, the list is incomplete because it turns out there is another whole repository of X maintained by another team;” or “Oops, the old system that we stopped using 18 months ago tagged X differently, or lumped it together with Y in an undifferentiated category it called Z.” This can all happen even if there’s even an agreed meaning of X. As an earlier post explained, that’s often not the case.

5. Asking for anything has costs; asking for everything has infinite costs. Particularly for smaller companies, transparency expenses are real and the work comes at the expense of other content moderation efforts. Building out tracking systems in predefined ways also risks locking in today’s content moderation systems and policies, by making change costlier. That’s especially likely if companies optimize for compatibility with tools from third party vendors or resources shared by larger platforms. Getting transparency on literally everything is impossible. There is no map the size of the territory. Setting priorities is unavoidable and important.

How This Should Affect Planning for Transparency

These complications can potentially affect transparency efforts of all kinds. They are probably most relevant, however, for proposals that empower researchers, regulators, or auditors to request specific data from platforms. The problems I described above will be vexing for everyone involved if the system is not designed to anticipate and correct for them.

Broadly, I think there are two ways to address these problems. One is for platforms and outside experts to work together to define structured and understood data sets in advance. Researchers who request data from these known sets should be able to proceed relatively smoothly. Projects like Social Science One were intended to work this way. Mechanisms proposed in Europe to address privacy and data protection issues could also provide the right expert forum for establishing data sets of this sort, and generally better aligning expectations between platforms and researchers. Even systems that involve other, more bespoke data access provisions may benefit from standardized data sets that serve many researchers’ needs. In rolling out new transparency regimes, an initial phase focused on these standardized data sets could provide regulators and platforms with a valuable test case — allowing them to work out the kinks in the system before attempting more complex data collection and disclosure.

The second way to address these problems, for researchers who want access to data that is not part of these known sets, is to design a much, much more high touch process before research projects are defined or approved. Figuring out what data to provide in this scenario would involve a very detailed and technical conversation between platforms, researchers, regulators, and perhaps other expert bodies. Through this process, researchers could iterate on their requests and arrive at formulations that collect the right data to serve legitimate research goals. In legal terms, the discussion would be less like a litigation discovery fight, and more like wary collaboration on settlement. Having that conversation, and doing the work to actually collect and verify the data, would be seriously time-consuming and expensive for all concerned — including any regulators tasked with understanding projects and their risks. High-touch processes of this sort cannot be justified for every research project.

I’ve talked about the risks and downsides of transparency measures elsewhere (including in Senate testimony). These are important, not because they should prevent platform transparency mandates, but because they should shape those mandates. Transparency about platforms and online speech is too important to get wrong. We need to wrangle intelligently with the foreseeable problems. Beyond the logistical questions raised in this post, this includes hard-to-spot privacy risks; potential reduction of Internet users’ protection from state surveillance; risks that transparency demands will harm competition or drive even more de facto standardization and homogenization of online speech rules; and First Amendment concerns for both Internet users and platforms themselves. We should also think deeply about the basic project of making data about Internet users even more trackable. Critics of surveillance capitalism may be rightly wary of laws that make platforms build even more internal tools to slice and dice and analyze data about our behavior.

All of these concerns should be prompts for discussions — waystations along the way to sound transparency laws. And happily, there are some very easy transparency improvements that don’t require wrangling with the hardest questions. Much easier, low-hanging-fruit transparency is well within our grasp. Researchers who “scrape” content displayed to users on apps and webpages, for example, have provided critical insights about platforms already. Their work does not require the complex technical negotiations about data described in this post. But their work is hindered today by the threat of lawsuits by platforms. Lawmakers in the U.S. and elsewhere could easily fix that, unleashing more and better scraping-based research. Platform transparency could also improve dramatically with reliable research APIs for access to already-public data. Not every transparency measure is as complicated as the projects I’ve described in this post. Simple measures, involving already-public data, can yield valuable results.

Better transparency is coming. Researcher access to platform data is coming. The benefit we get from these changes will depend, in significant part, on sweating the details. Policymakers can and should take the time to plan well for the practical realities of data collection.