Empirical Evidence of “Over-Removal” by Internet Companies under Intermediary Liability Laws

NOTE: I periodically update this post. Last update was May 8, 2020.

The "Over-Removal" Issue

Most intermediaries offer legal “Notice and Takedown” systems – tools for people to alert the company if user-generated content violates the law, and for the company to remove that content if necessary. Twitter does this for tweets, Facebook for posts, YouTube for videos, Google for search results, local news sites for user comments, etc. National law varies in terms of what content must be removed, but some version of Notice and Takedown exists in every major market. Companies receive a remarkable mix of requests – from those identifying serious and urgent problems, to those attempting to game the Notice and Takedown system as a means to silence speech they disagree with, to those stating wildly imaginative claims under nonexistent laws.

What do companies do with these removal requests? Many of the larger companies make a real effort to identify bad faith or erroneous requests, in order to avoid removing legal user content. (I worked on removals issues for Google for years, and can attest to the level of effort there.) But mistakes are inevitable given the sheer volume of requests – and the fact that tech companies simply don’t know the context and underlying facts for most real-world disputes that surface as removal requests.

And of course, the easiest, cheapest, and most risk-avoidant path for any technical intermediary is simply to process a removal request and not question its validity. A company that takes an “if in doubt, take it down” approach to requests may simply be a rational economic actor. Small companies without the budget to hire lawyers, or those operating in legal systems with unclear protections, may be particularly likely to take this route.

Much of the publically available information about over-removal by intermediaries is anecdotal. But empirical evidence of over-removal – through error or otherwise – keeps trickling in from academic studies. This data is important to help policy-makers understand what intermediary liability rules work best to protect the free expression rights of Internet users, as well of the rights of people with valid claims to removal. This post lists the studies I have seen.

These studies were mostly conducted by academics or advocates with a particular interest in protecting user free expression and ensuring that legal content remains available online. One day I hope we will see more data from the other side – advocates for rightsholders, defamation plaintiffs, or other groups harmed by online content that violates their legal rights. That could help build a more complete picture of the over-removal issue as well as any related under-removal problem – intermediaries failing to remove content when notified, even though applicable law requires removal.

The Studies

Urban et al's 2016 research, "Notice and Takedown in Everyday Practice.": This report is a treasure trove of qualitative and quantitative info on DMCA operations. A key finding is the divergence, documented and quantified in the study, between "classic" DMCA practice and new tools like robonotices and ContentID. The new tools are used by major players and dominate public discussion, but manual DMCA processing by small rightsholders and OSPs didn't go away. The report is long and well worth reading, my summary of key findings is here. Urban et al's detailed (and to me very persuasive) response to criticisms of the report, published in 2017, is here.
Jon Penney’s 2019 study of takedowns and counternotice: This study has an unusual survey component, and offers some unsettling indications about the chilling effect of takedowns and the role of gender in users’ decisions to challenge improper takedowns. Penney surveyed 1,296 panelists with hypothetical scenarios about receiving, or hearing that a friend received, notice that their online content had been removed based on a DMCA-like complaint. Respondents broadly reported being less likely in future not only to share the same content again, but also to share content they themselves had created (72%), “speak or write about certain topics online” (75%), or continue submitting queries to search engines in the same way (59%). Only 34% said they would counternotice or challenge a takedown they believed was wrong or mistaken. Female respondents and those who had particularly high concerns about privacy were meaningfully less willing to challenge erroneous takedowns. The study also reviewed 1000 Twitter and Google Blogger takedowns, reporting on questions like the apparent role of automation or whether users subsequently closed entire accounts, but not attempting to assess the validity of the initial notice.
French Ministry of Culture Report: It's incredibly hard to study the experiences of ordinary users affected by content takedowns, so this January 2020 publication from the French government is intriguing. Unfortunately, I don't read French, so for now I'm relying on this summary from the European research and advocacy organization Communia: "Based on a survey conducted among a representative sample of French internet users above 15, the report shows that 33% of all French internet users have shared audio or video material from third parties on platforms. Of these 13% have had at least one upload blocked. 58% of those who have had an upload blocked have challenged the last blocking decision and 56% of these challenges have been successful and have led to the reinstatement of the uploaded content. In absolute numbers this means that more than 700.000 French internet users (1.4% of the total) have been at the receiving end of an unjustified blocking decision."
Sharon Bar-Ziv and Niva Elkin-Koren's 2018 study of Israeli sites affected by DMCA requests to Google web search is the first work I’ve seen showing how a U.S.-based platform’s DMCA removals affect a specific non-U.S. country. The authors extracted from the Lumen Database a 9,890-URL data set of requests targeting webpages on Israel’s .il domain. This set proved idiosyncratic in two ways. First, 66% of requests – most of them from a single abusive requester -- did not involve copyright infringement, but instead concerned reputational damage. Second, of the actual copyright-based requests, the majority came from software rightsholders, rather than the music or film rightsholders who often predominate. The authors found that at least 88% of those requests identified clearly infringing uses, and only 5% identified likely non-infringing uses, with the remainder uncertain.

Jennifer Urban and Laura Quilter’s 2006 review of copyright-based removals from Google’s services under the US Digital Millennium Copyright Act (DMCA): Relying on information released to the Chilling Effects (now called Lumen) database by the company about processed removals (i.e. the ones where the company agreed to remove, not the ones it declined), the authors found that 55% of notices involved disputes between competitors, and 31% presented significant issues regarding the validity of the copyright infringement claim. (Daniel Seng’s more recent work with a similar but much larger data set has great detailed statistics on DMCA removal trends, but his published conclusions do not include analysis of the validity of the claims processed.)
The 2004 Brennan Center study on removals and free expression: Reviewing a data set of 320 copyright and trademark-based removal requests, the authors concluded that 47% stated weak claims or involved speech with important fair use or free expression legal defenses.
Rishabh Dara’s detailed experiment and study of over-removals by Indian intermediaries: Dara submitted increasingly unreasonable removal requests to various intermediaries, and carefully documented the responses. His results show considerable over-removal, including removal based on clearly invalid legal claims and removal of content not targeted by the requests.
The 2004 Bits of Freedom study of Dutch ISPs: The group created accounts with ten Dutch ISPs and used them to post copies of a famous, public domain, 19^th century political essay. It then used different contact information to send copyright “infringement notices” to the ISPs, under Dutch law implementing the eCommerce Directive. Of the ten ISPs, seven removed the content despite its age and public domain status.
Oxford Program in Comparative Media Law and Policy’s smaller experiment with UK and US ISPs: Researchers posted John Stuart Mill’s 1859 discussion of media freedom from “On Liberty” – which is in the public domain. They then used different accounts to request its removal via UK and US ISPs. The UK ISP removed the essay without question, while the US ISP responded by requiring the requester to comply with the more formal requirements of the US DMCA, including “good faith belief” and “penalty of perjury” attestations.
Company transparency reports: Transparency reports from Twitter, Google, Yahoo, Facebook, Microsoft and others offer some data about removal requests. The data is valuable for other purposes, but usually not great for sussing out the validity or even volume of complaints. Most reports list only requests from government sources, which represent a small minority of legally-based content removals. Some show the overall percentage of requests accepted and rejected; or include anecdotal examples. In some cases, particularly for Google’s “Right to Be Forgotten” (RTBF) removals, this data is supplemented by news reports. For example, coverage last summer suggested that most RTBF claims come from non-public figure requestors. It is also reported that relatively few requesters take their claims to Data Protection Agencies when Google rejects their removal requests; and that when they do the Agencies often agree with Google's decision.
The Lumen database and other research drawing on it: Lumen, formerly known as Chilling Effects, maintains a remarkable database containing millions of removal requests made public by companies and other contributors. Significant academic research has been carried out using the database, much of it relevant for over-removal questions. An overview of this literature as of 2010 is in Chilling Effects amicus brief from Perfect 10 v. Google.
Judith Townsend's research on removals by journalists and bloggers: This survey concerns removals by publishers rather than intermediaries, but contains interesting data on the frequency of requests and compliance, as well as availability of legal counsel to those receiving removal requests.

More studies and data sources surely exist, or will exist. I have heard in particular of one from Pakistan, but have not found it so far. If you know of other sources, please let me know or post them in the Comments section so this page can become a more useful resource for people seeking this information.