Problems with Filters in the European Commission's Platforms Proposal

This is the second of three posts about the Commission's Communication on Tackling Illegal Content Online. Post One addresses problems with relying on counter-notice to protect lawful content, and Post Three addresses dystopian aspects of the Communication.

* * *

Earlier I posted about the European Commission’s Communication on Illegal Content Online. I said it dangerously over-estimated the power of procedural mechanisms like counter-notice to get legal speech back online once platforms take it down. This post is about why platforms will over-remove (even more than they do now) if they do what the Commission wants and assume expansive monitoring and filtering obligations for user-generated content. The Commission’s discussion of filtering technology is at best magical thinking, and at worst cynically disingenuous.

The communication buys in wholeheartedly to the idea that expression can and should be policed by algorithms. For example, it says that

This human-in-the-loop principle [i.e., the idea that humans should review machines’ decisions to erase expression] is, in general, an important element of automatic procedures that seek to determine the illegality of a given content[.]

But then again,

Fully automated deletion or suspension of content can be particularly effective and should be applied where the circumstances leave little doubt about the illegality of the material, e.g. in cases of material whose removal is notified by law enforcement authorities[.]

The Commission’s faith in machines or algorithms as arbiters of fundamental rights is not shared by technical experts. Nick Feamster of the Princeton Computer Science Department and Evan Engstrom of Engine recently wrote in detail about why filters often don’t work. In principle, filters are supposed to detect when one piece of content – an image or a song, for example – is a duplicate of another. In practice, they sometimes can’t even do that. More fundamentally, when content that is illegal in one context appears in a new one, they don’t understand the difference. Algorithms won’t, for example, distinguish between Syrian war footage used by ISIS and the same footage used by human rights advocates. Or an image used in a car advertisement and that same image used to criticize the car company. This problem exists in some form for every technology the report discusses and some it doesn’t – including hashing, PhotoDNA, ContentID, and AudibleMagic. This factual backdrop makes statements like this one especially bizarre:

Automatic stay-down procedures should allow for context-related exceptions.

Automatic stay-down procedures cannot, by definition, “allow for context-related exceptions.” That’s the whole problem with them.

It gets weirder. After 16 pages exhorting platforms to “proactively detect, identify and remove illegal content” using algorithms, the report gets to a shorter, separate section on “preventing the re-appearance of illegal content.” That implies all the previous pages were about something other than duplicate detection – that the Commission is talking about technology that actually knows which expression is illegal, without any human assistance, on the very first encounter. It is hard to credit that a sophisticated body like the Commission could believe such a thing exists, much less center much of the communication on it.

The problems with filters are not news to anyone who has been following the twisted path of the EU’s draft Copyright Directive. It also includes a monitoring provision, and EU civil society organizations have explained its problems in detail. Filters over-remove (take down legal content) and they under-remove (leave up illegal content). They are shockingly expensive – YouTube’s ContentID had cost Google $60 million as of several years ago – so only incumbents can afford them. Start-ups forced to build them won’t be able to afford it, or will build lousy ones with high error rates. Filters address symptoms and leave underlying problems to fester – like, in the case of radical Islamist material, the brutal conflict in Syria, global refugee crisis, and marginalization of Muslim immigrants to the US and Europe. All these problems make filters incredibly hard to justify without some great demonstrated upside – but no one has demonstrated such a thing.

I wrote about these problems, as well as problems with using tech employees as monitors, in more detail with Annemarie Bridy in our 2016 filing with the US Copyright Office. I’m reproducing that part of our filing below. (If you’re not interested in the more US-specific discussion, I suggest starting a few paragraphs into Question 10.)

The filtering discussion in the Commission’s proposal is deeply troubling. Much more so than the mistakes about counter-notice (which the US Congress also made in the DMCA), these reflect a serious failure of the democratic process. There are not good excuses for a proposal so disconnected from reality – or indifferent to collateral damage to Internet users’ rights – at this late stage in the discussion. We are all starved for principled and fact-based political leadership these days. It appears we aren’t going to get it from Brussels, at least not on this topic.

From Annemarie Bridy and Daphne Keller’s submission to the U.S. Copyright Office 512 Study, April 1, 2016:

Question 9. Please address the role of both ‘‘human’’ and automated notice-and-takedown processes under section 512, including their respective feasibility, benefits, and limitations.

The DMCA moves literally billions of disputes about online speech out of courts and into the hands of private parties. Congress had good reason to create this privatized system of adjudication, given the sheer scale of online copyright infringement. Other effective choices are hard to imagine given current numbers. As discussed above in response to Question 5, the DMCA strikes a wise balance of rights and obligations between copyright owners and intermediaries. But it has meaningful and often underappreciated costs to Internet users. Malicious or erroneous over-removal silences legitimate speech. As discussed below in response to Question 12, this is no speculative harm; it happens all the time. And the due process rights that courts provide to an accused party are lost when private companies effectively adjudicate infringement accusations without that party’s involvement.^[1]

The DMCA’s correctives for these problems were supposed to come from its procedural protections.^[2] As discussed below in response to Questions 16 and 29, those protections have had mixed success. But one meaningful protection, historically, has come from the seriousness of lodging a DMCA complaint. A person signing a DMCA notice must state a good faith belief that the use is not authorized, declare her authority to act under penalty of perjury, and risk damages for misrepresentation under section 512(f).^[3]

That source of protection has not technically disappeared, but its value is largely lost when notices are generated not by a person, but by a machine. The rise of “robo-notices” moves the DMCA one step further away from “real” adjudication. This shift is happening for a compelling reason: Copyright owners’ works are massively reproduced across the Internet, so they need a way to identify infringement and submit notices at scale. But taking a human accuser out of the equation comes at an obvious cost. Automated notices are notoriously flawed, often targeting content that has nothing to do with the copyrighted work. In a famous example, an agent of Columbia Pictures sent takedown notices for numerous Vimeo uploads—including the short film that originally inspired Columbia’s Pixels movie—simply for having the word “pixels” in their names.^[4] The rights of wrongly accused Internet users, and of those who would read or watch their work, suffer as a result.

The rise of automated notices without a corresponding fortification of users’ rights and protections is troubling. Proposals to require intermediaries themselves to institute similar automated review, discussed in response to the next question, are even more so.

Question 10. Does the notice-and-takedown process sufficiently address the reappearance of infringing material previously removed by a service provider in response to a notice? If not, what should be done to address this concern?

Although some copyright owners advocate interpretations of the DMCA that would require OSPs to be more proactive in their efforts to enforce third-party copyrights, the DMCA is quite clear that active monitoring for infringing content is not a burden that Congress saw fit to allocate to service providers when it balanced the need to make the Internet safe for copyright owners against the need to promote growth and innovation in online services. That allocative choice was reasonable in 1998, and it remains reasonable in 2016. The Internet has grown exponentially in size since the DMCA was enacted, but we should not forget that the problem of large-scale infringement was an expected development—and one that the safe harbors were specifically designed to manage.

A “notice and staydown”—a.k.a. monitoring—requirement would radically alter the existing balance in the safe harbors. It would repeal section 512(m) and effectively impose a 24-7-365 monitoring obligation on all OSPs, no matter their size or available resources. Remaking the DMCA in such a way would shift almost the entire burden and cost of enforcement from copyright holders to OSPs. Such a change would be no mere adjustment to the DMCA’s balance of burdens; it would be closer to a demolition of the DMCA’s existing structure. Once a copyright owner identified a single appearance of a work on a platform, it would never again need to engage with DMCA processes for that work, except in the rare cases of counter-notice. From that point forward, the OSP would no longer just cooperate with the copyright owner, but effectively subsidize and replace it in the ongoing enforcement of its rights.

Whatever one thinks of this drastic shift in burden between intermediaries and copyright owners (often meaning between two large companies), the impact on Internet users could only be bad. Their expressive rights would be predictably compromised by both “human” and “automated” monitoring. Sections 512(f) and (g) indicate deep Congressional concern with such implications for ordinary users, even under the more protective notice-and-takedown system. With monitoring, users could easily find themselves caught between overreaching copyright owners on the one hand and overly-risk-averse OSPs on the other.^[5]

Because of the clear prohibition in section 512(m), courts in the United States have not had reason to assess what intermediary monitoring would mean for the rights of Internet users. In other countries, however, high courts have expressed great concern. The European Court of Human Rights recently ruled that Hungary violated the European Convention’s Article 10——the analog of the First Amendment——by requiring a host to monitor its users’ comments for defamation.^[6] The highest court of the EU has never accepted monitoring obligations for intermediaries, and cited concerns about users’ rights to seek and impart information in overturning two injunctions that would have required automated filtering.^[7] The highest courts of both Argentina and India reached similar conclusions in opinions that drew extensively on US law.^[8]

The threat of over-removal under a monitoring regime would not arise solely from intermediaries’ fear of liability. It would also stem from the near elimination of an important participant in the DMCA’s well-choreographed dance: the copyright owner. Copyright owners are the people best able to identify infringement, and their judgment is key to the DMCA’s balanced process. Without the copyright owner, an OSP that has once been notified of an infringement would thereafter act as prosecutor, judge, and executioner for all uses of the same work. Given that the copyright fair use analysis is highly contextual, treating all uses of a given work as presumptively unfair is problematic. Even complete copies of files can be fair use in some cases.^[9] The task of making judgments in individual cases would at best fall to the OSP’s employees, whose knowledge—for example, of authorization or of how much of a work has been borrowed—can only be poorer than the copyright owner’s. At worst, and most plausible for large companies, even this inexpert human judgment would be replaced by a less discriminating filtering algorithm. With automated filtering, every step in assessing and deleting Americans’ online speech would be handed off to software. Even when weighed against the substantial need for efficient copyright enforcement, the risk of algorithmically implemented over-removals is too great.

The sections below explore specific issues with “human” and “automated” monitoring regimes.

“Human” Monitoring

Content monitoring by OSP employees would inevitably be inaccurate. It is simply unrealistic to expect rank and file employees to identify infringing content without a significant error rate. This includes both false positives—removing lawful content—and false negatives—leaving infringing content up. Of course, liability concerns would give OSPs great economic incentive to err on the side of false positives and remove legal online speech. Studies show this problem exists already under notice and takedown, and it would only grow worse under a monitoring system.^[10]

One obvious problem with OSP employee review is the difficulty identifying fair use—a task with which lawyers and judges struggle. Another is that employees who grew up in other countries may not recognize older movies or TV shows known to many Americans. But perhaps the most significant and under-recognized problem is that tech employees simply don’t know about copyright owners’ licensing and marketing operations. As the chief IP counsel of Etsy explained, she has no way to know “who’s using licensed material in an appropriate way, and who has a license. . . . That’s why the DMCA exists.”^[11]

The problem is vividly illustrated by DMCA notices made public in the course of Viacom International, Inc. v. YouTube, Inc., in which Viacom identified numerous videos uploaded by its own marketing department as infringing.^[12] As YouTube’s chief counsel stated:

For years, Viacom continuously and secretly uploaded its content to YouTube, even while publicly complaining about its presence there. It hired no fewer than 18 different marketing agencies to upload its content to the site. It deliberately “roughed up” the videos to make them look stolen or leaked. It opened YouTube accounts using phony email addresses. It even sent employees to Kinko's to upload clips from computers that couldn't be traced to Viacom. And in an effort to promote its own shows, as a matter of company policy Viacom routinely left up clips from shows that had been uploaded to YouTube by ordinary users.^[13]

If a large copyright owner itself cannot identify which uses of its works are authorized, it makes little sense to expect those outside the company to know. Uncertainty about the licensing status of movie clips and other commercial content goes far beyond Viacom; it is a pervasive issue for platforms operating under the DMCA. But it is easy to predict what would happen if intermediaries bore the burden of finding infringing content, under threat of liability and statutory damages. Few would take the risk of leaving up potentially licensed, fair use, or public domain content. Erring on the side of removal would be the order of the day, and Americans’ ability to share legal content online would suffer.

“Automated” Monitoring

Algorithmic monitoring poses similar risks of suppressing legal speech. Part of the problem comes from inevitable failures and inaccuracies of filtering technologies themselves. Such failures arise because algorithms lack human judgment and cannot adequately assess context. A content filter intended to find copyrighted images would not know when those images appear in a critical essay, for example, or be capable of assessing the four fair use factors.^[14] An algorithm that could replace expert fair use analysis would be a marvel of technology.

Accidental filtering of legal information can also arise from purely technical errors, when algorithms misjudge whether one file is a copy of another. Developing monitoring tools that do not over-filter (remove more information than intended) or under-filter (remove less) is a significant, perhaps asymptotic technical challenge. Even YouTube’s Content ID, the product of over 50,000 hours years of engineering work and some 60 million dollars’ investment by Google,^[15] is routinely faulted by Internet users for misidentifying and removing lawful content.^[16]

Requiring similarly Herculean monitoring efforts from smaller Internet intermediaries would have foreseeable consequences. Some would go out of business. Some would never attract investment and be able to launch in the first place. Both of these results would serve to entrench incumbent companies with existing, expensive (yet still-flawed) filtering tools. Some small companies would build whatever filters their skill and budget permitted—meaning significantly less accurate ones, with significantly more false positives. And some would move back to the 1990s “walled garden” version of the Internet, accepting only monitored and curated content. The Internet’s current array of open platforms for free expression, commerce, and public participation would look very different if monitoring users for copyright infringement were a condition for launch.

Proponents of monitoring requirements sometimes argue that voluntarily-developed tools such as YouTube’s ContentID show that OSP’s are capable of effective monitoring—and that this in turn is a reason the law should require them to monitor.^[17] This reasoning is unsound on both pragmatic and policy grounds. As a reason to repeal section 512(m) and require all OSPs to monitor as best they can, it runs into the problems discussed above: decreased investment in new OSPs, clumsy monitoring efforts, and removal of users’ lawful content. As a reason to impose monitoring requirements on individual companies that build duplicate-detection tools, it is equally problematic, because it would deter competitors from developing similar technologies. A platform would think long and hard before investing in the next ContentID if it knew that doing so would only lead to additional legal liabilities and obligations.

This is exactly the issue that Congress recognized in enacting the United States’ other core intermediary liability law, the Communications Decency Act (“CDA”).^[18] That law responded to a 1990s case holding an OSP liable for user-generated defamation, on the basis of its failed undertaking to monitor for such content.^[19] To preclude further rulings of this sort, Congress in CDA section 230 immunized OSPs’ “private blocking and screening of offensive material.”^[20] As the CDA’s Policy section states, the legislation served “to remove disincentives for the development and utilization of blocking and filtering technologies” by ensuring that such efforts would not increase an OSP’s liability or subsequent obligations.^[21] The same reasoning pertains to copyright. A law that effectively penalized technical innovation in this area would mean lost opportunities for both OSPs copyright owners—for whom ContentID provides payments totaling hundreds of millions of dollars each year.^[22] The Copyright Office should view any proposal to amend section 512(m) on the basis of “improved” filtering technology with great skepticism.

^[1] Thoughtful assessments of Constitutional harms from this privatization of judicial function appear in several articles. See, e.g., Jack Balkin, Old-School/New-School Speech Regulation, 127 Harv. L. Rev. 2296 (2014); Derek Bambauer, Orwell’s Armchair, 79 U. Chi. L. Rev. 863 (2012); Seth Kreimer, Censorship by Proxy: The First Amendment, Internet Intermediaries, and the Problem of the Weakest Link, 155 U. Penn. L. Rev. 11 (2006).

^[2] 17 U.S.C. 512(g); S. Rep. No. 105-190 at 50 (1998) ("The put back procedures were added as an amendment to this title in order to address the concerns of several members of the Committee that other provisions of this title established strong incentives for service providers to take down material, but insufficient protections for third parties whose material would be taken down.").

^[3] 17 U.S.C. 512(c)(3)(A) (detailing requirements for a DMCA notice).

^[4] Michelle Starr, Videos Taken Down From Vimeo For Using the Word 'Pixels,' CNET, Aug. 9, 2015,

http://www.cnet.com/news/videos-taken-down-from-vimeo-for-using-the-word....

^[5] See supra note 2 and accompanying text.

^[6] MTE v. Hungary, ECtHR 22947/13 (2016). The case distinguished defamatory content, for which monitoring was not permitted, from more damaging hate speech and threats of violence, for which an earlier case permitted monitoring obligations to be imposed on a news platform’s comments forum.

^[7] Case C-70/10 Scarlet Extended SA v Société belge des auteurs, compositeurs et éditeurs SCRL (SABAM) [2011] ECR 2011:771 (Para. 52); Case C-360/10 SABAM v. Netlog [2012] ECHR 2012:85.

^[8] Corte Suprema [Supreme Court of Argentina], Civil, Rodriguez M. Belen c/Google y Otro s/ daños y perjuicios, R.522.XLIX., Oct. 29, 2014; Supreme Court of India, Criminal, Shreya Singhal v. Union of India, No. 167/2012, Mar. 24, 2015.

^[9] See, e.g., Online Policy Group v. Diebold, 337 F. Supp. 2d 1195 (N.D. Cal. 2004) (holding that it was fair use to post entire copies of company email messages).

^[10] See response to Question 29 infra.

^[11] Sarah Lai Stirland, Etsy’s Sarah Feingold on Small Business and Copyright Compliance, Project DisCo Blog, July 23, 2014, http://www.project-disco.org/intellectual-property/072314-etsys-sarah-fe....

^[12] See, e.g., Notice of Dismiss. of Specified Clips With Prejudice, Viacom International, Inc., et al. v. YouTube, Inc., et al., 2010 WL 2532404 (S.D.N.Y 2010) (referring to the hundreds of video clips that Viacom had initially identified as “infringing” but which were subsequently withdrawn from the list of works in suit); Mem. in Supp. of Def.’s Mot. for Summ. J., Viacom Intern. Inc. v. YouTube, Inc., 718 F. Supp. 2d 514, S.D.N.Y (Mar. 11 2010), http://static.googleusercontent.com/external_content/untrusted_dlcp/www.....

^<[13] Zahavah Levine, Broadcast Yourself, YouTube Official Blog, Mar. 18, 2010, https://youtube.googleblog.com/2010/03/broadcast-yourself.html.

^[14] See 17 U.S.C. § 107 (setting forth the factors that must be considered when assessing fair use).

^[15] See Section 512 of Title 17: Hearing Before the Subcomm. on Courts, Intellectual Prop., & the Internet of the H. Comm on the Judiciary, 113th Cong. 2 (2014) (statement of Katherine Oyama) (noting that “[w]hen YouTube became a part of Google, we really injected a huge amount of effort, so more than $60 million, more than 50,000 engineering hours went into building this system.”)

^[16] See, e.g., Glen Hoban, YouTube ContentID Fails, Medium, Jan. 7, 2015, https://medium.com/@glenhoban/youtube-content-id-fails-volume-i-6ae2c9ca570b#.57kvzho2v (reporting hundreds or thousands of erroneous ContentID matches for classical sound recordings); Mike Masnick, How Google's ContentID System Fails at Fair Use & the Public Domain, TechDirt, Aug. 8, 2012, https://www.techdirt.com/articles/20120808/12301619967/how-googles-conte....

^[17] For example, one European plaintiff successfully argued that courts should compel Google to detect and suppress images in his case, because the company had already built a “filter [that] works very well when it comes to child pornography.” See, e.g., Susanne Amann and Isabell Hülsen, Max Mosley: Google Is So ‘Arrogant They Do Whatever They Like,’ Spiegel Online, Jan. 29, 2014, http://www.spiegel.de/international/zeitgeist/max-mosley-discusses-his-f.... That case, which later settled, involved a privacy claim—which in the E.U. falls under the same notice-and-takedown laws as copyright. See also IFPI Denmark et al. v. Tele 2, Copenhagen Dist. Ct. F1-15124/2006 (2006) (ordering ISP to block www.piratebay.org based in part on existence of blocking mechanism for child pornography).

^[18] 47 U.S.C. § 230.

^[19] Stratton Oakmont, Inc. v. Prodigy Servs. Co., 1995 WL 323710 (N.Y. Sup. Ct. May 24, 1995). See also Zeran v. America Online, 129 F.3d 327, 331 (4th Cir. 1997) (discussing the legislative history of the CDA).

^[20] Id.

^[21] 47 USC § 230(b)(4).

^[22] See GOOGLE, HOW GOOGLE FIGHTS PIRACY 3 (2014), available at https://drive.google.com/file/d/0BwxyRPFduTN2NmdYdGdJQnFTeTA/view.