A Glossary of Internet Content Blocking Tools

This piece is exerpted from the Law, Borders, and Speech Conference Proceedings Volume, where it appears as an appendix. The terminology it explains is relevant for Intermediary Liability and content regulation issues generally - not only issues that arise in the jurisdiction or conflict-of-law context. The full conference Proceedings Volume contains other relevant resources, and is Creative Commons licensed.

*******

Conversations about unlawful online content and the responsibilities of Internet intermediaries have become more heated in recent years. Participants in these discussions often lack common terminology or understanding of technological options for online content control.

This problem is not entirely new—there has never been a single agreed set of terms, and people have often used the same terms to mean different things. But miscommunications become more consequential as governments expand legal mandates for intermediaries. Different blocking technologies lead to different outcomes, which can include under-blocking unlawful content, over-blocking lawful content, or disrupting service to users. They can also place different burdens on intermediaries, and make it easier or harder for users to circumvent the blocks or for researchers to detect them.

This document briefly lists key terms as the author has seen them most commonly used. It also lists common sources of confusion.^{^[1]}

I. Common Terms

Intermediaries: Entities that “give access to, host, transmit and index content originated by third parties or provide Internet-based services to third parties.”^{^[2]} There are many kinds of intermediaries, but for purposes of content blocking or removal they can generally be clustered into two groups with different capabilities.^{^[3]}

Network intermediaries, which provide technological connections between two endpoints, can sever that connection. (Examples: ISPs, mobile carriers, content delivery networks, and DNS providers.)
Hosting intermediaries, which store user content on their servers, can remove content or restrict access to it.^{^[4]} (Examples: consumer-facing hosts such as Facebook, back-end hosting providers such as Amazon Web Services.)

Content Providers: “[T]hose individuals or organizations who are responsible for producing information in the first place and posting it online.”^{^[5]}

Remove or take down: To erase or restrict access to online content, in whole or in part.

Block: To prevent a user from accessing content, without taking the content itself offline.

Variations in the blocking target: Sometimes intermediaries block particular content (like when an ISP stops all its users from going to a website or using an app). Sometimes they block particular users (like when a website blocks all users with IP addresses from a certain country). Sometimes they do both at once (like when Twitter prevents users in a particular country from seeing a particular tweet—which they call withholding content).
Variations in the blocking implementation: An intermediary can block content completely, or can do more subtle or complex things like degrading service (as can happen to foreign websites passing through the “Great Firewall of China”), demoting content visibility (as Google web search has done on DMCA grounds), removing content but notifying users who try to access it that it was removed (as WordPress does for DMCA removals), warning users before they choose to view content (as the Blogger platform does for adult content), or even supplementing offensive content with additional context or counter-speech (as Google did in response to the 2004 “jew watch” controversy, and Jigsaw has done with newer tools).
Variations in the means used to identify information for blocking: In order to block users or content, an intermediary must have a way for machines to identify which Internet communications are to be blocked.

o Users may be blocked based on identifying information such as an account, or location information such as an IP address.

o Content is most commonly blocked based on its location. Intermediaries can block based on a web URL (like www.example.com for an entire site or www.example.com/page for a single page)^{^[6]} or an IP address (like 216.3.128.12).^{^[7]} In some cases they can disrupt elements of the Domain Name System in order to prevent a URL from resolving to the correct IP address.^{^[8]} Location-based blocking can be over-inclusive (like by blocking all content on an IP address, when only some of it is unlawful) and under-inclusive (like by blocking one instance of an MP3 file, when identical copies exist at other locations).

o Intermediaries can also block based on technical specifications (such as blocking a port to prevent use of VOIP).

o Most ambitiously, intermediaries may block by building software capable of recognizing specific content. See “Filter”, below.

Websites that protect their users through SSL encryption (indicated by “HTTPS” in the browser address bar) may suffer unintended consequences if network intermediaries attempt to block content on the site. With SSL in place, an ISP monitoring user traffic may only be able to identify the site being accessed (www.example.com)—not the individual page (www.example.com/page) or any of its content. As a result, an ISP’s only options may be to block an entire site, including huge sites like youtube.com or wikipedia.org, or to block none of it.^{^[9]}

Monitor: To review online information with the goal of identifying specific, usually objectionable content. Automated monitoring tools look for particular content, such as an image or a phrase.

Filter: To take “action against material identified through monitoring in order to then block access to it or remove it[.]”^{^[10]} Tools that hosting intermediaries can use to filter content include keyword blocklists,^{^[11]} PhotoDNA for duplicate images,^{^[12]} AudibleMagic for duplicate audio tracks,^{^[13]} or YouTube’s ContentID for duplicate video.^{^[14]} They can also use human monitors, or a combination of technical and human monitoring. For ISPs, it is sometimes possible to identify and block content using Deep Packet Inspection (DPI), but this is computationally expensive. Content-based blocking is often costly. The risk of over- or under-inclusion—of blocking too much or too little—varies substantially depending on kind of technology, content, and legal claim at issue.^{^[15]}

Geolocate: to determine the location of a device, typically a user’s computer or mobile phone. This is typically done using IP address, GPS, WiFi network identification, or other technical information.

Geoblock: to use geolocation data to block particular devices or users (like Reddit blocking Russian users from certain pages).

II. Sources of Confusion

Miscommunication about removal issues often involves one of the following questions.

1. Is the intermediary a network intermediary, or a hosting intermediary?

This matters, because network intermediaries can block the channel of transmission, preventing users from reaching content (example: ISP blocking an IP address). Hosting intermediaries, on the other hand, can take content offline completely (example: YouTube removing a video based on a DMCA request). See “Intermediaries” definition, above.

2. How is the intermediary identifying information to block?

The technological means for blocking content are rarely perfect. Many blocking mechanisms foreseeably lead to specific types of over- or under-blocking. Blocking an IP address, for example, prevents users from accessing any lawful material that shares an IP address with the targeted content. Blocking a specific webpage may be ineffective if the webmaster merely re-posts the same content on a different part of the site. Filtering tools like ContentID that identify duplicate content may fail to recognize modified copies on the one hand, or erroneously remove satire and other lawful use on the other. See “Variation in the means used to identify information for blocking,” above.

3. Is “bad” content completely deleted, or does something else happen?

Removal can be partial and incomplete in various ways. For example, intermediaries can deny access to content for just some users (based on location, age, etc.), or some user activities (such as searching for certain specific names on Google under “Right to Be Forgotten” laws). An intermediary can also demote certain content (putting it lower in search results or a news feed), degrade connection speed (making it hard to load a page or watch a video), or otherwise deter users from accessing it (such as through a malware warning or fake news label). To my knowledge, there is no good umbrella term that encompasses all these options. See “Variations in the blocking implementation,” above.

III. Other Sources of Information

https://www.internetsociety.org/doc/internet-content-blocking
Excellent and up-to-date re network intermediaries; not comprehensive re hosting intermediaries.

https://tools.ietf.org/html/draft-hall-censorship-tech-04
Excellent and recent, great citations, somewhat technical. (Per IETF practice, “expired” as of January 2017 and not to be cited, but future version may be released as RFC.)

/content/files/wp-content/uploads/2011/12/accessdenied-chapter-3.pdf; https://opennet.net/about-filtering
Technically good but probably drafted ten years ago and some terminology does not match current normal use. (Project was launched in 2004, shut down in 2014.)

^{^[1]} Joe Hall and Jim Greer kindly checked this for errors. If I introduced any after their review, it’s my fault.

^{^[2]} OECD, The Economic and Social Role of Internet Intermediaries, (Apr. 2010) https://www.oecd.org/internet/ieconomy/44949023.pdf at 4.

^{^[3]} End-to-end design principles would suggest moving blocking capabilities away from these intermediaries and toward the edges of the network—for example, by enabling blocking at the level of an individual user’s mobile phone or browser. See Larry Lessig, The Future of Ideas (2001) pp. 34-39; Cyberspace’s Architectural Constitution (1999) /content/files/works/lessig/www9.pdf.

^{^[4]} For purposes of content blocking and removal, a search engine or other entity providing links to content functions like a host. Removing a link typically means removing hosted HTML. For search engines, it may include page title, snippet text, the link itself, and cache copies of webpage.

^{^[5]} Article 19, Internet Intermediaries: Dilemma of Liability (2013) at 6, https://www.article19.org/data/files/Intermediaries_ENGLISH.pdf. For some purposes, such as copyright, the law must also distinguish between original authors and those who merely re-post information.

^{^[6]} “URL-based blocking compares the website requested by the user with a pre-determined “blacklist” of URLs of objectionable websites selected by the intermediary imposing the blocking. URLs (or uniform resource locators, otherwise known more colloquially as “web addresses”) are character strings that constitute a reference (an address) to a resource on the internet and that are usually displayed inside an address bar located at the top of the user interface of web browsers.” Angelopolous et al, Study of fundamental rights limitations for online enforcement through self regulation (2016) https://www.ivir.nl/publicaties/download/1796, at 7.

^{^[7]} “This operates in a similar manner to URL blocking, but uses IP (Internet Protocol) addresses, i.e., the numerical labels assigned to devices, such as computers, that participate in a network that uses the internet protocol for communication. IP-based blocking has a higher chance of resulting in unintended ‘over-blocking’ than targeted URL blocking as a result of IP sharing, as a given unique IP address may correspond to multiple URLs of different websites hosted on the same server.” Angelopolous et al at 7.

^{^[8]} There are many variants on DNS disruption, ranging from DNS seizures (which break DNS resolution for users globally) to DNS disruption by ISPs, which affect only their users.

^{^[9]} The distinction between “location” and “content” can be fuzzy – very much as the distinction between “metadata” and “content of communications” is fuzzy in the surveillance context. For example, a URL designates location, but can also tell you something about the content of the page. (Example: www.example.com/donaldtrump.htm).

^{^[10]} “Monitoring tools such as content control software can be placed at various levels in the internet structure: they can be implemented by all intermediaries operating in a certain geographical area or only by one or some of those intermediaries; they can be applied to all of the customers of an intermediary or only to some of them (for example only to customers originating form country X); they can look only for certain content which is commonly transmitted through specific services (such as illegal file sharing through peer-to-peer networks) or indiscriminately to all content.” Angelopolous et al at 6. In order to effectively catch specific content, monitoring must be “systematic, universal, and progressive.” AG Cruz Villalon in SABAM Opinion, quoted in Angelopolous et al.

^{^[11]} Keyword blocking typically involves identifying words or strings in static online content. Over-removal issues with keyword blocklists of this sort are well illustrated in the Wikipedia entry for the Scunthorpe Problem, https://en.wikipedia.org/wiki/Scunthorpe_problem. Intermediaries can also keyword block text submitted by users – for example, a search engine might show no results if a user searches for “Tiananmen.”

^{^[12]} https://en.wikipedia.org/wiki/PhotoDNA

^{^[13]} https://www.audiblemagic.com/

^{^[14]} https://support.google.com/youtube/answer/2797370?hl=en

^{^[15]} See, e.g., http://www.engine.is/the-limits-of-filtering.