Click the local Home Depot ad and your email address gets handed to a dozen companies monitoring you. Your web browsing, past, present, and future, is now associated with your identity. Swap photos with friends on Photobucket and clue a couple dozen more into your username. Keep tabs on your favorite teams with Bleacher Report and you pass your full name to a dozen again. This isn't a 1984-esque scaremongering hypothetical. This is what's happening today.
[Update 10/11: Since several readers have asked – this study was funded exclusively by Stanford University and research grants to the Stanford Security Lab. It was not supported by any advocacy organization.]
Background on Third-Party Web Tracking and Anonymity
In a post on the Stanford CIS blog two months ago, Arvind Narayanan explained how third-party web tracking is not at all anonymous.
In the language of computer science, clickstreams – browsing histories that companies collect – are not anonymous at all; rather, they are pseudonymous. The latter term is not only more technically appropriate, it is much more reflective of the fact that at any point after the data has been collected, the tracking company might try to attach an identity to the pseudonym (unique ID) that your data is labeled with. Thus, identification of a user affects not only future tracking, but also retroactively affects the data that's already been collected. Identification needs to happen only once, ever, per user.
Arvind noted five ways in which a user's identity may be associated with third-party web tracking data.
- A third party is also a first party, e.g. Facebook, Twitter, or Google+.
- A first party hands off ("leaks") identifying information to a third party.
- A third party buys identifying information from a "matching service."
- A third party exploits a security vulnerability to learn a user's identity.
- A third party "deanonymizes" its data by matching it against identified data.
This post is an empirical study of identifying information leakage from first-party websites to third-party websites.1
Web Information Leakage
Leakage most often occurs when a first-party website stuffs information into a URL. For example, suppose Example Website sends users after they register to:
http://example.com/register?
username=GoCardinal
&name=Leland%20Stanford
&email=leland%40stanford.edu
&...
Third parties embedded in the page will receive the URL in a referrer header or equivalent2 – and therefore Leland Stanford's username, name, and email.
Another common form of leakage is through the page title. Suppose a website's landing page includes a title tag of:
Welcome, Leland Stanford!
Embedded third-party scripts often report back with the page title; in this case, they'd include Leland Stanford's name.
[Update 10/11: The original version of this post conflated the information OkCupid provides to Lotame and BlueKai. In the interest of complete accuracy, and in response to both a deluge of questions on OkCupid's intentional leakage and a note from BlueKai seeking clarification, I have updated this section with per-company intentional leakage. I have also included the results of a leakage test (with the methodology described below) on OkCupid. My apologies to BlueKai for the incorrect implication that it collects the same sensitive profile data that Lotame does. The amibiguous discussion was solely my error.]
Leakage, in common parlance, implies unintentionality. In computer security, leakage is a term of art for an information flow – some instances of leakage are entirely intentional. For example, OkCupid, a free online dating website, appears to sell user profile information to the data providers BlueKai and Lotame. , including gender, age, ZIP code, relationship status, and drug use frequency. To learn which profile information OkCupid leaks, I modified each field of a profile and observed how values sent to the two companies changed. Here's what the companies appeared to receive:
Age - Both
Cats - Both
Children - Both
Country - Both
Dogs - Both
Drinking Frequency - Lotame
Drug Use Frequency - Lotame
Education - Both
Ethnicity - Lotame
Gender - Both
Income - Both
Job Sector - Both
Language Proficiencies - BlueKai
Relationship Status - Lotame
Religion - Lotame
Smoking Frequency - Lotame
State - Both
ZIP Code - Both
(I also ran the leakage test described below on OkCupid. The username was sent to 27 third-party PS+1s (defined below), including crwdcntrl.net (Lotame) and bluekai.com (BlueKai). Since OkCupid does not limit who can see a profile – a user can only require that visitors be logged in – a username provides access to a user's entire profile.)
In a series of groundbreaking studies Balachander Krishnamurthy, Craig Wills, and Konstantin Naryshkin have demonstrated that information leakage is a pervasive problem (1, 2, 3). In their most recent paper, the authors examined signup and interaction with 120 popular sites for information leakage to third parties. They found that 56% leaked some form of private information, and 48% leaked a user identifier.
We roughly followed the same methodology as Krishnamurthy, Wills, and Naryshkin, with 1) a focus on identifying information leakage, 2) a greater number of sites, 3) and a public dataset.
Usernames as Identifying Information
Given the sizeable role usernames play in web information leakage, it's worth taking a moment to note how a username is identifying information. In some cases a username is just a user's name – for example, @jonathanmayer on Twitter. Even when it isn't the user's name, a username is often more than adequate for identifying a user.
First, a username is likely sufficient to link accounts across websites. Users routinely reuse their usernames – after all, who's going to remember a new login for each site they use? In a paper at PETS 2011, Daniele Perito et al. examined a sample of public data from Google, eBay, and other sites to estimate how linkable usernames are. They found that the vast majority of usernames in their sample had high entropy, and that simple algorithms for linking usernames could achieve pairwise precision and recall of over 70%. (For further discussion of using usernames to link social profiles, see Arvind's blog posts "The Linkability of Usernames" and "Lendingclub.com: A De-anonymization Walkthrough," as well as "Modeling Unintended Personal-Information Leakage from Multiple Online Social Networks" and "Large Online Social Footprints - An Emerging Threat" by Danesh Irani et al.) Some companies are already linking usernames in their products, including social matching services (e.g. Infochimps), scraped profiles (e.g. Spokeo), and automated social network linkage (e.g. Google Social Search).3
Second, combining data from multiple accounts often provides a sufficiently comprehensive mosaic to identify an individual.4 Arvind, for example, usually goes by the username "randomwalker." The first page of a Google search turned up his yCombinator Hacker News account, which includes his job and links to his personal website, blog, and Twitter account.
Some websites (e.g. Quantcast) have responsibly recognized that a username is identifying information and have included username in their legal definition of "personally identifiable information" (PII).
Methodology
We examined each website in the Quantcast top 250, checking for whether it
- offered a sign up,
- did not require a purchase or other qualification to sign up, and
- did not include so many features as to be impractical for study.
For each of the 185 websites that met all three criteria, we used the FourthParty web measurement platform to create an account and interact with the site.5 We emphasized exploring content that dealt with a user's identity, such as profile and settings pages. After collecting data, we searched Request-URIs and Referrer headers for known personal information. We treated each public suffix + 1 (PS+1) as an independent entity, and we considered any PS+1 different from a first party's to be a third party.6
Results
A complete spreadsheet of results is available in Excel format. We encourage interested readers to examine the results for themselves. [Update 10/22: Before consulting the spreadsheet, please be sure to read Footnote 6 to understand the limitations of our methodology.] Please email if you would like FourthParty logs for a specific site.
The most frequent type of leakage was a username or user ID.7 We identified username or user ID leakage to a third party on 113 websites, 61% of the websites in our sample. The top five PS+1 recipients of username and user ID leakage were:
- scorecardresearch.com (comScore), on 81 (44%) of the websites in our sample
- google-analytics.com (Google Analytics), on 78 (42%) of the websites in our sample
- quantserve.com (Quantcast), on 63 (34%) of the websites in our sample
- doubleclick.net (Google Advertising), on 62 (34%) of the websites in our sample
- facebook.com (Facebook), on 45 (24%) of the websites in our sample
Some websites leaked the username or user ID to dozens of third parties. For example, popular photo sharing website Photobucket embeds username in many of its URLs, and includes advertising on most of its pages; we observed the username get sent to 31 third-party PS+1s.
Other identifying information leaked in a number of instances. A sample:
- Viewing a local ad on the Home Depot website sent the user's first name and email address to 13 companies.
- Entering the wrong password on the Wall Street Journal website sent the user's email address to 7 companies.
[Update 10/11: A number of readers have written in noting that the Wall Street Journal leak is not in our spreadsheet. We identified the Wall Street Journal leak in a different browsing session from the one reported in the spreadsheet – and by accident. In the interest of consistency – we did not test logging out and logging back in on other sites, nor logging in with the wrong password – we decided to discuss the leak in our post but not our spreadsheet.] - Changing user settings on the video sharing site Metacafe sent first name, last name, birthday, email address, physical address, and phone numbers to 2 companies.
- Signing up on the NBC website sent the user's email address to 7 companies.
- Signing up on Weather Underground sent the user's email address to 22 companies.
- The mandatory mailing list page during CNBC signup sent the user's email address to 2 companies.
- Clicking the validation link in the Reuters signup email sent the user's email address to 5 companies.
- Interacting with Bleacher Report sent the user's first and last names to 15 companies.
- Interacting with classmates.com sent the user's first and last names to 22 companies.
Implications
From a legal perspective, identifying information leakage is a debacle. Many first-party websites make what would appear to be incorrect, or at minimum misleading, representations about not sharing PII. Here are some examples.
Personal Information Disclosure: The Home Depot will not trade, rent or sell your personal information, without your prior consent, except as otherwise set out herein. [Does not describe sharing with third-parties for advertising or analytics.]
We will not sell, rent, or share your Personal Information with these third parties for such parties' own marketing purposes, unless you choose in advance to have your Personal Information shared for this purpose. Information about your activities on our Online Services and other non-personally identifiable information about you may be used to limit the online ads you encounter to those we believe are consistent with your interests. Third-party advertising networks and advertisers may also use cookies and similar technologies to collect and track non-personally identifiable information such as demographic information, aggregated information, and Internet activity to assist them in delivering advertising on our Online Services that is more relevant to your interests.
Metacafe's Privacy Policy is to share personal information only with the owner's informed consent.
Likewise, a number of third-party trackers disclaim collection of personally identifiable information.8
Scorecard Research (comScore):
Does your beacon collect or store any personally identifiable information about me?
The tagging used by ScorecardResearch is unable to identify the user visiting a page.
We don't collect or serve ads based on personally identifying information without your permission.
The better practice for all first-party and third-party websites would be to acknowledge that identifying information leakage is a fact of life on the web, and that identifying information may be shared with third parties.
As for policy, some strands of the Do Not Track debate echo a sentiment of "it's all anonymous," and so, "where's the harm?" We believe there is now overwhelming evidence that third-party web tracking is not anonymous. It is a legitimate policy question whether, on balance, Do Not Track should be enforced by law. But the difficult weighing of competing privacy risks and economics can't be short-circuited by claims of anonymity.
Thanks to Arvind Narayanan for comments on a draft.
[1] For purposes of this post, "identifying information" is information that with moderate probability and moderate effort can be used to identify a user. This post does not use a formulaic legal definition of "personally identifiable information" (PII), an approach that has been discredited by a growing body of computer science research. The Federal Trade Commission staff notably rejected the notion of PII in its draft privacy report last year.
[2] Some third parties encode the referring URL into their Request-URI.
[3] A username isn't, of course, all a third party has to go on. IP geolocation is another trivial source of information, and can help disambiguate when several individuals use similar usernames. How many Jonathan Mayers are there in Palo Alto, CA? Using the Stanford University network? This is a possible area for future research.
[4] While it is quite clear that in practice a username can often be used to discern a user's identity, confirmatory empirical research would be valuable.
[5] We used a fictional persona with unique biographical traits to minimize false positives.
[6] For readers who engage in detail with our data, we wish to emphasize several caveats to our methodology.
- We did not study – and cannot study – what companies do when they receive personal information. It is likely that many of the information leaks we identified were logged. Some third parties may take precautions to prevent logging of identifying information, and we certainly laud such efforts. But for policy purposes, there is a tremendous difference between a tracking ecosystem that is anonymous and a tracking ecosystem that is suffused with identity but promises to ignore it.
- Since some websites host content from multiple PS+1s (e.g. amazon.com and amazonaws.com), our definition of a third party introduces some false positives. That said, our findings appear to be quite robust. For example, thresholding for leakage at more than three third parties still leaves 84 websites (45%) leaking a username or user ID.
- We did not examine POST request bodies or cookies, nor did we attempt to identify obfuscated or encrypted personal information.
- Our interaction with websites was neither comprehensive nor representative of what the average user might do. We may have missed information leaks, and some of the information leaks we identified may have affected only a minority of users.
- In the course of a user's browsing, identifying information for other users might leak. We did not gauge how easily a third party could identify which information was the user's. In most cases it appeared such a determination would be straightforward.
- The regular expressions we used for matching birth year, birthday, gender, and last name had a not insignificant number of false positives. We recommend against relying solely upon those fields.
- We did not explicitly take note of which stage of signup a leak occurred at.
- We did not use a single sign-on (SSO) provider unless required. Where an SSO was mandatory, we manually labeled PS+1s associated with the SSO provider as first-party. Measuring information leakage when SSOs are used is a promising avenue for future research.
- We did not attempt to discover third parties that have been CNAMEd into a first-party PS+1 (dubbed "hidden third-parties" in some papers).
[7] User IDs were, in our testing, almost always sufficient to locate at least a username, and sometimes additional identifying information. For example, with a Causes.com user ID, anyone can attain a link to a user's Facebook profile – which in turn provides a name, photo, and possibly more.
[8] Please note: we are not claiming any company has breached its self-regulatory commitments. The Digital Advertising Alliance (DAA) online advertising self-regulation imposes lax restrictions on personally identifiable information. First, personally identifiable information is defined to only include information that is used to identify a user.
Personally Identifiable Information is information about a specific individual including name, address, telephone number, and email address—when used to identify a particular individual.
Second, the DAA principles only require noting the use of PII in a privacy policy and getting consent to retroactively use PII before the privacy policy change.
PII is a term used primarily in two areas in the Principles and Commentary. First, PII is used in the Transparency principle so that consumers are informed specifi- cally about the collection and use of PII for Online Behavioral Advertising purposes. Second, PII is used in this Commentary to describe a specific example of a "material” change that would require Consent from the consumer under Principle V.