Tracking the Trackers: Where Everybody Knows Your Username

Click the local Home Depot ad and your email address gets handed to a dozen companies monitoring you. Your web browsing, past, present, and future, is now associated with your identity. Swap photos with friends on Photobucket and clue a couple dozen more into your username. Keep tabs on your favorite teams with Bleacher Report and you pass your full name to a dozen again. This isn't a 1984-esque scaremongering hypothetical. This is what's happening today.

[Update 10/11: Since several readers have asked – this study was funded exclusively by Stanford University and research grants to the Stanford Security Lab. It was not supported by any advocacy organization.]

Background on Third-Party Web Tracking and Anonymity

In a post on the Stanford CIS blog two months ago, Arvind Narayanan explained how third-party web tracking is not at all anonymous.
 

In the language of computer science, clickstreams – browsing histories that companies collect – are not anonymous at all; rather, they are pseudonymous. The latter term is not only more technically appropriate, it is much more reflective of the fact that at any point after the data has been collected, the tracking company might try to attach an identity to the pseudonym (unique ID) that your data is labeled with. Thus, identification of a user affects not only future tracking, but also retroactively affects the data that's already been collected. Identification needs to happen only once, ever, per user.

Arvind noted five ways in which a user's identity may be associated with third-party web tracking data.
 

  • A third party is also a first party, e.g. Facebook, Twitter, or Google+.
  • A first party hands off ("leaks") identifying information to a third party.
  • A third party buys identifying information from a "matching service."
  • A third party exploits a security vulnerability to learn a user's identity.
  • A third party "deanonymizes" its data by matching it against identified data.

This post is an empirical study of identifying information leakage from first-party websites to third-party websites.1

Web Information Leakage

Leakage most often occurs when a first-party website stuffs information into a URL. For example, suppose Example Website sends users after they register to:
 

http://example.com/register?
username=GoCardinal
&name=Leland%20Stanford
&email=leland%40stanford.edu
&...

Third parties embedded in the page will receive the URL in a referrer header or equivalent2 – and therefore Leland Stanford's username, name, and email.

Another common form of leakage is through the page title. Suppose a website's landing page includes a title tag of:
 

Welcome, Leland Stanford!

Embedded third-party scripts often report back with the page title; in this case, they'd include Leland Stanford's name.

[Update 10/11: The original version of this post conflated the information OkCupid provides to Lotame and BlueKai. In the interest of complete accuracy, and in response to both a deluge of questions on OkCupid's intentional leakage and a note from BlueKai seeking clarification, I have updated this section with per-company intentional leakage. I have also included the results of a leakage test (with the methodology described below) on OkCupid. My apologies to BlueKai for the incorrect implication that it collects the same sensitive profile data that Lotame does. The amibiguous discussion was solely my error.]
Leakage, in common parlance, implies unintentionality. In computer security, leakage is a term of art for an information flow – some instances of leakage are entirely intentional. For example, OkCupid, a free online dating website, appears to sell user profile information to the data providers BlueKai and Lotame. , including gender, age, ZIP code, relationship status, and drug use frequency. To learn which profile information OkCupid leaks, I modified each field of a profile and observed how values sent to the two companies changed. Here's what the companies appeared to receive:

Age - Both
Cats - Both
Children - Both
Country - Both
Dogs - Both
Drinking Frequency - Lotame
Drug Use Frequency - Lotame
Education - Both
Ethnicity - Lotame
Gender - Both
Income - Both
Job Sector - Both
Language Proficiencies - BlueKai
Relationship Status - Lotame
Religion - Lotame
Smoking Frequency - Lotame
State - Both
ZIP Code - Both

(I also ran the leakage test described below on OkCupid. The username was sent to 27 third-party PS+1s (defined below), including crwdcntrl.net (Lotame) and bluekai.com (BlueKai). Since OkCupid does not limit who can see a profile – a user can only require that visitors be logged in – a username provides access to a user's entire profile.)

In a series of groundbreaking studies Balachander Krishnamurthy, Craig Wills, and Konstantin Naryshkin have demonstrated that information leakage is a pervasive problem (1, 2, 3). In their most recent paper, the authors examined signup and interaction with 120 popular sites for information leakage to third parties. They found that 56% leaked some form of private information, and 48% leaked a user identifier.

We roughly followed the same methodology as Krishnamurthy, Wills, and Naryshkin, with 1) a focus on identifying information leakage, 2) a greater number of sites, 3) and a public dataset.

Usernames as Identifying Information

Given the sizeable role usernames play in web information leakage, it's worth taking a moment to note how a username is identifying information. In some cases a username is just a user's name – for example, @jonathanmayer on Twitter. Even when it isn't the user's name, a username is often more than adequate for identifying a user.

First, a username is likely sufficient to link accounts across websites. Users routinely reuse their usernames – after all, who's going to remember a new login for each site they use? In a paper at PETS 2011, Daniele Perito et al. examined a sample of public data from Google, eBay, and other sites to estimate how linkable usernames are. They found that the vast majority of usernames in their sample had high entropy, and that simple algorithms for linking usernames could achieve pairwise precision and recall of over 70%. (For further discussion of using usernames to link social profiles, see Arvind's blog posts "The Linkability of Usernames" and "Lendingclub.com: A De-anonymization Walkthrough," as well as "Modeling Unintended Personal-Information Leakage from Multiple Online Social Networks" and "Large Online Social Footprints - An Emerging Threat" by Danesh Irani et al.) Some companies are already linking usernames in their products, including social matching services (e.g. Infochimps), scraped profiles (e.g. Spokeo), and automated social network linkage (e.g. Google Social Search).3

Second, combining data from multiple accounts often provides a sufficiently comprehensive mosaic to identify an individual.4 Arvind, for example, usually goes by the username "randomwalker." The first page of a Google search turned up his yCombinator Hacker News account, which includes his job and links to his personal website, blog, and Twitter account.

Some websites (e.g. Quantcast) have responsibly recognized that a username is identifying information and have included username in their legal definition of "personally identifiable information" (PII).

Methodology

We examined each website in the Quantcast top 250, checking for whether it

  • offered a sign up,
  • did not require a purchase or other qualification to sign up, and
  • did not include so many features as to be impractical for study.

For each of the 185 websites that met all three criteria, we used the FourthParty web measurement platform to create an account and interact with the site.5 We emphasized exploring content that dealt with a user's identity, such as profile and settings pages. After collecting data, we searched Request-URIs and Referrer headers for known personal information. We treated each public suffix + 1 (PS+1) as an independent entity, and we considered any PS+1 different from a first party's to be a third party.6

Results

A complete spreadsheet of results is available in Excel format. We encourage interested readers to examine the results for themselves. [Update 10/22: Before consulting the spreadsheet, please be sure to read Footnote 6 to understand the limitations of our methodology.] Please email if you would like FourthParty logs for a specific site.

The most frequent type of leakage was a username or user ID.7 We identified username or user ID leakage to a third party on 113 websites, 61% of the websites in our sample. The top five PS+1 recipients of username and user ID leakage were:
 

  1. scorecardresearch.com (comScore), on 81 (44%) of the websites in our sample
  2. google-analytics.com (Google Analytics), on 78 (42%) of the websites in our sample
  3. quantserve.com (Quantcast), on 63 (34%) of the websites in our sample
  4. doubleclick.net (Google Advertising), on 62 (34%) of the websites in our sample
  5. facebook.com (Facebook), on 45 (24%) of the websites in our sample

Some websites leaked the username or user ID to dozens of third parties. For example, popular photo sharing website Photobucket embeds username in many of its URLs, and includes advertising on most of its pages; we observed the username get sent to 31 third-party PS+1s.

Other identifying information leaked in a number of instances. A sample:
 

  • Viewing a local ad on the Home Depot website sent the user's first name and email address to 13 companies.
  • Entering the wrong password on the Wall Street Journal website sent the user's email address to 7 companies.
    [Update 10/11: A number of readers have written in noting that the Wall Street Journal leak is not in our spreadsheet. We identified the Wall Street Journal leak in a different browsing session from the one reported in the spreadsheet – and by accident. In the interest of consistency – we did not test logging out and logging back in on other sites, nor logging in with the wrong password – we decided to discuss the leak in our post but not our spreadsheet.]
  • Changing user settings on the video sharing site Metacafe sent first name, last name, birthday, email address, physical address, and phone numbers to 2 companies.
  • Signing up on the NBC website sent the user's email address to 7 companies.
  • Signing up on Weather Underground sent the user's email address to 22 companies.
  • The mandatory mailing list page during CNBC signup sent the user's email address to 2 companies.
  • Clicking the validation link in the Reuters signup email sent the user's email address to 5 companies.
  • Interacting with Bleacher Report sent the user's first and last names to 15 companies.
  • Interacting with classmates.com sent the user's first and last names to 22 companies.

Implications

From a legal perspective, identifying information leakage is a debacle. Many first-party websites make what would appear to be incorrect, or at minimum misleading, representations about not sharing PII. Here are some examples.

The Home Depot:

Personal Information Disclosure: The Home Depot will not trade, rent or sell your personal information, without your prior consent, except as otherwise set out herein. [Does not describe sharing with third-parties for advertising or analytics.]

The Wall Street Journal:

We will not sell, rent, or share your Personal Information with these third parties for such parties' own marketing purposes, unless you choose in advance to have your Personal Information shared for this purpose. Information about your activities on our Online Services and other non-personally identifiable information about you may be used to limit the online ads you encounter to those we believe are consistent with your interests. Third-party advertising networks and advertisers may also use cookies and similar technologies to collect and track non-personally identifiable information such as demographic information, aggregated information, and Internet activity to assist them in delivering advertising on our Online Services that is more relevant to your interests.

Metacafe:

Metacafe's Privacy Policy is to share personal information only with the owner's informed consent.

Likewise, a number of third-party trackers disclaim collection of personally identifiable information.8

Scorecard Research (comScore):

Does your beacon collect or store any personally identifiable information about me?
The tagging used by ScorecardResearch is unable to identify the user visiting a page.

Quantcast:

We do not tie the information gathered by Quantcast Tags to the personally identifiable information of visitors to a Web site.
. . .
We do not link Log Data to any other Personally Identifiable Information about you or otherwise attempt to discover your identity.

Google Advertising:

We don't collect or serve ads based on personally identifying information without your permission.

The better practice for all first-party and third-party websites would be to acknowledge that identifying information leakage is a fact of life on the web, and that identifying information may be shared with third parties.

As for policy, some strands of the Do Not Track debate echo a sentiment of "it's all anonymous," and so, "where's the harm?" We believe there is now overwhelming evidence that third-party web tracking is not anonymous. It is a legitimate policy question whether, on balance, Do Not Track should be enforced by law. But the difficult weighing of competing privacy risks and economics can't be short-circuited by claims of anonymity.
 


Thanks to Arvind Narayanan for comments on a draft.

[1] For purposes of this post, "identifying information" is information that with moderate probability and moderate effort can be used to identify a user. This post does not use a formulaic legal definition of "personally identifiable information" (PII), an approach that has been discredited by a growing body of computer science research. The Federal Trade Commission staff notably rejected the notion of PII in its draft privacy report last year.

[2] Some third parties encode the referring URL into their Request-URI.

[3] A username isn't, of course, all a third party has to go on. IP geolocation is another trivial source of information, and can help disambiguate when several individuals use similar usernames. How many Jonathan Mayers are there in Palo Alto, CA? Using the Stanford University network? This is a possible area for future research.

[4] While it is quite clear that in practice a username can often be used to discern a user's identity, confirmatory empirical research would be valuable.

[5] We used a fictional persona with unique biographical traits to minimize false positives.

[6] For readers who engage in detail with our data, we wish to emphasize several caveats to our methodology.

  • We did not study – and cannot study – what companies do when they receive personal information. It is likely that many of the information leaks we identified were logged. Some third parties may take precautions to prevent logging of identifying information, and we certainly laud such efforts. But for policy purposes, there is a tremendous difference between a tracking ecosystem that is anonymous and a tracking ecosystem that is suffused with identity but promises to ignore it.
  • Since some websites host content from multiple PS+1s (e.g. amazon.com and amazonaws.com), our definition of a third party introduces some false positives. That said, our findings appear to be quite robust. For example, thresholding for leakage at more than three third parties still leaves 84 websites (45%) leaking a username or user ID.
  • We did not examine POST request bodies or cookies, nor did we attempt to identify obfuscated or encrypted personal information.
  • Our interaction with websites was neither comprehensive nor representative of what the average user might do. We may have missed information leaks, and some of the information leaks we identified may have affected only a minority of users.
  • In the course of a user's browsing, identifying information for other users might leak. We did not gauge how easily a third party could identify which information was the user's. In most cases it appeared such a determination would be straightforward.
  • The regular expressions we used for matching birth year, birthday, gender, and last name had a not insignificant number of false positives. We recommend against relying solely upon those fields.
  • We did not explicitly take note of which stage of signup a leak occurred at.
  • We did not use a single sign-on (SSO) provider unless required. Where an SSO was mandatory, we manually labeled PS+1s associated with the SSO provider as first-party. Measuring information leakage when SSOs are used is a promising avenue for future research.
  • We did not attempt to discover third parties that have been CNAMEd into a first-party PS+1 (dubbed "hidden third-parties" in some papers).

[7] User IDs were, in our testing, almost always sufficient to locate at least a username, and sometimes additional identifying information. For example, with a Causes.com user ID, anyone can attain a link to a user's Facebook profile – which in turn provides a name, photo, and possibly more.

[8] Please note: we are not claiming any company has breached its self-regulatory commitments. The Digital Advertising Alliance (DAA) online advertising self-regulation imposes lax restrictions on personally identifiable information. First, personally identifiable information is defined to only include information that is used to identify a user.

Personally Identifiable Information is information about a specific individual including name, address, telephone number, and email address—when used to identify a particular individual.

Second, the DAA principles only require noting the use of PII in a privacy policy and getting consent to retroactively use PII before the privacy policy change.

PII is a term used primarily in two areas in the Principles and Commentary. First, PII is used in the Transparency principle so that consumers are informed specifi- cally about the collection and use of PII for Online Behavioral Advertising purposes. Second, PII is used in this Commentary to describe a specific example of a "material” change that would require Consent from the consumer under Principle V.

Comments

What about all the traffic coming in and out of your computer, which goes through your ISP that has all your personal details, including real name and address?
What about your good old credit card that processes your soul from bottom up, including location and time of everything you buy with it? Do they sell any of these informations to third parties?
What about your e-Buddy app, which retrieves your personal emails from remote accounts using your very private usernames and passwords?
What about Skype, MSN, iChat, FaceTime (you name it), which may store a snapshot of your face during video calls? Would it be technically possible? I guess so.
Does anyone think that his or her informations, habits or whatever are not tracked by credit card companies, ISPs or even simple loyalty/membership cards?
If you can't beat them, confuse them.
p.s. the email address and name I've entered here are all wrong.

Really interesting read Jonathan. I will have to start visiting more websites on llama farming and watermelon seed spitting to really help all of these companies watching us build a hilariously inaccurate dossier on me.

I love the excel spreadsheet that was shared. And, I am glad websites are sharing my personal information with each other. You know why? Because now I can stuff them with junk data. All you have to do is every time you create an account with a first-party like cbsnews.com, enter bogus information about birthdate, address etc but keep your name unique. Let them share it with some advertiser. Now, go to another site like nba.com and do the same - all bogus info except name. Again, they share it with the advertiser. Now the advertiser has two profiles for the same unique name. You can see where this is going. The beauty of automated data collection and sharing is you can turn it against them easily.
Remember, unless the site/service needs to verify your personal info against some government issues id, always enter false information but never the same in two places. And, make sure you always login to these first-party sites and enable all their bling - cookies/whatever. The more false information they collect and share, the more messed up your profile will get in their databases.

Research Mayer writes this report with a forgone conclusion. This conclusion is evident by his use of terms like "lax restrictions" and ""identifying information" is information that with moderate probability and moderate effort can be used to identify a user"
I find it shocking that a school like Stanford would allow such research and conclusions without fully defining terms and methodologies. Failures include the reason for the exclusion of sites because "did not include so many features as to be impractical for study" (what does htis mean.
Also lacking is whether or not upon sign up at the sites, Mr. Mayer agreed to the terms of use on each site that states the infomraiton would be shared.
Did Mr. Mayer go directly to the site or did he click on ads from different locations? If so did he take into account cookie-stuffing at the ad level?
Also as in all of Mr Mayer's published "research" what was the impact if the user deleted cookies, browser history or cache?
Seems Mr Mayer is trying to make himself relevant on faulty research that Stanford would be wise to fully review before allowing their name to be associated with.

Interesting that even this blog is "leaking" data about my browsing habbits to Google Analytics (http://www.google.com/analytics/) and TweetMeme (http://tweetmeme.com/).
Even though you might only be sending the referring URL to these companies, there is nothing stopping them from indexing the page at a later time and learning what I'm interested in.
It's how the web works. Nothing is "free". If you're not paying for something then you're the something being sold.
However, it is important for sites to make sure they don't expose usernames and email addresses, as this is a security flaw. All reputable data collection companies don't collect or store this data. I wish this discussion would focus on this type of data leak instead.

TOR + Randomly generated user names stored in a decentralized hash table (along with their hashed passwords) would be a tweak to this reality.
This just simply needs to be done in a user friendly intuitive way that I have not seen yet.

Via FT.com:
Home Depot said it was still "researching carefully to determine if anything unusual occurred" but it believed it had not contravened its stated privacy policy, which was designed "to improve our product and service offerings and to enhance and personalise our customers' shopping experiences."

I'm also quite curious about your source for the claim that OKCupid appears to be selling "drug use frequency" data and other personal info to BlueKai and Lotame.

I was combing your spreadsheet for more on that juicy nugget you dropped in your blog post about okcupid. I cannot find okcupid (or lotame) in your spreadsheet. am I just missing something? where did that particular data point come from?
thanx
dan tynan

How do I protect myself from the tracking detailed here?

The only way to protect yourself 100% is get off the internet. No wait, that isn't full protection either :(
What you can do is create false information to confuse the databases. Different user names and logins, different birthdays etc. Have multiple throw away email addresses. Block cookies ...and if you have to accept them to use a service, mess with the data in the cookies :) and then delete the cookies when you're done. Use proxies to hide your IP when browsing.
And for goodness sake don't use sites like Facebook where you have to disclose REAL personal information. That's just dumb.
Remember that offline data goes online too. So don't use loyalty cards (or better still use other people's cards!). Don't deal with companies that ask for too much personal information.
But I'm not an expert in this. If the author of the article above wants to put together a more detailed guide I'd be delighted to read it.

You can disable your cookies and remove connections like twitter and facebook. Some sites will not allow you access if you disable your cookies.
Coming from a Marketers POV, we strive to make sure that data used for ad tracking, campaign management, and other purposes are used internally only.
Although what Jonathan has reveled here is surprising, it should be remember that the technology is still pretty new and hopefully there will be some regulations that hinder websites from infringing on its site visitors right to privacy. The UK is already passing laws that prohibit sites from collecting information without stating its purpose for collecting it.
-JM

Add new comment