A 1993 New Yorker cartoon famously proclaimed, "On the Internet, nobody knows you're a dog." The Web is a very different place today; you now leave countless footprints online. You log into websites. You share stuff on social networks. You search for information about yourself and your friends, family, and colleagues. And yet, in the debate about online tracking, ad networks and tracking companies would have you believe we're still in the early 90s — they regularly advance, and get away with, “anonymization” or “we don’t collect Personally Identifiable Information” as an answer to privacy concerns.
In the language of computer science, clickstreams — browsing histories that companies collect — are not anonymous at all; rather, they are pseudonymous. The latter term is not only more technically appropriate, it is much more reflective of the fact that at any point after the data has been collected, the tracking company might try to attach an identity to the pseudonym (unique ID) that your data is labeled with. Thus, identification of a user affects not only future tracking, but also retroactively affects the data that’s already been collected. Identification needs to happen only once, ever, per user.
Will tracking companies actually take steps to identify or deanonymize users? It’s hard to tell, but there are hints that this is already happening: for example, many companies claim to be able to link online and offline activity, which is impossible without identity.
Regardless, what I will show you is that if they’re not doing it, it’s not because there are any technical barriers. Essentially, then, the privacy assurance reduces to: “Trust us. We won’t misuse your browsing history.” Maybe you’re OK with those terms, maybe you’re not; but it is important to know that those are exactly the terms you’re getting. [1]
Here are five concrete ways in which your identity can be attached to data that was initially collected without identifying information.
1. The third party is sometimes a first party
Most of the companies with the biggest reach in terms of third-party tracking, such as Google and Facebook, are also companies that users have a first-party relationship with. When you visit these sites directly, you’re giving them your identity, and there is no technical barrier to them associating your identity with your clickstream that they’ve collected in the third-party context. In some cases you don’t even have to visit their sites directly. Many embeddable widgets like Disqus allow you to log in seamlessly on pages in which they are embedded.
Note that some third-party functionality such as personalized social widgets necessarily requires your identity, which means there isn’t even a question of collecting the data anonymously. However, at least for buttons such as ‘Like’ or ‘+1’, it is possible to avoid tracking unless and until the user clicks on it, as outlined in the Do Not Track cookbook and implemented by the ShareMeNot extension. Unfortunately, the companies in question have not adopted this practice.
The problem is exacerbated by the “Facebook loophole:” the company argues that this what users actually expect, i.e., that it is fair to track them across the web because they have a first-party relationship with Facebook. This flawed argument has made its way to Washington, and is reflected in the Kerry-McCain bill, for example.
Google says it keeps “some information” for “usually less than two weeks” from ‘+1’ logs, and Facebook says it deletes the data after 90 days. While I appreciate these policies (especially Google’s), I must reiterate the point I made at the beginning: these privacy assurances boil down to trust. There are no technical obstacles to persistent tracking of identified users via these widgets.
2. Leakage of identifiers from first-party to third-party sites
In a paper published just a few months ago, Balachander Krishnamurthy, Konstantin Naryshkin and Craig Wills exposed the various ways in which users’ information can and does leak from first parties to third parties. Fully three-quarters of sites leaked sensitive information or user IDs. There are at least four mechanisms by which identity is leaked:
- Email address or user ID in the Referer header
- Potentially identifying demographic information (gender, ZIP, interests) in the Request-URI
- Identifiers in shared cookies resulting from “hidden third-party” servers
- Username or real name in page title
There are quite a few technical details here, so to simplify, let’s look at an example from their paper, illustrating the first category above:
GET http://ad.doubleclick.net/adj/... Referer: http://submit.SPORTS.com/...?email=jdoe@email.com Cookie: id=35c192bcfe0000b1...
The user is browsing a page on sports.com which contains the user’s email address in the URL. To fetch an ad embedded on the page, the browser sends a request to the Doubleclick (Google) server, containing both the email address (as part of the referrer), and the “anonymous” Doubleclick identifier (as part of the cookie), thus creating an association between the two.
There are two important points to note. First, this is an extremely deep-rooted problem and largely a result of ignorance and carelessness on the part of first-party sites. Unlike, say, cross-site scripting, there are neither good technical tools to detect instances of identifier leakage through referrer headers and other means, nor a widespread realization that the problem even exists. [2] While many ad networks take steps to sanitize referer headers, this brings us back to the fact that privacy comes down to trust.
Second, site user IDs by themselves, even without email addresses or real names, are increasingly equivalent to identities in the modern web environment. This is because it is possible to hop from identities on one site to another using cross-site identity mapping databases and APIs. [3]
3. The third party buys your identity
Ever seen one of those “Win a free iPod!” surveys? The business model for many of these outfits, going by the euphemism “lead-generation sites,” is to collect and sell your personal information. Increasingly, these sites have ties with tracking companies. Other types of companies like direct-marketing lists and consumer data exchanges/aggregators could also play the role described in this section, but for simplicity I will focus on survey sites.
When you reveal your identity to a survey site, there are two ways in which it could get associated with your browsing history. First, the survey site itself could have a significant third-party presence on other sites you visit. When you visit the survey site and sign up, they can simply associate that information with the clickstream they've already collected about you. Later on, they can also act as an identity provider to sites on which they have a third-party presence.
Alternately, they could pass on your identity to trackers that are embedded in the survey site (via the methods listed in the previous section), allowing the tracker to link your identifying information with their cookie, and in turn associate it with your browsing history. In other words, the tracker has your browsing history, the survey site has your identity, and the two can be linked via the referrer header and other types of information leakage.
4. Hacks
In previous articles, I’ve described how a variety of browser and server-side bugs can exploited to discover users’ social identities: via a bug in Firefox’s error object, a bug in Google spreadsheets, via “history stealing” (a.k.a. “history sniffing”), history stealing again, and bugs in Facebook Instant Personalization partner sites. The known bugs have all been fixed, but computer security is a never-ending process of finding and fixing bugs.
Indeed, research out of CMU Silicon Valley subsequent to the above articles has shown how users can be identified by exploiting ‘Likejacking’ bugs. While this does require tricking the user into making a single click, the type of bugs involved are fundamentally harder to fix.
One might wonder if tracking companies will resort to exploiting software bugs, but recent revelations by Jonathan Mayer and his team have shown that this is not outside the realm of possibility.
5. Deanonymization
So far I’ve talked about identifying a user when they interact with the third party directly or indirectly. However, if the mountain of deanonymization research that has accumulated in the last few years has shown us one thing, it is that the data itself can be deanonymized by correlating it external information — specifically, facets of users’ browsing history that are they occasionally choose to reveal publicly. I’ve explained this attack in a slightly different context, but it is even easier for a tracking company sitting on a database of clickstreams.
The logic is straightforward: in the course of a typical day, you might comment on a news article about your hometown, tweet a recipe from your favorite cooking site, and have a conversation on a friend’s blog. By these actions, you have established a public record of having visited these three specific URLs. How many other people do you expect will have visited all three, and at roughly the same times that you did? With a very high probability, no one else. This means that an algorithm combing through a database of anonymized clickstreams can easily match your clickstream to your identity.
And that’s in the course of a single day. Don’t forget that tracking logs usually stretch to months and years.
Conclusion
As a computer scientist, I find it unfortunate that the misconception is being peddled around in policy circles that non-collection of “PII” renders clickstreams safe to collect and store — long after the technical community has concluded that this is not the case. It’s time we stopped accepting this excuse and started a more honest discussion of the privacy implications of online tracking.
[1] When you decide to trust a company’s policies, you’re also trusting that those policies will not be subverted by a rogue employee and that the company’s systems will not be hacked, to say nothing of the possibility of Government entities demanding access to the data.
[2] For an example of research on technical tools in the mobile context, see TaintDroid.
[3] To be perfectly clear, identity mapping is not the culprit here; it’s just a technology that has both positive and negative uses.
Big thanks to Ashkan Soltani for sharing his expertise on survey sites and to Jonathan Mayer and Ryan Calo for comments on a draft. Any errors are my own.