The Trouble with ID Cookies: Why Do Not Track Must Mean Do Not Collect

Co-authored with Jonathan Mayer.

The debate over the meaning of Do Not Track has raged for well over a year now. The primary forum is the W3C Tracking Protection Working Group, with frequent sparring in the press and capitals worldwide. There are, broadly, two Do Not Track proposals: one chiefly backed by the ad industry, and another advanced by privacy advocates [1]. These proposals reflect vastly different visions for Do Not Track with vastly different practical consequences. The two sides have unsurprisingly been at loggerheads, with scant movement towards resolution of the key issues.

The ad industry position is, and has been for over a decade, that data collection and retention should be largely unfettered so long as they are associated with a permitted business use [2]. At present these permitted-use exceptions totally swallow the rule, in practice barring little more than behavioral advertisement targeting (1, 2). (Critics often deride the status quo as “Do Not Target.”) A recent proposal by Yahoo! would add, in our view, only modest transparency requirements to the industry position.

But suppose the advertising industry were to meaningfully tighten its permitted uses and retention periods. Would privacy advocates, academics, and policymakers continue to object?

Yes. The industry approach to Do Not Track entirely misses the most serious privacy concerns associated with tracking, including:

Sensitive information. A user’s browsing history can include remarkably sensitive information, such as medical conditions and financial challenges (e.g. 1, 2). Individual users are often identified or easily identifiable (1, 2, 3).

Lack of consumer control. Users are generally unaware of who’s tracking them and how. Existing consumer control tools are difficult to discover and use, and they vary significantly in effectiveness.

Lack of market pressure. Since consumers are unaware of and lack control over tracking, third-party websites are under limited pressure to implement adequate security and privacy protections. Furthermore, many third parties are small, young, growth-oriented companies; security and privacy are not priorities.

Surveillance. Government requests for data stored in the cloud are becoming a regular occurrence, and many companies hand over data in response to such requests without informing users. If ad companies’ claims about the inferential power of tracking data are correct, then the potential for surveillance is correspondingly worrisome.

A toughened version of the industry’s position would also have significant practical shortcomings.

Fragile. Many systems are configured for comprehensive logging by default. It takes only the slightest oversight to begin unintentionally amassing data.

Unverifiable. There is no straightforward way to externally test whether a company is limiting its information retention and use [3].

Lock-in. As the online economy and its technology infrastructure change, use-based definitions are likely to become dated. A rigid use-based approach could lock in current advertising business practices, stifling innovation, or motivate some companies to bend the rules and justify tracking for an ever-expanding set of uses.

The privacy advocates’ definition of Do Not Track takes a much different tack: it would allow (just about) any third-party business practice, so long as it does not impose the privacy risk of collecting a user’s browsing history. A cookie that remembers a language preference would be allowed, for example, while a unique ID cookie would not be allowed [4].

The advocates’ solution avoids the shortcomings of the ad industry approach, and is particularly elegant for two reasons.

Privacy-preserving alternatives. There are simple technological solutions to implement most or all current advertising ecosystem functionality, as we have detailed in the “Tracking Not Required” series (overview talk, frequency capping, behavioral targeting, measurement). Shifting to these architectures would involve switching costs, and in some use cases they would underperform current implementations. That said, we believe it’s quite reasonable for ad companies to incur these minor burdens in exchange for the significant privacy benefits.

Verifiable. Tracking carried out in violation of this interpretation of DNT is externally detectable. This is a crucial point. Some tracking techniques store a unique ID in a user’s device (“supercookies”); others read attributes from a user’s device that, in combination, become unique (“fingerprinting”). Both approaches require accessing browser functionality in a manner that is, in principle, detectable.

It would also be detectable in practice — a “Web Privacy Measurement” community has sprung up that has the tools and motivation to police the web for DNT violations. Automated external detection will never achieve 100% accuracy, but it has proven highly effective at flagging possible privacy-violating information flows for manual inspection by analysts. In the worst case, it provides a basis of suspicion for regulators to conduct audits, whereas with the use-based approach audits would essentially have to be conducted blindly. As long as there is a significant chance that violators will be caught, external policing will have a strong deterrent effect. Companies will be both disincentivized from intentionally gaming DNT and incentived to institutionalize practices that ensure compliance [5].

In conclusion, the Do Not Track negotiations are nearing an impasse, while third-party tracking continues at unprecedented scale. If advertising companies and other third parties don’t step up to the plate, browser vendors and regulators will likely turn to heavy-handed alternatives. We reiterate our belief that a collection-based definition of Do Not Track combined with a deployment of client-side functionality is the ideal outcome for all stakeholders.

[1] The proposal is co-authored by Jonathan Mayer who is also one of the authors of this post.

[2] The paper “Third-Party Web Tracking: Policy and Technology” includes an expanded explanation of industry self-regulatory initiatives.

[3] This CMU Cylab study is one of many demonstrating widespread non-compliance with stated policies.

[4] Protocol information (including IP address and User-Agent string) could still be collected and retained for a short duration. This assuredly introduces some privacy risk, but it is much lesser than the risk associated with long-term retention of uniquely identifying information.

[5] Some smaller players, especially those located in jurisdictions where there is no potential legal liability for non-compliance, might simply ignore DNT. The dynamics of the online advertising market mitigate the privacy risks associated with these companies; reputable first-party websites are unlikely to deploy these services. Furthermore, some technical countermeasures (i.e. blocking) are possible against non-compliant companies. The more privacy-forward browser vendors might even choose to enable countermeasures by default.