People Like You

Cross-posted from Yale Journal of Law & Technology.

For more than a decade, the policy debates around informational privacy have focused on the fickle notion of identifiability. Companies and government agencies sought to collect and use personal information to deliver services, improve products and conduct research, while at the same time protecting individuals’ privacy by de-identifying (anonymizing) their data. Surely, by reliably unlinking personal information from individual identities, organizations could reduce the privacy impact of their actions. Alas, once viewed as a silver bullet allowing organizations to have the cake and eat it too, de-identification has come under increasing pressure by scientists who have demonstrated clever ways to re-identify data, to the point of it being largely discredited by critics.

The de-identification debate is anchored in a perception of privacy focused on protecting identity. As such, it has diverted policymakers’ attention from a central privacy problem of the big data age, the ability of organizations to draw highly sensitive conclusions about you without exposing your identity, by mining information about “people like you.” In fact, vendors in data rich industries such as ad tech and financial services frequently assert that they do not seek to know any specific individual’s identity; rather they aim to target goods and services at individuals or groups satisfying a certain set of characteristics, say, 25-34 year old white males living in zip code 10012, who make more than $140,000 per year and enjoy binge watching comedy shows and eating mochi.

In this environment of data mining driven algorithmic decision making, the main privacy issue is not identity but rather inference. When we see an individual walking into an AA meeting we assume he or she has a drinking problem. Can we compel a computer to “un-know” what’s known to prevent it from drawing a similar conclusion?

It’s therefore important to crystalize the specific problems raised by inference-led automated decision making:

Privacy

In an influential Yale Law Journal article from 1980, Ruth Gavison explained that modern day privacy infringements manifest in the commodification of individuals. Arguably, the grouping of individuals into nameless profiles, which are bought, sold and traded on automated data exchanges, inflicts no less of a dignitary harm than identifying individuals by name. Moreover, inference is reductionist by nature. We are, or at least would like to think we are, more than just a collection of attributes (relationship status, geo-location map, music and movies “liked”), “human machines” whose behavior can easily be analyzed, predicted and categorized into neat profiles. As one critic wrote, algorithmic inferences, which often underlie online behavioral advertising, at best derive “a bad theory of me.”

Due process

According to a recent story in the New York Times, Ben Bernanke, fresh after his departure from two 4-year terms as chairman of the Federal Reserve, was denied a loan to refinance his home mortgage. Reported to be earning $250,000 for giving a single speech, Bernanke was turned down a loan on an $800,000 home because the credit reporting system flagged him as someone who had just changed jobs. To be sure, Bernanke could probably remedy his personal predicament by identifying himself to his bank’s loan officer, but other borrowers who did not serve as chairmen of the fed may not be so lucky. Recognizing the risk of arbitrary, opaque, and potentially misguided automated decision making, scholars have called for institution of digital due process rights, including transparency of decisional criteria, access to underlying information, and maintenance of audit trails.

Discrimination

As Solon Barocas and Andrew Selbst recently explained, “By definition, data mining is always a form of statistical discrimination. Indeed, the very purpose of data mining is to provide a rational basis upon which to distinguish between individuals and to reliably confer to the individual the qualities possessed by those who seem statistically similar.” Unfortunately, such discrimination “could reproduce existing patterns of discrimination in society; inherit the prejudice of prior decision makers; or simply reflect the widespread biases and inequalities that persist in society.” The authors note that perversely, data mining could result in “exacerbating existing inequalities by suggesting that historically disadvantaged groups actually deserve less favorable treatment.”

Consider Latanya Sweeney’s research demonstrating that Google searches for black-sounding names were more likely to return contextual advertisements for arrest records than similar searches for white-sounding names. This disturbing result was an artifact of existing social biases, which were reflected in the learning process of the Google search algorithm.

Similarly, the selection of training data – the data of known cases used to create an algorithm – can have a negative impact. For example, Boston’s adoption of Street Bump, an app using the motion-sensing capabilities of smart phones to automatically report to the city the existence of road potholes, led to the unintended consequence of diverting resources from poor to wealthier neighborhoods. This was the result of the unequal distribution of smart phones and app usage across the population. Wealthier neighborhoods had more smart phone and app users than poorer ones, causing the discrepancy.

The new technological landscape requires a recalibration of privacy policy. From laws and regulations focused on whether data processes identify specific individuals to policies recognizing that even without identification, machine made inferences pose risks to societal values of privacy, fairness and equality.