What You Don't Say About Data Can Still Hurt You

Publication Type: 
Other Writing
Publication Date: 
November 21, 2013

Cross-posted from Forbes.

Written by Woodrow Hartzog and Evan Selinger.

Big data generates big myths. To help society set realistic expectations, the right kind of skepticism is needed.

Kate Crawford, Principal Researcher at Microsoft Research and Visiting Professor at MIT’s Center for Civic Media, does a fantastic job of explaining why folks are too optimistic about the promise of what big data can offer. She rightly argues that too much faith in it inclines us to misunderstand what data reflects, overestimate the political efficacy of information, and become insensitive to civil rights concerns.

Other forms of big data skepticism, however, go too far. This is particularly true in cases where extreme doubt is cast on the concept having any value. Prominent data scientists who believe that the novelty of big data has been exaggerated recommend that we should stop talking about it.

Take Harper Reed, the Chief Technology Officer for President Obama’s 2012 re-election campaign. He recently called big data “bullshit”, alleging it has become buzzword that generates hype for analytics platforms that don’t actually parse legitimately big volumes of information.

Then there’s Rayid Ghani, Chief Scientist of the Obama for America 2012 campaign. Dubious about big data conversations, he remarked that “no one in the computational world talks about big data.” His point is that computer scientists recognize that “nothing fundamentally has changed in the past ten years in data analysis”. They also can appreciate basic computational truths, like the fact that he currently has “more hard disks in his apartment than the [Obama] campaign had data”.

Outside of elite computational circles, others are also skeptical. They’re advising us to accept the fact that the “term ‘big data’ is absolutely meaningless”, and reorient conversations by focusing on specifics related to other terms: “smart data,” “data science,’ “predictive analytics,” and “NewSQL”.

This cautioning is sensible, at least to some extent. Vendors can overcharge for so-called big data goods and services by capitalizing on terminological imprecision, technical misunderstanding, and fevered interest in all things big data related. As technology writer Nicholas Carr observes, given anxiety about decision-making, even useless big data analysis can be marketed as essential.

Social commentators also can cash in on big data trendiness. By evoking the term, they can add rhetorical spice to otherwise bland analysis. Satirizing the sad state of affairs, technology critic Evgeny Morozov offers the following tip: “If you have a trove of unpublishable, crappy papers, just add the words ‘Big Data’ to their titles and see them go viral”.

Ok. Let’s grant that “big data” isn’t a standardized concept and that undue authority can come from appealing to it. There’s still good reason not to be swayed by the linguistic policing. For if the skeptic’s logic were pushed to the limit and people stopped talking about “big data”, society would lose something important. Our privacy discourse would be impoverished.

Evoking “big data” is a good way to get public conversations started about privacy. Due to the massive media coverage of Snowden and the NSA, “big data” has become closely associated with “Big Brother”. Hence, the term makes it easy to sensitize folks to how privacy problems are created by increasing storage capacity and processing power, coupled with expanding access to sensitive data points that can be analyzed for hidden connections and surprising correlations.

Yes, at a certain level of abstraction, “big data” has a familiar feel. But too much emphasis on this point makes it easy to overlook the fact that legal and social change is driven by increased vulnerability, not necessarily the novelty of threats. Remember, photographs and journalists existed before Warren and Brandeis wrote their seminal article, “The Right to Privacy.” Still, the privacy revolution wasn’t triggered until handheld cameras became prevalent and the press grew increasingly curious.

Big data references also help keep privacy conversations going. Indeed, “big data” is excellent shorthand for referring to the big vulnerabilities and big problems inherent to the law progressing slower than modern data sets, data inputs, and data analysis techniques.

While the lack of a set definition limits what we can do by using “big data” as shorthand, it’s better than a recommended alternative, sticking to the term “data”—a decision that would sever an immediate association with the modern industrial data complex. Someone’s name and address is data. But that’s a far cry from the millions of pieces of data required for IBM’s Watson to diagnose cancer or the detailed dossiers on virtually everyone created and maintained by commercial data brokers.

Now, given the importance of “big data” for privacy conversations, it might not be a coincidence that some of the linguistic skeptics have said dismissive things about privacy.

Ghani has no qualms admitting he personally doesn’t care about privacy. Because he’s got nothing to hide, he contends that worrying about surveillance would only get in the way of enjoying the benefits of being lazy and connected. But as Daniel Solove and others have argued, this way of looking at things is overly-individualistic and shortsighted.  Privacy is about much more than hiding personal information.  Specific and complex privacy-related concepts such as “predictive inferences,” “power differentials,” and “lack of transparency” can be quickly evoked in the mind of a listener by using the big data frame.