Big Data and the Perceptual Divide

There's an old joke that goes like this: “There are only 10 types of people in the world: those who understand binary and those who don't.” Like most old jokes, it's built around a kernel of truth. If you cram enough training in mathematics and science into a person's brain, it changes not just how they think, but how they see the world. It's hard to overstate just how deep this shift goes, but it's akin to the “overview effect” experienced by astronauts during spaceflight, in which suddenly seeing the planet from a different perspective induces a profound sense of oneness and connection. But for engineers and other types of data scientists, I suspect that the effect goes in the opposite direction. It seems like there's an inclination among some who work with large bodies of data, be they NSA cryptologists or Facebook researchers, to view their data as something separate from the individual citizens and consumers that those data points represent. And I believe that this disconnection goes a long way towards explaining the tensions in the modern big data world.

Last weekend, the Washington Post published a barnburner of a story analyzing the contents of 160,000 actual communications intercepted by the NSA under §702 of the FISA Amendments Act. The piece's headline focused on the fact that, according to the Post's calculations, only about 10% of the communications were sent by the targets of NSA surveillance. The rest had been sent by others who were not the first-order targets of the NSA's surveillance. Moreover, it seems like this 10% number actually might be a few orders of magnitude too high—that is, the NSA might gather tens or hundreds or thousands of times more communications from those one “hop” away from a target than they do from the target himself. And with the Post's cache representing only a minuscule fraction of the communications collected by the NSA, the true scale of government surveillance is quite literally beyond human capacity to visualize. There is nothing in our daily experience that lets us imagine what tens or hundreds of billions of communications looks like. It's no wonder that the continuing revelations have left civil libertarians gobsmacked by the government's behavior. The base biological and neurological architecture of our minds almost requires that we feel this way in the face of such enormity.

But let's step back for a moment and try to see it from another angle, one in which the observer has had her brain rewired by spending most of her adult life studying mathematics and computer science. First, understand that terrorism in general is an incredibly rare activity. More Americans are crushed to death by their own furniture than are killed by terrorist attacks. And of the few terrorist attacks that there have been, effectively none of them follow identical narratives. There's no characteristic pattern that's universal to terrorist attacks, which is why the idea of “connecting the dots” is more often fantasy than fact. This difficulty doesn't just apply to counterterrorism, it applies to all of the problems in which we employ the tools of the intelligence community. Yet despite these astronomical odds, and even even within the limited cache the Washington Post analyzed, there was information that was of undeniable national security value:

Among the most valuable contents — which The Post will not describe in detail, to avoid interfering with ongoing operations — are fresh revelations about a secret overseas nuclear project, double-dealing by an ostensible ally, a military calamity that befell an unfriendly power, and the identities of aggressive intruders into U.S. computer networks.

From an NSA analyst's point of view, that this system works at all is proof of its efficacy and necessity. Being able to cull any useful information from the massive, turbulent sea of global communication should be unbelievably difficult like neutrino detection. But somehow it works, and it works repeatedly. Looked at from this perspective, the scale of the net falls away, and the few fish it catches take on disproportionate importance. And if you're the kind of person who works on these problems, and you spend all day surrounded by other people who think like this, it should be no wonder to find that your eyes have adapted to this rarefied frequency of light.

The effects of this shifted perspective appear beyond the walls of Fort Meade. A couple of weeks ago, the story broke that Facebook manipulated the News Feed display algorithms of nearly 700,000 users in an effort to alter their emotional states. For my money, one of the most interesting aspects of the subsequent uproar is just how confused Facebook employees appeared by the scale of the backlash. As a former Facebook data scientist (who was not connected with the study) put it:

That being said, all of this hubbub over 700k users like it is a large number of people is a bit strange coming from the inside where that is a very tiny fraction of the user base (less than 0.1%), and even that number is likely inflated to include a control group. It truly is easy to get desensitized to the fact that those are nearly 1M [sic] real people interacting with the site, and it is something that people constantly are trying to remind everyone of when working on the product.

When looked at as a proportion of the total number of Facebook users, yes, 700,000 is a tiny number. Approximately 11% of all humans on planet earth are active on Facebook every day, and something like 1.2 billion people—1/6 of the world's population—have accounts. But 700,000 people is also roughly equivalent to the population of Detroit, Michigan. This is not a small number in real terms, even if it is small compared to the total number of lines in one of Facebook's databases.

This is where the perceptual divide comes in. Statistically speaking, the types of people who work for the NSAs and Facebooks of the world have received the kind of advanced technical training that rewires the way they think about information. This training is in many ways a good things. It gives them the intellectual tools to work with data in all the ways that have helped to create the modern world. But in a metaphorical and practical sense, it also might serve to separate them a bit, to enable them (for better and worse) to ignore the individuals in favor of the raw data that they produce. This might account for some of the outcry that we have witnessed against the NSA's and Facebook's activities, and, more broadly, against “big data” in general.

When we talk about “big data,” however we choose to define the phrase, we're really talking about power. The whole driving philosophy behind big data is that given a large enough data set, it's possible to extract valuable insights that would be otherwise unknowable. This value translates into power, be it geopolitical power in the hands of the intelligence community, or economic power in the hands of Silicon Valley tech companies. But the trouble with the big data approach is that this power only exists when data is concentrated. By definition, this is power that is inaccessible to individuals.

Now, put these two ideas together. Take an irreducible power imbalance, and put the majority of that power in the hands of people who fundamentally view the world differently from most of the people around them. Think about what that world might look like, how the many might feel about the choices made by the few, and the outsized impact that those choices might have upon the lives of the many. It looks a little familiar, doesn't it?