One of the most challenging problems for national security is predicting and stopping terrorist attacks before they happen. The government proposes that data mining is a useful tool for finding terrorists. By using database technology, statistical analysis and modeling, the government says it can search our email, phone calls, shopping habits, educational records, and find the needle (terrorists) in the haystack (the general population). One has to know a bit about the science and statistics behind data mining to evaluate this claim.
The debate over data mining is often cast as a trade-off between security and the privacy of individuals. But the real problem with national security data mining is that there is no trade off. There's an invasion of privacy, but no corresponding uptick in security. Why is this so?
Meanwhile, the city of Philadelphia is trying to use data mining to predict which ex-cons will become murderers.
Why might data mining help Philly find murderers, but not help the United States find terrorists?
First, Philly is analysing a discreet population, people on probation. This narrows the ratio of subjects to killers to one in one hundred. In contrast, the ratio of subjects to terrorists in the United States is one in millions.
Second, though its a relatively rare offense, there have been a lot of murders and so we have a lot of information about the characteristics of people who kill. We know what the indicators are that incline someone toward violence. Similarly, with consumer behavior, identity theft and credit card fraud, the models for suspicious activity are based on hundreds of thousands of known examples. Terrorism, in contrast, has no broad based model. As the CATO report says, there are "a relatively small number of attempts every year and only one or two major terrorist incidents every few years—each one distinct in terms of planning and execution—there are no meaningful patterns."
Data mining has two main public policy questions. First, does it help with resource allocation. In a world of scarce resources, you can't check every lead. Does data mining narrow the leads effectively, or does it generate so many false leads that it exacerbates the resource allocation problem?
Next, what are the costs of false positives, and given the number that data mining will generate, can we bear that cost?
Philadelphia is able to better its odds with datamining, because its dealing with a relatively high rate of murderers within a finite, tested population and because it has a model based on known data. It will be interesting to see how well the program works. But the results can not be extrapolated to the hunt for terrorists.