Kalev Leetaru recently mused, while comparing the level of protest surrounding the Arab Spring, whether we could measure with the level of global protest activity at any given time. He answers in the affirmative, suggesting that the project he directs, the Global Database of Events, Language, and Tone (or GDELT), may be able to give some insight into this. GDELT is based on a huge database of news media reports, cataloging some 2.4 million events. Leetaru uses these data to suggest that the 1980s were more turbulent than the post-Arab Spring era, and that the most contentious era of worldwide protest in the past 35 years was that of the controversy around the 2006 anti-Islam Danish cartoon.
Scholars of social movements may puzzle over this. It seems unlikely that an isolated incident in Denmark could trump such watershed events such as the fall of the Soviet bloc and the events which took place in Tiananmen Square 25 years ago. It raises the question — what is going on with the “cutting-edge” of protest event data?
GDELT, like most event data, is drawn from newspaper and media reports of protest. Municipal and governmental records of protests are spotty, uneven, and difficult to obtain. Because of this, the most consistent source of event data has been media reports about protest. We see this strategy adopted as far back as Sorokin (1937) and in the foundational works of Gurr (1970) and Tilly (1978). But GDELT, like its predecessors, suffers from the biases inherent in news reporting of protest. Reports on protest are uneven. They suffer from both selection (what gets picked up) and description (how it’s described) bias (for recent reviews, see Earl et al. 2004 and Ortiz et al. 2005).
Unlike many of its predecessors, GDELT uses automated methods borrowed from the subfield of natural language processing to gain information on date and location of the protest event. This adds another level of complexity to the pipeline. Although we can take out the team of human coders painstakingly assessing hundreds of articles by hand, we’re adding noise by not hand-checking every single article. We add even more noise when we add more than one media source.
What we get in the end with GDELT is a somewhat noisy but possibly more comprehensive catalog of protest events. However, against certain “big data” celebrators, more doesn’t necessarily mean better. Making both long-term temporal and globally spatial claims about discrete events based on media data is a dangerous game. News media has lumpy coverage across the world depending on location and time period. Little of GDELT has been validated against other data sets (and where they have, they’ve fared quite poorly). And these data are still mediated through news media. There is no “pot of gold” when it comes to event data — we cannot have unfettered access to the reality of protest events. FiveThirtyEight writer Mona Chalabi, for instance, recently attempted to estimate the level of kidnapping in Nigeria in the wake of Boko Haram’s abduction of 234 girls but gets it really wrong.
Similarly, we should be highly skeptical of Leetaru’s claim that the protests surrounding the Danish cartoon were the largest of the past 35 years worldwide. We live in a media environment which is biased towards English-language stories written by reporters in large urban bureaus and a post-9/11 world where news of “Muslim rage” makes fantastic clickbait, not to mention where the raw number of available news stories has risen greatly with the emergence of the Internet.
The future of any protest event data set or system is not exempt from principles of which we’ve used for years within the social sciences. In a recent take on big data, Lazer et al. (2014) remind us that “quantity of data does not mean that one can ignore foundation issues of measurement and construct validity and reliability and dependencies among data.” And regarding event data, long-time practitioner Erin Simpson puts it succinctly: “Learn the data generating process. Learn the coding rules. Match it against some real world reporting. THEN publish.” It’s critical to understand those processes which generated protest event data, both by media sources and by social science researchers.