By Alex Hanna
One of the longstanding issues with social movements research is the availability of reliable, timely, and comprehensive protest event data. Ideally, we would like to cover multiple movements and have adequate temporal and spatial variation. However, the generation of protest event data has usually meant many human hours dedicated to hand-coding, usually by farms of social science undergraduates. But the wide availability of electronic sources and advances in natural language processing – or in a word, “big data” – has the potential for pushing the boundaries of our field.
There have been a number of efforts to automate this process using tools from computer science and statistics. Several projects, including PETRARCH and SPEED, are working on generating data for political and conflict events writ large. The EMBERS project is focusing on forecasting protest events from social media. And our team at the University of Wisconsin-Madison is in its first year developing a system called MPEDS, the Machine-Learning Protest Event Data System. Our aim is two-fold: first, to identify protest-related articles in a database of news articles; and second, to identify a number of salient attributes about the protest, including protest form, issue, target, location, size, and social movement organizations involved. We do this by training a set of machine learning classifiers on both existing protest event data and new examples from a variety of news sources beyond the New York Times, which has several well-documented biases.
These automated systems may have the consequence of alleviating some well-known issues related to protest event data. For instance, the ability to code news articles quickly means that the incorporation of new sources (which use the same language to describe protests) is trivial. Incorporating more sources may reduce selection biases related to geographical proximity and size. It also will make the generation of new datasets fast and replicable by cutting out the undergraduate coders (who can then dedicate their time to more intellectually-stimulating tasks!) Lastly, these systems have the potential to generate data in real-time, which has major implications for testing hypotheses about mobilization. By forecasting, I mean making claims about events or trends in events which have not happened yet. Jay Ulfelder, for instance, has argued consistently that social science forecasting can be used as a means of testing competing theories. Sociology as a field, however, has been resistant to this approach.
However, even with all this machinery, many existing biases will remain. First and foremost, if there’s not anyone there to report the event, then it doesn’t exist. News bureaus are clustered in urban areas and the number of available correspondents is contracting. Additionally, the consolidation of broadcast news media compounds this issue. With social media, sources suffer from a strong urban and youth bias, and is highly variable across countries. This issue is particularly stark in conflict zones which are highly hostile to journalists or authoritarian states in which media is tightly controlled. But this isn’t as damning as we think. Michael Biggs has recently argued that social movement scholars constructing protest event datasets should focus more on very intense protests instead of collecting every single event. But it is still true that a protest with tens of thousands of people in Cairo has a much higher probability of being reported than one in Homs.
Second, the proliferation of sources introduces the potential for many records to refer to the same event. The tale of 538’s coverage of Boko Haram that I mentioned in a previous Mobilizing Ideas post highlights how event duplication can dramatically alter event counts. There may be promise, however, in using record-linking techniques to resolve issues in event repetition. For instance, the Human Rights Data Analysis Group, which estimates the numbers of victims of political violence, uses record-linking in matching multiple lists of these victims. But this is no trivial task, and the problem becomes more intractable when events have multiple fields on which we can match.
Lastly, reliance on news reports has forced researchers interested in generating event data to gather news articles from document databases such as Lexis-Nexis and Factiva. Although it may be easier to gather data in real-time for new events using web scrapers and RSS feeds, historical data is more difficult. These databases are typically only available to libraries which have purchased access, and even then they set up significant obstacles to downloading news articles at scale. The legal status of using news articles in event data research is thorny, and these databases understandably take an aggressive posture to limit access to the intellectual property they store. There may, however, be alternatives to this model of access for researchers. The Linguistic Data Consortium, for instance, provides free access to text repositories to member institutions. And the Open Event Data Alliance is an attempt to develop similar infrastructure for event data researchers.
Despite these issues, the promise of more source data and automated methods is to improve the quality and strength the claims of social movement research across multiple cases. This potential means more comprehensive datasets without large costs in human labor, more transparency through replication, and the potential of forecasting future events. Contrary to some popular conceptions that big data will result in mindless empiricism or the “death of theory,” the integration of these data and methods has the potential to open new doors to theory building and theory verification.