Text as Data: A Call to Standardize Access and Training

By Laura K. Nelson

Data come in all shapes and sizes, but in the past ten years we have seen huge leaps in the amount of data readily available in the form of unstructured or semi-structured text. This presents both opportunities and challenges for social science researchers, including social movements scholars.

Two sources of text-as-data have long been staples in social movements research: newspapers and organizational literature. Newspapers have been used as the basis for event counts (e.g., here, here, and here), as a measurement of movement frames, and as a movement outcome (e.g., here and here). Organizational literature is often used to show that the way social movements themselves express ideas is critical to their success (e.g. here, here, and here).

Luckily for scholars who use these types of data, the ways of collecting, storing, and analyzing text are rapidly improving. In this essay I will discuss how the increasing availability of these two types of data in digitized formats enables us to better answer questions that have long been at the center of social movement research. To fully utilize these new approaches to both collecting and analyzing data, however, I argue disciplines in the social sciences need to formally institutionalize 1) access to these data and 2) access to the training required to manage and analyze them.

The Digitization of Old Data

We now have digitized access to more sources of news than we ever have before, including many local and regional news sources. Private databases such as Factiva, LexisNexis, and EBSCO provide aggregated access to tens of thousands of global news sources, including blogs, television and radio transcripts, and traditional print news. LexisNexis has over 10,000 sources in their database, EBSCO’s newspaper coverage has over 400 sources, and Factiva has access to over 40,000 sources. Given these resources, we should no longer have to rely on one, or even a few major newspapers for data when analyzing movements, as most research has done.

In addition to the digitization of news sources, we are also seeing increasing access to writing produced by organizations themselves, including political statements, opinion pieces, and things like organizational calendars and event reports. These data include more recent online writing, but also the digitization of historical archives. If certain literature is not digitally available, the existence of increasingly more accurate Optical character recognition software is allowing us to digitize almost any literature of interest.

New Ways of Addressing Old Issues

The availability of these data can push research on social movements forward in many ways, but I will focus on just two here: 1) the problem of newspaper bias, and 2) the difficulty in measuring strategic frames, cultural institutions, and ideology.

1) We have long known that newspapers are biased, but we can now better test exact biases in a range of news sources. Amenta et al.’s 2009 paper, for example, explored which social movement organizations received attention in The New York Times and The Washington Post over time. We can now expand what they did to include TV, radio, and online sources as well as regional newspapers, to both test whether their claims are generalizable and to test if different types of social movement organizations (SMOs) receive attention from different sources. This would greatly expand our knowledge of SMO mentions in the media, our understanding of various source-based biases, and enable us to better interpret studies that use newspapers as sources of data about movements.

We can also now test whether what newspapers say movement organizations do is similar to what movements think they do. For example, Brayden King and I are currently using over 30,000 newspaper articles about Environmental Non-Governmental Organizations to construct an exhaustive list of tactics and strategies used by these organizations, and a list of issues these organizations have addressed over time. We could extend our study to include the literature produced by organizations themselves, both online and in print, to examine how the tactics, strategies, and issues reported by the news are similar or different to the those reported by SMOs. This would further help us understand how newspapers report on social movements, and the biases involved.

2) The relationship between frames, logics (or institutions), and ideology, and how these together affect social movements, is not settled in the literature. One reason why it is difficult to disentangle frames and ideology, and makes studying ideas in general difficult, is the difficulty in directly measuring ideas, especially over time. The typical way to study frames has been content analysis, but content analysis is slow, it is often difficult to get two people to code the same text the same way, and some have critiqued it for being opaque and thus misleading.

New developments in Natural Language Precessing (NLP) and the availability of digitized data can help us with some of the problems inherent in traditional content analysis. NLP and other statistical techniques are faster, they are more reliable, and if you publish your code along with your now digitized data, they are fully reproducible and transparent. While these methods are not magic, with them we can now potentially better measure frames, logics, and dare I say, ideology, and we can do so across a wide range of writing. For example, I use NLP techniques to identify political logics guiding women’s movement organizations in New York City and Chicago over time, showing that what was initially thought of as arbitrary political differences are actually the result of the institutionalization of city-level political logics. Applying these techniques to a wide range of digitized data will help us better measure, and understand, the role of ideas in social movements.

Ongoing Challenges

These data are available, and they can potentially be of great utility for social movement scholars, but they are unfortunately no longer simple to access. Most news databases, for example, allow access to a few articles but are wary of allowing access to their entire database, even for text-mining purposes. The subscriptions universities have now are based on the assumption that researchers want to read a few articles on a subject, not use articles as primary data. Universities need to be able to subscribe to these large troves of digitized newspapers not for reading purposes, but for the purpose of cutting-edge text-mining research. This means access to thousands, or tens of thousands of articles at one time, and will take a new type of agreement between universities and newspaper databases.

I thus support the letter written by Dalton Conly et al., who call for creating a mechanism so researchers can access these new troves of data. Included in this should be access, in mass, to the news data collected by agencies like Factiva and LexisNexis, to enable comparable and reproducible research across institutions and studies.

As we work on gaining standardized access to data, we also need to teach the tools necessary to deal with these data to every graduate student in every department. Classes like James Evans’ class at the University of Chicago is a great place to start, as are institutions like D-Lab at Berkeley. These efforts should be reproduced in all departments, and teaching these tools should become part of the standard curriculum.

Even after standardizing access and teaching, the availability of these new sources of data does not magically eliminate standard methodological concerns. Like always, those who analyze new data with these new methods will still need strong research designs, we will need to understand the potential biases in our data, and we still need a strong grounding in theory. If we do not include these new sources of data and new tools in the social science toolkit, however, we will certainly be left behind.

2 responses to “Text as Data: A Call to Standardize Access and Training”

Leave a comment Cancel reply

Search

Follow Mobilizing Ideas via Twitter

Follow Mobilizing Ideas via Email

Dialogues and Disruptions

Links to Dialogue Topics and the Daily Disruption

Recent Dialogue and DD Posts

Top Posts & Pages