When Internet use was beginning to grow in the 1990s, a now decades-old debate started over whether the Internet would bring vast improvements to society, social relations, and individuals or lead to greater inequality, more anomie, and a much thinner civic core. As time wore on, many scholars studying information communication technologies (ICTs) and society were influenced by earlier work in science and technology studies (STS), which suggested that technologies themselves had no direct impact on society, but rather that their impact depended on how the technologies were used (and misused). And, after watching conflicting findings on the impacts of Internet usage roll in for about a decade, the majority of researchers in this area began to support a much milder conclusion: Internet usage would produce some social benefits and likely some social difficulties and the mix and appearance of those would depend on its usage.
I argue that the debate we face over the impact of “big data” and “computational sociology” (i.e., sociological analyses that require computer programming to either collect or process data, whether that data is qualitative or quantitative) has some notable parallels to this earlier debate on ICTs and society. There are scholars who embrace with wild abandon the transition to big data and computational methods and there are scholars who argue we will rue the day that our disciplines collectively took this turn. I want to forward a compromise position: there will be notable advantages to big data and computational methods and there will also be notable problems. The impact of these assets and tools will ultimately turn on how individuals use them and how entire fields use them. So, our best course action is to be deliberative and thoughtful about the long term consequences of our individual and collective research decisions.
Of course, my view on this is partly predicated on my beliefs as a researcher, and so let me acknowledge some of those core beliefs as a starting place. First, and foremost, I believe that data is what keeps me honest as a social scientist. At the very core of my job is an obligation to try to prove my own and my scholarly community’s ideas wrong and collect the best data I can to do so. Only when I fail to prove these ideas wrong after taking my best shot can I have any confidence that the ideas might be right. In that way, I see data—big and small—as the antidote to theorizing gone wrong. And, given that many scholars and public intellectuals have felt confident about wading into debates on an area I am quite familiar with – the study of online protest – with little direct data but a great deal of opinionated argumentation (e.g., Gladwell’s http://www.newyorker.com/magazine/2010/10/04/small-change-3), I can see the very positive appeal of data-rich studies. Patience, quality research, and data will eventually resolve some important debates and reveal which ideas birthed from a proverbial armchair prove supportable and which don’t.
That said, I also think that data for data’s sake is as likely to create dead-ends and false arguments as data-free argumentation. Over the years, I have (negatively) reviewed countless articles and grant proposals where the availability of a massive dataset and the computational tools to analyze that dataset are positioned as the principal merits of a project without much (or sometimes any) thought being given to what that data really means, if anything, about the real world that it purports to measure. In areas I work in, this has often come in the form of massive network driven studies of online activism where the real empirical and theoretical meaning of hyperlinks, and the etiology behind their creation is often ignored as the siren’s song of massive data lures us to potentially false operationalizations of connections. I have also seen this in studies of other kinds of massive networks where all ties are presumed to be meaningful and little to no effort is spent on thinking about what those ties really mean and in what ways a network diagram built on one kind of interaction is necessarily partial.
So, I start a discussion of big data and computational sociology from an ambivalent place: I think there is great promise in large scale data and computational techniques to provide leverage over troublesome theoretical issues (as many of the other contributions to this dialogue have already pointed out) but only, as with any piece of research, when cases are carefully selected, concepts thoughtfully operationalized, and data is collected and analyzed with great care. Because it is easy to falter at each one of these steps, I see as much risk as I see promise.
But, that said, I think it is inevitable that our field will collect and analyze larger and larger datasets and use more computationally intensive skills over time. The promise or siren’s song, depending on your vantage point, is simply too strong. So, I want to spend the rest of my space thinking about what, given the core commitments to theoretically meaningful methodology that I just discussed, could we do to improve our collective future. First, we can require our students and ourselves to really immerse ourselves in literatures that problematize our data sources. I study online protest and I wouldn’t dream of doing that without having really engaged the interdisciplinary literature on ICTs and society and, increasingly, also with the interdisciplinary literature on political communication. That is, how can I assume that protest is complicated but Internet usage is not? How can I presume that I must be trained to understand social movement development and its likely consequences on other fields but not assume that other institutions and areas of social life have their own internal dynamics that will influence how protest and those areas of social life interact? So, a first step is to find the people who study the kind of data you are collecting and learn what they know about that data and the social life of those data before you think about the role of protest.
Second, we need to build computational skills into our graduate, and eventually our undergraduate, curriculum. Understanding basic programming in languages like python is eventually going to be a core research competency in the same way that understanding how to read a regression table or how to think about methods of similarity and difference are now. That is, we are going to have to stop assuming that motivated students will teach themselves and start structuring curriculum that treats programming as a skill that can be performed in better and worse ways.
Third, we are going to have to figure out how to work with private industry and non-profits to gain responsible access to their datasets. Right now, MoveOn, Change.org, and Facebook have better data from controlled comparisons amongst micro-samples of their membership on what frames work for mobilization, what individual engagement trajectories look like, and what predicts micro-mobilization, among other topics, than scholars do. We face a real risk of organizations learning more about mobilization than we as a scholarly community do if we don’t figure out how to work with them to get access to their data and to the results of the controlled comparisons in which they routinely engage.
If I had more space, there are about a dozen other things that I think we could collectively do to ensure that big data, computational intensity, and theory harmonize over time. But, I think the basic starting place is clear: there is both promise and peril in big data and the switchmen between those outcomes is our usage of these assets and tools. When we remember to strenuously couple data with theory and vice versa, dedicate ourselves to engaging research on the social life of our data (e.g., there is a literature on Facebook and Twitter outside of their use at protests that social movement scholars should be reading), and dedicate ourselves to concerted attention to our research techniques, we at least begin to sail in the right direction.