By Fabio Rojas
“Big data” sounds fun and exciting but it has also been heavily criticized. But now, it’s time to step back and treat “big data” as we would treat any other form of data. We should identify its strengths and weaknesses and ask how it can help us with our own specific research goals. So let’s start with an obvious, but under-appreciated, point about empirical research: there is no such thing as perfection in data. Every method for generating and collecting data has strengths and weaknesses. Thus, we should be interested in data collection methods where the positive points outweigh the negative points. For example, experimental data has a great virtue – those who receive the treatment are randomly selected, thus eliminating bias. Experimental data also has a serious drawback. Experimental settings may not reflect “real world” processes and are often not generalizable. This is a serious problem for biomedical research, for example. A drug tested in a highly controlled environment may work differently than in the actual setting of a hospital. Yet, we value experiments because they do one thing exceptionally well – they eliminate selection bias and address the issue of confounding variables.
We can go through this exercise with all other forms of data and identify those imperfect data that nonetheless possess virtue. So let’s define big data, consider strengths and weaknesses, and how that applies to the study of movements and collective behavior. Roughly speaking, “big data” is the data generated in real time by large groups of people as they use electronic communication systems. These would include Google search data, Reddit discussion threads, email archives, Twitter streams, and eBay auction records.
What are the positive features of this data?
• Massive size – we have data for millions of people, not the thousands, or hundreds, found in most conventional surveys.
• Candor – people will say anything or reveal themselves in interesting ways. For example, people may not admit to having racist attitudes in a survey but may type in racial slurs into a Tweet or Google search.
• Constant updating – the data set doesn’t stop. The Internet keeps collecting data while you sleep.
• Complexity – Big data is very raw and contains enormous amounts of information.
• Low cost – once a business (like Google) sets up their system, the costs of obtaining and spreading data is relatively small.
• Access – some forms of Big Data are made available to the public and any researchers with basic programming skills can
• Intrinsic importance – the Internet is one of humanity’s most important inventions. We should care about what happens on Facebook because it has more users than most countries have people.
What are the problems of this data?
• Limited generalizability – Internet users, or users of specific platforms, are not a random sample of the population. Thus, one can’t infer public opinion from a sample of online users. At best, one has how to show how one data set might correlate with “real world data” through extra analysis.
• Access – Some forms of data will be pulled from public use. Thus, it is difficult to reproduce analyses for many data sets.
• Instability – Like panel surveys, the type of data may vary over time. Certain bits of information may be dropped for technical, financial, or legal reasons.
In other words, “big data” is a bit like archival data – it is important and massive, but not a random sample. That doesn’t mean that we throw it out. We accept it because it has value and we curtail our claims. We don’t claim that it represents the rest of the world unless we take the time to demonstrate an association with observed behavior.
Big data has many uses for the social movement researcher. For example, social movements are often difficult to survey and map out. A movement like Occupy Wall Street has no clear social or geographical boundaries. Thus, conversational data from Twitter, for example, might indicate where the movement is located in physical space. Since big data is often uncensored, we might also be able to observe conversations that would be hard to observe for any researchers except the most dedicated ethnographer. Some movements exist primarily in “virtual space.” Big data might be useful for observing movements that have no meeting place or require the co-presence of members, such as Anonymous, the hacker group.
My own interest in big data is in using it as a “smoke signal,” a correlate of real world events. In my research, I was initially interested in whether talk about a Congressional candidate might be a signal that the candidate is doing well in an election. I think big data might have similar uses in movement research. For example, is it possible to use online talk, or Google searches, to reconstruct a “typical” life course for a protest? Are there certain emotions or events online the often precede a protest or social conflict?
Another avenue for research is framing and counter-movement activity. An important issue in movement research is the presence of frames, or shared understandings of events that motivate movement participants. Typically, one relies on statements by leaders, publications, or interviews with activists. But now there is additional data – one can use online talk from millions to parse out framings. Similarly, the study of counter-movements is difficult because it take a while for these groups to emerge. Now, it is almost certain that they can be observed online as they appear.
To summarize, no data is perfect and “big data” certainly has problems. But it also has numerous positive features that make it attractive to social movement researchers.