By Neal Caren
Hand in hand with the rise of the “big data” in the social sciences is an enthusiasm for incorporating new methods to analyze these data. Most prominent among these are topic models for analyzing text data and random forests for modeling categorical outcomes. Just as the rise of new, large-scale, real-time data sets presents challenges and opportunities to social movement researchers, many of the standard methods used to analyze this new data presents promise for scholars, but it won’t be necessarily be easy.
One large potential stumbling block to incorporating machine learning, the subfield of computer science where many of data scientists come from, into social movements research is that computer scientists don’t care about hypotheses testing. Whether or not the effect of a key variable is statistically significant is central to the hypothesis testing that goes on in much of our quantitative research, but this isn’t the case for computer scientists and many other data scientists.
As an outsider it appears that the primary goal of machine learning is to develop algorithms that are really good at prediction. Most famously, the Netflix prize was awarded based on how well an algorithm predicted user ratings. The website Kaggle runs similar competitions all the time for much smaller prizes and placing well in those competitions is a mechanism for credentialing yourself as a data scientist. The winning model is the one that does the best job correctly predicting the outcome variable on a fraction of the dataset that was held out from developing the model. That is, researchers come up with their best models using say 80% of the data (called the training set), and then are judged based on how well their model fits the other 20% of the data (the test set). In competitions you know the independent variables test set, but not the values of the dependent variable.
In these public competitions, the winning models often involve the aggregation of several different models. Random forest models, for example, work by running multiple decision tree models and then averaging the results. Despite the fact that each decision tree model involves a small selection of the possible independent variables, aggregating the results produces a more accurate prediction than putting all the variables together in one model. A wining model would actually be more likely to average the results of a random forest model with multiple different other estimator techniques, such as support vector machines or nearest neighbors.
What gets lost in this sea of models is coefficients. Most data scientists spend little to no time thinking about the relationship between a specific feature and the outcome. Standard machine learning implementations rarely report, or are even able to report, a parameter estimate, standard errors, or even whether or not the presence of one variable increases the likelihood of the outcome happening. Partially this is because these quantities are often not estimable based on the ensemble techniques employed. Partially this is because with a large n, almost everything has at least one statistically significant effect, especially when you have a fully interactional model. But I think this is also because the people who develop and use these methods don’t care about the relationships; they usually aren’t interested in whether an increase an X is associated with an increase in Y, but rather given all our Xs, and how well can we predict some new Y.
This lack of ability to see whether a variable has a statistically significant impact means that it isn’t that easy for sociologist to plug these big data methods into their normal scientific routine. We care about theory testing. In practice this typically means sorting out whether our key independent variable has a statistically significant effect on the dependent variable net of control variables. We aren’t that concerned about overfitting, and, in fact, reviewer 3 would probably like you to add two more control variables.
The one major exception to this trend in the data science world is what our friends in the industry would call A/B testing. This is when companies run experiments, randomly assigning customers or users into either the treatment or control group. Facebook, for example, has open-source software for running online field experiments. The statistical techniques involved analyzing these data are the same as those traditionally used in social scientific or medical experiments, like ANOVAs. This is incredibly useful for some kinds of social movements questions, but is of little use for the more common observational studies.
What this means is that the path to a Big Data sociology isn’t obvious. Incorporating data collection techniques, like web scraping or accessing the Twitter API, into our research paradigm seems pretty straightforward since this is just adding new sources of data that can be relevant to testing relevant hypothesis. However, using most of the new techniques won’t be easy unless we radically shift what counts as a contribution, or, more modestly, shift how we evaluate our theories. For example, there is a great deal to be said about the value of prediction, particularly out-of-sample prediction, in evaluating model fit.
Finally, it also means that graduate student’s interested in tenure track jobs in sociology shouldn’t necessarily be in a rush to teach themselves all the cool stuff that’s happening in machine learning. I’m entirely convinced that those folks are better at prediction than we are. But, I’m also not sure how that gets you into ASR or Moby.