Yesterday, with J. Nathan Matias, Marianne Aubin Le Quere and Charles Ebersole, I published a "Data Descriptor" in Nature: Scientific Data. We invite you to join the hundreds of scientists who have already been working with the dataset, the entirety of which is now available. It is, I believe, the largest dataset of social science experiments currently in the public domain.
This template was not originally designed for social science lol
I'm indulging in a blog post about this publication because it represents the culmination of many of the concepts I've been working through over the past year and a half on Never Met a Science. This is a practical result of my at-times wild theorizing about the future of social science. And I think it is among my most significant contributions to date.
We think of this contribution as a "knowledge windfall." The cost of running 32,000 experiments in the usual fashion---PI gets grants, designs a single intervention, conducts the experiment, analyses the data with one-off code---in terms of both time and money would be astronomical. There are transactions costs at every stage of that process that make it unlikely that a single professor could run more than a few hundred experiments in their entire career.
In contrast, our approach "produced" 32,000 experiments with only a few years of work by a small team. We (all the credit to Nathan for making this happen; I deeply admire his commitment to openness and collaboration, and his willingness to experiment with new forms of scientific work) contacted Upworthy and their parent company Good Media, who were open to sharing the data with us.
Many thanks also to the engineers and editors who worked at Upworthy during the period covered by the dataset. A key hurdle in producing a knowledge windfall like this one is a lack of process knowledge. When we got the .csv from Upworthy, we could not fully understand it. From simple things like the precise operationalization of a given continuous variable to more complicated details about the experimental design, we would not have been able to credibly learn from the .csv without an accurate map from the numbers in the columns to the social scientific processes that created them.
This is usually the easy part of social science. If I design and conduct an experiment, I know exactly what everything means and what I did. (It gets trickier with teams; a common problem is that the PI does not, in fact, necessarily know exactly what their junior collaborator did.) But interpreting the results of experiments conducted by other people; in a different context; for different purposes, is far from trivial. We hope to encourage the spread of best practices like the "technical validation" component of Nature: Scientific Data publications to increase the rigor of the process of finding knowledge windfalls.
In this blog post, I'll skip over the substantive contribution of the Archive, what we can learn about media, attention, economics, psychology and industrial organization. There are over 50 teams of researchers who are currently working with the full dataset after having pre-registered their analysis on the exploratory 10% of the data we previously released. I can't wait to see what they can teach us!
Nature: Scientific Data is an excellent meta-scientific intervention. The format of the journal publication (the .pdf that explains the contents of the .csv) is the "Data Descriptor." This has several desirable properties.
First, this .pdf is "interoperable" with existing academic structures. It is a document, with a DOI, that can be cited and aggregated just like any other. N:SD thus avoids the pitfall of being so bold as to be illegible to practicing academics.
Second, the data descriptor institutionalizes a novel set of standards for what documents like this should look like, what kind of information they should include. Establishing this format is important because it allows practitioners to learn from example rather than casting about at random. Establishing rigorous baselines also protects authors from potentially unlimited objections about what they *didn't* include.
Third, and most importantly: everything about this document, from the name to the section headings, is dedicated to the idea that knowledge is stored in the .csv. This helps legitimate a form of scientific contribution that centers the data, rather than treating the data as simply a means to an end.
Ultimately, of course, all social science data are a means to an end: the end is to better understand and/or predict human behavior. A major goal of meta-science, and this blog in particular, is to argue that the *format* of scientific production is an essential input to scientific output. The way we bundle (for example) data, code, and theory in a journal article has evolved more or less arbitrarily. This form has only loosely and slowly adapted to the frontiers of knowledge production made possible by contemporary communication technology.
Consider how the data descriptor differs from the replication file. Replication files have been a significant meta-scientific advance over the dark ages of "data available upon request." But they are clearly second-fiddle to the paper itself: they are optimized to ensure that the precise analysis presented in the paper can be replicated, not to archive or publicize data that might be used for meta-analysis or future research.
And the replication file will never see the light of day unless the paper itself is published. That is, researchers might have collected a fantastic dataset that others would love to learn from, but we never see it unless editors and reviewers agree that the associated .pdf contains a sufficient theoretical contribution.
(This is part of our motivation for founding the Journal of Quantitative Description: Digital Media. Although we retain the format of the academic journal article, we are able to publish useful descriptions based on datasets that may otherwise have lingered on someone's desktop. This is a less a knowledge windfall than knowledge scavenging, but the impulse is the same: lowering the marginal cost of knowledge production and thus increasing the aggregate amount of knowledge.)
The Problem of Site-Selection Bias
Site-selection bias is a serious problem for the field experimentalist paradigm. Field experiments are run in a given location (and at a given time) or set of locations; ideally, these locations would be selected as-if randomly from all relevant locations. But the practical realities of running experiments ensure that this is almost never the case. Allcott (2015) describes an experiment to investigate the effects of a public service campaign to reduce household energy consumption. The experiments do everything right: “large samples totaling 508,000 households, 10 replications spread throughout the country, and a useful set of individual-level covariates to adjust for differences between sample and target populations.”
But when the campaign is rolled out to the entire country, the average treatment effects estimated by the experiment are much too high. The problem is that communities with governments who are willing to participate in an experiment to reduce energy consumption are themselves composed of individuals who are unusually likely to respond to that experiment.
There's no easy solution to this problem, but that doesn't mean we should ignore it in the context of knowledge windfalls.
The Upworthy windfall was only possible because the knowledge was actually valuable to the company; they spent a lot of time and money developing and using the experimentation architecture. More generally, the site-selection biases resulting from donations from corporations means that certain types of knowledge---the most immediately profitable---will be over-represented.
No research paradigm can avoid these problems entirely, but through researcher awareness and methodological pluralism, we can better triangulate the truth.
Of course, there is zero risk that donated knowledge windfalls come to comprise the entirety of social science practice; there aren't nearly enough relevant datasets out there. But while this approach cannot scale indefinitely, social scientists should take as much advantage as possible.
There are unresolved ethical and scientific challenges involved in collaborating with powerful institutions capable of large-scale interventions or experiments, from Facebook to the Democratic Party. Both the ethical stakes for the researcher and the privacy/public relations risks for the entity are significantly reduced, however, if the collaboration takes the form of a knowledge windfall, with several years delay.
For some research questions, temporal validity is of the utmost importance, and this delay is costly. These pressing policy-relevant questions are also the least likely to find willing collaborators in tech companies. However, the most important long-term scientific goal (I believe) is to create lengthy time series data. In this case, we always have to wait decades to get a time series that spans decades, and the cost of a few years delay is negligible.
To any corporate/NGO/PAC/government tech workers who know of internal datasets (especially related to experiments!) that are just sitting around experiencing bit decay: get in touch with us! We have a playbook for turning that lead into gold, in a way that will benefit you, your organization, social scientists, and (most importantly) the public.
Thanks to GOOD/Upworthy's trailblazing example---combined, if necessary, with important advances in differential privacy---we could soon see an explosion in certain areas of social science knowledge production. At the risk of being slightly cheeky: the single most important action the US government could take to spur social science knowledge production would be to allow corporate entities to treat data donations as tax-deductible. The data in the Upworthy Research Archive cost millions of dollars to create, and while they're not perfect, I would happily pro-rate their scientific value, at a reasonable percentage.