Against the Labor Theory of Science
Two recent working papers gesture towards a new social scientific process. This process is much more efficient by eliminating redundant effort by social scientists and --- potentially revolutionarily --- by removing low-dimensional human text processing from the loop.
(For a background on what I’m talking about, check out previous posts in this series.)
These papers point towards a “post-textual social science.” The technology of the written word has been central to elite-level human information processing for almost 500 years. Everything about how we (my esteemed reader and I) understand expert knowledge and even our own cognition is shaped by the technology of the written word. This is deeply *unnatural*. We only got here through decades of intensive training, beginning in preschool. Few humans have ever been as invested in reading/writing as we are. But as the “Gutenberg Parenthesis” comes to a close, we need to grapple with the fact that our cognitive technology is becoming obsolete.
I'm not arguing anything so vulgar as “the end of theory” or “big data means never having to say causal inference.” My argument is derived from my attempts to model the entire system of social science knowledge production and identify the highest-leverage points for improvement. This is Metascience, a subfield of political methodology.
To the papers. Molly Offer-Westort, Leah Rosensweig and Susan Athey test around 40 different interventions designed to reduce the Coronavirus “infodemic” in sub-Saharan Africa.
Rather than comb the literature and derive what they think is the most effective intervention from theory, they *just test them all.* The “economies of scale” from this move are obvious. Imagine instead that 10 different teams had each tested 4 different interventions.
First are the efficiency gains on the “production side.” These thirty people would have had to get 10 times as many IRB approvals, engaged 10 times as much overhead (time and money) running the experiment itself, and then they would have had to produce 10 times as many papers. That means 10 different introductions, theory sections, conclusion sections. And then they would have had to go through 10 times as many submission processes! Taken the time of 10 times as many editors and peer reviewers! (I'm getting upset even thinking about that last part).
Then are the efficiencies on the “consumption side.” The reader only has to read one paper instead of 10. More importantly, we can be sure that the evidence from the tests run in the centralized process is *commensurable*, that ceteris really was paribus. If this had been decentralized, the 10 teams would have made different choices in implementation that would decrease our confidence that observed differences in results were driven by the stimuli themselves. (See “The Theory of the Academic Firm”).
Not satisfied with these massive efficiency gains, the authors squeeze out another 50% or so through an adaptive experimental design that avoids wasting time testing treatment arms that are likely to be duds. These gains matter (a lot) if you’re Facebook or another entity running digital experiments with N > 100,000 on the daily, but in my view, 99% of social science is years away from being able to take advantage. So, despite how cool the adaptive experimentation technology is, I see the centralization of knowledge production as a more important development.
The second paper (now forthcoming at the APSR) is by Luke Hewitt, David Broockman, Alex Coppock, Ben Tappin, and four affiliates of the political experimentation firm Swayable. The paper presents the results of a large number of media effects experiments comprising 681 stimuli and around half a million total subjects. In this case, the stimuli were the actual political ads tested by campaigns during the 2018 and 2020 US Elections. This is a *remarkable* dataset. This is not necessarily scalable -- it represents something like a “knowledge windfall” -- but we should absolutely be searching for and exploiting these windfalls.
The authors emphasize the comprehensiveness of the dataset: this is *all* of the ads that the firm tested, so we should not be concerned about biases due to publication bias or the file drawer problem. This is valuable, but analogously to adaptive experimentation, I believe this is a second-order issue relative to the absolute magnitude of the scientific evidence brought to bear.
From the perspective that knowledge is stored in the CSV and not the PDF, this project contains an appreciable fraction of all the knowledge ever produced through the academic, quantitative study of media effects.
This may make some readers uncomfortable. From a labor theory of knowledge, it’s absurd: there have been millions of person-hours invested in running those experiments and in reading and writing those papers. Dozens of august careers have been based on less (of this kind of) knowledge than the current dataset.
That kind of thinking is based on the flattening of knowledge enforced by the technology of the written word. Consider citing this paper and then citing one hundred other papers. Each citation, within flow of the text, is just (Proper Name et al, YEAR). This is a ridiculous medium in which to do quantitative knowledge synthesis.
In contrast to the view that treatment effect heterogeneity doesn’t exist, these results demonstrate that it *does* exist, at magnitudes that are substantively important -- but that “theory” is incapable of explaining more than a tiny percentage of the variance. Furthermore, the theoretical predictions that are borne out in the context of the 2018 election are not at all predictive in the context of the 2020 election. Since these are not theories that are intended to be particularly temporally constrained, this suggests that temporal validity is a first-order source of predictive error.
To connect these papers to the heady theorizing up top, let’s start with an easy one:
What is “theory”?
There is significant disagreement on this question by discipline. But I want to focus on the academic paper as a form of media—in this case, the genre of academic paper that proposes and provides an empirical test. Through this lens, we can see that a theory is “the words that come before the hypothesis”. A good theory is “words that come before the hypothesis of which the reviewers approve.”
But the *hypothesis* is often the crux of these papers, what guides us in understanding the empirical test. And what is a hypothesis?
A hypothesis is a sentence.
That is, a hypothesis is crucial bridge between the hundreds of words comprising the theory and the empirical operationalization of the stimuli. It is thus, from the perspective of information theory, a compression of that theory into a lower-dimensional space. The empirical test of the hypothesis is, analogously, a low-dimensional projection of the overflowing social reality of the phenomena of interest.
A fundamental weakness of the academic paper as media is that the textual medium is unable to communicate variations in dimensionality. A hypothesis is a sentence regardless of the dimensionality of the space in the theoretical world to which it refers.
Returning to the practice of in-line citations to illustrate the problem. The theory section contains claims followed by parenthetical citations. These claims and citations play a fundamentally binary function: paper shows X (XXXX), paper shows Y (YYYY). Within this space, the *amount of knowledge* contained within each of the papers thereby cited (or rather, perhaps, the amount of knowledge contained in the CSVs used to create those papers) is rendered artificially equivalent, 1 unit of knowledge a piece.
In-line citations, at present, are a serious but potentially fixable problem. And I am not drawing a strong distinction between linear text and printed graphs here. A picture speaks a thousand words, sure, and a well-designed graph kicks up the informational value maybe one more order of magnitude. Including graphs definitely increases the density of information in an academic paper.
The fundamental constraint is the rate at which the human brain can take in and process information. Try to conceive of the amount of information encoded in the 681 political ads tested in Hewitt et al—how many attributes of each video are there? How do each of the half million subjects understand these videos?
These are all “bundled treatments,” in the language of the potential outcomes framework. Research on “latent treatments” from high-dimensional stimuli suggests that it is difficult for academics to identify what exactly is being manipulated in, say, an group of real-world political ads.
The standard scientific method involves human “translators” at every stage in the process. Take the ads, code them on theoretically relevant dimensions, write a paper that summarizes the ads and findings within this low-dimensional “theory space”, humans read the paper, cogitate, test new ads / develop new theories.
Each of these acts of translation is prone to error, is an expensive use of highly educated human labor, and is limited by human cognitive and perceptive constraints.
Here, then, is the reason I find the adaptive experimental approach so interesting from the perspective of Metascience: the machine learning algorithm is able to extract high-dimensional knowledge directly from the stimulus/subject interaction and then design and execute “the next experiment” (that is, a refined distribution of the original stimuli) without requiring a detour through human-readable theory.
To conclude, let me try a few formulations of a key question:
How much of the variance in the high-dimensional world of real human behavior is it possible to explain in a sentence?
What is the highest-dimensional treatment for which there is minimal treatment effect heterogeneity?
Given the answer to these questions, then, how should we organize social science?