Discover more from Never Met a Science
Note: this is the third of four posts discussing the 2020 Meta Election research partnership and the resulting papers, surrounding a keynote panel at IC2S2 on the topic last Thursday July 18; here are the first and the second.
We’re almost two decades into the modern era of social science reforms. “Credibility Revolution”, “Causal Revolution”, “Replication Crisis.” Wake up, Boomers: Science 2 just dropped.
Terminology aside, we can safely say that the practice of quantitative and especially experimental social science has changed dramatically. So, we might ask:
How’s it going?
Or more specifically:
Have the costs been worth the benefits?
Ironically, given the hegemonic status of cost-benefit analysis in political, technological and even personal decision-making, the framework seems somehow profane in the context of Science. Pre-reform social scientists were using methods that we now believe to be invalid. How can we use methods that have been identified as causing bias?
The sociology of “methodology” within quant psychology/political science/econ is fundamentally mystical. We dispatch our priestly class to train in arcane arts of statistics and math, in the hopes that they might ascend up the mountaintop to glimpse the face of God (physics), and deliver back to us commandments about which practices are forbidden, permissible, or mandatory.
Until recently, this appeared to work; my above description is just a hyperbolic description of the division of academic labor. But given the development of academic blogs and especially Twitter, we are beginning to realize that our priests have been climbing different mountains. Their commandments conflict, a classic double bind. There can be no cost-benefit analysis when it comes to commandments.
In practice, these contradictions are resolved by subtly ignoring the older commandments, and the ones promoted by priests with weak PR chops. Rather than a coherent effort to advancing towards a goal, social science methodology is like playing Whac-A-Mole. Like trying to sculpt a balloon with our bare hands — we can squeeze more rigor into a new area, but in so doing we let the slack back into where rigor had been emphasized before. The circular progress of the whirlpool.
Meta-science reform in the current environment privileges the simplest and strictest commandments — the ones which can be tweeted about, the ones which can be subsumed into a checklist, the ones for which you can get a merit badge. More fundamentally, meta-science reform seems always to be more *sciencey*. Science 2 isn’t a sequel — it’s a gritty reboot.
Given our culture and our raison d’être vis-à-vis other forms of knowledge about humans and society, it is much easier to coordinate on a quantitative social science reform with the aesthetic of moving us towards physics, rather than towards the practices of qualitative methods or even the humanities.
My meta-scientific impulse, elaborated in Temporal Validity as Meta-Science, is to put a fine point on this contradiction. If we’re going to do science reform in this culturally positivist way, bending always towards public and verifiable displays of rigor, we need to be rigorous about how we do so. We cannot do science reform by commandment. We need actual cost-benefit analysis — we need metascience. This means theorizing, qualitative data, quantitative data, experiments, sure, but it also means deciding on what we’re doing.
Most of the time, we’re really not doing anything at all. The vast majority of quantitative social science work is simply not serious, in Adam Mastroianni’s incisive framework.
The Meta2020 experiments are a valuable “model organism” for metascientific reflection because they represent such a large, coordinated effort. They are serious. But are they good enough?
The following sentence does not inspire confidence. The largest RCT of Facebook and Instagram use ever conducted finds that:
“[the effect on] self-reported vote choice is large enough to be meaningful in a close election, although it is not significant at our preregistered threshold”
The estimated effect was “large enough to be meaningful in a close election” — but in strict compliance with the commandments of science reform, the study is unable to distinguish this effect from zero.
The most ambitious social science of social media to date was underpowered to detect changing the results of a US Presidential election.
From what I can tell, the problem emerged on one side from the following two commandments:
Thou shalt pre-register all of your hypotheses
Thou shalt conduct statistical adjustments for multiple comparisons
This combination makes perfect sense on its own, as a plausible correction to an undeniable problem with social science practice. We were getting way too many false positives from researcher degrees of freedom and from simply running too many tests. This is what led to the “Replication Crisis.” So I argue that the vibe of the “Credibility Revolution” coming out of psychology (and, before that, from medicine and clinical trials) is deductivist. The associated philosophy of statistics is frequentist, producing the hegemonic and flamboyantly rigorous practice of Null Hypothesis Statistical Testing.
This vibe is in tension with that of the emerging RCT-conducting community. This experimentalist community, coming out of development econ and poli sci, is fundamentally inductivist. They’re fucking around and finding out lots of cool stuff — hopefully one day it’ll all hang together and they’ll figure out how things work. This is the spirit of the “Causality Revolution.” There is one commandment above all others: causal identification is mandatory. Otherwise, it’s a bit ad hoc. The associated philosophy of statistics is Bayesian, with the associated (and far less rigorous) practice of “updating your priors.”
Which I think caused the problem. The experimentalists had the practical insight that it’s far cheaper at the margin to add more outcome variables. It’s the worst feeling in the world to run a years-long expensive RCT and then be asked by a reviewer “Why did you measure the effect on X instead of Y?” The ratio of the cost of measuring both X and Y to the cost of conducting the field experiment is very low; so why not just measure as many outcome variables as you can?
Well, the reason not to measure (and, as commanded, preregister) more outcome variables is that you’re taking a “penalty” in terms of Effective Sample Size/power due to the commanded multiple comparisons adjustments.
This tension, between the science reforms coming out of different communities, is the source of the problem. More fundamental, I think, is a disagreement about what an experiment is for—but that’s a topic for another time.
The contradiction in this case becomes more severe given the industrial organization of the enterprise. This is “Big Science”: it’s simply not possible to conduct research at this scale without large teams of collaborators. This has most commonly been implemented in the “lab model,” where a PI or small team of co-PIs is in charge of a collaboration with grad students and postdocs. This hierarchical model has some advantages in reducing the costs of coordination and consensus.
In contrast, the Meta2020 collaboration involved nearly three dozen quite established academics and industry partners. I can only imagine how much time it took to reach consensus — even the “consensus” of delegating authority to the “lead author” on each paper to dictate the terms. This form of “Big Science” thus shades into “science by committee,” which takes time and can produce incoherent results.
The incoherence might be sorted out by a “marketplace of ideas,” if one were to exist. Independent scientists have very different intuitions about what makes good science — they each do their own thing and whichever approach produces the best knowledge wins out. By centralizing the knowledge production, however, there’s no guarantee that even one coherent program will be implemented.
Readers will know that my intuitions are much more aligned with the inductivist, experimentalist vibe, especially in a context like the Meta2020 experiments. No one has ever done anything at this scale and depth before, and obviously it had never been done during the 2020 US Presential Election before. The idea that we should have very strong theoretical expectations about what would happen strikes me as deeply misguided. The research team didn’t even know how to implement the treatments beforehand — how could they have very strong beliefs about what those treatments would do?
My guess is that many of the research team shared my sentiments. And every one of them absolutely had their pet theories to test, their one or two outcome variables to toss into the mix. Given a flat organizational structure and a team of 32, it’s easy to see how the number of outcome variables would flourish. As, in my opinion, it should — if you’re running a multi-million dollar experiment, you should try to learn as much as you can!
But the presence of even two or two deductivist hardliners in the committee, and the strong coordination mechanism of a commandment to follow the more science-y practice, would be enough to fully preregister and adjust for every one of those outcome measures.
To be clear: I’m not 100% sure that one of these vibes and their associated practices are the correct way to do science. I have my preferences, others have there own; that’s totally natural.
But I am 100% that there’s no one making sure that all of this Science 2 makes sense. I think that these vibes are incoherent, the product of science reforms emerging from distinct methodological communities and clashing in the practice. The more ambitious, the more interdisciplinary, the greater the risk of incoherence.
You know I'm standing in the choir, but this is so good from start to end.
My one quibble is that I don't think it's helpful to think about social science as engineering. Tinkering with machines has a different character. I think a helpful framework for quantitative social science would be quantitative social story telling, appealing more to the exploratory spirt of Tukey than the rigid spirts of Popper or Fisher.