The Reproducibility Project’s Fatal Design Flaw

I know this is a technology blog but today, let’s talk about science.

When I’m not theorizing digital media and technology, I moonlight as an experimental social psychologist. The Reproducibility Project, which ultimately finds that results from most psychological studies cannot be reproduced, has therefore weighed heavy on my mind (and prominent in over-excited conversations with my partner/at our dogs).

The Reproducibility Project is impressive in its size and scope. In collaboration with the authors of original studies and volunteer researchers numbering in the hundreds, project managers at the Open Science Framework replicated 100 psychological experiments from three prominent psychology journals. Employing “direct replications” in which protocols were recreated as closely as possible, the Reproducibility Project found that out of 100 studies, only 39 produced the same results. That means over 60% of published studies did not have their findings confirmed.

In a collegial manor, the researchers temper the implications of their findings by correctly explaining that each study is only one piece of evidence and that theories with strong predictive power require robust bodies of evidence. Therefore, failure to confirm is not necessarily a product of sloppy design, statistical manipulations, or dishonesty, but an example of science as an iterative process. The solution is more replication. Each study can act as its own data point in the larger scientific project of knowledge production. From the conclusion of the final study:

As much as we might wish it to be otherwise, a single study almost never provides definitive resolution for or against an effect and its explanation… Scientific progress is a cumulative process of uncertainty reduction that can only succeed if science itself remains the greatest skeptic of its explanatory claims.

This is an important point, and replication is certainly valuable for the reasons that the authors state. The point is particularly pertinent given an incentive structure that rewards new and innovative research and statistically significant findings far more than research that confirms what we know or concludes with null hypotheses.

However, in its meta-inventory of experimental psychology, the Reproducibility Project suffers from a fatal methodological flaw: its use of direct replications. This methodological decision, based upon accurate mimicry of the original experimental protocol, misunderstands what experiments do—test theories. The Reproducibility Project replicated empirical conditions as closely as possible, while the original researchers treated empirical conditions as instances of theoretical variables. Because it was incorrectly premised on empirical rather than theoretical conditions, the Reproducibility Project did not test what it set out to test.

Experiments are sometimes critiqued for their artificiality. This is a critique based in misunderstanding. Like you, experimentalists also don’t care how often college students agree with each other during a 30 minute debate, or how quickly they solve challenging puzzles. Instead, they care about things like how status affects cooperation and group dynamics, or how stereotypes affect performance on relevant tasks. That is, experimentalists care about theoretical relationships that pop up in all kinds of real life social situations. But studying these relationships is challenging. The social world is highly complex and contains infinite contingencies, making theoretical variables difficult to isolate. The artificial environment of the lab helps researchers isolate their theoretical variables of interest. The empirical circumstances of a given experiment are created as instances of these theoretical variables. Those instances necessarily change across time, context, and population.

For example, diffuse status characteristics, a commonly used social psychological variable, are defined as: observable personal attributes that have two or more states that are differentially evaluated, where each state is culturally associated with general and specific performance expectations. Examples in the contemporary United States include race, gender, and physical attractiveness. In this example, we know that any of these may eventually cease to be diffuse status markers, hence the goal of social justice activism. Similarly, we could be sure that definitions of “physical attractiveness” will vary by population.

Experimentalists are meticulous (hopefully) in designing circumstances that instantiate their variables of interest, be they status, stereotypes, or, as in the case below, decision making.

One of the “failed replications” was from a study that originated at Florida State University. This study asked students to choose between housing units: small but close to campus, or larger but further away from campus. The purpose of the study was to test conditions that affect decision making processes (in this case, sugar consumption). For FSU students, the housing choice was a difficult decision. At the University of Virginia, where the study was replicated, the decision was easy. While Florida is a commuter school, UVA is not, therefore living close to campus was the only reasonable decision for the replication population. Unsurprisingly, the findings from Florida didn’t translate to Virginia. This is not because the original study was poorly designed, statistically massaged, or a fluke, but because in Florida, the housing choice was an instance of a “difficult choice” but in Virginia, it was not. Therefore, the theoretical variable of interest did not translate. The replication study failed to replicate the theoretical test

Experimentalists would not expect their empirical findings to replicate in new situations. They would, however, expect new instances of the theoretical variables to produce the same results. Those instances, however, might look very different.

Therefore, the primary concern of a true replication study is not empirical research design, but how that design represents social processes that persist outside of the laboratory. Of course, because culture shifts slowly, empirical replication is both useful and common in recreating theoretical conditions. However, A true replication is one that captures the spirit of the original study, not one that necessarily copies it directly. In contrast, the Reproducibility Project is actively atheoretical. Footnote 5 of their proposal summary states:

Note that the Reproducibility Project will not evaluate whether the original interpretation of the finding is correct. For example, if an eligible study had an apparent confound in the design, that confound would be retained in the replication attempt. Confirmation of theoretical interpretations is an independent consideration

It is unfortunate that the Reproducibility Project contains such a fundamental design error, despite its laudable intentions. Not only because the project used a lot of resources, but also because it takes an important and valid point—we need more replication—and undermines it by arguing with poor evidence. The Reproducibility Project proposal concludes with a compelling statement:

Some may worry that discovering a low reproducibility rate will damage the image of psychology or science more generally. It is certainly possible that opponents of science will use such a result to renew their calls to reduce funding for basic research. However, we believe that there is a much worse alternative: having a low reproducibility rate, but failing to investigate and discover it. If reproducibility is lower than acceptable, then we believe it is vitally important that we know about it in order to address it. Self-critique, and the promise of self-correction, is why science is such an important part of humanity’s effort to understand nature and ourselves.

I whole heartedly agree. We do need more replication, and with the move towards electronic publishing models, there is more space than ever for this kind of work. Let us be careful, however, that we conduct replications with the same scientific rigor that we expect of the studies’ original designers. And in the name of scientific rigor, let us be sure to understand, always, the connection between theory and design.

Jenny L. Davis is on Twitter @Jenny_L_Davis

Headline Image: Source

The Reproducibility Project’s Fatal Design Flaw

About Cyborgology

Pages