How Scientists Massage Results Through ‘P-Hacking’

Jonathan KitchenGetty Images

The discovery of science is designed to find significance in a maze of data. At least, that’s how it’s supposed to work.

By some accounts, that façade began to crumble in 2010 when Darryl Beim, a social psychologist at Cornell University, published a 10-year analysis on the prestige. Journal of Personality and Social Psychology, demonstrating with widely accepted statistical methods that extrasensory perception (ESP), basically a “sixth sense” was an observable phenomenon. Bem’s colleagues couldn’t replicate the paper’s results, soon blaming what we call “p-hacking,” the process of massaging and overanalyzing your data in search of statistically significant—and published—results.

♾ You like math. So do we. Come, dive into its intricacies together—join Pop Mech Pro.

To support or refute a hypothesis, the goal is to establish statistical significance by recording a “p-value” of less than 0.05, explains Benjamin Baer, ​​a postdoctoral researcher and statistician at the University of Rochester, whose current work seeks to address that. This issue. The “p” in p-value stands for probability and is a measure of the likelihood of a null hypothesis result versus chance.

For example, if you want to test whether all roses are red, you would count the number of red roses and roses of other colors in the sample and perform a hypothesis test to compare the values. If this test yields a p-value of less than 0.05, then you have statistically significant grounds for asserting that only red roses exist—even though evidence outside of your sample of flowers suggests otherwise.

The misuse of p-values ​​to support the idea that ESP exists may be relatively harmless, but when the practice is used in medical trials, it can have very serious consequences, Baer says. “I think the biggest risk is the wrong decision,” he explains. “There’s this big debate going on in science and statistics, trying to figure out how to make sure that this process can happen more smoothly and that the decisions are actually. Based on what they should be.”

Baer was first author on a paper published in the journal in late 2021 PNAS Along with his former Cornell mentor and professor of statistics, Martin Wells, he looked at how new statistics could improve the use of p-values. The metric they look at is called the frailty index and is designed to complement and improve p-values.

This measure describes the fragility of a data set in that some data points flip from a positive to a negative result—for example, if a patient had a positive effect from a drug but actually felt no effect. If changing just a few of these data points is enough to reduce the result to a level that is not statistically significant, then it is considered weak.

p value curve

PM

In 2014, physician Michael Walsh originally proposed the frailty index. Journal of Clinical Epidemiology. In the paper, he and his colleagues applied the frailty index to 400 randomized controlled trials with less than statistically significant results and found that one in four had a low frailty score, meaning their findings may not actually be very reliable or robust.

However, the fragility index has yet to gain much steam in clinical trials. Some critics of the approach have emerged, such as Ricky Carter of the Mayo Clinic, who say it is consistent with p-values ​​without offering substantial improvements. “The irony is that the frailty index was a p-hacking approach,” says Carter.

“Talking to a victim’s family after a failed operation is a very different matter [experience] Statisticians are sitting at their desks doing math.

To improve the frailty index, Baer, ​​Wells, and colleagues focused on improving two key elements to answer previous criticisms: making only adequate possible modifications, and moving beyond binary 2×2 tables (representing positive or negative control and experimental group results). Generalizing the approach to work. .

Despite the uphill battle the frailty index has fought so far, Baer says he still believes it’s a useful metric for medical statisticians and hopes improvements in his recent work will help convince others to do the same.

“Talking to a victim’s family after a failed operation is a very different matter [experience] Statisticians are sitting at their desks doing math,” says Baer.

Leave a Comment

Your email address will not be published.