Your hypothesis test sucks inside the Garden of Forking Paths

Trigger warning: I don’t do a lot of editing, since it’s my least favorite thing, so I apologize for the stream of consciousness flow of this essay. Also, at nearly 5000 words, the essay is long. Also, if I had to say so myself, I think sections I and II are weaker than III and IV.

 

I.

Andrew Gelman, A statistician and social science researcher at Columbia (but a statistician by training – check his CV) doesn’t like hypothesis testing. And after reading some of what he has to say about the subject, coupled with my own experiences, I am inclined to agree.

When I took undergraduate mathematical statistics, the course primarily consisted of defining some notions from probability theory (such as expectation, distribution functions, and convergence in probability, which is just a fancy way of saying convergence in measure, with a particular measure), and then giving a slightly too vague definition of an estimator, showing what consistency, (un)bias, and sufficiency was, then deriving a bunch of simple statistical tests, such as the two sample t-test, confidence intervals, some non-parametrics, etc.

I also remember mostly being fascinated with non-parametric statistics. Why? Few assumptions were needed in order to apply them. Consider the Mann-Whitney U-test (hereby the rank sum test )or the Wilcoxen signed rank test. In particular the Mann-Whitney test has paper thin assumptions regarding your data. You need independent observations from your groups (every statistical test ever) from an underlying continuous distribution (no problem, measurements are often of a continuous nature), you need ordinal observations (this is immediate from the assumption of a continuous distribution) and you need a well defined null and alternative hypothesis just like in every other statistical test you would use.

The rank-sum test is closely related to the independent sample t-test. Not only that, it is actually more *robust* than the t-test when the assumptions of the t-test are not adequately met. And the rank-sum test can also be used to test the homogeneity of distributions, because it is sensitive to a difference in variances between samples. While this information is perhaps less precise than a simple analysis of means between groups, it can be quite telling. If the distributions are significantly different between samples, then it’s like that “treatment” had an effect. This test seemed so magical at the time, that I couldn’t help but wonder: Why doesn’t everyone use this test? Why is the t-test so popular? You don’t need assumptions of normality, and you don’t need a “sufficiently large sample” for the central limit theorem (CLT) to hold – whatever that means.

[NB: This actually does have a meaning, but, at my level of training, I haven’t seen a formal argument for it. Here is a (the?) heuristic one: The CLT, in order to hold, requires that your sample size n \rightarrow \infty. This is actually *never* true, because we do not have infinite sample sizes. The big joke in statistics is that, as a rule of thumb, once n > ~30 or so, then that’s large enough for the CLT to hold, and you can just apply tests that require it. Does no one see the madness behind this? Moreover, the CLT is not very robust to *highly skewed* data, which means that the “rule of thumb” actually doesn’t even work if you use a data set that is, I dunno, from any actual population? If your data is from leaf life people, or observations, chances are it will not be nice and normally distributed. Skew hurts every test, including the t-test, and if it’s bad enough, no amount of praying to the central limit theorem will fix that.]

Okay, so the argument is that if you *do* have normally distributed samples (ha!) then the t-test is a better hypothesis test. Also, if you do actually have a larger amount of data, then the CLT is “pretty good”. This will positively affect power calculations.

Oof, yeah. Turns out that non-parametrics have a fatal flaw: many of them have low power. Power is, informally, just a way to detect effect sizes. Larger power — detect smaller effect. Large enough effect, and even statistical tests with shit-tier levels of power will pick up on it. Think of this as hearing. Someone who is hard of hearing has low “power” in the domain of detecting sound waves. But even a completely deaf person will “hear” a bomb going off down the street — the reverberations will rattle their bones. In particular, the rank-sum test is actually not consistent (becomes more powerful as sample size increases) unless the two groups you are testing are actually different. And in the case that your samples are actually normally distributed, the two-sample t-test remains more powerful. In fact, it has a nice property: It is uniformly most powerful, when its assumptions are met. In general, more power is good, but there is such a thing as having too much power, which often occurs when dealing with massive data sets. Typically, this won’t happen with research, as funding is limited and you can’t hook up the entire country to your MRI machine, but in the era of big-data, sample sizes that result from web-scraping, or even in business analytics, actually do possess the critical property n \rightarrow \infty and this can result in power that actually does converge to 1. Essentially this means that your hypothesis test *will* reject any null hypothesis, no matter how small the effect size is. This is actually a bad thing. Even minute differences in measurement quality between samples will result in a rejected hypothesis test with an arbitrarily small p-value. Again, this is generally  not going to be an issue in academic research. Your data sets will simply not be large enough to make a difference, and very likely, your data is going to be so noisy that your power will be closer to 0 than to 1. (t. every social science study ever). However, this has practical implications for anyone who works with “big data”. I probably need to read more, but inference on massive data sets is very tricky. *Every* effect will flag as significant, and when your features p > n (sample size), then good luck fitting a regression and having it make sense. “Feature selection, bro”. Sure, but the moment you start selecting features, your p-values will fail to achieve their nominal value. Your estimate will be biased. This is made even worse if you use a bad feature selection technique– such as stepwise regression, or pretty much any univariate feature selection. You can throw out all the features with domain expertise and just model what “makes sense”, but this still seems wrong to me. You have information that is not being used. But maybe that’s besides the point when it comes to causal modeling. (Can someone educate me here?)

I digress; there are massive tomes written on non-parametric statistics — I know because I spend too much time in the library, but they seem to be seldom used by researchers. Perhaps the boost in power you get from a t-test with a moderate sample size is worth accepting, despite the flimsy assumption checking. I am not totally sure, but if you open up pubmed and read some studies, you will hardly ever see “non-parametric” or “bootstrapping”. I am not sure if I am lacking in education regarding experimental design, or if I am actually missing something really big. Why do we focus on hypothesis tests anyway, when if you’re data is sufficiently large, or noisy, or doesn’t meet the assumptions of the test, then the results can’t be trusted?

II.

This post isn’t really about non-parametrics vs normal-theory backed tests.

I never thought that a single undergraduate level class could be transformative. But, here I am. Transformed. It was a data analysis class, and part of the reason that it was so interesting was it was in the math department, but it was open to anyone who knew a little bit of R or Python, and knew some basic rules of probability. The class was rigorous in a sense — proofs were shown. Derivations were made. But what made the class interesting was, for the first time for me, we stopped talking so much about how a test works under perfect theoretical assumptions, and instead just started working with real data. Old data, yes, but real data, from real studies. With missing values, non-symmetric distributions, unequal sample sizes between groups, you name it. While the learning still took place in an academic safe space, the data and the class itself lacked structure, and is what made the class so interesting.

Here was the skeleton of the class: You received some data, you were given a list of a couple things you *had* to do that would earn you a B or a B+ or whatever, if you did them all perfectly. If you wanted a higher grade, you had to go above and beyond and ask additional questions about the data, and perform tests. This class, in the end, turned out to be the most illuminating class of my (short) statistical career. Since I only knew a couple things, and I wanted to impress the professor, I started reading a bit more. And, as the cliche goes, the more reading I did, the less I knew.

(NB: Anyone who is familiar with research might be aware that actually broadening your hypotheses as you explore data is actually a bad habit unless the intent of the data analysis is, in fact, for exploration. This type of analysis is called an exploratory analysis and is used to generate hypotheses for further study — i.e., replication. This was, in fact, the title of the class. There is a reason I am noting this. It comes up later.)

I remember on particularly frustrating assignment, in which we had two groups that had some continuous trait that I wanted to measure. The groups were pretty close to homogeneous, except in the observation of what I later learned to be a kind of treatment effect (in a loose sense) — one group consisted of smokers, and the other did not. This was not the problem. The problem was that the non-smoking group had many more samples than the smoking group (iirc, this might be flip flopped), and the variances were obviously not equal. While the distribution of the two groups * looked* like it was normal (I mean statistically normal), there were enough differences in even the histogram that one could not reasonably be sure that they came from the same distribution. I asked my professor if this was a problem: “It is.” Later in lecture, she showed a simulation study that revealed that the t-test doesn’t like samples between groups to differ too much when the variances are also unequal. See also this link. Before you start yelling, yes, the t-test is remarkably robust to unequal variances, but only when the samples are approximately equal in size. Even if you have equal sample size, how sure are you that the variances are equal? My (and probably your) statistics text-book recommended an F-test for equality of variances prior to the t-test. This is actually a bad thing to do. Whenever you predicate a statistical test on the results of another statistical test, you are adversely affect your nominal p-value of the subsequent test, and will bias your result. In the end, I had to do some reading outside of assigned textbooks in order to figure out the correct procedure.

(The solution: There’s two. You can actually bootstrap subsamples from the larger group to a number that matches the size of the smaller group. Compute a t-test statistic, then repeat the process. Do this many times and compute an estimated p-value using your bootstrapped t-test statistics. Or, just go simple and do a bootstrapped difference in means test, which tests essentially the same thing and is easier to implement – you assume the variances are equal in the latter method, however – which doesn’t really solve the original problem. If you draw without replacement, then you have a permutation test, which will reject the null if the variances are unequal *or* the means or unequal. I actually like this test a lot — but it’s a little old fashioned. For the second approach, which is correct: Welch’s t-test. In fact, some people recommend *always* using Welch’s t-test.  When assumptions of Student’s t-test are met, they perform about the same. When the assumptions of Student’s t-test are not met, then Welch is clearly better. Allegedly, it still has a problem with vastly different sample sizes, though some would disagree with that point.)

Two weeks later, while reading a research paper, made by actual researchers, I noticed that they had committed the *exact* same mistake I almost did – but on an order of magnitude worse. They performed a two-sample t-test between two groups in which one group had only around 50 observations, and the other had well over a thousand. Variances between group were evidently *not* equal when I looked at the data myself (they made it public, thankfully). What this means is that the type I error rate can be vastly inflated or deflated. Their statistically significant result effectively cannot be trusted. I remember being somewhat perplexed by this. The textbook I was learning out of recommended a procedure that actually doesn’t work (F-test before t-test) and even professional researchers seemed to make the same type of mistakes. Around this time, I started to wonder why we perform hypothesis tests at all if we can never fully meet their assumptions. We almost could just do something more primitive: plotting data. If one has two groups of samples, and you plot the histograms of the groups, and there is a severe shift in location based on just the plot, then of course a t-test will reject the null. Of course pretty much every other test for center of location will reject as well. “But how do you know what you’re observing is actually significant?” If you overlay two histograms, and they’re almost disjoint, then it’s significant. I wish I could formalize this. If you’re testing for a difference in means, then what you’re really doing is testing a difference in center of location. If the center of location is so obviously different to the naked eye, your test will reject. However, good luck getting a couple of plots into an academic journal without an accompanying statistical test.

I can’t help but wonder: what is the point of applying hypothesis tests, carefully checking assumptions that never hold anyway, and then reporting a resultant p-value that you expected the moment you plotted the data?  If performing an F-ratio test prior to a t-test actually affects the nominal p-value of the t-test itself, then what does it mean if you plot some data, look at it, then decide on a hypothesis test?

Enter the garden of forking paths.

III.

From the abstract of the aforementioned paper: “Researcher degrees of freedom can lead to a multiple comparisons problem, even in settings where researchers perform only a single analysis on their data. The problem is there can be a large number of potential comparisons when the details of data analysis are highly contingent on data, without the researcher having to perform any conscious procedure of fishing or examining multiple p-values. We discuss in the context of several examples of published papers where data analysis decisions were theoretically-motivated based on previous literature, but where the details of data selection and analysis were not pre-specified and, as a result, were contingent on data. “

See also this.

Also from the paper: “Researcher degrees of freedom without fishing: computing a single test based on the data, but in an environment where a different test would have been performed given different data; thus T(y; \phi (y)), where the function \phi (\cdot) is observed in the observed case.”. Here T refers to a statistical test, and \phi (\cdot) refers to just some function dependent on the data. E.g., the choice of variables to be controlled in a study.

Gelman’s more or less argues that whenever you make a choice of hypothesis tests that are contingent on the data, you are going to get p-values that are overconfident. Overconfident, because you would never intentionally choose to analyze something you expect to have no difference. A problem with null hypothesis testing in general: We are technically assuming the null hypothesis to apply some test, but we never actually believe in that null. We chose the analysis because we actually expect a rejection. This *can* include plotting the data, looking at it, then deciding to test a hypothesis based on what you see. For example, say group A is a control group and group B takes some drug to measure the effectiveness of that drug on some affliction. Let’s say we are interested in how long someone is sick taking no drug, or taking the drug. Now, if the hypothesis you are interested in is the number of days someone is sick with or without the drug, there are many many ways that hypothesis can be tested. Even if a researcher works very hard to not dredge their data, even plotting the distribution of sick days between groups can lead to overconfident results if the choice of test depends on that plotting. For example, you plot the distributions and see that the center of location is about the same, but that one group has fatter tails in their distribution. Deciding to test the variance after the fact will naturally give a p-value that rejects, and an overconfident result. Likewise, deciding to test center of location *after* seeing that the centers of location do not appear to be the same will result in the same overconfidence. As Gelman himself states, even though the researchers in his examples were trying to do a good job (they weren’t intentionally trying to lie to us), because their data analysis procedures were not well defined apriori  to seeing the data. The analysis ends up depending on the data, and loses its interpretability.

The more correct procedure then, is to gather the data with a specific hypothesis in mind, with a specific testing procedure.  “We suspect that the drug will reduce, on average, the number of days a person is sick. To investigate this, we will perform a test of for center of location of the data”. That means you gather your data and you only test that result. If you also report a change in variance, because the plot suggested it, or even that people feel optimistic taking the drug, after seeing that there is something going on there after the fact, then you are in the garden of forking paths. The reason for the ominous sounding name is that at any one moment, a researcher can apply one of various tests. Their paths are “forked”. If you navigate this maze by choosing tests that “make sense given the data” (read: dependent on the data),  then your results are probably biased.

See the aforementioned paper by Gelman for more examples of this actually happening. My favorite example is the idea that conservative leaning men are physically stronger than left-leaning men — That analysis was done by none other than the (in)famous Jordan Peterson. Essentially Peterson wanted to look at the effects of political stance on physical strength in men, but did not actually measure strength — he measured arm circumference. And he did not use men, he used male college students, so we are off to a bad start already. Gelman claims that arm circumference is actually a proxy for age, and so when age is included into the analysis, the association between “strength” and socioeconomic status evaporates. Not only that, since no main effect was found, Peterson reported the interactions. But his interactions did not take into account this age-as-a-proxy-for-arms quibble. The result is that we get to read a statistically significant p-value, it’s meaningless. While it may yet be true that political leaning has some association with strength, this study doesn’t actually show what it purports to.

Okay, fine. You’re a smart researcher. You’re critical, and well-trained, and rigorous. You know that a p-value is not the probability the null is true. You can figure out how to do research without accidentally dredging or ending up in the garden of forking paths. But can you do this? From the post:

“In causal inference, it is often important to study varying treatment effects: for example, a treatment could be more effective for men than for women, or for healthy than for unhealthy patients. Suppose a study is designed to have 80% power to detect a main effect at a 95% confidence level. Further suppose that interactions of interest are half the size of main effects. What is its power for detecting an interaction, comparing men to women (say) in a study that is half men and half women? Suppose 1000 studies of this size are performed. How many of the studies would you expect to report a statistically significant interaction? Of these, what is the expectation of the ratio of estimated effect size to actual effect size?”

Answer: I have no idea. Allegedly *none* of the statistics PhD students in Gelman’s class got this right either (this is at Columbia by the way). In fact, they couldn’t even start it. If statistics PhD students, who have extensive training in statistical theory and methodology couldn’t get this question right, then what hope do social science researchers have? “Psychology is a rigorous science! We use bayesian statistics!” I am unmoved.

I want to prevent this post from getting too long, so here is the main take away: Hypothesis testing requires a metric ton of assumptions about your data. It requires very very strict and careful methodology, and is often extremely difficult to get right. There is a mistaken view among people (including myself when I was younger) that statistical tests are like a calculator which spit out a yes-no or significant-insignificant in response to some data you feed it. This isn’t true. Statistics isn’t computer science, unfortunately (ba-dum-tss) and is difficult to get right. While there are some things that can be done in order to prevent dredging your data all over the place, they aren’t ideal. Splitting your data into exploratory and confirmatory sets will always hurt your statistical power, and in many situations, power isn’t that high to begin with. While there are problems with arbitrarily high power values in a test, the opposite problem is more problematic in actual real-world research. I am also skeptical of simply applying a correction to p-values. Anyone who has done data analysis has probably seen p-values so small that the software just estimates them as a flat 0.000. What correction can you apply? “Isn’t a p-value that’s arbitrarily small actually more evidence of a statistically significant result, dredging or no?” Maybe. It’s true that extremely small p-values are “more convincing”, but in the face of dredging, I am not so sure. If you gave me a data set, any data set, I am sure that I could dig up an arbitrarily small p-value given enough time and beer. P-values can never actually be zero, since there is always a positive probability that your observations were due to chance. However, a p-value is, technically, a random variable. That means it has a distribution, and even your very small, very pleasing p-value that got you published could be entirely due to chance.

 

IV.

“If there are so many problems with p-values, what do we do instead?” This is an answer I can’t readily give, because it remains a topic of hot debate within the statistical community. Some suggestions are simply reducing the “standard” threshold for significance to an \alpha = 0.005, but not everyone finds this to be a useful substitution. Bayesians, of course, advocate for bayesian statistics instead, though that itself is not without its problems. As researchers, there is the more immediate problem of our employers and PI’s expecting p-values. Old habits die hard, and in order to get published, you have to reject the hell out of some hypotheses. It is not good enough to say “We thought maybe these two things are associated, but turns out they are not. P>0.05“. While you won’t get published over such a claim, it is actually fallacious to think that the non-rejection of a null gives no information. Add in the fact that scientific research is based on “discovery”, and you have a recipe for accidental data dredging in order to achieve significant results.

However, here’s a couple ideas for non-statisticians who like to read studies: If the study hasn’t been replicated many times, just take it as exploratory. The fact of the matter is, most studies fall into this realm. Hypotheses aren’t well defined, someone just had some data and wanted to see if the data said anything interesting. In this area, results shouldn’t be trusted beyond piquing your interest that an association might be there.  As I was explaining earlier, exploratory studies are studies designed to give ideas for more rigorous testing later. The results from exploratory studies need replication before they can be trusted. We can be reasonably sure, for example, that smoking causes cancer, since that has been studied again and again, but any time you read an article titled “Why science says you shouldn’t…”, just ignore it. Especially if it’s a new study and being posted on Huffingtonpost or Buzzfeed, or some other clickbait site. The more I learn about statistics, the more I feel that someone like Nassim Nicholas Taleb has the right idea– heuristics that have been handed down for generations are useful, because their existence and propagation were empirical. Maybe mom really does know best. I’m not saying to abandon statistics and reading studies as an avenue for self-learning, but studies in themselves are pretty useless to a lay-person unless you have domain knowledge in either the field of interest or in the methodology. This is because you have to be able to sift through the garbage. I hesitate to make this an necessary and sufficient condition, but maybe it can be a heuristic itself: If you don’t know the field, then you shouldn’t be relying on studies in the field. The first step to acquiring domain knowledge involves textbooks, not studies. If you want to know how the brain works — buy a human physiology textbook! Don’t read stuff on pubmed until you know how to differentiate the bullshit from the novel. By extension, be careful of blogs, including this one. There is usually only one, incomplete, and flawed person behind the screen, and errors can be abound.

For the statisticians and researchers: I am fascinated by the idea of statistics without hypothesis testing. Visualization of data is simple and an extremely useful tool to understanding what is going on with your data. Parameter estimation is also useful, and has a huge robust grounding in theory. While “CI’s” suffer many of the same problems of hypothesis tests, they are, I think, very useful in describing what’s going on. Classical confidence intervals have some problems, but the use of bootstrapping to obtain confidence intervals is a pretty good procedure. They’re entirely non-parametric, your coverage will be pretty good, and if they’re disjoint, you can still claim some kind of hypothesis test for your employer/PI. A good quote from a random commentator on Gelman’s blog- “Just say something reasonable and useful about your data”. Understandably, researchers already think they are doing this, but for every statistically significant result, what is the significance to the reader? Can we go from p < 0.05 to a decision? I think maybe not, but causal modeling, probably yes.

The idea of simply “modeling the problem directly” is attractive. While I don’t know much about this methodology other than it involves the usage of “multi-level models” (a kind of fancy linear regression), my (limited) statistical training is about as classical as you can get, and so this idea is exciting to me. It is very interesting that computer\data scientists focus entirely on model building, and are able to achieve extremely accurate predictions while violating almost every assumption a classically trained statistician relies on. Even statisticians use predictive accuracy as an assessment of model fit, so what does it mean if computer scientists are so dominant at this? Yeah, the models are biased, but they have ROC scores of > .90 on the hold out set. “Machine learning is just a black box!” Maybe, but there is work being done in that direction. See also this.  And causality aside, prediction is the other half of reasons why people want to hire statisticians. “But how do we actually learn from the data if we are just focusing on out-of-sample predictions”. I’m not sure. But accurate out of sample predictions are very, very, useful, and marketable. This seems like a good thing to learn how to do well.

 

 

 

 

Leave a comment