Useless PValues: The Twitter Bot
In this post, we will discuss the ideas and code behind a Twitter bot (called “Useless PValues”) that I built recently. The bot simulates 2condition experiments (with parameters similar to realworld scenarios in the social sciences) and posts a pvalue of the betweengroup comparison to Twitter. Every 8 hours there’s a new simulation, comparison, and pvalue.
Background
There’s been lots of thinking in many areas of science (but particularly in the social sciences) about the lack of replicability of wellpublisized research findings. See for example, this widelypublicized article which aims to estimate the reproducibility of psychological research.
There are many possible reasons that experimental social science findings tend not to replicate. This subject is not the focus of this blog post, but for an easilyreadable introduction I recommend this excellent blog post by Andrew Gelman or this Wikipedia page on the replication crisis.
The Consumption of PValues
In this post, we’ll focus on the consumption of quantitative research. Both scientific and nonscientific audiences are confronted with a nearly impossible challenge when making sense of significance tests in published articles (or anywhere else one might happen to come upon them). Typically, published research articles focus on one or more hypothesis tests which compare an estimated parameter (e.g. the difference between group means, the size of a regression coefficient) against zero (the default nullhypothesis).
Conventionally, a pvalue of less than .05 is taken as evidence against the null hypothesis (e.g. that the observed difference between groups is not equal to zero), and furthermore (despite the break in logic) as support in favor of the researcher’s alternative hypothesis.
Lots of error can creep into the computation of pvalues in published research, from phacking (conducting many different statistical tests and only reporting those that yield statistically significant results) to straightup data fabrication. However, even when the tests are conducted and reported properly, they are often hard to interpret.
The Useless PValues Twitter Bot is a metacommentary on three important issues surrounding the typical use of pvalues in published research:
 Statistical power (the probability that a statistical test will yield a significant result, given the effect size, the sample size, and the significance level) is low. Therefore, “real” differences that are small in magnitude are likely to not be statistically significant, using conventional significance levels (e.g. alpha = .05) and sample sizes common in experimental social science research.
 This is the situation in our simulations, in which we observe 50 samples for each of our two groups (for a total sample size of 100). With 50 observations per group, we only have 80% power to detect standard effect sizes of .56 or greater.
 We never know the **true** effect size (but we implicitly assume that it is large). For any given study in the literature, it is rare that anyone knows a priori what effect size to expect. This is normal, in a way. Scientific research often takes place at the bleeding edge, with the explicit goal of extending knowledge into unknown territory. Indeed, if we already knew the answer before we started, we probably wouldn’t conduct the study (especially because rewards in the academy go to novel research that publishes surprising and counterintuitive findings). However, by using small sample sizes, researchers make an implicit assumption about the effect sizes in their studies. This unstated assumption is that effect sizes are rather large, and therefore likely to yield statistically significant effects, given the sample size, significance level, study design and subject matter.
 This is the situation in our simulations. Each significance test starts with the selection of an effect size of the difference between the two groups. This effect size can take on any value between 0 and .5 standard deviations (a reasonable assumption, in my opinion, given the relatively small effects that are yielded by wellpowered studies in the social sciences). However, the true effect size that serves as the basis for the simulations remains hidden from us as observers.
 We focus on the statistical significance of a null hypothesis sigificance test. The crux of most empirical arguments (especially in the experimental literature) is the sigificance test and subsequent reporting of pvalues. This despite the fact that effect sizes (along with domain knowledge and the potential gain or loss implied by applying the results to a realworld situation) are a better metric to judge the meaning of an observed effect.^{1}
 This situation is reflected in our simulations. The Twitter bot only posts the pvalue and whether or not it is statistically significant, mirroring the most common way that statistical analyses are consumed by many readers.
A Deep Dive into the Code
We will now go through the code, explaining how it all works.
(Many thanks to Felix Schönbrodt and Stella Bollmann, whose presentation was very helpful in writing the below code. Their simulation examples were written in R, but I used Python because it was easier to set up the Twitter bot using GCP.)
Step 1: Load the Necessary Libraries
For this simulation, we will use the numpy library to simulate the data and the Scipy stats library to conduct the ttests. Let’s first load the libraries.
Step 2: Set the Sample Size and the Effect Size
In our simulation, we compare two groups (which we can think of as experimental conditions), each with a sample size of 50 observations. The number of observations per group is constant for each simulation, at 50 observations each, and is set in the sample_size variable in the code below:
We then randomly select an effect size for the difference between the two groups. The effect size is taken from the uniform distribution, with a lower bound of zero and an upper bound of .5. In other words, the effect size for each simulation is equally likely to be any value in between 0 and .5. This seems like a fairly reasonable assumption for an experiment of an unknown phenomenon in the social sciences. We store the randomly generated effect size in a variable called effect_size in the code below:
Step 3: Generate 50 Observations for Each Group, Based on the Effect Size
We next generate 50 observations for each of the two groups; we draw these observations from normal distributions. The first group’s mean (loc or location in the code below) is zero, with a standard deviation of 1 (scale in the code below). The second group’s mean is set to the effect size chosen above; this distribution also has a standard deviation of 1. Because both of the groups have standard deviations of 1, the effect size (e.g. the difference between the “true” group means) is expressed in standard deviations.
Step 4: PValues (Testing the Difference Between the Two Groups)
Finally, we conduct an independentgroups ttest to compare the difference between the two samples’ means. (We use a Welch’s ttest, which doesn’t assume equal variances between the two groups, a good assumption in most realworld situations). Our nullhypothesis is that the difference between the two samples’ means is zero. Rather than reporting the observed means, standard deviations, or the estimated effect size, we simply return the pvalue and report whether it is less than .05.
Summary and Conclusion
The Useless PValues Twitter bot simulates what it’s like to consume social science research. Sample sizes are low, effect sizes are at best smalltomedium (but also sometimes zero!), and we are asked to interpret pvalues from an underpowered nullhypothesis significance test. Under these conditions, it becomes extremely difficult to know what to do with conclusions based on such pvalues.
There’s no nice, neat conclusion here, unfortunately. Let’s end with an amusing statistics Futurama meme:
Coming Up Next
In the next post, we’ll move away from statistical inference and towards data visualization. Specifically, we’ll learn how to make word clouds that are fit for both data scientists and senior management.
Stay tuned!

It must be acknowledged, however, that increasing importance is being placed on the reporting of effect sizes and that these are increasingly common in published articles. Although, on a more negative note, sample sizes of the kind often found in the literature are too small to yield precise estimates of effect size. ↩