Pase et al. (2017) recently published an article in the journal *Stroke* entitled “Sugar- and Artificially Sweetened Beverages and the Risks of Incident Stroke and Dementia”. This publication was quickly picked up by the mainstream media, and it was off to the races from there. News article titles included:

*Daily dose of diet soda tied to triple risk of deadly stroke*

*Diet sodas may be tied to stroke, dementia risk*

*Diet sodas tied to dementia and stroke*

*Diet Sodas May raise risk of dementia and stroke, study finds*

*Diet soda can increase risk of dementia and stroke, study finds*

With headlines at these at the fingertips of millions of people, it is important to retain a healthy dose of scientific skepticism, especially with suggestions like these:

“if you’re partial to a can of Pepsi Max at lunch, or enjoy a splash of Coke Zero with your favorite rum – you might want to put that drink back on ice”

The driving point from this article was that drinking artificially sweetened drinks led to a 3x increase in the risk of incidental stroke and dementia. Now to be fair, while some news outlets may have overstated some of the results from Pase et al., others actually included fair points regarding the article, including that this connection was only found between artificially sweetened beverages, and that no link was found between other sugary beverages (i.e. sugar-sweetened sodas, fruit juice). Some news articles rightly pointed out the classic “correlation does not equal causation” remark. The lead author of the paper even pointed out in an interview that “*In our study, 3% of the people had a new stroke and 5% developed dementia, so we’re still talking about a small number of people developing either stroke or dementia*”. With 2,888 participants analyzed for incident stroke, and 1,484 observed for new-onset dementia, that translates into roughly 87 people who suffered a stroke, and roughly 74 people who developed new-onset dementia.

Pase et al. (2017) did adjust for age, sex, caloric intake, education, diet, exercise, and smoking, among other things. Interestingly however, they did not control for multiple comparisons, which is the bigger point I would like to raise in this post. Whenever a researcher runs multiple tests using the same dataset (for instance when a treatment has 3 or more levels), or when running extra analyses on a subset of a dataset, or even when running extra tests on variables that weren’t previously considered, this all increases the risk of Type I error. A.K.A. a “false positive”, Type I error occurs when you find an effect, when in the population no effect actually exists. Running multiple tests yields more chances that an effect will be found, thus increasing the risk of running into Type I error. An easy solution would be to adjust the alpha criterion (usually .05) by the number of tests being run (i.e. the Bonferroni correction, a very popular Type I error correction because it is easy to calculate manually). For instance, if you are running 10 *t*-tests on the same dataset, one could easily adjust alpha to .005 (.05/10). So, controlling for multiple comparisons, an effect would be deemed significant if the *p*-value fell below .005, not the typical .05.

How much does this actually matter? Would adjusting for multiple comparisons yield any meaningful changes in regards to statistical interpretation? To investigate this, I simulated 100 datasets, each with a sample size of 100, assuming a medium effect size. Data were generated as Likert type data ranging from 1 to 7. One factor with five levels was considered (with means of 2.0, 2.3, 2.6, 2.9, 3.2). *Post-hoc* *t*-tests were then analyzed for all pairwise comparisons both with no *p*-value adjustment as well as using a Bonferroni correction. The number of significant *p*-values were then logged in both cases for each dataset. The *R* code used for this simulation can be found at the end of this post.

Simulation results revealed, probably not to anyone’s surprise, that yes, there is a difference in the number of significant *p*-values found, depending on if one controls for multiple comparisons. Out of 10 possible comparisons, on average, post-hoc *t*-tests revealed more significant *p*-values (*M* = 5.60, *SD* = 1.33) when you don’t control for multiple comparisons compared to when you do (*M* = 3.64, *SD* = 1.38), *t*(99) = 17.25, *p* < .001, *d _{avg}* = 1.45, 95% CI [1.16, 1.73].

Turning back to Pase et al. (2017), what effect would this have had on their conclusions, considering they did not control for multiple comparisons? Below is a snapshot of beverage intake and the risk of stroke from Pase et al.

Apologies if the table is a bit blurry (might be better to look up the article directly), but the top 2/3 of this table shows neither total sugary beverages nor sugar-sweetened soft drinks had any significant effect on the risk of stroke, as the *p*-values are above what we assume is their alpha-criterion, *p* < .05. The bottom third of the table shows artificially sweetened soft drinks. Looking at just the results from model 3, which controlled for the most potentially confounding factors, we see that 6 out of the 8 *p*-values are significant. However, by using a simple adjustment of .05/8 tests, our new alpha criterion is .00625. Using this criterion, none of the *p*-values reported would fall below the significance criterion.

Considering dementia, total sugary beverages and sugar-sweetened soft drinks both remained non-significant. However, using Model 3 (the authors most conservative model), none of the *p*-values were significant for artificially sweetened drink intake, even before controlling for multiple comparisons. If we assume that these regression models include control for other variables (i.e. lessened *df* for including many predictors), we could reduce the number of comparisons down to 4 (recent intake/cumulative intake by stroke type), and then the corrected alpha would be .05/4 = .0125. Given the precision of the table is only two decimals, it is difficult to tell if the .01 values would still be considered significant. By not employing this simple adjustment, the authors increased their risk of Type I error, and as a result the conclusions from this paper drastically changed, from finding significant effects to finding none. By making sure we control multiple comparisons, we can avoid problems, such as finding false positives, and make for better, more robust science.

Given the large sample size and the large number of models employed here (24 regressions!), we must be careful in our interpretation – especially given we are predicting very small categories, which is always a difficult science. The use of an effect size in the table is encouraging, especially with confidence intervals. These confidence intervals indicate an even more telling story – many of them include a 1:1 ratio or very close (i.e. you have a 50-50 or chance shot of developing a stroke) and are quite wide, indicating we don’t quite have a good picture of the relationship between drinks and stroke just yet.

R Syntax

#set up

library(mvtnorm)

library(reshape)

library(data.table)

library(devtools)

#devtools::install_github(“doomlab/MOTE”)

library(MOTE)

library(ggplot2)

options(scipen=999)

Means = c(2.0, 2.3, 2.6, 2.9, 3.2)

N = 100

round = 0

sig_data = data.table(no = 1:N,

yes = 1:N)

for(i in 1:N){ #start loop

#create data

sigma = matrix(c(3,0,0,0,0,

0,3,0,0,0,

0,0,3,0,0,

0,0,0,3,0,

0,0,0,0,3

), nrow = 5, ncol = 5)

dataset = as.data.table(rmvnorm(N, Means, sigma))

dataset = round(dataset, digits = 0)

dataset[ dataset < 1 ] = 1

dataset[ dataset > 7 ] = 7

dataset$partno = as.factor(1:nrow(dataset))

longdataset = melt(dataset,

id = “partno”)

#pairwise comparisons

round = round+1

noadjust = pairwise.t.test(longdataset$value,

longdataset$variable,

paired = T,

p.adjust.method = “none”)

x = unname(c(noadjust$p.value[c(1:4)],

noadjust$p.value[c(2:4),2],

noadjust$p.value[c(3:4),3],

noadjust$p.value[4, 4]))

x = x < .05

sig_data$no[round] = sum(x == TRUE)

yesadjust = pairwise.t.test(longdataset$value,

longdataset$variable,

paired = T,

p.adjust.method = “bonferroni”)

y = unname(c(yesadjust$p.value[c(1:4)],

yesadjust$p.value[c(2:4),2],

yesadjust$p.value[c(3:4),3],

yesadjust$p.value[4, 4]))

y = y < .05

sig_data$yes[round] = sum(y == TRUE)

} # end loop

## Differences in number of significant p-values?

t.test(sig_data$no, sig_data$yes, paired=TRUE)

m1 = mean(sig_data$no)

sd1 = sd(sig_data$no)

n = length(sig_data$no)

m2 = mean(sig_data$yes)

sd2 = sd(sig_data$yes)

d.dept.avg(m1 = m1, m2 = m2, sd1 = sd1, sd2 = sd2, n = n, a = .05, k = 2)

sig_data$partno = 1:nrow(sig_data)

noout = melt(sig_data, id = “partno”)

theme = theme(panel.grid.major = element_blank(),

panel.grid.minor = element_blank(),

panel.background = element_blank(),

axis.line.x = element_line(colour = “black”),

axis.line.y = element_line(colour = “black”),

legend.key = element_rect(fill = “white”),

text = element_text(size = 15))

ggplot(noout, aes(variable, value)) +

stat_summary(fun.y = mean,

geom = “point”, size = 2, fill = “gray”, color = “gray”) +

stat_summary(fun.data = mean_cl_normal,

geom = “errorbar”, position = position_dodge(width = 0.90),

width = 0.2) +

theme + xlab(“Controlling for Multiple Comparisons”) + ylab(“Number of Significant p-values”) +

scale_x_discrete(labels = c(“None”,”Bonferroni”))