Page by Sophie E. Hill | @sophieehill.bsky.social

516

Sophie E. HillSeptember 5, 2025 at 12:45 PM EDT

"Sweeteners can harm cognitive health equivalent to 1.6 years of ageing, study finds" Or does it? Let's take a look at this "study"... theguardian.com

Sweeteners can harm cognitive health equivalent to 1.6 years of ageing, study finds

www.theguardian.com

Disclaimer: I'm a social scientist not a health researcher. I'm going to comment on the research design, statistical analysis, and interpretation. Here's the study - just published in Neurology. The authors claim that consumption of LNCS (low- and no-calorie sweeteners) "was associated with an accelerated rate of cognitive decline during 8 years of follow-up". www-neurology-org.ezp-prod1.hul.harvard.edu

Their data comes from the Brazilian Longitudinal Study of Adult Health (ELSA-Brasil), a study of ~15,000 current and retired civil servants aged 35-75 at inception, with 3 study waves (2008–10, 2012–14, and 2017–19). 🚩 Red Flag #1: Correlation/Causation 🚩 This paper finds an *association* between two variables. But you know the saying: association is not causation! Yet the authors continually slip into causal language ("harm" "effects"), qualified with a few weasel words ("possibility", "suggest"):

This paper provides *ZERO* evidence of the negative effects of sweeteners, because that is a causal claim and there is no strategy for causal inference in this paper: no statement of assumptions required to give their results a causal interpretation, no consideration of alternative explanations. 🚩 Red Flag #2: No Theory 🚩 There is no theory in this paper. The authors just tell us they "hypothesized a priori that higher consumption of LNCSs would be associated with faster cognitive decline". One reason why they can't give us an actual theory is that they're going to study a whole bunch of different sweeteners with different metabolic pathways: -artificial sweeteners (aspartame, saccharin, acesulfame-K) -sugar alcohols (sorbitol, xylitol, erythritol) -a rare sugar (tagatose) 🚩 Red Flag #3: Multiple Comparisons 🚩 No theory, no pre-registration = nothing to tie the researchers' hands. The result is a LOT of different regressions, any one of which they could claim as evidence to support the broad conclusion that sweeteners are linked to cognitive decline. Why is this bad? Because sometimes you find spurious associations by chance. The more regressions you run, the more likely this is. This is the problem of "multiple comparisons". Multiple comparisons: 1) Two sets of controls (demographic + health) 2) Obesity, diabetes, and healthy diet as moderators (only with health controls) 3) 7 sweeteners in aggregate and separately 4) Stratified by age (<60 or >=60) That's 2 x 2.5 x 8 x 2 = 80 variations! That final modification - stratifying by age - is explicitly described as an ad hoc modification: i.e. the authors had no reason to think this association would vary by age, they just saw ("visual inspection") something in the data.

This is textbook p-hacking! I'm not making any assertions about the intent of the authors. They state in the text that this was an ad hoc adjustment to the specification based on looking at the results. There's no other way to describe this. Now I'm not saying that all their results can be explained away as false positives. Rather, the red flag here is that the authors do not even acknowledge this problem. No theory or pre-regristration to restrict researcher degrees of freedom, no statistical adjustment for multiple comparisons. 🚩 Red Flag #4: Incorrect Confidence Intervals 🚩 Can you spot the problem here? "consumption of combined LNCSs in the highest tertile was associated with a faster decline in verbal fluency (second tertile: β = −0.016, 95% CI −0.040 to −0.008"

A standard confidence interval from a linear model should be centred around the estimate. The centre of this CI (-0.040, -0.008) is -0.024 But the reported estimate is -0.016 Perhaps it's just a typo in the text? (0.016 is the half the width of the CI) Let's check the full table... In eTable 6, we have the same estimate and confidence interval. And now we also get a p-value, that is obviously inconsistent with the reported confidence interval! (A 95% CI that almost crosses 0 should have a p-value close to but just below 0.05)

And I've found another one: 9th row: -0.008 (-0.032; 0.024), p=0.668 CI centre is -0.004 p=0.57 based on reported values

This is what I found after about 5 minutes of basic sense checks and back-of-the-envelope calculations. I can only imagine a more systematic analysis of all tables would find more. 🚩 Red Flag #5: EV Measured Only at Baseline 🚩 The main explanatory variable - consumption of sweeteners - is measured only ONCE at baseline. So they are using sweetener consumption in 2010 to predict cognitive decline from 2010-2018. Can you see any problems with this? Diet changes over time, often *in response to* health problems. This can create "reverse causality": in a cross-sectional analysis, you may find that obese people consume more sweeteners. Did the sweeteners make them obese? No. Being obese made them consume more sweeteners. The authors address this point only once, in a single sentence: "In addition, diet was assessed only at baseline, which may not reflect longitudinal diet changes and may lead to an underestimation of the associations between LNCS and cognition" It's true that random measurement error in diet will downwardly bias estimates of the association between diet and cognitive health ("regression dilution"). But what about *non-random* measurement error? The authors do admit that "some groups may consume more LCNSs because of their lifestyle and clinical history", but at no point do they consider how this could affect their interpretation of the association between sweeteners and subsequent cognitive decline. 🚩 Red Flag #6: Missing Data for Under-55s in W2 🚩 Participants who were under 55 at wave 2 were NOT tested for cognitive function in that wave. So the cognitive health data looks something like this:

You might think that they would just restrict their sample to those who were 55+ at wave 2, right? Then they would have cognitive health data at W1, W2, and W3 for the whole sample. But that's not what they did! Actually, I'm really not sure what they did. They discuss using inverse probability weighting to account for attrition, but this obviously cannot help with this type of *deterministic* attrition.

I am guessing they just left these values as missing in the main analysis, though that makes interpreting the results much harder, since they are looking at the coefficient of [Tertile of sweetener consumption] * [Time], where "Time" is actually just the age of the participant in each wave. They also do a robustness check using "next observation carried backward" imputation to fill in the missing data. This part at least I can understand!

Think about what this imputation does. Consider our 54 year old Brazilian. We have his cognitive scores in Wave 1 and 3, but not Wave 2, say: 100 -> ? -> 90 The authors will impute the data like this: 100 -> 90 -> 90 Is that a "conservative" imputation strategy? Not really! They've created an artificially steep slope from W1 to W2 and an artificially flat slope from W2 to W3. Even a very simplistic linear interpolation would seem more reasonable to me: 100 -> 95 -> 90 This is not a minor detail. The participants who were under 55 at Wave 2 account for at least 45% of the sample (5,869 / 12,772). And recall that the study ONLY finds a relationship between sweeteners and cognitive decline among the under 60s.

So why didn't they just restrict the sample to those participants who had cognitive health data measured in all 3 waves, i.e. a "complete case" analysis? Answer: because they didn't find any statistically significant associations there!

🚩 Red Flag #7: Changing Measures of Verbal Fluency 🚩 Verbal fluency is one component of cognitive health, but it was measured differently in W2 vs W1 & W3. - Semantic fluency is measured by naming animals or vegetables - Phonemic fluency is measured by naming words beginning with A or F

The categories were changed to attenuate "learning effects"... though I don't find it very plausible that naming animals in 2010 would help me with the same task in 2014 or 2018! Learning effects here (if they exist) are a small problem in comparison to the gigantic headache of measurement error. Side note: it's probably not ideal that one measure of their outcome relates to knowledge of vegetables, which is plausibly linked to many things affecting both the explanatory variable and the outcome variable! Naming animals is easier than naming vegetables. Naming words beginning with A is easier than naming words beginning with F. So to get a consistent measure of verbal fluency, we need to "grade on a curve". The authors use equipercentile equating, a method where you equate scores across 2 tests based on percentiles. (e.g. if 50% of students got at least 20/100 on a hard test and at least 80/100 on an easy test, then we could "equate" the scores of 20 and 80)

But you can probably see the issue here: the change in the measurement of verbal fluency exactly coincides with a huge change in the age of the participants in their sample, and age is obviously correlated with verbal fluency! So how did they do the equipercentile equating? There's no info in the text or appendix other than a citation of Bertola et al. (2021), a paper published in the Brazilian Journal of Psychiatry by some of the same authors of this paper (Lotufo, Benseñor, Caramelli, Barreto, Suemoto). scielo.br

So we can assume they are using the same method as Bertola et al. 2021. But here's the problem: that paper has even MORE red flags! The Bertola paper constructs a mapping using an "equating sample" of 260 participants (55-65, college-educated, white, stable memory scores). And then uses that to create equated scores for their full sample of over 55s. OK, sounds reasonable... The equated scores are higher than the raw scores, as expected (the W2 tests were slightly harder): phonemic fluency goes up a bit: 11.63 ➡️ 11.80 but semantic fluency goes up a LOT: 17.01 ➡️ 19.01 That change is so large that average semantic fluency is now INCREASING from W1 to W2!

An *increase* in semantic fluency is strongly contrary to expectations here, since the participants in this sample are 55+ and the average time from baseline to follow-up is about 4 years. But the authors don't mention this at all! So let's find out what's going on. Figure 3 provides a graph of the raw vs equated scores for each measure. However, on this graph, the adjustment for *semantic* fluency looks pretty minimal - it's basically a 45-degree line. While the adjustment for *phonemic* fluency is much more funky!

Table 2 and Figure 3 seem inconsistent to me. Fig 3 suggests that the semantic fluency scores were barely changed by the equating. That would be more consistent with the mean changes reported for *phonemic* fluency in Table 2. So did they get the variables mixed up? Table 3 shows t-tests for all cognitive tests from baseline to follow up, using the equated scores for the two measures of verbal fluency. Here we find that semantic fluency declines (t = -11) as expected and in line with the other tests, but phonemic fluency increases (t = +12).

If I had only seen Table 3, I might assume that +12 was a typo and it should be -12. But combined with Table 2 and Figure 3 it looks more like a variable mix-up. And here's another red flag: check out the Pearson correlation coefficients there! All < 0.2 and some as low as 0.04! Even over a 4 year time period, we would expect measures of cognitive health to be moderately correlated. This is really odd.

The correlations are so low I wondered if I misinterpreted what they were. Here's the text description. Note the non-standard phrasing ("Pearson r effect size"). Obviously could just be a language issue but it's adding to the confusion.

One complication here is that the authors switch between raw and equated scores expressed in the original units (counts of words), and standardized as z-scores (mean = 0, sd = 1). And it's not always clear when the data comes from the equating sample (n=260) vs the full sample (n=5,949). Table 1 shows means and SDs for "standardized" cognitive scores. But clearly they didn't standardize using the W1 sample of 5,949 or else the first two columns should be 0 and 1 exactly. Perhaps they used the full W1 sample including under-55s?

These inconsistencies are compounded by the fact that the text of the article is very vague and sometimes completely incoherent. For example: "Despite significant differences between baseline and follow-up mean cognitive scores, cognitive parameters at follow-up showed major stability over time according to the small effect sizes for repeated measurement comparison"

This is just word salad. And I don't think it can be explained as a language/translation issue. It is just fundamentally confused. The fact that the authors never mention the discrepancies in how the semantic and phonemic scores are adjusted (whichever way round is correct!) is baffling. So, I would say the Bertola et al. (2021) paper is very low quality. Why does that matter for the paper on sweeteners? 1) there is a substantial overlap in authorship 2) a component of their outcome variable depends on methodology in Bertola et al. (2021) So, this study linking sweeteners to cognitive decline: 🚩 confuses correlation + causation 🚩 has no theory 🚩 shows evidence of p-hacking 🚩 reports inconsistent CIs / p-values 🚩 measures diet only once 🚩 has a big missing data problem 🚩 adjusts outcome opaquely theguardian.com

Sweeteners can harm cognitive health equivalent to 1.6 years of ageing, study finds

www.theguardian.com

There's honestly a ton more red flags I could go over (including the skewed distribution of sweetener consumption, the way they define "normal cognitive ageing", etc.) But I had 2 diet cokes this morning and the cognitive decline is setting in... Investigation continues here: bsky.app

Sophie E. Hill09/05/25

As noted by @carlmc.bsky.social, one odd pattern in these results is that ALL the estimates and confidence interval bounds are multiples of 0.008 What could explain that? 🤔

Full write-up of the errors and inconsistencies in the results tables (including estimates outside their confidence intervals, sign errors, and many duplicate values 😳): github.com

GitHub - sophieehill/sweeteners: A blog post on errors and inconsistencies in quantiative results reported in Gonçalves et al. (2025), a paper recently published in Neurology, claiming to find an asso...

github.com

Share this Page