Backstage story: The Oct 2023 Correction to Pekar et al
When formal peer-review fails
Introduction:
The preprint publication of Pekar et al at the end of February 2022 was accompanied by an aggressive global media push. That push presented the preprint conclusions as ‘dispositive’, and the preprint itself as a fine example of clever work. All for a piece of complex modelling that had not yet been peer-reviewed. The preprint was followed by a peer-reviewed publication in Science on 26 July 2022, with essentially the same content.
This Medium article explains how some undeterred quantitative-minded individuals decided to check the modelling and its assumptions, shortly after the publication of the preprint. This was done in a dedicated DRASTIC Twitter group created in March 2022, where eventually one of them (Nod, @nizzaneela) triggered a substantial correction to Pekar et al in Science in October 2023, by posting detailed modelling issues, their effects, and their corrections on PubPeer, a post-publication peer-review site.
After going over the modelling issues and their corrections, I will show how they substantially affect the market double-jump hypothesis. I will then show that the amended text of Pekar et al largely downplays the significance of these corrections, and draw some conclusions on the limitations of publication peer review.
1. Modeling vs. Shoe-leather epidemiology
Following the preprint release of Pekar et al and of its companion piece, Worobey et al, on 26 February 2022, a problem that normally would have required a classical shoe-leather epidemiological investigation seemed to have been resolved by the sheer power of ‘clever’ modelling [1]. All from 1,000s of miles away. All based on the limited data that had been made available. And all coming out of the double-barrelled gun of a pair of interlinked preprints [2].
An attentive read convinced me that the modelling of Pekar et al did not seem entirely right. That modelling looked way too complex and strangely exculpatory of some fundamental issue with the data and its possible interpretations. To me and others within and outside DRASTIC [3], the modelling, whatever its sparkles, seemed unable to provide the kind of definite answer that the media push had imbued it with.
I have been working for close to 30 years in one way or another with models, quite often on sparse data, and experience has taught me that unwarranted complexity is usually indicative of either a weak hypothesis or of weak skills (or of both). It is also the perfect way to make mistakes, since very few people have either the time or the skills to do a thorough model check. Worst, it is the perfect way to distance oneself from trying to collect better data, or of understanding the existing data better, tasks that should be the anchor of any serious epidemiological investigation.
Raising my doubts further, the main modeller on that paper (Niema Moshiri) seemed particularly confident of his work. When challenged, he sounded rather dismissive of any possible flaws in what was, after all, still a preprint. So, I decided to get a few people to have a good look at the modelling. Here is a screenshot of the very beginnings, on 21 March 2022. (I am the one in blue, acting as the facilitator.)
Two days after creating that chat group, Nod (@nizzaneela) was suggested as a good person to have in the group. I got in touch with him and promptly added him to the conversation.
Quick forward to August 2023:
After quite a lot of work trying to make sense of the model, Nod first raised an issue on the GitHub repository of Pekar et al for a basic modelling bug. This was quickly followed by a comment on PubPeer, an informal and open scientific post-publication review platform. Over the following weeks, Nod pointed out a few more issues in the modelling of Pekar et al. He also provided revised code on GitHub, with the help of Adrian J (@humblesci).
This triggered the publication of a correction in Science on 12 October 2023, without mentioning Nod by name, with just an indirect reference to his PubPeer entry.
2. The modelling issues and their corrections
2.1 What are the modelling issues?
The main issue was posted on PubPeer by Nod on 1st Aug 2023. It all comes down to a programming error likely due to a bad cut-and-paste.
The key consequence of that error is not in the small change in the support for the AB topology (from 0.5% to 3%) within the phylogenetic structures arising from a single introduction of SARS-CoV-2. It is instead in the next step of the Pekar et al analysis, namely in the calculation of the support for multiple introductions versus a single one.
That support is encapsulated in some Bayes factors, using Bayesian Hypothesis Testing techniques that go back to Jeffreys and that have become much more mainstream over the last 20 years. There, that single error has dramatic effects that bring the support in the ‘moderate’ category (to use modern labels, instead of the ones proposed by Jeffreys back in 1961).
Following his first post on PubPeer, Nod soon pointed two more issues with the modelling, which further degraded the Bayes factors:
2.2 What are the corrections to the modelling?
The revised version of Pekar et al eventually included corrections for the three issues raised by Nod, without giving him any credit at all.
The Erratum also mistakenly mentions only ‘an error in the code’, when in fact all three errors pointed by Nod were corrected.
3. Significance of the modelling corrections
3.1 Significance of the revised Bayes Factors
The Bayes factors have effectively lost one order of magnitude in the correction. On the face of it, the corrected Bayes factors are now rather tepid, on the low side of ‘moderate’ throughout (with one point falling further at ‘anecdotal’).
This is best seen by considering Table S5 in the Supplement. That table gives the support of multiple jumps versus a single jump, expressed via the Bayes factors. It also includes some sensitivity analysis, looking for instance at alternate doubling times (my annotations in colour below):
As the revised table S5 shows, the probabilistic support for a double jump at the market hovers at the low end of ‘moderate support’ to use modern labels (Lee & Wagenmakers, Cambridge University Press, 2013). \
With a slight increase in the doubling time, that support even drops down to 3, which is ‘anecdotal’ in modern parlance (‘not worth more than bare mention’ in the original vocabulary introduced by Jeffreys).
Given the residual data, conceptual and parsimony issues, this practically means that the Bayes factors of 3 to 5 are now very much at the level of data and modelling noise.
When these Bayes factors were at ~60, the authors could argue that they would safely pull away the posteriors from these issues. But at 3 to 5, the anecdotal to weak Bayes factors are not pulling anything away any more. Those posteriors are now going nowhere.
3.2 Significance to the logic of the paper
In the end, one may doubt that the authors would have been able to publish their article with such support for their double jump hypothesis, especially after such complex modelling, so many necessary assumptions, and the limitations of their data.
To understand that better, let’s unroll the logic of the paper more precisely:
- The Bayes factors supposedly support at least 2 separate introductions at the market.
- The authors assume that independent introductions at the market can only be zoonotic.
Conclusion: That’s two or more market zoonotic jumps.
This way, the authors can turn a ‘market superspreader event in Dec 2019’ into ‘multiple market zoonotic events in Nov/Dec 2019’ (allowing for earlier cryptic cases).
Based on their zoonosis conclusion with its putative timing, the authors then feel justified in
- Ignoring ample contrary evidence for many more non-cryptic early cases that would otherwise have wreaked havoc with their analysis, and
- taking at face value some totally implausible data points provided by China.
Problem: if the Bayesian support for the two independent introductions at the market goes, the whole house of cards collapses. And then the authors need to have a proper look into the limitations of their data, as their timing for human jumps in late November to early December 2019 falls apart. That in turn weakens their model even further.
Basically, the modelling falls into a tailspin once the support for multiple independent introductions is gone. Too much modelling, not enough care for the data and methodology: it’s a lethal combination in a non-experimental field.
3.3 Renewed need to pay proper attention to the data
A good example for taking at face value some totally implausible data points provided by China, is the mention in the paper of the supposedly all-negative 67 retrospectively identified suspected cases for Oct 1 to Dec 10, 2019.
Given an urban positivity background rate of around 4.4% back in early 2020, the chance of such a result (following testing in late 2021) is about ~6%, even if none of these infections date back to Oct 1 to Dec 10 2019. The only way around that is to suppose that the testing happened too late to detect any infection in early 2020; which in any case means that that data point is useless.
Even less probable is the extremely low number of 92 retrospectively identified suspected cases across 233 medical facilities in Wuhan, during that WHO review of late 2021 to January 2022. The chance of identifying only 92 suspected cases while doing a proper job, especially given the important flue peak at the time, is strictly zero. Which again means that that data point is useless.
For contrary evidence for many more non-cryptic early cases, see this presentation made to the WHO SAGO for instance.
4. The corrections to the text of Pekar et al
4.1 The Erratum:
As mentioned earlier, there are already issues with the Erratum for Pekar et al in the sense that it does not credit Nod, and also incorrectly states that there was only one error corrected (when actually all three issues pointed by Nod were corrected).
4.2 Revised texts for the paper and supplement materials:
Side by side comparisons of the text before and after correction, for the paper and the supplementary materials, are available in this folder: ttps://drive.google.com/drive/folders/1JVDqekRUUmjQ8YkeiApaEzQt3PNjIPz0?usp=sharing.
Here are some of the key differences in the paper:
Here are some of the key differences in the Supplementary Materials:
5 Issues with the revised text:
5.1 Obfuscations
The text of the corrected Pekar et al is not exactly forward in its representation of the importance of the corrections.
For instance, this is what the revised text says:
Phylodynamic rooting methods, coupled with epidemic simulations, reveal that these lineages were most probably the result of at least two cross-species transmissions into humans.
[…]
There was substantial support for two introductions with our primary analysis (BF=4.3 and BF=4.2 with the recCA and unconstrained rooting, respectively; see Methods), as well as with sensitivity analysis with varying transmission and ascertainment cases (Table S5).
when an accurate wording would be:
Phylodynamic rooting methods, coupled with epidemic simulations, give moderate support to the hypothesis that these lineages were the result of at least two cross-species transmissions into humans.
[…]
There was moderate support for two introductions with our primary analysis (BF=4.3 and BF=4.2 with the recCA and unconstrained rooting, respectively; see Methods), as well as with sensitivity analysis with varying transmission and ascertainment cases, except when increasing the doubling time of the Primary Analysis (from 3.47 to 4.45), which resulted in mere anecdotal support (Table S5).
5.2 Choice of vocabulary
One issue is that Pekar et al uses an outdated label for the key [3.2 (or 3), 10] significance interval. It uses ‘substantial’ instead of the more modern ‘moderate’.
The corrected version of Pekar et al states that ‘Bayes factor significance cutoffs from Kass and Raftery (1995) are now used throughout’.
In fact Kass and Raftery (1965) itself mentions that the cutoff-levels and their descriptions are from Jeffreys (1961), the original developer of the Bayesian approach to hypothesis testing. In other words, Pekar et al choice of vocabulary for the BF levels (‘not worth more than a bare mention’, ‘substantial’, ‘strong’, etc) is now more than 60 years old. Since then, Lee & Wagenmakers (2013, Cambridge University Press) have introduced a well-received revised vocabulary, in particular ‘substantial’ has been revised to ‘moderate’ as it was felt that ‘substantial’ exaggerates the importance of the evidence at that BF label.
So the dated choice of labels obscures the fact that the support has substantially shifted: there does not seem to be too much of a difference between ‘strong’ and ‘substantial’, while using ‘moderate’, the modern label for ‘substantial’, would make this very clear. [4]
5.3 Bad typo in the Bayes factors significance intervals:
You can see below that the original version of the paper used 10, 10², 10³ as cut-offs for the significance levels, thus ignoring the tepid [3.2, 10] bucket, which is normally called ‘moderate support’.
After correction, the Bayes factors lost one order of magnitude, so the paper had to shift all the key cut-offs:
From [3.2,] 10, 100, 1000
….. to 3.2, 10, 100, [1000]
In doing so, the revised Table 1 also got a typo (!!): 32 for 3.2,
[Credit to @danwalker9999 in our Twitter Pekar et al for quickly spotting it].
June 2024 update: R.I.P.
After further corrections for inconsistencies in the modelling [5], while trying hard to remain generous, nizzaneela found a Bayes factor for multiple introductions of around 0.25 to 0.30.
That’s at least 3 to 1 against multiple introductions.
In other words, the very model of Pekar et al, using the very data it decided to use (which itself is another point of contention), actually goes against the stated conclusion of the paper, once corrected for the cut-and-paste error and a series of incorrect Bayesian calculations.
There is no other way to put it, but Pekar et al is effectively dead. Some, in desperation, have tried to argue that yet another modelling attempt may corroborate the results of Pekar et al, or that Worobey et al supports their conclusion, so that that Pekar et al should be still considered valid.
That logic is totally flawed. First, papers such as Worobey et al already have their own set of criticisms. Secondly, Pekar et al must be judged not on the conclusion it reaches, but on the way it reaches it. Unfortunately Pekar et al is all about a complex model, and that model has now be found totally wrong.
6. Conclusions
There comes a time when, whatever your original stance on Pekar et al was, you just look at all the elements and latest findings in front of you, and suddenly, without the slightest effort, there, in a stroke of immediate clarity, you suddenly see it for what it always was: an oversold rather poor piece of modelling.
The recently disclosed modelling errors of Pekar et al, and the amplitude of the subsequent corrections, which we have discussed at length above, will likely be such a moment for many observers. Still, a good number of scientists had already determined that Pekar et al did not exactly pass the smell test, right after its publication as a preprint at end Feb 2022.
The preprint pretended to have found a ‘smoking gun’ for a market origin, with the pristine probabilistic proof of multiple zoonotic introductions in the market over just a few weeks. A proof delivered via some state-of-the-art modelling and expert usage of Bayesian Hypothesis Testing. Case closed.
Still, some warning signs were clear:
- An extremely complex modelling that can easily obscure both conceptual and coding mistakes.
- A long list of 29 authors, with likely only one or two with a real understanding of the model implementation.
- A high stake game, with at least four key authors (Holmes, Andersen, Rambaut and Garry) very likely feeling in strong need of vindication, after having claimed a near-certain zoonosis origin in Proximal Origin without strong data points to back it up.
- A rather dubious publication dynamic, with a well organised media push.
This all seemed like the perfect setting for some blind spots and overconfident conclusions.
Unfortunately peer review will typically not pick up intricate modelling errors, especially in a domain where the modelling should normally play a supporting role, not the main one. A situation that is made worse in this case by the absence of any experimental setup to help validate the modelling, as this is not an experimental field.
No peer reviewer is going to spend at least two weeks trying to understand the model of Pekar et al, then spend a few more weeks trying to rerun it, before ideally spending yet more time testing its robustness. If such a person had the time and experience, they would most likely be in the authors list anyway. DRASTIC was an exception, and I made sure that we had a little group dedicated to a proper modelling review of Pekar et al. That group, thanks largely to Nod, took 17 months but eventually delivered.
Still, it is rather surprising that we seem to be the only ones to have seriously investigated that piece of modelling. For sure, given the expertise, time and effort it took, there must be very little incentive for any scientists to do what we did. But I also now suspect that there is a strong incentive not to: one is most likely to make some serious enemies in academia by taking on one of the most cited scientific papers of the past few years. Incidentally, to this day Nod prefers to remain anonymous.
Stepping back further, it seems to me that the publication peer-review model has increasingly obvious deficiencies, and that the usual players in the publishing industry have little incentive to address these issues. After all, their legitimacy and profits are largely based on maintaining the aura of the effective publication gating they supposedly offer. Whether the publication peer review system that we have now, imperfections included, is the best of all possible options, or whether some suggested alternatives will succeed, is not yet clear. In the meantime, a few but essential independent scientists, armed with common sense and grit, will keep doing their best to tackle the worst cases of failed publication peer review.
Notes:
[1] The origin of a cholera epidemic in Haiti provides us with a good recent case where a straight shoe-leather investigation quickly zoomed in on the actual origin, while more theoretical experts who had not set foot in the country strongly sided with a local zoonosis. It is worth noting that at least one of the key players at the time on the zoonosis side (a member of the board of EcoHealth alliance) has again taken similar positions on the origin of SARS-CoV-2.
[2] The limitations of Worobey et al have been discussed at length elsewhere. See for instance the work of Sung Nok Chiu and Dietrich Stoyan on the modelling issues, or Courtier and Ribera and my own work on the data quality and biases issues.
[3] Experts such as Alexander Kekulé, Jesse Bloom or Francois Balloux have explained in simple words for the general public some of the parsimony issues with the double-jump hypothesis. Kumar et al (2021), using some modelling on a larger dataset, came to different conclusions as to the likely timing of the outbreak and the relation between clade A and B.
[4] Please note that while it is important to at least use a more appropriate vocabulary to designate the various confidence levels, this discussion shall not distract from the fact that the whole idea of associating a given precise vocabulary to clear-cut thresholds, especially at the margin of significance, needs to be handled very carefully. To suppose that there is some kind of mechanical vocabulary to use, without paying attention to whole set up (including the handling of the data) would be just a repetition of some of the worst abuses of the Null Hypothesis Significance Testing.
[5] For the latest PubPeer results and discussions, see https://pubpeer.com/publications/3FB983CC74C0A93394568A373167CE
For the GitHub code, see https://github.com/nizzaneela/multi-introduction/tree/corrected_with_relative_size_and_separation_conditions