Gustavo's Newsletter

The general-purpose models beat the clinical tools. Read the footnotes before you switch.

Gustavo Monnerat PhD — Tue, 16 Jun 2026 14:36:03 GMT

Hi,

This week I read a new Nature Medicine Brief piece that pits two specialized clinical AI tools against three frontier models, and the result is the kind of thing that gets screenshotted without the caveats.

The headline: general-purpose models (GPT-5.2, Gemini 3.1 Pro, Claude Opus 4.6) beat the clinical tools (OpenEvidence and UpToDate Expert AI) on every test. The clinical tools landed in the same tier as Google’s AI Overview, the thing that shows up above your search results.

If you sell, buy, or use clinical AI, that comparison is key. But the strength of the evidence is very different across the three tests, and that’s the whole story.

What happened

Three stages: 500 MedQA exam questions, 500 HealthBench items, and a real clinical queries (RCQ) benchmark of 100 de-identified physician questions pulled from live care.
MedQA accuracy: Gemini 97.4%, GPT 94.2%, Claude 90.2%, then OpenEvidence 89.6% and UpToDate 88.4%.
HealthBench: GPT 88.0, Gemini 79.3, Claude 77.0, with both clinical tools far back at 62.6 and 61.3.
RCQ (the good part): 12 US clinicians, blinded and randomized, scored 6 systems across correctness, completeness, safety, and clarity on a 1 to 4 scale. That’s 1,800 annotations, 3 raters per response.
RCQ results split into two clean tiers. Frontier models on top (Gemini 3.62, GPT 3.54, Claude 3.52), then the clinical tools and Google AI Overview (OpenEvidence 3.24, Google 3.27, UpToDate 3.17). No significant gaps inside either tier.
On safety, nobody won and nobody lost. No model produced more harmful content (P = 0.55) or more hallucinations (P = 0.42) than any other.

Key aspects beyond these tests

MedQA and HealthBench are public benchmarks: the models may have seen these questions during training.

HealthBench has a second problem. It was built by OpenAI, and the top scorer on it was GPT-5.2. The authors flag the benchmark-developer overlap themselves.

So the real evidence is the RCQ. Real physician questions, no contamination, blinded human raters. And it still favors the frontier models, by 0.36 to 0.44 points on the 1 to 4 scale after adjusting

One detail worth sitting with: OpenEvidence scored lowest on clarity (2.84), not on correctness. Its problem was communication, not knowledge. And UpToDate refused 19% of queries, far more than anyone else. A tool that won’t answer one in five questions loses on a benchmark even if its answers are fine.

Here’s the part that decides how much this generalizes. The clinical tools have no public API, so the authors had to drive them through the browser, by hand. That caps the sample size and bakes in whatever hidden prompts and retrieval quirks the web interface uses. So you’re measuring the consumer dashboard, quirks and all, rather than the engine underneath. Frontier models, meanwhile, ran through clean APIs.

What this doesn’t show

It doesn’t show that frontier models are better at real applications scenarios. It shows they scored higher on three benchmarks.

It says nothing about latency or citation quality, key implementation metrics. For a lot of us, “can it open the full text and cite the trial” matters more than two-tenths of a point on a rating scale.

And it’s one snapshot of a fast-moving field, with 12 US clinicians judging questions from one US health system. It doesn’t tell you how any of this holds up with residents versus senior clinicians, or in settings with a different disease burden.

Why it matters

If you run a clinic, a research or a health system, the takeaway is that “specialized clinical AI” stays a marketing category until proven otherwise, and the burden of independent evidence sits with the vendor. Don’t switch tools off one paper.

The safety finding is the quietly reassuring one. All six systems hallucinated at the same low rate. The frontier models didn’t buy their accuracy with extra danger.

What I tried

A personal note on the RAG angle, because the paper hints at it: retrieval can hurt as easily as help when the wrong material gets pulled in. The way I get reliable retrieval with clean citations is to build a project (on Claude) around exactly the references I need and instruct the model to use only those. The trade-off is real, my token use is high, which is why I’ve started running the same setup with local references and other tools such Codex. Curated retrieval beats open retrieval, and that’s the lever the clinical tools should be pulling.

Headlines

Claude shipped a new model Fable 5, but they still aren’t available outside the US and carry heavy restrictions on healthcare applications. Aslo, quite interesting AI Governance and Regulation case

My feed

A frontier without an ecosystem is not stable. The model is one piece. Access, integration, citations, and trust are the rest.
Plasma proteomic signatures of cellular aging predict human disease. Proteomic aging clocks linked to disease. Association, not causation, but a nice example of signal pulled from messy biology.

To test

Next time you see an AI benchmark result, ask one question before you believe the gap: was the test public? What are the Real Word Evidence and the Implementation Metrics.

Written in a personal capacity. Public evidence only. Not medical advice.

Vishwanath et al. General-purpose large language models outperform specialized clinical AI tools on medical benchmarks. Nature Medicine, 2026. doi:10.1038/s41591-026-04431-5

How to use AI to evaluate your Research - Lessons from 497 desk rejections

Gustavo Monnerat PhD — Fri, 10 Apr 2026 16:28:29 GMT

Last month, the International Conference on Machine Learning desk-rejected 497 papers , after catching 398 reciprocal reviewers breaking the AI-use rules they had explicitly agreed to follow. The detection method: organizers embedded invisible watermarks in submitted PDFs that, when fed into a large language model, instructed the LLM to include telltale phrases in the review text. Every flagged case was manually verified.

Can AI Review Its Own Evidence? A Deep Dive into LLM-Assisted Systematic Reviews

Gustavo Monnerat PhD — Fri, 06 Mar 2026 14:30:50 GMT

A systematic review just published in Nature Medicine used GPT-5 to screen and classify 4,609 studies evaluating large language models in clinical medicine. The results: only 19 randomized trials exist, LLMs outperform clinicians just 33% of the time on real patient data, and at least a quarter of studies have sample sizes under 30. But beyond the results, this paper raises a deeper question, can we trust AI to evaluate AI? And what does this mean for the future of evidence synthesis?

The Method: How GPT-5 is Suporting Reviewers

The study built a three-stage automated pipeline. First: GPT-5 screened 12,894 deduplicated abstracts from PubMed, Embase, and Scopus against predefined inclusion criteria, filtering down to 4,609 studies. Second: each included study was assigned to an evidence tier: Tier S for prospective RCTs of deployed systems, Tier I for real patient data, Tier II for simulated scenarios, and Tier III for exam-style knowledge tasks. Third: GPT-5 extracted metadata from every abstract: model names, specialties, comparator types, sample sizes, and whether the LLM outperformed humans.

The critical question is: how do we know GPT-5 got this right?

The authors designed a rigorous validation framework. For screening, 500 randomly sampled studies were independently reviewed by two groups of five human reviewers each, with an independent tiebreaker for disagreements. GPT-5 achieved a Cohen’s κ of 0.820 against the tiebroken human consensus, which actually exceeded the inter-human agreement. For tiering, 250 studies were validated the same way, with GPT-5 reaching κ of 0.695 (inter-human κ was 0.645). When the LLM made errors in tiering, 84.8% were off by only one tier.

They then used a Bayesian hierarchical Dirichlet-multinomial model to propagate classification uncertainty into the final counts. Instead of reporting raw numbers, they produced credible intervals: an estimated 1,048 Tier I studies (95% CI 847–1,252), 1,857 Tier II (95% CI 1,427–2,280), and 1,704 Tier III (95% CI 1,273–2,134). This quantifies how much the automation might be wrong.

Strenghts of this Approach

Scale that manual review cannot match. Screening thousands of abstracts with two independent reviewer groups would require thousands of person-hours. The automated pipeline did it programmatically and validated a meaningful subsample to quantify error. For a field generating many papers per day, this approach is critical.

The LLM vs. human inter-rater agreement. We human screeners disagreed on 12.8% of studies while the LLM disagreed with the human consensus on only 8.2%. For well-defined inclusion criteria applied to abstracts, the LLM may be very consistent.

Bayesian uncertainty quantification. Many LLM-assisted reviews report sensitivity and specificity only. By modeling the full confusion matrix probabilistically and using MCMC sampling, produce results that carry explicit uncertainty bounds.

The tiering framework. Independent of the automation, having a standardized evidence hierarchy, from board-exam benchmarks through simulated scenarios to real patient data and RCTs, may gives the field maturity evaluation of evidence that may be adapted in future evaluations.

Potential Limitations

The abstract-only info. The entire analysis was performed on titles and abstracts, not full texts. Abstracts may omit critical details about study design, blinding, study size and location, data quality, and statistical methods.

The metadata extraction was not validated by humans. Screening and tiering were validated, but the extraction of model names, specialties, comparator types, sample sizes, and outperformance outcomes were automated. This is a risk, because extraction from abstracts requires interpretation, not just classification.

Single-model. The entire review was conducted with GPT-5. The authors acknowledge that benchmarking multiple frontier models was prohibitively expensive, but this means every result is conditioned on GPT-5’s specific capabilities and potential errors and bias.

Prompt. The screening and tiering prompts are the critical aspects that shape the entire review, however they are available on GitHub, which is important for transparency.

Using an LLM to evaluate research about LLMs creates a potential feedback loop. GPT-5 may have systematic patterns in how it interprets studies about GPT models than on other LLMs, for instance, being more familiar with terminology and framing used in OpenAI-related studies, or having training data that includes earlier versions of some of these papers. This doesn’t invalidate the work, but it’s worth registering.

Possible applications of AI for Evidence Synthesis

Screening and study selection: semi-automated tools use active learning to reduce workload by 50–90% while keeping humans in the loop. Full LLM-based screening replaces human screening entirely but requires rigorous validation against human ground truth.

Evidence tiering and classification: LLMs can assign studies to predefined evidence hierarchies, enabling rapid stratification of large bodies of literature by methodological rigor.

Data extraction: LLM-based extractors can pull structured metadata from abstracts or full texts: study design, PICO elements, sample sizes, outcomes. Works best for well-defined fields, still struggles with nuanced or conditional results.

Risk of bias assessment: Automated tools can partially assess risk of bias across standard domains.

Quality of reporting assessment: LLMs to evaluate adherence to reporting guidelines such as CONSORT, PRISMA, and TRIPOD-LLM, flagging incomplete or non-compliant manuscripts at scale.

Living systematic reviews: Continuously updated reviews where AI monitors new publications, screens against inclusion criteria, extracts data, and flags studies that might change existing conclusions. Arguably the most important frontier for fast-moving fields like clinical AI.

Search strategy development: LLMs can help generate and refine database search strings, suggest indexing terms, and identify gaps in search coverage across databases.

Deduplication and record management: Automated matching of duplicate records across databases using title, DOI, and metadata similarity

What This Means for Clinical AI Governance

This paper makes a compelling case that AI-assisted evidence synthesis is not only feasible but, for certain types of large-scale bibliometric work. The validation framework they built, combining human ground truth, bootstrapped confidence intervals, and Bayesian modeling, can be a relevant approach for future studies and frame works.

The field needs to develop standards for when LLM-assisted evidence synthesis is appropriate and what level of validation is required. We need reporting guidelines that specify how to document prompt design, model selection, validation procedures, and sensitivity analyses. And we need to be honest about the limitations.

Bottom Line

The Chen et al. paper is simultaneously a sobering audit of clinical AI evidence and a framework for AI-assisted evidence synthesis.

The deeper implication is that evidence synthesis itself needs to evolve, not just to accommodate the volume of clinical AI research, but to match the speed at which the technology changes. Living systematic reviews powered by AI, with robust validation and transparent methodology, may be the path forward.

Ref: Chen, S.F., Alyakin, A., Seas, A. et al. LLM-assisted systematic review of large language models in clinical medicine. Nat Med (2026). https://doi.org/10.1038/s41591-026-04229-5

AI Literacy for Medical Research & Publishing

Gustavo Monnerat PhD — Tue, 17 Feb 2026 10:59:01 GMT

The U.S. Department of Labor released a national AI Literacy Framework: a set of baseline competencies they believe worker needs in an AI, a continuation on the 2025 AI Action Plan.

I found it both interesting and useful, but it’s also broad (intentionally). It doesn’t do is address the specific stakes we face in medical research and publshing: patient safety, evidence integrity, regulatory requirements, and the reproducibility standards that underpin everything we do.

Over the past year, I’ve been increasingly involved in AI training and lectures for medical researchers, PhD students, and healthcare professionals. What I’ve seen consistently is a gap between general AI literacy content and what clinicians and researchers actually need to work with these tools responsibly. So when the framework came out, I went through it with that lens: what does each of these competencies look like when your work involves clinical trials, systematic reviews, and peer-reviewed publishing?

The framework is built on five foundational content areas:

Understand AI Principles: How these systems actually work. For us, that means understanding why LLMs generate probabilistic outputs, why hallucinations are a structural feature (not a bug), and why the same prompt can produce different outputs.
Explore AI Uses: Where AI is already being applied. In medical research, that spans literature screening for systematic reviews, protocol drafting, statistical code generation, regulatory intelligence through ClinicalTrials.gov, and patient communication materials.
Direct AI Effectively: Prompting as a skill. The single biggest lever for output quality is grounding your prompt in source data (attaching the actual paper) rather than asking the model to generate from memory. Add role context, specify output format, and iterate.
Evaluate AI Outputs: Critical appraisal for AI-generated content. Verify every statistic, every citation, every claimed source against originals. AI will fabricate references with complete confidence. Your domain expertise is the primary quality control mechanism.
Use AI Responsibly: Data protection, disclosure, and accountability. Never input identifiable patient data. Know your journal’s AI policy. ICMJE guidelines are clear: AI cannot be an author, and use must be disclosed.

The framework also includes seven delivery principles for how AI literacy training should be designed:

Enable Experiential Learning: Hands-on practice with real tasks, not abstract lectures
Embed Learning in Context: Teach AI through the lens of each specialty and workflow
Build Complementary Human Skills: AI amplifies critical thinking, writing, and statistical reasoning
Address Prerequisites: Not everyone starts from the same digital literacy baseline
Create Pathways for Continued Learning: Foundational literacy is the starting point, not the destination
Prepare Enabling Roles: PIs, editors, and mentors need specific training for their oversight responsibilities
Design for Agility: Tools change every few months; principles don’t

I recorded a free video lecture walking through each of these areas with concrete examples from clinical research workflows — including prompting demonstrations, a hallucination taxonomy specific to medical literature, and practical verification checklists you can use immediately.

The core message: AI literacy is not about replacing your expertise. It’s about amplifying it responsibly. The framework gives us the structure. Our training gives us the judgment. Together, they make us far more capable than either alone.

Source: https://www.dol.gov/sites/dolgov/files/ETA/advisories/TEGL/2025/TEGL%2003-25/TEGL%2003-25.pdf

How Do You Prove a Drug Doesn't Cause Harm?

Gustavo Monnerat PhD — Tue, 10 Feb 2026 14:12:49 GMT

Last week, The Lancet published what may be the most comprehensive assessment of statin adverse effects study. The Cholesterol Treatment Trialists’ (CTT) Collaboration analyzed individual participant data from 123, 940 people across 19 double-blind randomized trials to determine whether the long list of side effects on statin drug labels is actually supported by rigorous evidence.

Their conclusion: most of it isn’t.

Of 66 outcomes listed as potential undesirable effects in statin product labels, only four showed a statistically significant excess after controlling for multiple testing: abnormal liver transaminases, other liver function test abnormalities, urinary composition alteration, and oedema. Conditions like cognitive impairment, depression, sleep disturbance, sexual dysfunction, and peripheral neuropathy showed no significant results.

But this newsletter isn’t about whether statins are safe. It’s about how this team built the methodological approach to answer that question, and what every researcher, clinician, and medical affairs professional can learn from their approach.

Why Individual Participant Data Changes Everything

Most meta-analyses work with aggregate data, the summary statistics published in each trial’s paper. You extract the hazard ratio, the confidence intiterval, the sample size, and you pool them. This is useful, but it has real constraints. It is challange to: standardize outcome definitions across trials, reclassify events, examine subgroups that weren’t reported in the original publications.

The CTT Collaboration did something fundamentally different. They obtained the raw, individual-level records for every participant in every trial. This meant access to every adverse event reported by every patient.

The practical implications are enormous. When Trial A codes a liver enzyme elevation as “hepatic dysfunction” and Trial B codes the same finding as “transaminase increased,” an aggregate meta-analysis treats these as different outcomes. An IPD meta-analysis can remap both to the same Medical Dictionary for Regulatory Activities (MedDRA) preferred term and analyze them together. The CTT team processed over 38 million records and more than 800 datasets to achieve this.

This is the difference between asking “what did each trial report?” and asking “what actually happened to each patient?”

The Outcome Mapping Pipeline: From Label to Analysis

One of interesting aspect of this study is how the outcomes were defined. The team didn’t start with a hypothesis about what statins might cause. They started with what regulators already claim statins cause: the Summaries of Product Characteristics (SmPCs) for five statins (atorvastatin, fluvastatin, pravastatin, rosuvastatin, and simvastatin).

For each statin type, researchers reviewed SmPCs for at least one low-to-moderate-intensity and one high-intensity formulation, extracting every term listed under section o f undesirable effects. These terms were then consolidated into a single list, duplicates were removed, and each term was mapped to MedDRA

When a direct MedDRA match wasn’t available, a clinician performed the mapping to the closest available term. Clinically related terms were then grouped, for example, nausea and vomiting, which are listed separately in SmPCs, were merged into one composite outcome. Terms indicating clearly non-drug causes (post-procedural diarrhea, traumatic arthritis, congenital anaemia) were excluded.

The final result: 66 composite outcomes encompassing 555 MedDRA preferred terms, organized into 15 body system classes.

Why does this matter? Because the regulatory anchor gives the analysis a concrete, actionable question: are the specific claims on these drug labels supported by the best available evidence? This is different from (and more useful than) the open-ended question of whether statins cause any adverse effects at all.

The Double-Blind Requirement: Eliminating the Nocebo Problem

The decision to restrict the analysis exclusively to double-blind trials deserves special attention, because it addresses one of the most relevant problems in adverse event assessment: the nocebo effect.

When patients know they’re taking a statin, and have read the product label listing dozens of potential side effects, or seen media reports about statin harms, they are more likely to attribute any symptom they experience to the drug.

In unblinded studies, this may create systematic inflation of adverse event rates in the treatment group. But in a double-blind trial, neither the patient nor the clinician knows whether the patient is receiving statin or placebo. Any symptom attribution bias operates equally in both groups, and the between-group comparison remains uncontaminated.

The CTT Collaboration went further: they required at least 1,000 participants and at least 2 years of scheduled treatment, excluding small or short trials that would contribute statistical noise without meaningful signal. These inclusion criteria ensured that only trials with sufficient power and duration to detect real adverse effects contributed to the analysis.

Controlling for Multiple Testing: The FDR Approach

Testing 66 outcomes simultaneously may create a serious statistical problem. At a conventional p<0.05 threshold, you would may expect false positives by pure chance. If you then report these as “significant findings,” you may lead to evidence of harm that doesn’t exist.

The CTT team addressed this using the Mehrotra and Adewale double false discovery rate (FDR) method. The FDR approach controls the expected proportion of false discoveries among all findings declared significant. At a 5% FDR level, if you identify ten significant results, you would expect no more than one of them (on average) to be a false positive.

This is a meaningful middle ground between two extremes. The Bonferroni correction may be be too conservative, potentially missing real adverse effects. Using p<0.05 without correction would be too liberal, generating false alarms. The FDR method preserves sensitivity to true effects while controlling the rate of false discoveries.

An important nuance: the paper reports nominal (uncorrected) p-values and 95% confidence intervals throughout. This means that some results with p<0.05 and confidence intervals excluding 1.0 are not FDR-significant.

The Dose-Response Layer: Strengthening Causal Inference

The CTT team built in an additional layer of evidence: the four trials comparing more intensive versus less intensive statin therapy (30,724 participants).

The logic is straightforward. If statins truly cause an adverse effect through a pharmacological mechanism, you would expect more of that effect at higher doses. If a finding appears in the statin-versus-placebo comparison but shows no dose-response in the intensive-versus-less-intensive comparison, the causal inference is weaker.

This is exactly what the data showed. Liver transaminase elevations were significant in both comparisons, with a rate ratio of higher for more intensive versus less intensive therapy, a clear dose-response.

This dual-layer design, requiring both a signal against placebo and a dose-response gradient, creates a hierarchy of causal plausibility that is far more informative than either analysis alone.

The Innovative Visualization

The circular radar-style plots in Figures 1 and 2 deserve recognition as a visualization innovation. Displaying 66 outcomes on a conventional forest plot would require a figure spanning multiple pages. The radial design places body system categories around the circumference and effect sizes on the radial axis, allowing the reader to see the overall pattern immediately: a sea of null results with a handful of exceptions clustered in the hepatobiliary and renal/urinary sectors.

The visual distinction between FDR-significant results (black circles), nominally significant but not FDR-significant results (grey circles), and outcomes excluded from FDR testing (white circles for muscle and diabetes, which were previously reported) is elegant and informative. These figures communicate the paper’s central message, the absence of most of the adverse effects, more powerfully than any table could.

Where the Evidence Has Limits

No study is without limitations, and the CTT analysis has several that matter for clinical interpretation.

Follow-up duration. The median follow-up of 4.5 years in placebo-controlled trials captures short- and medium-term adverse effects but cannot address outcomes that develop over decades. This is particularly relevant for cognitive outcomes, neurodegenerative processes unfold over much longer timeframes than these trials captured. The null finding for cognitive impairment is reassuring but not definitive for very long-term use.

Population representativeness. The trial populations were mostly composed by white males. and some trials didn’t report race or ethnicity data. The generalizability to other populations is still to be investigated. Real-world statin prescribing now reaches populations that were underrepresented from these trials.

Variable data collection depth. Not all trials collected all adverse events. Some collected only serious adverse events plus selected non-serious events.

Run-in phases. Several trials used run-in periods where all participants received active treatment or placebo before randomization, potentially excluding statin-intolerant individuals.

Adverse event reports versus biochemical data. The analysis relied exclusively on investigator-reported adverse events rather than systematic laboratory data. For liver function tests specifically, this likely underestimates the true incidence of asymptomatic transaminase elevations.

Reversibility. Due to heterogeneous recording of treatment discontinuation across trials, the team could not reliably assess whether adverse effects resolved after stopping therapy, or what happened with treatment rechallenge.

The SmPC constraint. By anchoring the analysis to SmPC-listed terms, the study doesn’t address outcomes that aren’t currently listed on drug labels.

What This Means for How We Think About Drug Safety

This CTT analysis illustrates a fundamental tension in pharmacovigilance: the system for adding adverse effects to drug labels hasproblems. Effects may get added based on observational signals, and post-marketing surveillance data, all of which are susceptible to confounding, reporting bias, and the nocebo effect. But they are almost never removed, even when higher-quality evidence fails to support them.

The result is product labels that list dozens of potential side effects, many of which have never been validated in double-blind trials. These labels are then read by patients and clinicians, creating expectations that generate more case reports, which further reinforce the labeling, a self-sustaining cycle of unvalidated attribution.

The consequences are not abstract. After misleading claims of drugs cause side effects, many patients stop their therapy. This is the clinical cost of unreliable safety information: patients abandoning effective therapy based on adverse effects that rigorous evidence does not support.

The Broader Lesson: Why Methodology Matters

For researchers designing trials, this paper demonstrates the value of systematic, pre-specified adverse event collection with standardized coding. The harmonization effort that made this meta-analysis possible, converting heterogeneous adverse event data from many trials spanning decades into a common MedDRA-coded format, took years. Future trials can reduce this burden by adopting standardized adverse event collection from the outset.

For clinicians counseling patients, the findings provide an evidence base information about statin safety. The confirmed adverse effects (muscle symptoms, diabetes, liver enzyme elevations) can be discussed with appropriate context about their magnitude and clinical significance. The long list of unconfirmed effects that currently populate drug labels can be placed in proper perspective.

For medical affairs professionals and regulatory scientists, the paper raises a direct challenge: if the best available evidence does not support the inclusion of specific adverse effects on drug labels, those labels should be revised. The CTT team’s call for regulatory action is unusually direct, but the evidence supporting it is robust.

And for anyone who reads, evaluates, or creates medical evidence, this study is a masterclass in how to do a meta-analysis right, not just pooling numbers, but building an entire analytical infrastructure designed to answer a specific, consequential question with the highest possible degree of rigor.

Reference: Cholesterol Treatment Trialists’ (CTT) Collaboration. Assessment of adverse effects attributed to statin therapy in product labels: a meta-analysis of double-blind randomised controlled trials. Lancet 2026. Published online February 5, 2026.

AI Can Write Your Paper. But Can It Get It Published?

Gustavo Monnerat PhD — Thu, 05 Feb 2026 19:43:03 GMT

Real-time collaboration, LaTeX formatting, citation management, all powered by GPT-5.2.

AI genuinely accelerates certain tasks. Translating your manuscript to English. Improving sentence clarity. Formatting references. Converting your messy whiteboard sketch into a proper figure.

But after years of reviewing manuscripts and working with researchers, I keep seeing the same pattern:

The bottleneck was never typing speed. It’s knowing what to write.

Most rejected papers aren’t poorly written. They’re poorly structured. The argument doesn’t flow. The contribution isn’t clear. The Discussion repeats the Results. The Abstract buries the lead.

No AI tool fixes this. The problem isn’t execution only, it also includes strategy.

Today I’ll share the framework I use for structuring scientific papers, the mistakes I see most often, and where AI actually helps versus where it can hurt you.

The Abstract: Your Paper’s First and Often Only Chance

Most reviewers and editors form their initial opinion from your abstract. Many readers will never go beyond it.

Yet most abstracts I review fail at the basics.

The common mistakes:

Vague (or non included) objectives. “We aimed to investigate the relationship between X and Y” tells me nothing about your actual hypothesis or what’s new.

Missing methods specifics. What population? What design? What primary outcome? Readers need this to evaluate whether your study is relevant to them.

Results without numbers. “We found a significant association” doesn’t cut it. How significant? What was the effect size? What’s the confidence interval?

Conclusions that don’t match results. I see this constantly. Authors claim implications their data simply doesn’t support.

What actually works:

Open with one or two sentences of context and a clear objective. State the gap and what specific question you asked.

For methods, include your study design, population, setting, and primary outcome. Be specific. “Randomized controlled trial of 450 adults with type 2 diabetes across 12 centers” beats “We conducted a study in diabetic patients.”

For results, report your key findings with actual numbers. Effect sizes, confidence intervals, p-values for primary outcomes. Not everything you found. The findings that answer your research question.

Your conclusion should state what this means practically and what’s genuinely new. And it must follow logically from your results. No overclaiming.

AI can help you condense a draft abstract, improve clarity, and check that all required elements are present. But AI cannot decide what your key contribution is, choose which results matter most, or ensure your conclusion matches your evidence. That’s your job.

The Introduction: Setting Up the Gap

The Introduction has a main job: convince the reader your study needed to exist.

The common mistakes:

Literature reviews that read like a book. Comprehensive but unfocused. Readers don’t need a complete history of your field. They need to understand why your study matters.

Burying the research question. If readers have to hunt for your objective, you’ve lost them.

No clear gap. Lots of “X is important” but no “we don’t know Y.” Without a gap, there’s no justification for your study.

What actually works:

Start broad, then narrow quickly. By the end of your first paragraph, readers should know the general territory. By the end of the introduction, they should understand what we currently know, what we don’t know (this is the gap), why filling this gap matters, and exactly what you did to address it.

The literature review isn’t a comprehensive survey. It’s a curated argument that leads inevitably to your research question. Every paper you cite should serve that narrative.

AI can summarize literature, find papers you might have missed, and improve flow between paragraphs. But AI cannot identify the actual gap in knowledge, craft the narrative arc, or decide which of 200 relevant papers actually matter for your argument. Those decisions require your expertise.

Methods: The Transparency With Your Reader

Methods is where trust is built or broken. Reviewers read this section asking this question: Could I reproduce this study?

The common mistakes:

Vague population descriptions. “Patients were recruited from a hospital” raises more questions than it answers. Which hospital? What inclusion criteria? What timeframe? How many were screened versus enrolled?

Missing ethical details. IRB approval and consent procedures aren’t optional information.

Statistical analysis as an afterthought. “Data were analyzed using SPSS” tells me nothing about your analytical approach.

What actually works:

State your study design explicitly upfront. Randomized trial? Prospective cohort? Cross-sectional survey? Don’t make readers guess.

Describe your population with enough detail for replication: inclusion and exclusion criteria, recruitment setting, timeframe, and sample size justification.

Include ethics committee approval, consent procedures, and trial registration if applicable.

Define your variables precisely. How were exposures measured? How were outcomes defined? Be specific enough that someone could replicate your measurements in a different setting.

For statistical analysis, explain which tests you used and why. How did you handle missing data? What was your significance threshold? Were your analyses pre-specified or exploratory?

AI can help ensure you haven’t forgotten standard elements, maintain formatting consistency, and improve clarity of technical descriptions. But AI cannot make methodological decisions for you, know what level of detail your specific field expects, or identify potential biases in your design.

Results: Show, Don’t Interpret

Results should present findings. Not interpret them. Not discuss implications. Just report what you found.

Sounds simple. Most authors still get it wrong.

The common mistakes:

Narrating every number in the tables. If it’s in the table, you don’t need to repeat all of it in the text. Highlight what matters and let the table carry the details.

Missing measures of uncertainty. Point estimates without confidence intervals or measures of variability are incomplete reporting.

Selective reporting. Highlighting favorable findings while burying unfavorable ones. Reviewers notice this, and it damages your credibility.

Interpreting in Results. “This significant finding suggests...” belongs in the Discussion. Results is for reporting, not explaining.

Figures and tables that can’t stand alone. Readers should understand your figures without hunting through the text for explanations.

What actually works:

Follow a logical flow. Usually this means: participant description, then primary outcome, then secondary outcomes, then sensitivity or subgroup analyses.

Report numbers with context. Effect sizes, confidence intervals, p-values. Measures of variability and dispersion. Move beyond just labeling findings as “significant” or “not significant.”

Design tables and figures that work independently. Clear titles, labeled axes, defined abbreviations, complete legends.

Report all pre-specified outcomes, whether they support your hypothesis or not.

AI can check statistical reporting consistency, help format tables, describe figures clearly, and catch missing elements. But AI cannot choose what to emphasize, design figures that tell the right story, or know which sensitivity analyses actually matter for your question.

Discussion: Your Interpretation, In Context

This is where your expertise matters most. And where AI is least useful.

The common mistakes:

Repeating Results. “We found that X was associated with Y” just restates what you already reported. Discussion is for interpretation, not summary.

Ignoring contradictory evidence. Only citing studies that agree with your findings signals either poor scholarship or intellectual dishonesty.

Limitations as an afterthought. Two generic sentences at the end (”our sample size was small”) doesn’t demonstrate serious engagement with your study’s weaknesses.

Overclaiming. “This study proves...” Calibrate your language to your evidence.

No practical implications. After reading your Discussion, readers should know what clinicians, policymakers, or researchers should do differently. If you can’t articulate this, why did the study matter?

What actually works:

Start with your key findings, but interpret them rather than restating them. What do your results mean? How do they compare with existing literature? Where do they fit in the broader scientific conversation?

Engage with contradictory evidence directly. If your findings differ from previous studies, explain why. Different population? Different methods? Different context? This demonstrates intellectual honesty and strengthens your argument.

Discuss real limitations specific to your study, not generic ones. Explain how these limitations might affect interpretation of your findings.

State implications for clinical practice, policy, or future research. Be specific but proportionate to your evidence.

AI can help find related literature to reference, improve clarity of complex arguments, and ensure you’ve covered standard discussion elements. But AI cannot provide your interpretation. That’s the entire point of this section. Your expertise, your perspective on what the findings mean, your understanding of the field’s debates and nuances. A general-purpose AI doesn’t have access to any of this.

Structure Before Speed

Most researchers get AI writing tools backwards.

They try to use AI to write faster before they’ve figured out what to write.

The result: polished manuscripts that is rejected because the core argument is weak.

The workflow that actually works:

First, clarify your contribution. Before writing anything, answer these questions: What’s new? Why does it matter? What’s the one thing readers should remember? If you can’t answer these clearly, you’re not ready to write.

Second, outline each section. Know what each paragraph needs to accomplish before you draft it. This is where most of the intellectual work happens.

Third, write the skeleton. Get the structure right, even if the prose is rough.

Fourth, then use AI. To polish, clarify, translate, format. Not to think for you.

AI is a multiplier. If your structure is solid, AI makes you faster. If your structure is weak, AI helps you produce polished garbage more efficiently.

Want the Implementation?

This newsletter gave you the framework: what each section needs to accomplish and where authors typically fail.

In the premium version, I share the specific prompts I use for each section, from literature review to discussion strengthening. I walk through my actual workflow on video, applying this framework to a real manuscript. And I include the checklists and templates I use.

If that’s useful to you, you can upgrade here:

Subscribe now

AI-Assisted Scientific Writing: A Implementation Guide

Gustavo Monnerat PhD — Thu, 05 Feb 2026 19:38:50 GMT

The Core Principle

AI accelerates execution. It cannot replace thinking.

The workflow: clarify your contribution first, structure your argument second, write third, polish last. AI helps with steps three and four. Steps one and two remain yours.

This guide gives you prompts and tools for each section of a scientific manuscript. Every prompt assumes you’ve …

Designing and Interpreting Clinical AI Implementation Studies

Gustavo Monnerat PhD — Sun, 01 Feb 2026 17:57:37 GMT

Most clinical AI papers still lead with model performance: AUROC, sensitivity, specificity. But in deployed healthcare systems, models don’t create impact. Workflows do. Adoption does. Downstream capacity does.

The most informative recent AI studies are no longer algorithm evaluations, they are implementation trials and real world evidence. They test what happens when AI is embedded into real clinical environments with real constraints, real clinicians, and real patients.

When you read these studies together, a clear pattern emerges:

Clinical AI fails or succeeds at the level of workflow, behavior, and system capacity, not at the level of the model.

This issue breaks down what the strongest recent implementation studies teach us, and how to design and interpret studies beyond performance metrics.

Case 1: AI Mammography Screening at National Scale

Study: PRAIM — Prospective real-world implementation study
Setting: Nationwide German mammography screening program; ~463,000 screening exams across multiple sites and radiologists
Design: Prospective observational implementation comparing AI-supported reading vs standard double reading, with discretionary AI use by radiologists

Key Findings

Higher cancer detection rate observed in AI-supported reads compared with standard double reading (approximately 6.7 vs 5.7 per 1,000 exams, model-adjusted analyses)
Recall rate non-inferior under AI-supported workflow
Substantially shorter reading time for exams classified as low-risk/AI-normal
AI safety mechanisms identified additional cancers that were initially missed by both human readers

Critical Design Feature

AI use was not randomized. Radiologists chose when to use the AI viewer versus the standard viewer. This created adoption and selection bias, including preferential AI use in exams already appearing low risk.

Analytical Controls Used

Propensity score overlap weighting to balance case mix
Placebo intervention analyses to test for residual bias

Interpretation

This is a strong example of real-world AI implementation science with causal adjustment. When AI use is discretionary, adoption bias may become the primary confounder, and causal modeling is required for valid effect estimation.

Eisemann, Nora, et al. "Nationwide real-world implementation of AI for cancer detection in population-based mammography screening." Nature medicine 31.3 (2025): 917-924.

Case 2 — AI Stethoscope in Primary Care

Study: TRICORDER: Pragmatic cluster-randomized implementation trial
Setting: 205 NHS primary care practices; ~1.55 million registered patients
Design: Practices randomized to receive AI-enabled stethoscope access vs usual care; outcomes derived from routine clinical records

Key Findings

In the cluster-randomized implementation trial of the AI-enabled stethoscope:

Intention-to-treat analysis: Heart failure detection rate ratio ≈ 0.94 (95% CI ~0.86–1.02): indicating no statistically significant increase in population-level detection after device deployment.
Per-protocol (device actually used): Heart failure detection rate ratio ≈ 2.33 (95% CI ~1.3–4.3): indicating substantially higher detection among patients who were examined with the device, though this estimate is exposure-dependent and subject to selection bias.

Interpretation

This trial cleanly separates:

Device capability vs system-level effectiveness under real adoption

The tool improved detection when actually used, but optional, workflow-friction-dependent uptake limited real-world exposure, resulting in no ITT benefit.

Kelshiker, Mihir A., et al. "Triple cardiovascular disease detection with an artificial intelligence-enabled stethoscope (TRICORDER) in the UK: a cluster-randomised controlled implementation trial." The Lancet (2026).

Case 3: AI Diabetic Retinopathy Screening in Resource-Limited Settings

Study: Trial investigating AI-supported diabetic retinopathy screening
Setting: Lower-resource health system clinics in Rwanda.
Design: Randomized comparison of immediate AI-supported screening feedback vs delayed standard grading feedback

Key Findings

Primary endpoint: Attendance at referral services within defined follow-up window
Immediate AI-supported feedback increased referral attendance rates compared with delayed results
Many patients still did not complete referral within the follow-up window despite improved initiation

Implementation Constraints Identified

Travel and geographic barriers
Specialist access limitations
Cost and time burden
Follow-up loss

Interpretation

Here the limiting factor is care pathway capacity and adherence, not detection performance. AI improves early pathway steps but does not eliminate structural access barriers.

Mathenge, Wanjiku, et al. “Impact of artificial intelligence assessment of diabetic retinopathy on referral service uptake in a low-resource setting: the RAIDERS randomized trial.” Ophthalmology Science 2.4 (2022): 100168.

What Strong Implementation Trials Measure

The best studies report outcomes across four layers:

1. Clinical Outcomes

Mortality, complications, admissions

2. Process Outcomes

Detection rates, time-to-treatment, referral completion, diagnostic yield

3. Implementation Outcomes

Adoption rate, sustained use over time, workflow burden, interaction frequency

4. Equity Outcomes

Subgroup uptake, urban vs rural access, pathway completion by demographics

If a paper reports only clinical outcomes or only model metrics, it is not a complete implementation study.

Trial Design Patterns for Implementation Questions

Cluster randomized controlled trials (Cluster RCTs) are best suited for practice-level interventions and workflow tools where individual randomization would cause contamination. They answer the question: “What happens if we deploy this across real clinical units or practices?” These designs require a sufficient number of clusters and proper estimation of the intraclass correlation coefficient (ICC) to ensure valid power and inference.

Stepped-wedge designs are useful for system-wide rollouts when there are ethical or operational constraints on withholding the intervention. They answer: “What is the effect as adoption spreads over time?” They require careful modeling of period effects and time trends to avoid bias from secular changes.

Pragmatic cohort studies may be used for EHR-embedded tools and behavior-dependent AI exposures. They address: “Does actual use lead to better outcomes?” These studies require rigorous exposure definitions (based on real interaction or use), along with strong confounding control strategies.

Real-world implementation studies with causal modeling are typical in screening programs and policy or program rollouts where adoption is not randomized. They answer: “What is the effect under routine conditions?” They require causal inference methods such as propensity score approaches and multiple sensitivity analyses to address selection and adoption bias.

High-Resource vs Lower-Resource Settings: Different Failure Modes

High-Resource Settings

Potential bottlenecks:

Workflow friction and documentation burden
EHR integration gaps
Unclear accountability for AI outputs

Lower-Resource Settings

Potential bottlenecks:

Referral and specialist capacity
Geographic and financial access barriers
Workforce limitations
Follow-up and pathway completion

Note: These categories oversimplify. Many “high-resource” systems have pockets of fragile pathways; many “lower-resource” settings have strong community health infrastructure. Implementation assessment should be context-specific, not assumption-driven.

How to Interpret ITT vs Per-Protocol in AI Trials

Intention-to-Treat (ITT)

Includes randomized units regardless of actual use
Answers: “Does deployment change outcomes?”
Policy-relevant, adoption-sensitive
Often null in early implementations with low uptake

Per-Protocol / As-Treated

Restricted to those who actually used the intervention
Answers: “Does it help when used correctly?”
Shows capability, but may be selection-biased
Not a deployment guarantee

Both are useful, but they answer different questions.

This may tells us if technology works but implementation failed. Policy response should target implementation, not just the algorithm.

Reading Clinical AI Implementation Evidence

Key points when evaluating a clinical AI study:

Is this a model validation study or an implementation study?
How is exposure defined, by model output or clinician behavior?
Are adoption and sustained use metrics reported?
Is downstream care pathway capacity measured?
Are ITT and per-protocol results both presented and distinguished?
Is adoption bias addressed analytically?
Are equity outcomes reported?

Where do you think most clinical AI tools fail today, model performance, workflow integration, or downstream care capacity?

AI for Clinical Trials: Research, Summarization, and Drafting

Gustavo Monnerat PhD — Thu, 22 Jan 2026 14:55:29 GMT

Clinical trials generate an overwhelming volume of structured and unstructured information: registry entries, protocols, SAPs, amendments, publications, press releases, and regulatory documents. AI can reduce the friction of navigating that evidence, but only if we design workflows that keep humans accountable and outputs auditable.

A useful way to frame the moment: the biggest risk is not that AI is “too powerful.” It’s that teams adopt it as if better text generation automatically means better evidence.

Working principle: AI is strongest at search, extraction, comparison for human review. Humans must lead on scientific judgment, feasibility, ethics, and sign‑off.

3 Applications of AI use for clinical-trials

1: Research and retrieval (finding the right trials and documents)

Best for: landscape scans, competitive intelligence, eligibility comparisons, endpoint precedent.

AI can help you:

Build and refine search strategies for ClinicalTrials.gov and publications
Identify similar trials (population, intervention, comparator, outcomes, design)
Extract structured fields (phase, status, endpoints, timelines, enrollment)

Key point: outputs should always include traceable identifiers (e.g., NCT IDs, DOI/PMID, FDA application references).

2: Summarization and synthesis (turning many sources into a usable view)

Best for: internal briefings, medical affairs summaries, cross-trial comparisons.

AI can help you:

Create trial summary (one-page structured summaries)
Produce comparison tables (eligibility, endpoints, safety monitoring, visits)

Key point: “If it’s not in the source, AI must say NOT AVAILABLE.”

3: Drafting (documents and content that will be reviewed)

Best for: first drafts and outlines not final scientific claims.

AI can help you draft:

Protocol section outlines (background, rationale, objectives) using provided source text
Eligibility criteria rewrite for clarity (without changing meaning)

Critical rule: draft ≠ decision. AI tools can help you with insigits and trends, but drafting must not become authorship or decision making. Humans specilsits should be reposntable

Where AI is genuinely useful in clinical trials

Below are valuable use cases that tend to be “safe” when paired with verification.

1) Trial disambiguation (nicknames → official records)

Trials often circulate internally with nicknames, acronyms, or drug code names.

Goal: resolve “GLORY‑3” / “SUNRISE‑2” / “Drug‑123” into a verified list of candidate NCT records.

What to require:

List multiple candidates when uncertain
Explain why each candidate matches
Never guess a single “correct” trial without evidence

2) Competitive landscape mapping

Goal: “Show all Phase 2–3 completed trials in condition X since 2020, include sponsor, endpoints, and results availability.”

What to require:

Results availability and links
“missingness map” (what fields are absent)
A list of trials excluded and why

3) Endpoint benchmarking

Goal: learn what endpoints have precedent in similar programs.

What to require:

Endpoint taxonomy (primary/secondary, timepoints)
Frequency counts
Context notes: design, population, comparator

Note: benchmarking is not endorsement.

4) Regulatory intelligence (precedent, not prediction)

Goal: connect trials to approvals and labels.

What to require:

Cross‑reference: ClinicalTrials.gov + FDA labels/approval docs + publications

Note: registry data ≠ approval package

5) Patient eligibility screening (triage, not determination)

Goal: Shortlist recruiting trials for a patient profile (condition + biomarker + prior therapy + geography), and surface key exclusion risks.

What to require:

NCT IDs + links for every trial
Recruiting status + locations (site filter)
Key inclusion/exclusion
Unknowns list (labs/ECOG/washouts often need protocol)
Final step: site confirmation + protocol review

Note: AI supports triage, not final eligibility decisions.

The problem

When AI summaries fail in clinical trials, it’s may be due to:

Data incompleteness: registry fields missing or inconsistent
Results lag: many trials never post results or do so late
Ambiguity: interventions and outcomes described differently across sources
Hallucination : AI fills gaps with plausible‑sounding text

A practical safeguard: require the model to output:

NCT IDs for every claim about a trial
A “NOT AVAILABLE” response for missing fields
A short “checklist” (what to check manually)

Human-in-the-loop

AI excels at

Searching large datasets quickly
Extracting and structuring trial fields
Comparing trials and generating tables
Drafting templates and shells

Humans must lead

Protocol authorship decisions and scientific rationale
Feasibility and operational judgment
Regulatory strategy defensibility
Final eligibility decisions and clinical judgment
Accountability and sign‑off

Risk controls: the checklist I recommend

Before

Define the decision boundary: what AI can and cannot do
Require traceability (NCT IDs / citations)
Choose a verification sampling plan

During

Force “NOT AVAILABLE” for missing fields
Run a deliberate failure test (e.g., fake NCT)
Keep a changelog of edits and approvals

After

Spot‑check key claims against primary sources
Archive inputs/outputs for auditability

Never

Let AI author a protocol or endpoints without human ownership
Use AI output as evidence when source is absent
Treat a fluent summary as validated truth

Want the full workflow + prompts?

In the premium version, I walk through a complete end-to-end demo (research → structured summaries → drafting with guardrails), and I share a optmized prompt pack you can use immediately, trial cards, cross-trial comparison tables, endpoint benchmarking, verification checklists, and drafting insights.

Upgrade to Premium to access:

The video walkthrough (live examples )
The prompt library
Subscribe now

AI for Clinical Trials: Using the ClinicalTrials.gov Connector

Gustavo Monnerat PhD — Thu, 22 Jan 2026 14:53:26 GMT

This video is a practical demonstration of how to use AI for clinical trials work without losing traceability, scientific judgment, or accountability.

I walk through real use cases, using the ClinicalTrials.gov connector as the primary data source, and show how to pair it with verification guardrails and free alternatives when paid tools aren’t availabl…

How to Evaluate Digital Health: A Practical Guide

Gustavo Monnerat PhD — Thu, 15 Jan 2026 19:26:38 GMT

First: get the categories right

Before judging any digital health solution, you need to answer two basic questions:

Who is it for?

Patient-facing
Clinician-facing
System-facing

What does it do?

Care delivery
Monitoring
Decision support
Population health

A meditation app, a tele-ICU system, and an AI triage model may all be called “digital health” but they do not require the same evidence, metrics, or validation.

One-size-fits-all evaluation is the fastest way to get it wrong.

Different applications require different evidence

Here’s where most hype collapses.

A digital therapeutic claiming outcome improvement should be supported by clinical trials or strong real-world outcome data.
A wearable should demonstrate data accuracy, stability over time, and robustness to missing data.
A decision-support tool must show safety when wrong—not just average performance.
Telemedicine should be compared to in-person care, including quality, access, and equity.
Population-level analytics must prove stability and transportability across settings.

If the evidence doesn’t match the claim, the claim is not credible.

Common reasons digital health studies fail

Most failures are not about bad technology. They’re about weak evaluation.

Watch out for:

Selection bias (early adopters ≠ real-world users)
Short follow-up that captures novelty, not sustainability
Weak or undefined “usual care” comparators
Observational results framed as causal impact
Non-representative populations
Small samples with unstable estimates
Lack of external validation

If you’ve read a “breakthrough” digital health paper that felt underwhelming on closer inspection, this is usually why.

A simple evaluation checklist

Before trusting a digital health claim, ask:

Does it solve a real clinical or operational problem?
Is the contribution genuinely new, or just incremental?
Is the population large and representative?
Are the methods appropriate for the claim?
Are results transparent, including uncertainty and limitations?
Is it ready for real-world use, considering workflow, safety, equity, and scale?

If the answer is “no” early on, no amount of model performance will save it.

Why metrics alone are not enough

Accuracy numbers without context are misleading.

Meaningful evaluation includes:

Discrimination and calibration
False positives and false negatives
Decision-relevant thresholds
Uncertainty and confidence intervals
Subgroup performance
Sensitivity analyses

The key question is always:
Do these metrics reflect how the tool will be used in practice?

Want the full framework?

This newsletter is the high-level map.

For premium subscribers, I recorded a full video class where I walk through:

The complete digital health taxonomy
How to match application type to evidence requirements
Real examples of strong vs weak studies
Validation strategies (retrospective, prospective, external, real-world)
Which metrics matter for which use cases
How editors, reviewers, and regulators actually think

The video is designed to help you:

Read papers faster and more critically
Avoid being misled by hype
Make better decisions in research, product, or policy

If digital health is part of your work, or soon will be, this framework will save you time and mistakes.

👉 Premium members can access the full video class inside Evidence Decoded.

How to Evaluate Digital Health: A Practical Guide

Gustavo Monnerat PhD — Thu, 15 Jan 2026 19:19:06 GMT

Digital health is not one thing. It is an umbrella covering very different technologies, risks, and evidence needs.
This guide summarizes how to classify digital health solutions, understand what kind of evidence they require, and critically evaluate studies and claims.

The goal is not to decide whether a technology is “innovative,” but whether it is safe…

2025’s most impactful papers in Digital Health and AI

Gustavo Monnerat PhD — Fri, 26 Dec 2025 10:26:11 GMT

A lot of AI-in-health research still lives in the sandbox: impressive benchmarks, limited deployment reality. But 2025 produced a different class of papers, studies where the unit of analysis wasn’t just the algorithm, but the workflow, the scale, and the operational consequences.

Below are five studies I consider impactful because they moved digital health and AI closer to routine clinical infrastructure: national screening, bedside ultrasound, equity-oriented remote specialty care, generative AI in therapeutics, and a randomized trial of a generative AI mental health intervention.

1) National real-world AI in mammography screening: performance and implementation at scale

Paper: Nationwide real-world implementation of AI for cancer detection in population-based mammography screening (Nature Medicine, 2025)

Why it matters: This is the type of evidence that changes policy conversations: large-scale real-world implementation

Key results

463,094 women screened across 12 sites, involving 119 radiologists; AI used in 260,739 screens.
Cancer detection rate: 6.7/1,000 with AI vs 5.7/1,000 control (+17.6%).
Recall rate: 37.4/1,000 with AI vs 38.3/1,000 control (lower; noninferior).
Precision improved (PPV of recall 17.9% vs 14.9%; PPV of biopsy 64.5% vs 59.2%).

Takeaway: The headline isn’t “AI detects more cancers.” It’s: AI can improve detection without inflating recalls under real-world constraints.

2) AI + point-of-care ultrasound: opportunistic screening years before diagnosis

Paper: Artificial intelligence-guided detection of under-recognised cardiomyopathies on point-of-care cardiac ultrasonography: a multicentre study (The Lancet Digital Health, 2025)

Why it matters: This is what scalable clinical AI looks like: use a tool already spreading everywhere (POCUS) and upgrade it from “acute question” imaging to opportunistic screening.

Key results

Two large health systems; 78,054 eligible POCUS videos (YNHHS) and 13,796 (MSHS).
Strong discrimination (AUC ~0.90) for hypertrophic cardiomyopathy and transthyretin amyloid cardiomyopathy across sites/views.
Clinically meaningful lead time: 58% of hypertrophic cardiomyopathy and 46% of transthyretin amyloid cardiomyopathy cases would have screened positive ~2 years before diagnosis.
Prognostic signal: in those without known cardiomyopathy, higher AI probabilities were associated with higher mortality risk (adjusted HR 1.17 and 1.39 for top vs bottom quintile).

Takeaway: A plausible pathway to earlier detection using routine bedside data, reducing diagnostic delay for treatable disease.

3) Digital neonatal neurocritical care at national scale: tele-monitoring as equity infrastructure

Paper: Digital neonatal neurocritical care in Brazil: a retrospective multicentre cohort study of over 11,000 remotely monitored infants (The Lancet Regional Health – Americas, 2025)

Why it matters: Digital health impact also lies on the operational ability to deliver specialist-grade care across geography. This paper describes a large-scale network across 79 NICUs (2017–2024) supported by 24/7 specialist availability, training, and remote multimodal monitoring.

Key results

11,333 neonates, 727,858 hours of remote brain monitoring, and 124,967 interactions between monitoring centers and bedside teams.
Electrographic seizures identified in 18.4%; single antiseizure medication achieved control in 56.1%.

Takeaway: This is digital health as care delivery architecture, explicitly framed as feasible for resource-limited settings and equity-driven scale-up.

4) Generative AI in drug development: an AI-discovered therapy tested in phase 2a

Paper: A generative AI-discovered TNIK inhibitor for idiopathic pulmonary fibrosis: a randomized phase 2a trial(Nature Medicine, 2025)

Why it matters: It was a claim that GenAI can compress parts of discovery into timelines that reach human trials.

Key results

Phase 2a multicenter double-blind RCT; 12 weeks; 71 patients across three dosing arms vs placebo.
Primary safety endpoint (treatment-emergent adverse events) similar across arms.
Lung function signal: highest dose mean forced vital capacity change +98.4 ml vs −20.3 ml placebo.
Program timeline claim: candidate nomination streamlined to 18 months, and phase 0/1 completion to under 30 months from initiation of target discovery.

Takeaway: AI-originated discovery reaching a controlled phase 2a test, with a measurable physiological signal and concrete safety considerations.

5) RCT of a generative AI therapy chatbot: clinical symptom reductions with engagement metrics

Paper: Randomized Trial of a Generative AI Chatbot for Mental Health Treatment (NEJM AI, 2025)

Why it matters: Randomized trial supporting effectiveness of a GenAI mental health chatbot, including explicit measurement of engagement and therapeutic alliance.

Key results

Adults N=210 with clinically significant symptoms (MDD, GAD, or CHR-FED); 4-week intervention vs waitlist; outcomes at 4 and 8 weeks.
Depression (PHQ-9): mean change −6.13 vs −2.63 at 4 weeks; −7.93 vs −4.22 at 8 weeks; effect sizes d ≈ 0.85–0.90.
Anxiety: mean change −2.32 vs −0.13 at 4 weeks; −3.18 vs −1.11 at 8 weeks; d ≈ 0.79–0.84.
Use exceeded 6 hours on average; alliance reported as comparable to human therapist norms.
Governance caveat: exclusions included active suicidality/mania/psychosis, with multiple safety guardrails and post-transmission human supervision.

Takeaway: It establishes a plausible pathway to evidence-based, scalable mental health with engagement and alliance treated as outcomes.

The 2025 pattern: what “impact” looks like now

Across these five papers, “impact” isn’t a bigger AUC. It’s:

Scale (population screening, multi-hospital POCUS, national NICU network)
Workflow integration (human-in-the-loop, training/monitoring, real-world constraints)
Actionability (earlier detection, better yield, expanded access, trial-stage therapeutic testing)

Clinical Trials that will shape medicine in 2026

Gustavo Monnerat PhD — Sat, 20 Dec 2025 13:38:41 GMT

Nature Medicine recently published clinical trials most likely to shape medical practice in the coming years.

Among all of them, these are the ones I personally found most interesting:

1. Next-generation tuberculosis vaccine (M72/AS01E)

This large phase 3 trial, enrolling 20,000 participants across Africa and Asia, aims to prevent progression to active pulmonary TB

Why this matters:

The current BCG vaccine works reasonably well in young children, but its protection fades in adolescents and adults, precisely the population responsible for TB transmission. This mismatch has been one of the quiet failures of TB prevention for decades.

M72/AS01E is designed to do something different:
→ prevent progression to active pulmonary TB in adolescents and adults, including people with prior TB exposure and those living with HIV.

Early phase data already showed ~50% reduction in progression to disease. The phase 3 trial is testing whether this effect holds at scale.

If successful, this would represent the meaningful shift in TB prevention strategy in nearly a century, not by replacing BCG in infants, but by finally addressing the epidemiological core of transmission.

2. Cardiovascular disease beyond cholesterol: inflammation as a causal target

The second trial that stood out focuses on IL-6 inhibition with ziltivekimab.

For years, cardiovascular prevention has been dominated by lipids, with a great focus on LDL cholesterol metrics and Statins treatments.

However, biology has been telling us a more complicated story. Residual inflammatory risk, often measured by high-sensitivity C-reactive protein, persists even in patients with optimal lipid control.

Ziltivekimab directly targets IL-6, a key inflammatory mediator. If this trial is positive, cardiovascular medicine will no longer be able to justify treating cholesterol while ignoring inflammation.

3. Autoimmune and rare diseases: from suppression to correction

The third area that stood out is where things start to feel like the future. Two approaches deserve attention:

• mRNA CAR-T therapy for autoimmune disease

Instead of permanently editing DNA, these therapies temporarily reprogram immune cells using mRNA. The goal is precision immune reset without long-term genomic modifications. Early trials in diseases like myasthenia gravis show durable symptom control after a short treatment course.

• Prime editing of autologous stem cells in rare diseases

Prime editing allows highly precise correction of genetic defects. In early trials for rare immunodeficiencies, patients’ own stem cells are edited ex vivo and reinfused, avoiding graft rejection and graft-versus-host disease.

Early signals suggest functional cures, even in ultra-rare conditions with limited commercial incentives.

Oral GLP-1 Therapy: Promise, Evidence, and Limits

Gustavo Monnerat PhD — Tue, 16 Dec 2025 15:14:23 GMT

GLP-1 receptor agonists act on a core gut–brain–pancreas axis. They slow gastric emptying, increase satiety via central appetite pathways, enhance glucose-dependent insulin secretion, and suppress glucagon release. The clinical result is a coordinated effect on energy intake, glycaemic control, and cardiometabolic risk. This biology explains both their effectiveness and a consistent pattern seen across the class: weight loss is attenuated in people with established type 2 diabetes compared with those without.

One limitation however of GLP-1 therapies is not their mechanism, but how they are delivered. Most approved agents are injectable peptides, requiring cold-chain logistics, repeated injections, and healthcare infrastructure that is unevenly distributed worldwide. These barriers affect uptake, persistence, and equity.

Orforglipron represents a structural shift. It is a non-peptide, small-molecule GLP-1 receptor agonist, taken orally once daily, without refrigeration. From a population-health perspective, this difference has direct implications for access and long-term adherence.

The NEJM trial (ATTAIN-1): obesity without diabetes

The ATTAIN-1 trial, published in The New England Journal of Medicine, evaluated once-daily oral orforglipron in adults with obesity without diabetes over 72 weeks. More than 3,100 participants were randomized to orforglipron 6 mg, 12 mg, or 36 mg, or placebo, alongside lifestyle intervention.

At week 72, mean weight loss reached 11.2% in the 36 mg group, compared with 2.1% with placebo. Importantly, weight-loss thresholds were clinically meaningful: more than half of participants on the highest dose achieved ≥10% weight reduction, over one-third achieved ≥15%, and nearly one in five reached ≥20%. Treatment was also associated with improvements in waist circumference, blood pressure, and lipid parameters. Adverse events were predominantly gastrointestinal and led to treatment discontinuation in a minority of patients, more often than with placebo

Interpretation: In people with obesity without diabetes, oral orforglipron delivers double-digit mean weight loss at higher doses, approaching the lower range of efficacy seen with injectable GLP-1 therapies. Ref: N Engl J Med 2025;393:1796-806.

The Lancet trial (ATTAIN-2): obesity with type 2 diabetes

The ATTAIN-2 trial, reported in The Lancet, studied the same oral agent in adults with obesity or overweight and established type 2 diabetes, again over 72 weeks. A total of 1,613 participants were randomized to orforglipron or placebo as an adjunct to lifestyle modification.

As expected for this population, weight loss was more modest. Mean reductions reached 9.6% with the highest dose, compared with 2.5% with placebo. However, this trial highlighted the drug’s dual metabolic effect: glycated haemoglobin fell by up to 1.66%, and a substantial proportion of participants achieved standard glycaemic targets (<7% and ≤6.5%). Cardiometabolic risk markers, including waist circumference, blood pressure, and lipids, also improved. Gastrointestinal adverse events remained the main tolerability issue, with slightly higher discontinuation rates than placebo

Interpretation: In people with type 2 diabetes, oral orforglipron provides clinically relevant weight loss together with robust glycaemic improvement, consistent with class biology. Ref: Horn, Deborah B et al. The Lancet 2025

What remains unanswered

It is still to be demonstrated superiority over the most potent injectable GLP-1. Weight-loss efficacy is consistently attenuated in diabetes, reflecting disease biology rather than drug failure. Gastrointestinal intolerance remains a key issue. Critically, these studies do not address long-term cardiovascular outcomes, durability after treatment discontinuation, or real-world adherence outside controlled trial settings.

Why this matters

At scale, the impact of obesity and diabetes therapies is shaped not only by efficacy, but by access, acceptability, and persistence. Oral GLP-1 receptor agonists remove several structural barriers associated with injectables, particularly in resource-constrained health systems. If oral administration translates into improved adherence and broader uptake, the population-level impact of these drugs could be great.

How GBD 2023 Rewires Our Map of Global Health

Gustavo Monnerat PhD — Sun, 30 Nov 2025 11:59:27 GMT

The new GBD 2023 cycle rewires some of the machinery underneath the numbers everyone quotes for deaths, DALYs and risk factors. That has real implications for how we read trends, design policies, and judge whether a program is working or not.

In this issue, I’ll walk you through:

The core metrics GBD uses (and how not to misinterpret them)
The main methods behind GBD 2023
What really changed compared with previous rounds
How these choices affect policy, health‑system planning and evaluation
Caveats and failure modes – where you should be cautious when using GBD results

1. The key metrics

1.1 Mortality

Number of deaths.
Cause‑specific death rate: deaths per 100 000 population.
Age‑standardised death rate: death rates reweighted to a standard global age structure.
- Why it matters: this is a way to compare a younger country with an ageing country. Without age‑standardisation, older countries always look “worse” just because they’re older.

1.2 YLLs, YLDs and DALYs

GBD is built around a simple but powerful idea: not all deaths and not all illnesses are equal.

Years of Life Lost (YLLs)
- Take each death and ask: how many years short of an ideal life expectancy did this person die?
- That “ideal” life expectancy is based on the lowest observed mortality rates globally, by age.
- A death at 20 and a death at 80 no longer count the same.
Years Lived with Disability (YLDs)
- Prevalence of each health state × a disability weight (0 = full health, 1 = equivalent to death).
- YLDs capture the loss of healthy life from non‑fatal conditions.
Disability‑Adjusted Life Years (DALYs)
- DALY = YLL + YLD.
- One DALY = one year of life lost to early death or to living in less‑than‑full health.
- This is GBD’s main metric for burden comparisons across diseases, age groups and countries.

1.3 Risk‑attributable DALYs

GBD doesn’t stop at “how big is the problem?” – it also asks “how much of this could, in principle, be prevented?”

For each risk factor (e.g. blood pressure, smoking, air pollution), GBD estimates:
- The exposure distribution in the population.
- The relative risk curve linking exposure to specific outcomes.
- A theoretical minimum risk level (TMREL): the counterfactual “optimal” exposure.
Then it calculates the population‑attributable fraction (PAF): the fraction of burden that would disappear if everyone were at the TMREL.
Risk‑attributable DALYs are DALYs × PAF for each risk–outcome pair.

These are the numbers behind statements like “X% of cardiovascular DALYs in this country are attributable to high blood pressure”.

1.4 Uncertainty intervals

Quantifying uncertainty is a central part of GBD’s modeling strategy. Every GBD estimate comes with a 95% uncertainty interval (UI).

They’re derived from hundreds of randon draws through the entire modelling chain.
For users: if two countries’ UIs heavily overlap, treat any rank differences with suspicion.
For communicators: always show UIs – not just single numbers – especially in policy debates.

2. Inside GBD 2023: how the numbers are generated

At a high level, GBD 2023 runs a four‑step pipeline:

Demography: estimate all‑cause mortality and life tables.
Cause‑of‑death modelling: split that data into specific causes.
Non‑fatal modelling: estimate incidence, prevalence, sequelae and YLDs.
Comparative risk assessment: attribute pieces of that burden to specific risk factors.

2.1 Demography: OneMod – a new unified mortality engine

The biggest technical change in GBD 2023, replacing fragmented life-table approaches with OneMod, a unified engine that directly models age-specific mortality across time and geography using sociodemographic drivers, pandemic effects and risk exposure..

2.2 Causes of death: CODEm + COVID refinements

Cause of death estimates now combine ensemble machine-learning models with a dedicated COVID misclassification correction and systematic garbage-code redistribution for full cross-country comparability.

2.3 Non‑fatal outcomes: DisMod‑MR and DisMod‑AT

Non-fatal burden is reconstructed through Bayesian disease models that enforce consistency between incidence, prevalence and mortality, with a new generation model capturing cohort effects in rapidly evolving conditions.

2.4 Risk factors: updated evidence, same core logic

GBD 2023 quantifies avoidable burden for 88 risks by combining updated exposure data, meta-analytic relative risks and population-attributable fractions under a theoretical minimum-risk framework.

3. What’s new compared with earlier GBD rounds?

If you’ve been using older GBD numbers, what changed under your feet?

3.1 Life expectancy trends and YLLs for some countries

OneMod directly models age‑specific mortality without imposing external life‑table shapes.
This allows the model to detect new patterns (e.g. mortality reversals, age‑specific shocks) more flexibly.

Implication: life expectancy trends and YLLs for some countries – especially in sub‑Saharan Africa and parts of Asia – are not just extrapolated; they’re more tightly anchored in newly available birth‑history, sibling‑history and surveillance data.

3.2 More granular causes and subnational detail

Across rounds, GBD has:

Expanded from ~100 causes in the 1990s to >370 diseases and injuries.
Reached 660 subnational units in 20 countries for full burden estimation.

GBD 2023 adds new causes and refines some previous groupings. This matters if you follow, for example, CKD, inflammatory bowel disease or specific endocrine conditions – some will now have their own cause codes rather than being hidden in “other” buckets.

3.3 New disease models and updated risk evidence

Shift to more flexible non‑fatal modelling.
CKD and other exemplar conditions in GBD 2023 use more sophisticated staging, aetiology splits and sequelae.
Risk factor evidence has been refreshed with thousands of new data sources and dozens of new systematic reviews.

4. Why this matters for policy and health‑system decisions

You don’t need every technical detail to use GBD intelligently. But you do need to understand how the methods shape the answers you get.

4.1 Priority‑setting: DALYs and risk attribution

DALYs and risk‑attributable DALYs let you:

Rank diseases by total health loss in your country or region.
Decompose that loss into what is probably preventable via modifiable risks.
Compare where you are now to where you “should” be given your Socio-Demographic Index (SDI) (observed vs expected burden).

Good uses:

Identifying under‑prioritised conditions (e.g. mental health disorders, CKD, musculoskeletal conditions) that generate substantial YLDs but relatively few deaths.
Targeting prevention policies toward high‑impact risks (e.g. blood pressure, tobacco, diet, air pollution) where the evidence is strong and attributable DALYs are large.

4.2 Monitoring systems and SDGs

Because GBD 2023:

Re‑estimates the entire time series back to 1990 with new data and methods, and
Provides age‑standardised rates,

…you can use it to:

Track long‑term trends (e.g. premature NCD mortality, under‑5 mortality, maternal deaths) on a consistent basis.
See where your country is an outlier relative to others at similar SDI.
Monitor whether health shocks (COVID‑19, conflict, economic crises) have left persistent impacts on mortality or specific causes.

4.3 Subnational equity and program design

Subnational estimates in GBD 2023 allow more precise questions, such as:

Which states/regions drive national TB, CKD or hypertensive burden?
Are risk patterns (e.g. obesity, alcohol, ambient air pollution) clustering in specific urban corridors?
How does mean age at death differ between regions beyond what their demography would predict?

These insights help align:

Where you deploy primary care expansion, specialist services, or screening programs.
How you sequence policies (e.g. air‑quality regulations vs urban trauma care vs mental health services) to local realities.

4.4 Evaluation and accountability

Because GBD regularly updates with new data and methods, it can function as a mirror for health systems:

If your flagship program claims big mortality reductions, but GBD shows flat age‑standardised death rates and DALYs with narrow UIs, you should question the program, or your data.
If GBD shows accelerated improvements where you scaled a specific intervention (e.g. vaccination, salt reduction policy, tobacco tax), that supports the plausibility of impact, though it’s not a causal proof by itself.

5. Caveats, limitations and how to use GBD safely

Every large global model has limitations. Here are the main pitfalls, and how to minimise them.

5.1 Data gaps and model dependence

Many countries and subnational regions still have

Weak civil registration and vital statistics (CRVS),
Sparse survey or registry data for non‑fatal conditions, and
Limited direct measurements for key risks (especially diet and environmental exposures).

GBD fills gaps using borrowing of strength across time, space and covariates. That’s a feature, not a bug – but it means:

Estimates for data‑poor settings are more model‑driven, and UIs may still understate true uncertainty.

Practical advice:

Always check how many local data sources exist.
For data‑sparse countries, treat exact ranks and small differences with scepticism; focus on orders of magnitude and broad patterns.

5.2 Single underlying cause of death

For mortality, GBD assigns one underlying cause per death, even though multiple causes often coexist.

This is necessary to prevent double-counting of deaths. But it means complex deaths (e.g. multimorbidity in older adults) are forced into a single definition.

Practical advice:

Use cause‑specific death rates for high‑level patterns, not for fine‑grained clinical debates about individual pathways.
For multimorbidity questions, lean more on YLDs and comorbidity‑adjusted analyses than on underlying‑cause counts alone.

5.3 Risk attribution is not causal impact evaluation

Risk‑attributable DALYs answer a specific question:

“If we could move everyone to an optimal exposure level, and if the relative risks are truly causal and correctly specified, how much burden would disappear?”

They do not tell you:

What any specific policy will achieve in practice.
How quickly risk reductions translate into outcome changes.
What happens when multiple policies interact.

Practical advice:

Use risk‑attributable DALYs to prioritise directions of impact (e.g. “blood pressure control should be a top priority”), not to set numeric targets (e.g. “this program will save exactly 10 000 DALYs”).

5.4 Continuous revisions

GBD 2023 replaces previous cycles. As methods and data improve, historical estimates are revised.

6. How to bring GBD 2023 into your own work

A few practical ways to operationalise all this:

In teaching and communication
- Use YLL/YLD examples to explain why death counts are not enough.
- Show age‑standardised rates and UIs by default.
- Walk through GBD Compare to explore a specific country or disease.
In policy briefs and strategy documents
- Anchor priority‑setting sections in DALYs and risk‑attributable DALYs, but clearly label them as potentially preventable rather than guaranteed gains.
- Use observed vs expected burden (by SDI) and subnational maps to highlight inequities and outliers.
In program design and evaluation
- Combine GBD trends with local administrative data and study‑level evidence.
- Use GBD to check whether your evaluation findings are broadly consistent with national and regional patterns.

Final thought

GBD 2023 doesn’t just give us new numbers; it gives us a better‑calibrated lens on where humans are losing years of life and years of health.

Used well, that lens helps us ask sharper questions:

Where is the real avoidable burden in this system?
Which risks and conditions are being systematically ignored?
Are we closing gaps in a way that matches the evidence, or chasing noise?

References:

Mark, Patrick B., et al. “Global, regional, and national burden of chronic kidney disease in adults, 1990–2023, and its attributable risk factors: a systematic analysis for the Global Burden of Disease Study 2023.” The Lancet (2025).

Schumacher, Austin E., et al. “Global age-sex-specific all-cause mortality and life expectancy estimates for 204 countries and territories and 660 subnational locations, 1950–2023: a demographic analysis for the Global Burden of Disease Study 2023.” The Lancet 406.10513 (2025): 1731-1810.

Naghavi, Mohsen, et al. “Global burden of 292 causes of death in 204 countries and territories and 660 subnational locations, 1990–2023: a systematic analysis for the Global Burden of Disease Study 2023.” The Lancet 406.10513 (2025): 1811-1872.

I Analyzed the 100 Most-Cited Medical Papers of the Last 3 Years. Here’s the Pattern

Gustavo Monnerat PhD — Mon, 24 Nov 2025 15:44:58 GMT

Every year, thousands of medical papers are published. A few become globally influential. Most disappear.

I wanted to understand why, so I did something simple:
I downloaded the 100 most-cited medical papers indexed in Scopus (2022–2025) and analyzed the patterns.

I expected to see breakthrough trials and innovative AI models. But here’s what I found: The papers dominating global medical influence today are not discoveries — they are reference points.

1. The Most-Cited Papers Are Not Experiments, They Are Maps of Reality

Of the top 100 papers, about two-thirds are not experimental studies at all. They are:

Global burden & mortality reports
GLOBOCAN 2022; cancer statistics 2023–2025; GBDs, global dementia, stroke, RSV, HF, CKD, AMR, and COVID burden.
Clinical guidelines & standards
ESC guidelines, ADA Standards of Care, KDIGO CKD guideline, AHA heart-disease statistics, EULAR, AASLD, ESHRE.
Disease atlases & fact sheets
Alzheimer’s Facts & Figures, IDF Diabetes Atlas, CBTRUS brain tumor statistics.

They don’t introduce new mechanisms.
They don’t test new drugs.
They don’t use cutting-edge AI.

But they shape medical practice, funding decisions, regulatory agendas, and clinical guidelines, so they dominate citations.

If you want to know what the world actually uses to make decisions, it’s these:
maps, frameworks, standards, and burden-of-disease dashboards.

2. Three Disease Areas Dominate

The top-100 list is not evenly distributed.

1. Cancer (~30%)

Global cancer statistics, national cancer reports, breast cancer, colorectal cancer, cervical cancer, liver cancer, China cancer trends.

2. Cardiovascular (~25–30%)

AHA heart disease & stroke statistics, ESC guidelines (ACS, HF, cardiomyopathies, CCS), global HF burden.

3. Metabolic (~15–20%)

Diabetes (IDF Atlas, ADA standards), obesity projections, NAFLD prevalence, CKD burden.

After these:

Mental health & dementia (~10–15%)
Infectious diseases & AMR (~15–20%)

3. AI Papers are Getting Massive Citations

AI/LLM papers are absolutely present in the top-100 dataset:

AI in health and medicine
GPT-4 in medical decision-making
ChatGPT in healthcare education & research

They are hitting 1,000–1,800+ citations within 12–24 months, an insane velocity.

4. High-Impact Papers Follow a Similar Structure

After reviewing all 100, the most influential papers share a single narrative formula, even though their topics are wildly different.

Step 1. Define a global problem

Burden, incidence, mortality, trends.

Step 2. Quantify it with credible data

GBD, WHO, national registries, multi-country cohorts.

Step 3. Interpret the meaning

What does this trend mean for clinical decisions? Resource allocation? Surveillance? Therapeutic gaps?

Step 4. Offer a framework or standard

Guidelines, classifications, staging systems, care pathways, diagnostic algorithms.

Step 5. Make it accessible

Most top-100 paper are open access.

This structure is the real engine of medical influence in 2025.

5. The Most-Cited Papers Are “Public Goods”

The papers that dominate global influence do so for one simple reason:

They solve a shared problem for millions of people.

A guideline is useful to every cardiologist.
A global cancer map is useful to every epidemiologist, policymaker, and oncologist.
A diabetes atlas is useful to every health system.

RCTs are important, but they serve a narrower audience.
AI papers are flashy, but their implementation is still uneven.

But burden maps, guidelines, and classifications?

They’re used everywhere.

This is why they are cited endlessly.
This is why they anchor the literature.
This is why they set the agenda.

The Big Lesson

After analyzing the 100 most-cited papers of 2022–2025, the pattern is undeniable:

Modern medical influence is built on clarity, structure, and population relevance

If you want your work (or your writing, or your AI research) to have outsized impact, focus on:

answering widely shared questions
summarizing complex realities
offering stable frameworks
clarifying uncertainty
making insights usable to large audiences

When Good Science Gets Lost: The Intersection of Timing, Systems, and Communication

Gustavo Monnerat PhD — Fri, 21 Nov 2025 09:54:20 GMT

In 1992, Nature rejected a manuscript by Peter Ratcliffe.
27 years later, the same discovery earned him the Nobel Prize.

When I shared this story on LinkedIn, the post unexpectedly exploded, dozens of comments from clinicians, scientists, statisticians, medical writers, professors, industry leaders, and journal editors.

And the most striking part wasn’t the engagement.
It was the diversity of interpretations of what the rejection meant.

Some readers saw it as evidence of a flawed publishing system.
Some as proof that paradigm-shifting ideas take time.
Others emphasized clarity, communication, and visibility.

After reading every comment and pairing them with what I’ve seen across thousands of articles in top-tier publishing, one conclusion emerged:

Many rejections are not just about brilliance, but also about barriers.
And many of those barriers are avoidable.

This newsletter is about those barriers.

1. The Publishing System Has Limitations, And They Matter

One commenter put it bluntly:
This just shows how ridiculous the publishing world has become.”

Peer review is human.

Structural limits that influence decisions

• Reviewers are unpaid and overloaded
• Editors receive far more strong papers than they can publish
• Time pressure can lead to scanning, not deep reading
• Interpretation varies widely between reviewers

Even excellent work can be misunderstood, delayed, or overshadowed.

But, and this is important, these are structural realities.

2. Paradigm-Shifting Ideas Arrive Before Their Time

Several experts highlighted a second truth:

Some ideas are too early for the field.

This came up repeatedly in the comments:
“Some ideas are simply too early for their time.”
“Novel ideas often get rejected and scrutinised.”

History supports this:

• Barry Marshall’s Helicobacter pylori work was rejected as a poster.
• Rosalind Yalow’s radioimmunoassay paper was rejected before earning a Nobel.

As another commenter put it:

“Great ideas don’t become accepted until the field is ready to see them.”

Paradigm shifts require conceptual scaffolding the community may not yet have.

But there is a third dimension, the one authors can influence most.

3. Communication Is Part of Scientific Quality

This was the strongest theme across senior researchers and publication experts:

Clarity is not cosmetic, it is foundational.

One comment captured it well:
“Impact begins with readability.”
And another:
“A brilliant idea poorly communicated is indistinguishable from a weak one.”

Across my editorial work, this pattern is consistent.

When reviewers cannot quickly understand:

• the contribution
• the evidence
• the novelty
• the implications

…they default to caution.
And caution often looks like rejection.

Typical communication problems that trigger rejection

1. The contribution is not explicitly stated.
If reviewers must infer the novelty, they will assume it’s incremental.

2. The data story is buried.
Reviewers scan figures first, if the main contrast is unclear, confidence drops fast.

3. Methods are under-explained or not reproducible.
Lack of transparency is interpreted as lack of rigor.

4. The discussion becomes speculative.
Overstating claims is the fastest way to activate reviewer skepticism.

5. The narrative wanders.
If the thread is unclear, reviewers assume the reasoning is unclear too.

None of these reflect the quality of the idea, only its visibility.

4. The Intersection: Why Good Papers Get Rejected

Most commenters were right, but in different ways.

The truth is not A or B or C.
It’s the intersection:

A. The system has limitations.

Peer review is overworked and imperfect.

B. Paradigm-shifting ideas take time.

New frameworks take years, sometimes decades, to be accepted.

C. Communication is part of scientific quality.

A strong idea with weak clarity becomes invisible.

As I wrote in the follow-up post:

A brilliant idea poorly communicated looks weak.
A brilliant idea ahead of its time still needs evidence readers can follow.
And a brilliant idea reviewed in a fragile system needs every advantage it can get.

These realities coexist.

5. The Part You Can Control: Clarity as a Competitive Advantage

This is why I wrote the original post.

Not to defend peer review.
Not to blame authors.
Not to claim that writing fixes systemic issues.

But because in a world where:

• scientific output doubles every 9–12 years
• reviewers often have minutes, not hours
• AI now produces more text than humans
• novelty is abundant, attention is scarce

clarity has become a competitive advantage in science.

Not the only factor.
Not the magic solution.
But one of the very few that is entirely within an author’s control.

Clarity does not guarantee acceptance.

How Reviewers Interpret It

“This is exploratory or opportunistic.”

Fix

Directly map results to the original research question.
Anything exploratory belongs in secondary analyses.

VI. Overclaiming in the Discussion

Why This Happens

Authors want to emphasize importance; reviewers see that as exaggeration.

How Reviewers Interpret It

“If they are overstating impact, maybe the data are weak.”

Explicitly state what changes because of your findings:
• practice
• policy
• theory
• research priorities
• mechanisms
• modeling assumptions

Impact must be named.

Putting It All Together

Even in imperfect systems, even with delayed recognition, many rejections stem from issues that authors can address with strategy, clarity, and transparency.

The goal is not to eliminate rejection.
The goal is to eliminate avoidable rejection.

Because every manuscript is competing for attention in an environment where:

• reviewers have minutes, not hours
• cognitive load is high
• AI accelerates the volume of text
• clarity has become a competitive advantage

And competitive advantages matter.

7,000 Steps, Causality, and the interpretation danger

Gustavo Monnerat PhD — Mon, 17 Nov 2025 09:53:38 GMT

This month, a new systematic review and dose–response meta-analysis in The Lancet Public Health tried to answer a seemingly simple question:

How many steps per day are actually associated with better health outcomes?

The headline number – “about 7,000 steps per day” – exploded on social media. Underneath my own post, the comments section turned into an impromptu journal club: epidemiologists, clinicians, HEOR people, statisticians, methodologists.

The themes were very consistent:

“Are people healthy because they walk, or walking because they’re healthy?”
“Where is the control for endogeneity?”
“How do we get from an observational HR of 0.53 to the prescription ‘7,000 steps delivers large, clinically meaningful benefits’?”
“What about intensity? What about people who can’t walk 7,000 steps?”

This newsletter is my attempt to decode the evidence for you.

Instead, we are going to walk through:

What the study actually did (and did not do).
What the numbers mean, technically.
Why reverse causation and residual confounding are real, not pedantic.
How to reconcile “strong observational signal” with the cautionary tale.
How I would actually use this evidence in practice and in public communication.

1. The study in one paragraph:

The authors did a systematic review and dose–response meta-analysis of prospective cohorts where:

Exposure = device-measured daily steps (accelerometers, pedometers, wearables) in free-living adults (≥18 years).
Outcomes included:
- all-cause mortality
- cardiovascular disease (incidence and mortality)
- cancer (incidence and mortality)
- type 2 diabetes
- dementia and other cognitive outcomes
- depressive symptoms
- physical function
- falls

Key features:

Literature search from 2014–Feb 2025 .
57 studies from 35 cohorts in the systematic review;
31 studies from 24 cohorts in the meta-analyses.
Step counts treated as a continuous exposure.
They tested several shapes (linear, splines, quadratic, cubic) and chose the best fit via Bayesian Information Criterion (BIC).
Reference set at 2,000 steps/day (a low but realistic value for older adults).
Hazard ratios (HRs) pooled using one-stage random-effects dose–response meta-analysis.
Risk of bias: Newcastle–Ottawa Scale (most studies scored high).
Certainty: GRADE, starting from “low” (observational cohort) and upgrading where dose–response and consistency were strong.

This is at the upper end of what you can reasonably do with heterogeneous cohort data. It does not turn an observational association into a trial.

2. What did they actually find?

I’ll simplify the numbers: 7,000 vs 2,000 steps/day.

Across the pooled analyses:

All-cause mortality
- HR ≈ 0.53 (47% lower risk) at 7,000 vs 2,000 steps.
- Clear inverse, non-linear association: big gains between 2,000 and ~6,000–7,000, then the curve flattens but keeps going down up to ~12,000.
Cardiovascular disease incidence
- HR ≈ 0.75 (25% lower risk) at 7,000 vs 2,000.
- Non-linear, with inflection around 7,800 steps overall.
- Older adults showed a lower “elbow” (~5,400 steps).
Cardiovascular disease mortality
- HR ≈ 0.53 at 7,000 vs 2,000, but:
- only three cohorts; substantial heterogeneity; result sensitive to one large study.
- Certainty: low.
Cancer
- Incidence: small, imprecise association (HR 0.94; CI crosses 1).
- Mortality: stronger association (HR 0.63), but again few cohorts.
Type 2 diabetes
- HR ≈ 0.86 at 7,000 vs 2,000; roughly linear trend -> more steps, lower diabetes incidence.
Dementia
- HR ≈ 0.62 at 7,000 vs 2,000, with a non-linear curve and inflection closer to ~8,800 steps.
- Only two cohorts, but consistent direction and dose–response.
Depressive symptoms
- HR ≈ 0.78 at 7,000 vs 2,000; linear association.
Falls & physical function
- Falls: inverse non-linear association in pooled model, but very low certainty; deleting one large cohort made the association fragile and possibly U-shaped.
- Physical function: heterogeneous studies, no pooled HR; broadly consistent with “more steps, less functional decline,” but evidence is patchy.

On GRADE, they rated certainty as moderate for:

all-cause mortality
CVD incidence
cancer mortality
type 2 diabetes
dementia
depressive symptoms

and low or very low for cancer incidence, CVD mortality, physical function, and falls.

So far so good. But now we hit the hard questions.

3. The elephant in the room: reverse causation and residual confounding

From comments on my Linkedin post:

“Sick people can’t take as many steps.”

That is not a side note; it is central.

3.1 What the cohorts did try to do

Most primary studies:

excluded people with major disease at baseline or did sensitivity analyses excluding early events;
adjusted for:
- age and sex
- BMI
- smoking
- self-reported health
- sometimes education, income, etc.

The meta-analysis only included prospective cohorts (exposure measured before outcomes), which is the minimum requirement to talk about temporal sequence.

3.2 What they cannot solve

But there are at least three layers of problems they cannot fully adjust away:

Subclinical or under-documented disease
A patient can look “healthy” in a registry and still have:
- early heart failure,
- COPD,
- cancer,
- neurodegenerative disease.
  These conditions reduce step counts long before they are coded.
Frailty and functional reserve
Two 75-year-olds with identical ICD codes can have dramatically different:
- muscle mass,
- balance,
- sarcopenia,
- falls risk.
  Those factors drive both the ability to walk and the risk of death, but are rarely measured well.
Lifestyle clustering
The 7,000-step walker is not just walking more. On average, they are also more likely to:
- eat differently,
- sleep differently,
- engage in other forms of physical activity,
- have different jobs and stress profiles,
- be more health-conscious overall.

Even with careful adjustment, these domains are incompletely captured. So the HR of 0.53 at 7,000 vs 2,000 is not the isolated “effect of 5,000 extra steps” in an otherwise unchanged life.

It is a composite contrast between two real-world phenotypes:

“people who habitually live around ~2,000 steps/day”
vs
“people who habitually live around ~7,000 steps/day.”

Those phenotypes differ on much more than step counts.

3.3 Why this matters

Another comment invoked the hormone replacement therapy (HRT) story, and that is a good analogy to keep in mind.

Observational data once suggested HRT reduced cardiovascular risk.
Women who used HRT were healthier, wealthier, more educated – the “healthy user effect.”
RCTs later showed no such benefit, and in some cases, harm.

Physical activity is not HRT – we have mechanistic and experimental evidence that exercise improves multiple risk factors. But the logic holds:

If we treat a large observational as the causal effect size of a behavior, we are vulnerable to being misled.

So, for this paper, we should avoid language like:

“Adding 5,000 steps reduces mortality by 47%.”

and prefer something like:

“People who accumulate ~7,000 steps/day have substantially lower risk than those around 2,000, even after adjustment – consistent with a strong protective role for higher movement, plus healthier underlying status.”

That may sound less impactdull, but it is more accurate.

4. Why I still take this paper seriously

Given all that, why not just shrug and say “correlation ≠ causation, case closed”?

Because in this specific domain, we have three additional pillars:

Decades of causal-leaning evidence on physical activity and cardiorespiratory fitness (CRF)
- Exercise trials demonstrate improvements in blood pressure, insulin sensitivity, lipids, functional capacity.
- Higher CRF is one of the most powerful predictors of mortality we have.
Biological plausibility
- We are not inventing a novel exposure with unknown mechanisms.
- The pathways from movement → cardiovascular and metabolic health are well characterized.
Consistency and dose–response across outcomes and cohorts
- The same basic pattern appears for mortality, CVD, diabetes, dementia, depressive symptoms.
- Multiple cohorts, different countries, device types, age groups.

So I interpret the paper like this:

The exact magnitude of risk reduction is uncertain and likely inflated at the very low end by reverse causation.
The direction (more steps = lower risk, within reasonable limits) is extremely robust and biologically plausible.
The shape – steep gains from very low levels up to roughly 6–8,000 steps and then diminishing returns – is very similar to classic Pysical Activity dose–response curves (e.g. for moderate-to-vigorous minutes).

In other words:
This paper does not create a causal story from scratch; it quantifies, in step counts, a story we already believed based on stronger evidence about movement and CRF.

5. What about the magic number: is 7,000 steps “real”?

Let’s deconstruct “7,000” carefully.

5.1 Where does 7,000 come from?

It is not a biological breakpoint. It is a pragmatic anchor where:

For most outcomes, HRs at 7,000 vs 2,000 are already substantially lower.
Above 7,000, the dose–response curve continues downward, but further gains are smaller.
The full range up to ~12,000 shows no obvious harm signal in general adults.

The inflection points they estimate are:

All-cause mortality: ~5,400 steps.
CVD incidence: ~7,800 steps overall, lower in older adults.
Dementia and falls: ~8,800 steps (with high uncertainty for falls).

When they tabulate 1,000-step increments, 7,000 emerges as a point where:

the curves have clearly “bent” away from the steepest risk,
the absolute risk reduction vs very low steps is large,
the incremental benefit of pushing everyone to 10–12k becomes smaller (not zero, but smaller).

So 7,000 is best read as:

A policy-friendly level where feasibility meets substantial benefit for many adults, especially those currently around 2–4,000.

Not as:

“The true optimal number of steps for all humans.”

5.2 What about 10–12k steps? “No upper limit”?

Several people correctly pointed out that:

For all-cause mortality, going from 7,000 to 12,000 in the model yields an additional ~20% relative risk reduction vs 7,000 (depending on how you compute it).
The pooled curves do not show increased risk at high step volumes in typical general-population cohorts.

So a more nuanced message is:

If you are highly sedentary, getting to 4–5,000 is already a big win.
For many adults, 7,000 is a realistic intermediate target with clear associated benefits.
For those who enjoy higher levels and tolerate them well, 10–12,000 very likely adds further protection; the slope is simply less steep.

There is no conflict between:

“7,000 is a realistic benchmark with strong evidence,” and
“there is no clear upper limit; more is better up to at least ~10–12k for most people.”

Both are true, depending on who you are talking to.

6. Volume vs intensity: are all steps equal?

One repeated concern in the comments:

“1,000 steps wandering in a small shop is not the same as 1,000 steps in a brisk 60-minute walk.”

The paper actually tackles this in a secondary analysis on cadence (step rate):

They looked at metrics like peak 30-min step cadence.
Findings:
- Higher peak 30-min cadence is associated with lower all-cause mortality.
- But when they adjust for total step volume, many cadence associations become non-significant.

Translation:

In these datasets, how much you move (total steps) is the dominant signal.
Intensity adds something, but is strongly correlated with total volume and harder to disentangle.

For practice, my synthesis is:

For population messaging, you can safely emphasize total daily step volume: it is simple, objective, and strongly predictive.
For clinical prescriptions and CRF improvement, you still want bouts of higher-intensity / brisk walking – steps plus “getting out of breath” is what moves VO₂max.

Steps are a good monitor of movement behaviour, not a replacement for thinking about intensity.

7. What about people who cannot walk 7,000 steps?

Several clinicians raised an important point:

“What about patients with severe knee complications, heart failure, neurological disease? Or people whose main exercise is cycling, rowing, or swimming?”

This is not a limitation of the data only; it is also a communication risk.

A few points to keep straight:

The cohorts included various special populations (diabetes, lung disease, liver disease, HF), but they are a minority.
Step counts are a poor metric for:
- cycling, rowing, swimming,
- many wheelchair users,
- people who primarily do upper-body training.

So:

For the general ambulatory adult population, daily steps are a practical and useful metric.
For people who cannot accumulate many steps:
- you should translate the same dose of movement into other modalities (e.g. minutes of cycling, water exercise, resistance training),
- and not enforce step targets that are unrealistic or unsafe.

Guidelines still rest on total moderate-to-vigorous physical activity, of which walking is just one mode.

8. How I would actually communicate this to different audiences

8.1 For clinicians and health professionals

A formulation I’m comfortable with:

“This meta-analysis pooled data from more than 20 prospective cohorts and found a strong, consistent association: people who accumulate more daily steps have substantially lower risks of premature death, cardiovascular disease, diabetes, dementia and depression.
The biggest gains happen when you move from very low levels (around 2,000 steps/day) up to about 6–8,000. Beyond that, risk continues to fall, but more gradually.
It is observational, so some of that effect is because healthier people can walk more. But taken together with trial and mechanistic evidence on physical activity and fitness, it strongly supports a simple, pragmatic message:
If you can, move more than you do today. For many adults, working gradually toward ~7,000 steps/day, and beyond if tolerated, is both realistic and clinically meaningful.”

…followed by nuance for the individual patient:

joint disease, frailty, comorbidities, preferences, non-walking exercise, etc.

8.2 For policy and public health

I would not frame this as “replace 150 minutes/week with 7,000 steps/day and we’re done.”

Instead:

Integrate steps as a parallel metric in guidelines:
- e.g. “150–300 minutes/week of moderate intensity OR roughly 7–10k steps/day on most days, for those who can walk.”
Use 7,000 as a realistic target in population messaging, especially for highly sedentary groups, while still celebrating higher volumes.
Make explicit that any increase from a very low baseline is beneficial.

8.3 For researchers and methodologists

The paper itself raises some interesting opportunities:

Harmonised re-analyses with causal inference frameworks (e.g. marginal structural models, G-formula) across cohorts, focusing on:
- time-varying confounding (illness → activity → outcomes),
- competing risks in older adults.
More precise evaluation of:
- device type and wear location,
- step-derived thresholds by age, frailty, and baseline health.
Trials that use step counts as both target and adherence measure for Pysical Activity interventions, connecting change in steps → change in intermediate outcomes → hard endpoints.

For now, however, the meta-analysis is the best “big picture” we have on steps specifically.

9. My bottom line for “Evidence Decoded”

If I had to compress this paper into three sentences for you:

This Lancet Public Health meta-analysis offers a strong, observational dose–response signal: more daily steps are associated with lower risks of mortality, CVD, diabetes, dementia and depressive symptoms, with large gains as you move from ~2,000 to ~7,000 steps/day.
The numbers are not causal effect sizes; they are shaped by reverse causation, residual confounding, and selection bias. We should not claim that “prescribing 5,000 extra steps halves mortality” for any given individual.
When interpreted in the context of decades of evidence on physical activity and cardiorespiratory fitness, the study supports a pragmatic, honest message:

“If you can walk, every extra 1,000 steps from a low baseline is likely to help. For many adults, gradually working toward around 7,000 steps/day – and beyond if you enjoy it – is a realistic and clinically meaningful goal, alongside other forms of movement, diet, sleep and risk-factor control.”
Subscribe now

Reference: Ding, Ding, et al. “Daily steps and health outcomes in adults: a systematic review and dose-response meta-analysis.” The Lancet Public Health 10.8 (2025): e668-e681.

Reporting Guidelines for Medical AI

Gustavo Monnerat PhD — Thu, 28 Aug 2025 12:10:01 GMT

Artificial intelligence is rapidly reshaping medical research, but one challenge persists: most studies are still reported inconsistently. Poor reporting makes it hard to reproduce findings, assess clinical validity, or even trust that an algorithm will work outside the study setting.

To close this gap, several peer-reviewed guidelines have been published (or extended) specifically for AI in healthcare. They provide structured checklists for authors, reviewers, and editors, covering all relevant points from clinical trials to diagnostic accuracy studies, prediction models, and early implementation pilots.

The major frameworks now in use include:

CONSORT-AI and SPIRIT-AI — for randomized trials and trial protocols involving AI.
TRIPOD-AI — for diagnostic and prognostic prediction model studies.
PROBAST-AI — for assessing risk of bias in AI-based prediction models.
DECIDE-AI — for early-phase, real-world clinical evaluations.

Together, these guidelines aim to bring order and transparency to a fast-moving field. Below is a concise overview of each, with their scope, key checklist elements, and what we know so far about adoption.

CONSORT-AI (2020): Randomized Trials with AI

Published in Nature Medicine, BMJ, and Lancet Digital Health in 2020, CONSORT-AI is an official extension of CONSORT 2010. Its checklist was designed for randomized controlled trials (RCTs) where the intervention includes an AI system.

What’s new?
CONSORT-AI adds specific items. These cover:

a clear description of the AI system and intended use,
how input data and outputs are handled,
the role of human oversight,
required user expertise or training, and
how errors or system failures were analyzed.

Why it matters: Early AI trials often missed some of these details, making replication and appraisal difficult. CONSORT-AI pushes authors to spell out how the AI actually worked in the trial context.

Adoption so far: The guideline is widely cited, and leading journals now reference it in instructions for authors. However, a recently found that only a small part of AI RCTs explicitly mentioned CONSORT-AI, and details like error analyses or human-AI interaction were often missing.

SPIRIT-AI (2020): Trial Protocols for AI

Published in BMJ, The Lancet Digital Health and Nature Medicine in 2020, SPIRIT-AI complements CONSORT-AI by focusing on protocols — the “what you plan to do” before the trial starts.

Key additions:

Define the AI intervention and its integration into workflow.
Specify data inputs, training, or tuning to be done during the trial.
Outline how human-AI interactions will occur.
Pre-plan how errors or updates to the algorithm will be handled.

Adoption so far: SPIRIT-AI is gaining adoptions, however protocols often miss AI-specific details, like how outputs will guide decisions.

TRIPOD-AI (2024): Prediction Model Studies

The TRIPOD+AI was released in BMJ in 2024 as an update to the original 2015 TRIPOD guideline. It applies to any study developing or validating diagnostic or prognostic prediction models — whether using classical statistics or machine learning.

What’s new?
TRIPOD-AI reorganizes the originalitems, adding detail for AI methods:

Clear description of data sources and preprocessing.
How hyperparameters were tuned and models validated.
Model interpretability (what inputs were used, whether it’s explainable).
Subgroup analyses to identify bias.
Transparency around fairness, trustworthiness, and open science (e.g. code sharing).

Why it matters: Prediction models are among the most common AI studies in healthcarem but reporting has been frequently poor. TRIPOD-AI sets a higher bar, addressing both regression and machine learning methods.

Adoption so far: TRIPOD 2015 is one of the most cited reporting guidelines in medicine, and TRIPOD-AI is expected to follow that trajectory. Journals are already updating instructions for authors to align with TRIPOD-AI.

PROBAST-AI (2025): Risk of Bias in AI Models

PROBAST-AI allows all stakeholders (eg, model developers, AI companies, researchers, editors, reviewers, healthcare professionals, guideline developers, and policy organisations) to examine the quality, risk of bias, and applicability of any type of prediction model in the healthcare sector. It’s an update to the widely used PROBAST (2019) tool, published in BMJ in 2025.

What’s new?
The core four domains — Participants, Predictors, Outcomes, and Analysis — are expanded with AI-specific criteria:

Was the dataset representative, or prone to bias?
Was there data leakage between training and test sets?
Was validation independent and robust?
Did the model perform fairly across demographics?
Was algorithm complexity or updating properly described?

Why it matters: Many AI prediction models look good on paper but fail in real-world practice due to poor validation or hidden biases. PROBAST-AI equips systematic reviewers, regulators, and journal editors with sharper tools to flag high-risk studies.

Adoption so far: Given PROBAST’s popularity, PROBAST-AI is likely to become standard in AI-focused evidence syntheses.

DECIDE-AI (2022): Early Clinical Evaluations

Most AI tools fail not in development, but when tested in clinical workflows. DECIDE-AI, published in Nature Medicine and BMJ in 2022, tackles this gap by focusing on early, exploratory studies (the “Phase 1–2” equivalent for AI).

Checklist focus :

Human factors: how clinicians used the AI, and what training was needed.
Workflow integration: setting, version control, and system updates.
Monitoring for drift: how the AI performed outside its training data.
Safety and bias: tracking errors, subgroup performance, and equity.
Decision-making: how AI outputs actually influenced clinical judgments.

Why it matters: DECIDE-AI treats AI tools as complex interventions, highlighting usability and safety before large-scale RCTs.

Adoption so far: boradly cited, and increasingly referenced in pilot implementation studies. Regulators also recognize it as a way to build early clinical evidence. DECIDE-AI is closing a critical reporting gap between algorithm development and large clinical trials.

Emerging and Specialized Guidelines

Alongside these core frameworks, niche checklists are filling field-specific needs:

CLAIM (updated 2024): Reporting of AI in Medical Imaging studies.
TRIPOD-LLM (2025): Reporting of studies that are developing, tuning, prompt engineering or evaluating a large language model (LLM).

These specialized frameworks echo the same principles: transparency about data, algorithms and validation, but tailored to each field’s unique challenges.

Big Picture: Why This Matters

The explosion of AI in healthcare has been matched by a wave of guidelines designed to standardize reporting.

CONSORT-AI / SPIRIT-AI: Randomized trials and their protocols.
TRIPOD-AI / PROBAST-AI: Prediction models and bias assessment.
DECIDE-AI: Early, real-world evaluations.
Specialty extensions: Imaging, LLMs, and beyond.

Collectively, these frameworks set a higher bar for transparency, reproducibility, and fairness. Adoption is growing, journals are beginning to require them, and researchers are citing them, but compliance remains not perfect.

The takeaway: reporting guidelines are only as powerful as their adoption. For AI to move from hype to trusted clinical practice, researchers, reviewers, and journals need to treat these checklists not as optional add-ons, but as essential scientific infrastructure.