AI

OpenAI o3 aids 18 rare childhood genetic disease diagnoses

OpenAI o3 aids 18 rare childhood genetic disease diagnoses

Image: OpenAI

An AI reasoning model helped physicians diagnose 18 previously unsolved rare genetic diseases affecting children, per a joint study from Boston Children’s Hospital, Harvard, and OpenAI published June 18, 2026 in NEJM AI. The OpenAI o3 Deep Research tool delivered a 4.8% additional diagnostic yield after reanalyzing 376 long-unsolved cases collected by Boston Children’s Hospital’s Manton Center for Orphan Disease Research. All findings are detailed in the joint study published in NEJM AI.

OpenAI o3 Deep Research workflow design for rare disease reanalysis

Rare disease genomic reanalysis is as much a maintenance and resource allocation problem as a scientific one. A patient’s genome remains static over time, but new gene-disease associations, variant reclassifications, and published case reports can convert previously inconclusive genomic results into solvable diagnostic cases.

An estimated 50% of individuals with rare diseases receive no diagnosis even after full genomic sequencing and specialist evaluation, per the study, due to fragmented clinical records, millions of potential genetic variants, and rapidly expanding scientific literature that can obscure relevant diagnostic signals.

For the study, researchers built a custom workflow that integrated the o3 Deep Research model as a reasoning layer operating atop existing genomic analysis pipelines. Rather than outputting only a ranked list of candidate genes, the model was prompted to synthesize clinical features, inheritance patterns, variant evidence, and relevant peer-reviewed literature into a human-readable justification for each proposed diagnosis.

Each case input packet included standardized Human Phenotype Ontology terms describing the patient’s clinical presentation, de-identified clinician notes, age and gender metadata, and a filtered variant table logging each variant’s rarity, predicted protein effect, ClinVar classification, and segregation signal quality across family members. Most packets contained genomic data from the affected child and both biological parents.

For example, a case packet for a patient with undiagnosed neurodevelopmental delay included HPO terms for intellectual disability and speech delay, alongside filtered variants including a TSC2 variant reclassified from uncertain significance to likely pathogenic in ClinVar in 2024.

Validation accuracy across diverse rare disease cohorts

Before testing the 376 unsolved cases, the team refined the prompt and review protocol using cases with already confirmed diagnoses to validate workflow reliability. In duplicate test runs across 51 cases covering a range of rare genetic conditions, the workflow correctly identified the causative gene and variant in 48 cases, a 94.1% accuracy rate for this validation set.

For a separate cohort of 57 rare neuromuscular disease cases, the workflow returned the correct diagnosis in 45 duplicate runs, a 78.9% accuracy rate for that group.

The 94.1% accuracy rate for the general rare genetic condition cohort was 15.2 percentage points higher than the 78.9% rate for the neuromuscular disease subgroup, which the team attributed to the more heterogeneous symptom presentation of neuromuscular conditions.

Testing on 15 long-read genome cases produced stronger performance: the workflow correctly identified the causative gene in all 15 cases, and detected both disease-causing alleles in 12 of those 15 cases, an 80% full biallelic detection rate. Across all validation tests, the model’s self-reported confidence scores correlated strongly with accuracy: the mean minimum confidence score for consistently correct diagnostic calls was 85.6, versus 42.1 for incorrect or unconfirmed calls.

The research team explicitly noted these confidence scores were not calibrated probabilities, and were not used as a substitute for genetic evidence or clinical adjudication. Instead, they functioned as a prioritization tool to help expert reviewers focus on the most promising candidate diagnoses for further investigation.

4.8% additional diagnostic yield across 376 long-unsolved cases

The team applied the finalized workflow to four distinct cohorts of long-unsolved cases: pediatric patients with neurodevelopmental disorders, individuals with rare neuromuscular conditions, children and adolescents with early-stage psychosis, and pediatric cases of sudden unexpected death. None of the cases in these cohorts were unreviewed samples: nearly all had already been analyzed via multiple commercial and institutional genomic pipelines, plus review by multidisciplinary specialist teams, with no diagnosis identified in any prior assessment.

For every candidate diagnosis output by the model, at least two members of the study team independently reviewed the supporting evidence, with any disagreements resolved via group consensus.

A model-generated candidate was only counted as a confirmed diagnosis after the relevant variant was classified as pathogenic or likely pathogenic per ACMG/AMP variant classification standards, a CLIA-certified laboratory confirmation was obtained, and the patient’s clinical care team formally returned the result to the family.

The 4.8% additional diagnostic yield is equivalent to 1 new diagnosis for every 21 cases reanalyzed, a meaningful improvement for patients who have waited years for answers. This yield is particularly notable given that all 376 cases had already undergone extensive prior expert analysis with no diagnosis identified in any earlier assessment.

Safeguards ensuring clinical validity of AI-generated diagnoses

The o3 Deep Research model was not authorized to issue independent diagnostic determinations. Its function was limited to surfacing evidence-backed candidate diagnostic explanations for specialist review, with all 18 confirmed diagnoses validated through ACMG/AMP variant classification standards, CLIA-certified laboratory confirmation, and formal sign-off from the patient’s clinical care team before results were disclosed to families.

The study frames the o3 Deep Research model as a scalability solution for a well-documented bottleneck in rare disease care, not a replacement for clinical judgment.

As new gene-disease associations and variant reclassifications are published at a steady pace, the gap between inconclusive genomic test results and actionable diagnoses will continue to widen without more efficient reanalysis workflows, per the research team.

We may earn commission from affiliate links at no extra cost to you. Last updated: Jun 20, 2026.
Aira

Founding Editor and Publisher of ZBrandCo, covering artificial intelligence, open-source software, and the developer tools people actually use. Signal over hype: every story starts from a primary source and explains why it matters. ZBrandCo runs no paid reviews and no affiliate links. Tips and corrections: editorial@zbrandco.com.