A peer-reviewed study published in NEJM AI on June 18, 2026 finds OpenAI’s o3 Deep Research reasoning model helped clinicians confirm 18 previously unsolved rare childhood genetic disease diagnoses from a 376-case long-unsolved backlog, delivering a 4.8% additional diagnostic yield over prior expert analysis. The research was conducted by a joint team from Boston Children’s Hospital’s Manton Center for Orphan Disease Research, Harvard University, and OpenAI, per OpenAI’s official study announcement.
All 376 de-identified cases had undergone multiple rounds of standard genomic analysis and multidisciplinary specialist review with no confirmed diagnosis prior to AI-assisted reanalysis. Per the study’s authors, roughly 50% of patients with rare diseases never receive a clear genetic diagnosis even after extensive genomic sequencing and specialist review, making this long-unsolved backlog a high-impact target for AI-assisted analysis.
OpenAI o3 Deep Research Workflow Design for Rare Childhood Genetic Diagnosis
The research team designed the o3 Deep Research workflow to act as an explanation-first reasoning layer atop standard genomic pipelines, rather than returning only a ranked list of candidate genes, per the peer-reviewed NEJM AI study. For each of the 376 de-identified cases, the model was fed a structured packet including standardized Human Phenotype Ontology terms describing the patient’s clinical features, demographic metadata, and a filtered variant table capturing each variant’s rarity, predicted protein impact, ClinVar classification, and family segregation signal.
Most packets included genomic data from the affected child and both biological parents. The workflow did not issue independent diagnoses, instead producing evidence-linked candidate explanations for clinical review, with confirmed diagnoses only finalized after validation by a clinical laboratory and sign-off from a multidisciplinary care team.
Pre-Release Validation and Confidence Score Performance
Prior to testing on the unsolved case pool, the team fine-tuned the workflow on 51 cases with confirmed rare disease diagnoses. The model recovered the correct gene and associated variant in duplicate test runs for 48 of those 51 cases, a 94.1% success rate on the fine-tuning set.
It also returned correct diagnoses for 45 of 57 neuromuscular cases in the validation set, and identified the correct disease gene in all 15 long-read genome test cases, including both disease-causing alleles in 12 of those 15 samples.
The model’s self-reported confidence scores aligned with accuracy across these validation runs: the mean minimum score for consistently correct calls was 85.6, compared to 42.1 for incorrect or unconfirmed calls. Study researchers emphasized these scores were not calibrated probabilities, and were only used to prioritize candidate explanations for expert review, not as a replacement for clinical decision-making.
4.8% Additional Diagnostic Yield for Unsolved Rare Childhood Genetic Cases
The 376 unsolved cases were pulled from four separate diagnostic groups: pediatric patients with neurodevelopmental disorders, people living with rare neuromuscular conditions, children and teens with early psychosis, and pediatric cases of sudden unexpected death. Every case had already been analyzed by multiple commercial or institutional genomic pipelines and discussed by multidisciplinary specialist teams, with no confirmed diagnosis identified prior to the AI-assisted reanalysis.
The 18 confirmed diagnoses represent a 4.8% additional diagnostic yield over earlier expert analysis. The workflow’s ability to cross-reference fragmented clinical records, variant databases, and published literature in minutes, rather than the hours or days required for manual expert review, means clinical teams can systematically revisit older unsolved cases as new gene-disease associations are published.
Clinical Guardrails and Implementation Protocol
Per the study’s clinical protocol, all model outputs were classified as hypotheses, not formal diagnoses. At least two members of the research team reviewed every candidate explanation using the ACMG/AMP variant classification framework, the standard for clinical genetic testing.
Reviewer disagreements were resolved by consensus, and no result was counted as a confirmed diagnosis until a variant was classified as pathogenic or likely pathogenic, verified by a CLIA-certified clinical laboratory, and communicated to the patient’s care team.
The study’s strict guardrails reflect the high stakes of pediatric genetic diagnosis, where false positive results can lead to unnecessary medical interventions and false negative results can delay critical care for years. Study researchers emphasize the workflow is not a replacement for clinical judgment, but a tool to surface high-priority candidate diagnoses for teams already stretched thin by heavy rare disease caseloads.
OpenAI Health AI Expansion Context
The study’s release coincides with OpenAI’s broader push to expand health AI capabilities, including updates to ChatGPT’s health intelligence features that the company reports have reduced flagged factuality issues in health responses by 71% in the two months prior to June 18, 2026, per OpenAI’s June 18 health AI product update. OpenAI currently collaborates with more than 260 physicians across 60 countries and 26 medical specialties to evaluate and refine its health AI tools.
The collaboration network spans 60 countries and 26 medical specialties, with feedback used to refine prompt design, clinical guardrails, and output formatting for health-specific use cases.
