A peer-reviewed study published in the June 2026 issue of NEJM AI details a collaboration between the Manton Center for Orphan Disease Research at Boston Children’s Hospital, Harvard University, and OpenAI to test the o3 Deep Research reasoning model on long-unsolved rare pediatric genetic cases.
The work found the model helped clinicians identify 18 previously undiagnosed rare genetic diseases from a pool of 376 cases that had already been reviewed by specialist teams without a diagnosis. This delivered a 4.8% additional diagnostic yield compared to prior human-only analysis of the same case set OpenAI.
The o3 model was not designed to issue standalone clinical diagnoses. Instead, it functioned as an explanation-first reasoning support layer, cross-referencing de-identified patient genomic data, standardized Human Phenotype Ontology (HPO) clinical phenotype records, and current peer-reviewed scientific literature to produce evidence-linked candidate diagnostic hypotheses for specialist review. All candidate outputs required formal clinical validation before being shared with patients or care teams OpenAI.
Workflow Design for AI Diagnose Rare Childhood Genetic Diseases Prioritizes Explainability
To align with clinical safety requirements, the research team designed the AI workflow to act as a support layer for existing institutional genomic pipelines, rather than a standalone diagnostic tool. For each of the 376 unsolved cases included in the study, research analysts built anonymized data packets containing standardized Human Phenotype Ontology (HPO) terms to capture each patient’s clinical symptoms and traits OpenAI.
These packets also included existing clinician notes, demographic information, and a curated variant list noting each variant’s rarity, predicted impact on protein function, ClinVar classification, and inheritance signal quality across family members. Most packets included genomic data from the affected child and both biological parents, a deliberate design choice that allowed the model to assess variant inheritance patterns, a key factor in distinguishing disease-causing variants from benign inherited variants present in family lineages OpenAI.
The use of standardized HPO terms was also intentional: it eliminates ambiguity in clinical phenotype descriptions that often leads to missed matches across disparate databases and research studies. The model was prompted to propose the most plausible molecular explanation for the patient’s symptoms and show its full reasoning chain, rather than returning only a ranked list of candidate genes without supporting evidence OpenAI.
This explainability-first design was deliberate: the research team noted that black-box gene rankings are difficult for clinicians to interrogate. A full evidence-linked justification, by contrast, can be cross-checked against existing clinical knowledge and lab results to confirm or rule out candidate diagnoses OpenAI.
How was the o3 model validated before testing on unsolved cases?
Before running the workflow on unsolved cases, the team validated it on 123 cases with established diagnoses to refine prompts and identify failure modes. In duplicate runs across 51 of these validation cases covering a wide range of rare conditions, the model recovered the correct gene and disease-causing variant 48 times, a 94.1% accuracy rate for that subset. For the 57 neuromuscular cases included in the full validation set, it returned the correct diagnosis 45 times, a 78.9% accuracy rate for that condition subset, while for a subset of 15 long-read genome cases, it identified the correct causative gene in every run, and both disease-causing alleles in 12 of those cases, an 80% rate of identifying both pathogenic alleles for long-read data OpenAI.
The model’s self-reported confidence scores correlated strongly with accuracy in these validation runs: the mean minimum score was 85.6 for consistently correct calls, versus 42.1 for incorrect or unverified calls. The research team explicitly noted these scores are not calibrated probabilities, and were used only to prioritize which candidate explanations reviewers should examine first, not as a substitute for clinical adjudication OpenAI.
What was the additional diagnostic yield for unsolved rare childhood genetic disease cases?
The team then applied the validated workflow to four distinct patient cohorts totaling 376 previously unsolved cases: pediatric patients with neurodevelopmental disorders, individuals living with rare neuromuscular conditions, children and teens diagnosed with early-stage psychosis, and instances of sudden unexplained death in pediatric populations. All of these cases had already been analyzed by multiple commercial or institutional genomic pipelines and discussed by multidisciplinary specialist teams, with no diagnosis identified in prior reviews OpenAI.
Following the model’s candidate generation, expert review, additional testing, and clinical confirmation, physicians established new diagnoses in 18 cases. This delivered a 4.8% additional diagnostic yield over earlier specialist analysis, a meaningful improvement given the exhaustive prior review all cases had received. The research team noted this indicates the model uncovered leads that human analysts had missed even after extensive multidisciplinary review OpenAI.
What guardrails prevent the AI from issuing standalone clinical diagnoses?
The research team emphasized that no model output was ever treated as a formal diagnosis during the study. All candidate explanations generated by the model were reviewed by at least two members of the research team, with disagreements resolved by consensus, and evaluated using the same ACMG/AMP framework that clinical laboratories use to classify genetic variants OpenAI.
A finding was only counted as a confirmed diagnosis after qualified experts classified the relevant variant as pathogenic or likely pathogenic, a CLIA-certified clinical laboratory verified the result, and the patient’s care team returned the result to the family. This multi-step verification process aligns with standard clinical diagnostic workflows, and ensures that the AI’s role is limited to generating testable hypotheses, rather than making clinical decisions OpenAI.
All data used in the study was de-identified to comply with patient privacy regulations, and no patient interaction or direct clinical decision-making was performed by the model. The study’s safeguards were designed to integrate with existing clinical safety protocols without requiring workflow changes for participating care teams OpenAI.
Bottom line: For medical genetics teams working with long-unsolved rare pediatric genetic cases, the June 2026 NEJM AI study demonstrates that deploying OpenAI’s o3 Deep Research model as a hypothesis-generation support tool, paired with mandatory dual human review and standard ACMG/AMP variant classification, can deliver a 4.8% additional diagnostic yield across 376 previously unsolved cases without requiring changes to existing clinical safety protocols.
