Released as a June 2026 arXiv preprint (identifier 2606.28406), SciDraw-Bench is the first standardized benchmark designed to evaluate how well AI systems generate usable scientific figures, a use case unaddressed by existing image-generation evaluation tools. The full benchmark suite includes 32 distinct tasks, covers 8 unique scientific figure types, and spans 10 separate academic disciplines. 1
SciDraw-Bench Addresses a Long-Unmet Need for Scientific Figure Evaluation
Current widely used image-generation benchmarks including GenEval, T2I-CompBench, and DPG-Bench only measure natural-image metrics such as compositionality, object counting, and photorealism, and do not assess the unique requirements of functional scientific figures. These requirements include accurate, readable text labels, precise representation of entities and their relationships, logically organized diagram layouts, and compliance with field-specific drawing standards. 1
For example, a standard neuroscience synaptic transmission diagram requires precise neuron labeling, directional arrow conventions for signal flow, and consistent symbol usage for neurotransmitters—none of which are evaluated by existing natural-image benchmarks. These tools focus exclusively on natural image quality metrics, leaving critical scientific figure usability requirements unmeasured for researchers relying on AI-generated visuals for publication. 1
Four-Dimensional Protocol Measures Core Scientific Figure Quality
Each of SciDraw-Bench’s 32 tasks pairs a natural-language prompt with machine-checkable specifications defining required labels, entity relations, component parts, disciplinary conventions, and negative constraints to avoid. The framework uses a four-dimensional evaluation protocol to score generated outputs against these specifications. 1
The four dimensions are Text Fidelity, measured via OCR-based label recall and character error rate; Semantic Correctness, judged by a vision-language model against the task specification; Structural Quality; and Convention Adherence. Each dimension is scored independently to provide granular feedback on model performance gaps for developers and researchers. 1
The package also includes a meta-evaluation protocol to assess the reliability of automated scoring metrics, plus preliminary inter-judge reliability analysis to align automated scores with human expert ratings. The meta-evaluation protocol is designed to quantify alignment between automated benchmark scores and ratings from domain expert judges, with preliminary inter-judge reliability analysis included in the preprint to validate initial scoring consistency.
Full human-rating validation of all four evaluation dimensions is scheduled for the full public benchmark release, per the preprint’s outlined roadmap. 1
Pilot Tests Confirm Domain-Specific Models Outperform General-Purpose Baselines
In a pilot test spanning all 8 figure types included in the initial benchmark release, the domain-specific SciDraw AI system substantially outperformed general-purpose text-to-image baselines on every evaluation dimension and figure type tested. 1
The largest performance gaps appeared in the Semantic Correctness and Convention Adherence categories, while Text Fidelity—specifically the generation of correct, legible text labels—remained the most challenging dimension for all tested systems, including the domain-specific SciDraw AI model. This aligns with widely documented limitations of general-purpose text-to-image models, which frequently produce illegible or incorrect text in generated images, a critical flaw for scientific use cases where label accuracy is required for reproducibility. 1
Planned Code-to-Figure Extension Will Broaden Benchmark Use Cases
The preprint also outlines a planned code-to-figure baseline extension, which will evaluate systems that generate figures directly from scientific code or data rather than from natural-language prompts. 1
For AI builders and research tooling teams building tools for scientific writing and publication, the benchmark fills a critical gap in evaluation infrastructure. As text-to-image models are increasingly integrated into academic writing workflows, the ability to measure adherence to disciplinary conventions and text accuracy is as important as photorealism for real-world, production-ready usability, per the preprint’s analysis of current gaps in scientific AI evaluation tools. 1
