A new 32-task benchmark for scientific figure generation shows general-purpose AI falls short on discipline-compliant research diagrams, with a domain-specific SciDraw AI system outperforming all tested baselines across every usability metric. The SciDraw-Bench evaluation framework, submitted to arXiv on June 24, 2026, highlights a critical gap in current generative AI capabilities for scientific communication.
SciDraw-Bench: 32-Task Benchmark Shows AI Falls Short on Scientific Figures
The SciDraw-Bench evaluation framework tests 32 structured generation tasks spanning 8 distinct figure types and 10 separate scientific disciplines. Tested figure types include molecular mechanism diagrams, chemical reaction schematics, experimental design flowcharts, conceptual frameworks, and graphical abstracts, while covered disciplines range from biochemistry and physics to ecology and social sciences.
The benchmark measures four core usability metrics: text fidelity, assessed via OCR-based label recall and character error rate as measured by OCR-based label recall and character error rate metrics used in scientific AI benchmarks to evaluate legibility of axis labels, IUPAC compound names, and other specialized text; semantic correctness, judged by a vision-language model against task-specific specifications for entity relationships and component accuracy; structural quality, measuring coherence of diagram layout and element alignment; and convention adherence, scoring compliance with field-specific drawing rules such as standard arrow types for biochemical pathways or axis labeling norms for physics graphs.
None of these four metrics are covered by existing image generation benchmarks, which focus exclusively on natural image photorealism, compositionality, and object counting. The protocol includes a meta-evaluation layer to assess judge consistency, with preliminary inter-judge reliability testing completed and full human-rating validation currently ongoing.
General AI Models Fall Short Across All Test Dimensions
In pilot testing across all 8 tested figure types, the domain-specific SciDraw AI system substantially outperformed representative general-purpose text-to-image and multimodal baselines on every metric and figure type, with the largest performance gaps observed on semantic correctness and convention adherence. For example, general models frequently misrepresented molecular bonding structures, used non-standard arrow types for signaling pathways, or misplaced graph axis labels, while SciDraw AI consistently produced compliant, accurate outputs aligned with disciplinary norms.
Text fidelity, measured via OCR-based label recall and character error rate, was the hardest dimension for all tested models. General systems frequently produced garbled or missing specialized terminology such as IUPAC compound names or statistical notation across all 10 tested disciplines, rendering outputs unusable for research purposes without extensive manual correction.
Implications for Researchers and AI Tool Builders
The benchmark highlights a critical blind spot in current generative AI evaluation: existing standards prioritize natural image photorealism and compositionality, none of which map to the strict, rule-bound requirements of scientific communication. For individual researchers, this means using general-purpose AI image generators to create figures for papers or presentations carries a documented risk of inaccurate labels, incorrect structural representations, or violations of field-specific drawing conventions that could undermine the credibility of published work or require hours of manual correction.
For teams building AI tools for research, education, or scientific publishing, the results suggest off-the-shelf text-to-image models are not yet viable for high-stakes diagram generation without domain-specific fine-tuning or custom output validation guardrails tailored to scientific use cases.
The paper’s authors note a code-to-figure baseline as a planned future extension outlined in the arXiv submission, which would evaluate models that generate diagrams directly from scientific code or data pipelines rather than natural language prompts alone, addressing common real-world use cases where researchers generate figures from analysis scripts in tools like Python’s Matplotlib or R’s ggplot2.
