ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

Introduction

ArtifactsBench is the first automated multimodal evaluation benchmark for LLM-generated visual artifacts that renders dynamic outputs, assesses fidelity and interactivity using MLLM judges guided by fine-grained checklists, and achieves over 94% human preference correlation across 1,825 diverse tasks.

Dataset Statistics

The ArtifactsBench benchmark comprises 1,825 high-quality, challenging queries organized into nine distinct categories: Game Development, SVG Generation, Web Applications, Simulations, Data Science, Management Systems, Multimedia Editing, Quick Tools, and Others. This structure ensures broad coverage of practical application domains.

Dataset Construction Pipeline

We organize ArtifactsBench creation as an eight-stage pipeline: Extraction & Filtering, Manual and LLM-based Rewrite & Polish, Classification and Difficulty Filtering, Small Sample Annotation, CheckList Generation, Model Generation, Manual QA Checking and Quality Control, and Final Data Consolidation. This structured process ensures the generation of diverse, high-quality tasks for robust evaluation of visual code generation.

Multi-Stage Automated Evaluation

To ensure the validity of our evaluation framework, we first rigorously assess the MLLM-as-judge approach by measuring its pairwise scoring agreement with human experts on a carefully curated task subset. After confirming its high reliability (achieving >90% agreement), we then deploy this automated judge for large-scale evaluation across the entire benchmark. The scoring process follows a structured three-stage pipeline: (1) Code Extraction, (2) Dynamic Rendering and Capture, and (3) MLLM-as-Judge Assessment.

🚀 Latest Updates & Release Notes

Version 1.1 - July 30, 2025 🔥🔥🔥

Key Updates:

🆕 Model Coverage Expansion: Added comprehensive evaluation of GLM-4.5 to expand our coverage of state-of-the-art language models and provide more comprehensive benchmarking insights.
📊 Enhanced Visualization: Introduced a new analysis chart artifactsbench_vs_model_infer.png that visualizes the relationship between model inference scores and model response lengths, providing deeper insights into model behavior patterns.

Version 1.1 - July 25, 2025 🔥🔥🔥

We're excited to announce important updates to ArtifactsBench that significantly improve reproducibility, expand model coverage, and enhance evaluation stability:

Key Updates:

🔧 Unified Judge Model: Migrated from Gemini-2.5-Pro-Preview-0605 (now deprecated) to the stable Gemini-2.5-Pro for all evaluations, ensuring consistent reproducibility for research communities.
🆕 Expanded Model Coverage: Added comprehensive evaluation of latest high-quality open-source code models to keep pace with rapid developments in the field.
📊 Enhanced Transparency: Released intermediate reasoning results and evaluation data to improve research confidence and full reproducibility.

🎯 Full Open-Source & Complete Reproducibility:

🔓 100% Data Open-Source: All evaluation data, model outputs, judge reasoning, and intermediate results are completely open-sourced - no proprietary data withheld.
♻️ Complete Paper Reproducibility: Every result in our paper can be fully reproduced using the provided data and scripts - we guarantee 100% reproducibility.
🔍 Full Transparency: From raw model outputs to final scores, every step of our evaluation pipeline is transparent and auditable.

Data Release:

All intermediate model results, judge model inference results, and reasoning chains from this update are available at dataset/release_data_20250725/ for complete transparency and reproducibility.

Updated Results Overview:

Figure: Analysis of model inference scores versus response lengths on ArtifactsBench, revealing the relationship between model performance and output verbosity patterns.

Figure: Latest ArtifactsBench results (July 2025) with expanded model coverage and unified Gemini-2.5-Pro evaluation.

For comparison, the previous results table is available in the Leaderboard section.

Model Performance on ArtifactsBench

Our comprehensive evaluation covers over 30 state-of-the-art Large Language Models (LLMs), including both open-source and proprietary systems. The open-source cohort features 23 models spanning several influential families: the Qwen2.5 series, Qwen3 series, Gemma3 series, along with notable standalone models like Seed-Coder-8B-Instruct and the Deepseek series. For proprietary models, we evaluate 8 leading systems: Gemini-2.5-Pro-Preview-05-06, Claude 3.7 Sonnet (20250219), Claude 4 Sonnet (20250514), GPT-4o-2024-11-20, o3-mini-2025-01-31, Seed-thinking-1.5, and Hunyuan-TurboS-preview.

Scores of differenct difficulties

We organize the benchmark into three tiers of increasing difficulty. Even the best-performing models struggle to surpass 50 points on the most challenging subset, indicating that our benchmark remains far from saturation. Notably, the models' relative rankings remain consistent across all difficulty levels, and each tier maintains strong discriminative power—demonstrating our benchmark's ability to reliably differentiate model capabilities at every level of challenge.

Ranking correlation between ArtifactsBench and WebDev Arena

We find that the model rankings from ArtifactsBench exhibit a remarkably high correlation with WebDev Arena, achieving a 94.4% consistency score. This score is calculated using the normalized Footrule metric, which quantifies the agreement between two ranked lists.

BibTeX

@misc{zhang2025artifactsbenchbridgingvisualinteractivegap,
      title={ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation}, 
      author={Chenchen Zhang and Yuhang Li and Can Xu and Jiaheng Liu and Ao Liu and Shihui Hu and Dengpeng Wu and Guanhua Huang and Kejiao Li and Qi Yi and Ruibin Xiong and Haotian Zhu and Yuanxing Zhang and Yuhao Jiang and Yue Zhang and Zenan Xu and Bohui Zhai and Guoxiang He and Hebin Li and Jie Zhao and Le Zhang and Lingyun Tan and Pengyu Guo and Xianshu Pang and Yang Ruan and Zhifeng Zhang and Zhonghu Wang and Ziyan Xu and Zuopu Yin and Wiggin Zhou and Chayse Zhou and Fengzong Lian},
      year={2025},
      eprint={2507.04952},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.04952}, 
}