ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

Tencent Hunyuan Team
Teaser

Introduction

ArtifactsBench is the first automated multimodal evaluation benchmark for LLM-generated visual artifacts that renders dynamic outputs, assesses fidelity and interactivity using MLLM judges guided by fine-grained checklists, and achieves over 94% human preference correlation across 1,825 diverse tasks.

Dataset Statistics

Dataset Statistics

The ArtifactsBench benchmark comprises 1,825 high-quality, challenging queries organized into nine distinct categories: Game Development, SVG Generation, Web Applications, Simulations, Data Science, Management Systems, Multimedia Editing, Quick Tools, and Others. This structure ensures broad coverage of practical application domains.

Dataset Construction Pipeline

Dataset Construction Pipeline

We organize ArtifactsBench creation as an eight-stage pipeline: Extraction & Filtering, Manual and LLM-based Rewrite & Polish, Classification and Difficulty Filtering, Small Sample Annotation, CheckList Generation, Model Generation, Manual QA Checking and Quality Control, and Final Data Consolidation. This structured process ensures the generation of diverse, high-quality tasks for robust evaluation of visual code generation.

Multi-Stage Automated Evaluation

HumanEval Overfitting

To ensure the validity of our evaluation framework, we first rigorously assess the MLLM-as-judge approach by measuring its pairwise scoring agreement with human experts on a carefully curated task subset. After confirming its high reliability (achieving >90% agreement), we then deploy this automated judge for large-scale evaluation across the entire benchmark. The scoring process follows a structured three-stage pipeline: (1) Code Extraction, (2) Dynamic Rendering and Capture, and (3) MLLM-as-Judge Assessment.

πŸš€ Latest Updates & Release Notes

October 27, 2025 πŸ”₯πŸ”₯πŸ”₯

  • 2025.10.27 | MiniMax-M2 β€” ArtifactsBench: 66.8 (DeepSeek-V3.2: 55.8). πŸŽ‰πŸŽ‰πŸŽ‰ Link: MiniMax-M2 model card. Scores per the report in the linked page.
  • 2025.10.23 | ReLook (paper) β€” Vision-grounded RL for agentic web coding: MLLM visual critic, zero-reward for invalid renders, Forced Optimization, critic-free inference; outperforms strong baselines. πŸŽ‰πŸŽ‰πŸŽ‰ Link: arXiv:2510.11498.
  • 2025.10.08 | Ling-1T (1T) β€” ArtifactsBench: 59.31 (DeepSeek-V3.1-Terminus: 43.29). πŸŽ‰πŸŽ‰πŸŽ‰ Link: Ling-1T model card; Scores per the report in the linked page.

July 30, 2025 πŸ”₯πŸ”₯πŸ”₯

  • πŸ†• Model Coverage Expansion: Added comprehensive evaluation of GLM-4.5 to expand our coverage of state-of-the-art language models and provide more comprehensive benchmarking insights.
  • πŸ“Š Enhanced Visualization: Introduced a new analysis chart artifactsbench_vs_model_infer.png that visualizes the relationship between model inference scores and model response lengths, providing deeper insights into model behavior patterns.

July 25, 2025 πŸ”₯πŸ”₯πŸ”₯

We're excited to announce important updates to ArtifactsBench that significantly improve reproducibility, expand model coverage, and enhance evaluation stability:

Key Updates:

  • πŸ”§ Unified Judge Model: Migrated from Gemini-2.5-Pro-Preview-0605 (now deprecated) to the stable Gemini-2.5-Pro for all evaluations, ensuring consistent reproducibility for research communities.
  • πŸ†• Expanded Model Coverage: Added comprehensive evaluation of latest high-quality open-source code models to keep pace with rapid developments in the field.
  • πŸ“Š Enhanced Transparency: Released intermediate reasoning results and evaluation data to improve research confidence and full reproducibility.

🎯 Full Open-Source & Complete Reproducibility:

  • πŸ”“ 100% Data Open-Source: All evaluation data, model outputs, judge reasoning, and intermediate results are completely open-sourced - no proprietary data withheld.
  • ♻️ Complete Paper Reproducibility: Every result in our paper can be fully reproduced using the provided data and scripts - we guarantee 100% reproducibility.
  • πŸ” Full Transparency: From raw model outputs to final scores, every step of our evaluation pipeline is transparent and auditable.

Data Release:

All intermediate model results, judge model inference results, and reasoning chains from this update are available at dataset/release_data_20250725/ for complete transparency and reproducibility.

Updated Results Overview:

Analysis of model inference scores versus response lengths on ArtifactsBench

Figure: Analysis of model inference scores versus response lengths on ArtifactsBench, revealing the relationship between model performance and output verbosity patterns.

Latest ArtifactsBench results (July 2025)

Figure: Latest ArtifactsBench results (July 2025) with expanded model coverage and unified Gemini-2.5-Pro evaluation.

For comparison, the previous results table is available in the Leaderboard section.

Model Performance on ArtifactsBench

HumanEval Overfitting

Our comprehensive evaluation covers over 30 state-of-the-art Large Language Models (LLMs), including both open-source and proprietary systems. The open-source cohort features 23 models spanning several influential families: the Qwen2.5 series, Qwen3 series, Gemma3 series, along with notable standalone models like Seed-Coder-8B-Instruct and the Deepseek series. For proprietary models, we evaluate 8 leading systems: Gemini-2.5-Pro-Preview-05-06, Claude 3.7 Sonnet (20250219), Claude 4 Sonnet (20250514), GPT-4o-2024-11-20, o3-mini-2025-01-31, Seed-thinking-1.5, and Hunyuan-TurboS-preview.

Scores of differenct difficulties

HumanEval Overfitting

We organize the benchmark into three tiers of increasing difficulty. Even the best-performing models struggle to surpass 50 points on the most challenging subset, indicating that our benchmark remains far from saturation. Notably, the models' relative rankings remain consistent across all difficulty levels, and each tier maintains strong discriminative powerβ€”demonstrating our benchmark's ability to reliably differentiate model capabilities at every level of challenge.

Ranking correlation between ArtifactsBench and WebDev Arena

HumanEval Overfitting

We find that the model rankings from ArtifactsBench exhibit a remarkably high correlation with WebDev Arena, achieving a 94.4% consistency score. This score is calculated using the normalized Footrule metric, which quantifies the agreement between two ranked lists.

BibTeX

@article{zhang2025artifactsbench,
  title={ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation},
  author={Zhang, Chenchen and Li, Yuhang and Xu, Can and Liu, Jiaheng and Liu, Ao and Hu, Shihui and Wu, Dengpeng and Huang, Guanhua and Li, Kejiao and Yi, Qi and others},
  journal={arXiv preprint arXiv:2507.04952},
  year={2025}
}