WebMMU is a comprehensive benchmark designed to push the boundaries of AI for the web. It challenges models to answer questions about websites, edit real HTML/CSS/JS code, and generate web layouts from mockups—across four languages and 20+ domains. Whether you're building smarter web agents or testing the limits of multimodal models, WebMMU is your go-to testbed.

- 🌐 Multilingual: English, Spanish, German, French
- 🧩 Three Core Tasks: WebQA, Code Editing, Mockup2Code
- 🖥️ Real-World Data: 20+ website domains
- 🔍 Fine-Grained Evaluation: Web Understanding & Reasoning, Agetntic UI Action, and Code generation
- 🤝 Open & Human-Annotated: Expert-verified, high-quality samples
English | Spanish | German | French | Total | |
---|---|---|---|---|---|
Website Images | 392 | 133 | 130 | 131 | 786 |
WebQA | 1476 | 484 | 379 | 456 | 2795 |
Mockup2Code | 180 | 93 | 85 | 78 | 436 |
Code Editing | 165 | 75 | 67 | 68 | 375 |
Total | 2213 | 785 | 661 | 733 | 4392 |
WebQA
Answer questions about real website screenshots—test your model's ability to reason, ground, and understand UI elements and content.
Mockup2Code
Turn hand-drawn or digital mockups into working HTML/CSS code. Evaluate how well your model translates design into code.
Code Editing
Edit real website code based on user instructions. Can your model make precise, functional changes to HTML, CSS, or JS?
Web VQA Performance
Model accuracy (%) by question type and language. 🧠 Reasoning, ⚙️ Functional, 🔎 Understanding. Best and runner-up models per size category are bold and underlined. Model sizes: blue (<8B params), orange (8–12B), green (>12B), gray proprietary.
Model | English | French | German | Spanish | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
🧠 | ⚙️ | 🔎 | 🧠 | ⚙️ | 🔎 | 🧠 | ⚙️ | 🔎 | 🧠 | ⚙️ | 🔎 | |
Gemini2.0 Flash | 44.3 | 1.2 | 59.2 | 41.6 | 9.0 | 52.8 | 18.2 | 12.8 | 29.1 | 46.1 | 12.0 | 36.1 |
Claude3.5 Sonnet | 51.4 | 3.7 | 64.1 | 53.0 | 12.7 | 51.2 | 26.9 | 15.6 | 31.6 | 63.8 | 15.9 | 41.9 |
PhiThreeFiveVI | 8.90 | 1.80 | 31.60 | 2.20 | 6.90 | 39.00 | 8.40 | 13.00 | 23.90 | 3.00 | 10.20 | 32.00 |
UITARS | 19.30 | 8.10 | 47.60 | 7.70 | 8.90 | 47.60 | 7.80 | 14.30 | 28.40 | 20.90 | 14.00 | 38.80 |
MolmoSevenB | 12.30 | 3.80 | 32.90 | 7.00 | 7.50 | 47.60 | 8.30 | 13.70 | 31.90 | 15.10 | 10.30 | 32.00 |
QwenVLSevenB | 18.00 | 2.90 | 57.10 | 10.10 | 10.20 | 52.00 | 10.70 | 17.60 | 26.30 | 19.30 | 14.00 | 36.50 |
FuyuEightB | 1.60 | 0.40 | 14.30 | 0.00 | 1.30 | 17.50 | 1.00 | 5.60 | 15.70 | 0.70 | 1.50 | 10.90 |
InternVLTwoFiveEightB | 16.30 | 1.90 | 46.30 | 11.00 | 13.30 | 40.00 | 7.40 | 16.00 | 25.90 | 13.80 | 11.90 | 31.10 |
GlmFourVNineB | 15.30 | 8.10 | 41.80 | 11.40 | 13.90 | 48.10 | 14.70 | 13.80 | 25.00 | 21.60 | 13.40 | 35.60 |
LlamaThreeVision | 27.10 | 7.90 | 53.20 | 11.60 | 11.30 | 48.10 | 11.80 | 14.30 | 33.60 | 17.50 | 11.80 | 37.90 |
PixtralTwelveB | 27.10 | 9.20 | 44.90 | 17.70 | 11.30 | 53.40 | 19.50 | 19.30 | 21.70 | 28.70 | 17.80 | 40.20 |
InternVLTwoFiveThirtyEightB | 22.90 | 3.80 | 59.30 | 20.90 | 15.30 | 65.70 | 18.00 | 20.10 | 39.70 | 36.20 | 14.90 | 41.40 |
QwenVLSeventyTwoB | 23.60 | 4.30 | 53.70 | 16.90 | 13.90 | 54.50 | 15.30 | 17.50 | 36.20 | 29.10 | 12.70 | 41.00 |
@inproceedings{awal2025webmmu,
title={WebMMU: A benchmark for multimodal multilingual website understanding and code generation},
author={Awal, Rabiul and Massoud, Mahsa and Li, Zichao and Feizi, Aarash and Wang, Suyuchen and Pal, Christopher and Agrawal, Aishwarya and Vazquez, David and Reddy, Siva and Rodriguez, Juan A and others},
booktitle={ICLR 2025 Third Workshop on Deep Learning for Code},
year={2025}
}