WebMMU: A Multimodal, Multilingual Benchmark for Website Understanding & Code Generation


Rabiul Awal Mahsa Massoud Zichao Li Aarash Feizi Suyuchen Wang Christopher Pal Aishwarya Agrawal David Vazquez Siva Reddy Juan A. Rodriguez Perouz Taslakian Spandana Gella Sai Rajeswar
ServiceNow Mila Université de Montréal McGill University École de Technologie Supérieure (ETS) Polytechnique Montréal
About

WebMMU is a comprehensive benchmark designed to push the boundaries of AI for the web. It challenges models to answer questions about websites, edit real HTML/CSS/JS code, and generate web layouts from mockups—across four languages and 20+ domains. Whether you're building smarter web agents or testing the limits of multimodal models, WebMMU is your go-to testbed.

WebMMU Main
Key Features
Dataset Overview
EnglishSpanishGermanFrenchTotal
Website Images392133130131786
WebQA14764843794562795
Mockup2Code180938578436
Code Editing165756768375
Total22137856617334392
Tasks

WebQA

Answer questions about real website screenshots—test your model's ability to reason, ground, and understand UI elements and content.

Mockup2Code

Turn hand-drawn or digital mockups into working HTML/CSS code. Evaluate how well your model translates design into code.

Code Editing

Edit real website code based on user instructions. Can your model make precise, functional changes to HTML, CSS, or JS?

Explore Dataset
[Infographic or dataset chart here]
Results

Web VQA Performance

Model accuracy (%) by question type and language. 🧠 Reasoning, ⚙️ Functional, 🔎 Understanding. Best and runner-up models per size category are bold and underlined. Model sizes: blue (<8B params), orange (8–12B), green (>12B), gray proprietary.

Model English French German Spanish
🧠 ⚙️ 🔎 🧠 ⚙️ 🔎 🧠 ⚙️ 🔎 🧠 ⚙️ 🔎
Gemini2.0 Flash44.31.259.241.69.052.818.212.829.146.112.036.1
Claude3.5 Sonnet51.43.764.153.012.751.226.915.631.663.815.941.9
PhiThreeFiveVI8.901.8031.602.206.9039.008.4013.0023.903.0010.2032.00
UITARS19.308.1047.607.708.9047.607.8014.3028.4020.9014.0038.80
MolmoSevenB12.303.8032.907.007.5047.608.3013.7031.9015.1010.3032.00
QwenVLSevenB18.002.9057.1010.1010.2052.0010.7017.6026.3019.3014.0036.50
FuyuEightB1.600.4014.300.001.3017.501.005.6015.700.701.5010.90
InternVLTwoFiveEightB16.301.9046.3011.0013.3040.007.4016.0025.9013.8011.9031.10
GlmFourVNineB15.308.1041.8011.4013.9048.1014.7013.8025.0021.6013.4035.60
LlamaThreeVision27.107.9053.2011.6011.3048.1011.8014.3033.6017.5011.8037.90
PixtralTwelveB27.109.2044.9017.7011.3053.4019.5019.3021.7028.7017.8040.20
InternVLTwoFiveThirtyEightB22.903.8059.3020.9015.3065.7018.0020.1039.7036.2014.9041.40
QwenVLSeventyTwoB23.604.3053.7016.9013.9054.5015.3017.5036.2029.1012.7041.00
Open models <8B Open models 8–12B Open models >12B Proprietary Underlined = runner-up, Bold = best in category 🧠 Reasoning, ⚙️ Functional, 🔎 Understanding
BibTeX
@inproceedings{awal2025webmmu,
  title={WebMMU: A benchmark for multimodal multilingual website understanding and code generation},
  author={Awal, Rabiul and Massoud, Mahsa and Li, Zichao and Feizi, Aarash and Wang, Suyuchen and Pal, Christopher and Agrawal, Aishwarya and Vazquez, David and Reddy, Siva and Rodriguez, Juan A and others},
  booktitle={ICLR 2025 Third Workshop on Deep Learning for Code},
  year={2025}
}