● ○ ◇ □
•••
DL4C Workshop @ ICLR 2025
🌐 First Multilingual Benchmark for Web Agents

WebMMU: A Multimodal, Multilingual Benchmark for Website Understanding & Code Generation

Rabiul Awal🦄∗ Mahsa Massoud🦄 Zichao Li🦄 Aarash Feizi🦄 Suyuchen Wang🦄 Christopher Pal🌌 Aishwarya Agrawal🎓 David Vazquez🦄 Siva Reddy🦄 Juan A. Rodriguez🏢 Perouz Taslakian🦄 Spandana Gella🦄 Sai Rajeswar🦄
🦄ServiceNow 🌌Mila 🎓Université de Montréal 🏫McGill University 🏢École de Technologie Supérieure (ETS) 🏗️Polytechnique Montréal
WebMMU is a comprehensive benchmark that evaluates AI models' ability to understand and interact with real websites. Unlike existing benchmarks that use synthetic or simplified data, WebMMU uses authentic website screenshots and real-world code, covering three critical tasks: answering complex questions about web interfaces, converting visual mockups into functional code, and making precise code edits. With 4,392+ examples across four languages and 20+ website domains, each carefully crafted by 127 professionals, WebMMU reveals genuine model limitations that simpler datasets miss. This benchmark is essential for developing AI systems that can truly understand and manipulate web content in real-world scenarios.
*Corresponding: rabiul.awal at mila.quebec
WebMMU Cover

About

WebMMU addresses a critical gap in AI evaluation: how well can models understand and manipulate real websites? Current benchmarks often use simplified or synthetic data, masking the true challenges of real-world web interaction.

  • As AI systems increasingly interact with web interfaces, from automated testing to web development assistance, we need benchmarks that reflect real-world complexity. WebMMU fills this gap by using real-world website screenshots, real HTML/CSS/JavaScript code, and professionally crafted scenarios that mirror actual use cases. By focusing on atomic, visually-grounded tasks, WebMMU enables fine-grained diagnosis of model strengths and weaknesses in reasoning, grounding, and code manipulation.
WebMMU benchmark overview

Key Features

  • 📊 Data Sources: Sourced from FineWeb (CommonCrawl) and heuristics for everyday actions such as popular shopping, booking appointments, travel, review websites, etc.
  • 🌐 Multilingual: English, Spanish, German, French, revealing 20–40 point performance drops across languages.
  • 🧩 Three Core Tasks: WebQA (complex, visually-grounded questions), Mockup2Code (hand-drawn to code), Code Editing (novel, real-world code patching with functional verification).
  • 🖥️ Multi-panel Screenshots: Mimics browsing experience with multiple panels.
  • 🖥️ Real-World Data: 20+ domains, 4,392+ examples, all expert-annotated and QA'd by 127 professionals.
  • 🔍 Fine-Grained Evaluation: LLM-as-Judge and human scoring (89–91% agreement), with breakdowns by reasoning, agentic action, and code correctness.

Tasks

WebMMU evaluates three fundamental capabilities that AI systems need for real-world web interaction. Each task targets specific skills that are essential for building intelligent web agents and development tools.

WebQA

Ability to understand and reason about website content and functionality through visual analysis.

Example questions: "Which button should a user click to view their order history?" or "Sum the prices of all items in the shopping cart."

Measures: Spatial reasoning, understanding UI hierarchy, and connecting visual elements to functionality.

Mockup2Code

Ability to translate visual designs into functional HTML/CSS code that accurately reproduces the intended layout and styling.

Example: "Generate code for this login page sketch, preserving layout and style."

Measures: Understanding design intent, maintaining visual fidelity, and producing clean, maintainable code.

Code Editing

Ability to make precise, functional modifications to existing website code based on user requirements.

Example: "Add a dark mode toggle to the navbar and ensure all text remains readable."

Measures: Understanding code structure, preserving functionality, and making targeted changes without unintended side effects.

Dataset Overview

WebMMU's dataset is built with rigorous quality standards to ensure it reflects real-world complexity.

Data Collection & Quality Assurance

Our dataset spans diverse real-world scenarios: e-commerce platforms, government websites, educational portals, news sites, and more. This broad coverage ensures that models are tested on the full spectrum of web interfaces they might encounter in practice. Each example undergoes a three-stage quality assurance process involving 127 professionals. Each example is authored by expert annotators with domain knowledge, then reviewed by multiple professionals to ensure accuracy and relevance.

Task English Spanish German French Total
Website Images 392 133 130 131 786
WebQA 1,476 484 379 456 2,795
Mockup2Code 180 93 85 78 436
Code Editing 165 75 67 68 375
Total 2,213 785 661 733 4,392

Explore Dataset

Browse through real samples from each task to understand the diversity and complexity of the WebMMU benchmark.

Explore Dataset on Hugging Face

Browse through real samples from each task directly on Hugging Face. Each dataset contains authentic examples with expert annotations and quality assurance.

Code Editing Dataset

Explore real-world code editing examples where models must make precise modifications to existing HTML/CSS/JavaScript code based on user instructions.

Mockup2Code Dataset

Browse hand-drawn and digital mockups that models must convert into functional HTML/CSS code, testing visual-to-code translation capabilities.

WebQA Dataset

Explore complex, visually-grounded questions about real website screenshots, testing models' ability to understand and reason about web interfaces.

Results & Key Insights

WebMMU's comprehensive evaluation reveals significant gaps in current AI capabilities for real-world web interaction. Our results provide actionable insights for researchers and developers working on multimodal AI systems.

Detailed Results

WebMMU uses a rigorous evaluation protocol combining LLM-as-Judge scoring (validated by human annotators with 89–91% agreement) and automatic metrics (BLEU, TreeBLEU, visual similarity). This multi-faceted approach enables fine-grained analysis of model performance across reasoning, grounding, and code correctness.

Web VQA Performance

WebQA evaluates models on their ability to answer questions about real website screenshots across three categories: 🧠 Reasoning (complex logical thinking), ⚙️ Agentic UI Actions (navigation and interaction), and 🔎 Content Understanding (basic information extraction). Most models struggle with complex reasoning and agentic UI actions, especially outside English.

Model accuracy (%) by question type and language. Best and runner-up models per size category are bold and underlined.

Model English French German Spanish
🧠 ⚙️ 🔎 🧠 ⚙️ 🔎 🧠 ⚙️ 🔎 🧠 ⚙️ 🔎
Claude3.5 Sonnet 51.4 3.7 64.1 53.0 12.7 51.2 26.9 15.6 31.6 63.8 15.9 41.9
Gemini2.0 Flash 44.3 1.2 59.2 41.6 9.0 52.8 18.2 12.8 29.1 46.1 12.0 36.1
QwenVLSeventyTwoB 23.6 4.3 53.7 16.9 13.9 54.5 15.3 17.5 36.2 29.1 12.7 41.0

Mockup2Code Results

Mockup2Code evaluates how well models can generate HTML/CSS code from hand-drawn or digital web mockups. Models are scored on a 1-5 scale for both visual similarity (how well the generated code matches the visual design) and code quality (cleanliness, maintainability, and best practices). While proprietary models perform well on simple layouts, all models struggle with complex or deeply nested UI structures.

Mockup2Code performance results

Code Editing Results

Code Editing tests whether models can make precise, functional changes to real website code based on user instructions. Models are evaluated on correctness (does the code work as intended?) and functionality (does it preserve existing features?). Despite advances in code generation, no model reliably produces correct, ready-to-use code edits—manual fixes are still required for production use.

Code editing performance results

Key Insights

Grounding Is the Hardest, Reasoning Comes Next, General Understanding Is the Easiest

WebMMU shows a clear difficulty gap across tasks. Most models handle basic visual understanding (like reading labels and identifying images) fairly well. They do worse at multi-step reasoning, such as performing calculations or combining information from different parts of a web page. But the hardest challenge is grounding — identifying the exact location of elements on a page and reasoning about user actions (e.g., where to click). For example, while many models could list navigation categories correctly, few could pinpoint where to click to open the "About Us" page, with grounding accuracy often falling below 10%. This reflects a gap between recognizing content and understanding how users interact with it.

Simple Layouts Are Fine, But Complex UI Hierarchies Break the Models

When turning design mockups into HTML/CSS code, most models succeed on simple, flat layouts. But as soon as mockups include nested sections, multi-column layouts, or complex styling, the models break down. They often flatten the hierarchy, misalign elements, or miss relationships between components. For example, while they can correctly generate a basic "Contact Us" page, they struggle with a product page featuring sidebars, filters, and product grids. This suggests current models understand basic page layouts but lack deeper comprehension of modern, structured web design.

Code Editing: Models Generate Edits, But Risk Breaking the Site

In code editing tasks, models can follow instructions like "Add a header with search and post buttons," but often produce edits that break the site's structure or behavior. While the syntax of their HTML/CSS/JavaScript is mostly correct, they miss subtle dependencies, like class names or JavaScript functions, that keep the page functional. Even top models cannot yet generate reliable, ready-to-deploy code patches. This makes human review essential for all but the simplest edits.

Open-Source Models Lag Behind Closed-Source Models, Especially on Complex Tasks

Across all tasks, closed-source models like Gemini 2.0 Flash and Claude 3.5 consistently outperform open-source alternatives. They show better grounding, more accurate reasoning, and higher-quality code generation. For example, in the mockup-to-code task, closed-source models score above 4 out of 5 on simple layouts, while open-source models often struggle to score above 3. However, even the best models — closed or open — fail on complex designs, especially with nested layouts and precise spacing. Open-source models also suffer from greater multilingual performance drops, highlighting the training and resource gap between the two categories.

Multilingual Tasks Expose Major Gaps in Cross-Lingual Generalization

WebMMU covers English, Spanish, German, and French. Across all tasks, performance in non-English languages drops significantly — sometimes by more than half. Grounding and reasoning suffer most in these languages. This reveals that despite large training datasets, models haven't yet learned to generalize well to multilingual websites, which often have layout and content differences across languages.

The Big Picture: Real-World Web Automation Remains a Challenge

WebMMU shows that while AI models are progressing, they remain far from automating real-world web development. They can extract basic information and generate simple UI code, but they struggle with reasoning, structured code generation, precise edits, and multilingual scenarios. Closing this gap will require better multimodal reasoning, web-specific model architectures, and stronger cross-lingual capabilities — essential steps toward building truly intelligent web automation agents.

Citation

@inproceedings{awal2025webmmu, title={WebMMU: A benchmark for multimodal multilingual website understanding and code generation}, author={Awal, Rabiul and Massoud, Mahsa and Li, Zichao and Feizi, Aarash and Wang, Suyuchen and Pal, Christopher and Agrawal, Aishwarya and Vazquez, David and Reddy, Siva and Rodriguez, Juan A and others}, booktitle={ICLR 2025 Third Workshop on Deep Learning for Code}, year={2025} }