Train AI agents on Web UI with WaveUI-25K dataset
Read Time 5 mins | Written by: Cole
LLMs are trained on text, images, and audio. For the next step—agents capable of completing complex tasks—they must learn to navigate user interfaces like humans do. Buttons, clicks, and all the intricacies of UI design are crucial for this.
The WaveUI-25K dataset introduces one of the first and most comprehensive collections for training agents in UI interaction. With 25,000 annotated images focusing on UI elements, it’s invaluable for enhancing ML models that automate and understand UIs.
WaveUI-25k dataset on Hugging Face
The WaveUI-25k dataset contains 25k examples of labeled UI elements. It’s a combined subset of a collection of ~80k preprocessed examples assembled from multiple datasets
WaveUI-25k is a preprocessed compilation of these three datasets – filtering out unwanted examples like duplicated, overlapping and low-quality datapoints. Many text elements that didn’t fit the scope of this work were also filtered out.
Original fields from the source datasets are included in WaveUI-25k and additional fields were added for each UI element:
- name: A descriptive name of the element.
- description: A long detailed description of the element
- type: The type of the element.
- OCR: OCR of the element. Set to null if no text is available.
- language: The language of the OCR text, if available. Set to null if no text is available.
- purpose: A general purpose of the element.
- expectation: An expectation on what will happen when you click this element.
Here’s an example of an “About Us” button from a website and the correlated data:
You can see all the other elements in the dataset here to get an idea of what’s included.
Data sources used for WaveUI-25k
Here’s an overview of each of the datasets that were combined to create the WaveUI-25k dataset.
WebUI dataset – This dataset aims to bridge the gap between web and mobile UI understanding by leveraging the abundance of web data to improve models for mobile interfaces, where labeled data is less available.
- Size: 400,000 rendered web pages
- Content: Web pages with associated automatically extracted metadata
- Source: Crawled from the web
- Purpose: To support visual UI modeling and understanding
- Key features:
- Large-scale dataset of rendered web pages
- Automatically extracted metadata for each page
- Despite some noise in the extracted data, most examples are suitable for visual UI modeling
- Designed to improve performance of visual UI understanding models, particularly in the mobile domain where labeled data is scarce
RoboFlow – The Roboflow Website Screenshots dataset is a synthetically generated dataset composed of screenshots from over 1000 of the world's top websites. They have been automatically annotated to label the following classes:
- button - navigation links, tabs, etc.
- heading - text that was enclosed in <h1> to <h6> tags.
- link - inline, textual <a> tags.
- label - text labeling form fields.
- text - all other text.
- image - <img>, <svg>, or <video> tags, and icons.
- iframe - ads and 3rd party content.
This is an example image and annotation from the dataset:
You can use these annotated screenshots in Robotic Process Automation – especially useful for training LLMs to act as agents and automate website tasks. This dataset would cost over $4000 for humans to label on popular labeling services.
GroundUI-18K – The GroundUI-18K dataset on Hugging Face is a collection of 18,000 annotated images designed for training and evaluating user interface (UI) grounding models.
It includes images, instructions, and bounding box annotations, helping models to understand and interact with UIs by identifying elements like buttons, text fields, and icons.
The dataset is stored in Parquet format and is used to enhance machine learning models for UI automation tasks.
Imagine a financial analyst who needs to quickly extract insights from hundreds of complex financial reports in spreadsheet format.
Check out the whole WaveUI-25k dataset here on Hugging Face.
Want to hire senior engineers to build AI agents with these UI datasets?
You could spend the next 6-18 months planning to recruit and build an AI team (if you can afford it), but you won’t be building any AI capabilities. That’s why Codingscape exists.
We can assemble a senior AI development team to start building internal AI agent tools for you in 4-6 weeks. It’ll be faster to get started, more cost-efficient than internal hiring, and we’ll deliver high-quality results quickly.
Zappos, Twilio, and Veho are just a few companies that trust us to build their software and systems with a remote-first approach.
You can schedule a time to talk with us here. No hassle, no expectations, just answers.
Don't Miss
Another Update
new content is published
Cole
Cole is Codingscape's Content Marketing Strategist & Copywriter.