Code Monkey home page Code Monkey logo

autogui's Introduction

AutoGUI: Scaling GUI Grounding with Autonomous Functionality Annotations from LLMs

Multi-Modal Maintenance Awesome

This repo opensource the training and evaluation code for AutoGUI, an automatic and scalable GUI annotation pipeline


AutoGUI pipeline - Revolutionizing Large-Scale GUI Data Annotation

Existing UI annotation methods typically collect data from static UIs, focusing on describing either the visual appearance (e.g., a button beside 30 the navigation bar), element categories (e.g., “menu button”), or brief functions weakly related to the UI context (e.g., “show more information”).

Here, we are thrilled to unveil AutoGUI, a groundbreaking and scalable UI annotation pipeline. AutoGUI can autonomously annotate the contextual functionalities of diverse UI elements at scale, entirely eliminating the need for human experts. This innovation not only accelerates the data collection process but also enhances the depth and accuracy of UI functionality descriptions, opening a new path in the field of UI annotation.


Illustration of AutoGUI Pipeline

AutoGUI initiates by collecting interaction trajectories on Commom Crawl websites. Each trajectory step captures all interactable elements and the accessibility tree (AXTree) that briefly outlines the UI structure. The content changes in the AXTrees before and after interaction will be used by an opensource LLM (e.g., Llama-3-70B) to predict functionality annotations of the interacted elements.

This annotation process provides rich funtional semantics in the generated annotations, thereby allowing for curating a GUI dataset that can be potentially enhance the GUI understanding capabilities of GUI agents.

Installation

You can install the AutoGUI package by cloning the repository and running the following command:

git clone https://github.com/BraveGroup/AutoGUI
cd AutoGUI
pip install -e .

Additional Packages

Please also follow the installation instructions of LLaVA, vLLM==0.4.0 and SGLang==0.1.14 to install them for evaluation.

Note that installing these

AutoGUI Dataset

Training Set

We provide 625k functionality grounding/captioning tasks that are generated by populating task templates with the collected element-functionality pairs. To mitigate the gap between various device types, the screenshots are rendered at various resolutions to mimic web browsers and mobile devices.

Please view the training data here. To mitigate the burden of preprocessing Parquet-format data, we also provide a tar-format data file in here.

A functionality grounding example:

User: In this web page image, please locate the element as I describe it (with point). This element triggers a user registration process, allowing new users to create a PayPal account and gain access to the platform's services.

Assistant: (91,6)

A functionality captioning example:

User: What happens when you tap position (61,73) on the screen?

Assistant: This element serves as an input field for users to provide their birth date, contributing to the registration process by ensuring that users meet the age requirements for creating a Yahoo account.

Funcpred - Functionality Grounding Benchmark

We also curate a 2k split used for evaluating the functionality grounding capabilities of existing vision-language models (VLMs). This split contains 1k samples at web resolution (1280 x 720) and 1k at mobile resolution (428x746).

Download this test split on Google Drive.

Each test sample contain:

  • image: the GUI screenshot.
  • func: the functionality annotation of a target element on the screenshot.
  • point: the center point (X,Y) of the target element. Note that the coordinates are normalized with the range 0-100.
  • unnormalized_box: the bounding box of the target element in the image coordinate frame.
  • elem_text: the displayed or alt text of the element.
  • elem_tag: the HTML tag of the element.
  • device: the device type of the screenshot.

Finetuning Code

  1. Prepare Data

After downloading the tar-format data, please generate a json file that records all samples with the absolute image paths required by the Qwen-VL model.

For example, the conversations field must starts with a user message that looks like "<img>path/to/autogui_625k/1_web.png</img>\n (instruction)"

  1. Finetuning Qwen-VL-Chat

Set the data_path in finetune/finetune_autogui_lora.sh and then run it.

Evaluation code

Our evalaution code is adapted from lmms-eval. To evaluate a model on a specific UI grounding benchmark, run this command:

python3 -m accelerate.commands.launch
    --num_processes=8 \
    -m lmms_eval \
    --model autogui \
    --model_args pretrained=WebAgent/AutoGUI-Qwen-v0.1-LoRA \
    --tasks func_pred_rec \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix autogui_funcpred \
    --output_path ./logs/ \
    ["--limit", "0.01"] \ # For debugging

The evaluation tasks used in our paper include: func_pred_rec, screenspot_rec, refexp, motif, vwb.

The supported models include: autogui, qwen_vl_chat, llava_sglang, llava_hf, deepseek_vl_chat.py, cogagent, llava_hf. If autogui is used, the pretrained argument can be either a LoRA model path that contains only the adapter or a merged model path.

Acknowledgement

Our project codes are based on the Qwen-VL, SeeClick, and lmms-eval. We thank the authors for their open-source works.

autogui's People

Contributors

zjulihongxin avatar jingrsu avatar

Stargazers

 avatar Jiawei He avatar

Watchers

Junsong Fan avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.