Code Monkey home page Code Monkey logo

seeact's Introduction

GPT-4V(ision) is a Generalist Web Agent, if Grounded

Code, Dataset, and Demo for the paper "GPT-4V(ision) is a Generalist Web Agent, if Grounded".

Check project website for an overview and demo videos.

Release process:

  • Dataset
    • Example data for the three element grounding methods
    • Data used in the paper with screenshot images
  • Code
    • Offline Experiments
      • Screenshot generation
      • Code to overlay image annotation
      • BLIP-2 fine-tuning
    • Online Evaluation Tool
  • Models
    • Fine-tuned BLIP-2 Model

Dataset

The dataset is derived from Mind2Web by pairing each HTML text with the rendered webpage screenshots. The screenshot image data comes from the Raw Dump with Full Traces and Snapshots captured with PlayWright during data annotation.

Screenshot Generation

These scripts can collect screenshot images from the Mind2Web raw dump and overlay image annotation for action grounding.

Online Evaluation Tool

We develop a new online evaluation tool using Playwright to evaluate web agents on live websites. Our tool can convert the predicted action into a browser event and execute it on the website.

We acknowledge Xiang Deng for his initial contribution to this tool.

Contact

Questions or issues? File an issue or contact Boyuan Zheng

Licensing Information

The code under this repo is licensed under an MIT License.

Disclaimer

The code was released solely for research purposes, with the goal of making the web more accessible via language technologies. The authors are strongly against any potentially harmful use of the data or technology by any party.

Citation Information

If you find this dataset useful, please consider citing our paper:

@article{zheng2023seeact,
  title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
  author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
  journal={arXiv preprint arXiv:2401.01614},
  year={2024},
}

seeact's People

Contributors

boyuanzheng010 avatar boyugou avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.