Code Monkey home page Code Monkey logo

droidbot-gpt's Introduction

demo

DroidBot-GPT

About

DroidBot-GPT is an easy-to-use UI agent framework for Android. It can control a smartphone app automatically based on a natural language task description. It is built upon DroidBot.

Implementing a domain-specific mobile agent is very simple with DroidBot-GPT. You only need to modify two files: droidbot/input_policy.py which specifies how to call LLM based on the task and UI state and droidbot/device_state.py which specifies how to convert the UI state to something that LLM (or other multi-modal models) can understand.

The policy implemented in this repo is a naive baseline for mobile task automation. It simply converts the view list to text, and sends the text to GPT and ask for the next action to perform. A more advanced version (with app-specific memory injection and other optimizations) can be found at AutoDroid.

Technical report of this repo:

Hao Wen, Hongming Wang, Jiaxuan Liu, Yuanchun Li. "DroidBot-GPT: GPT-powered UI Automation for Android"

How to install

Make sure you have:

  1. Python (both 2 and 3 are supported)
  2. Java
  3. Android SDK
  4. Added platform_tools directory in Android SDK to PATH
  5. Access to an LLM/LVM/LMM (e.g. ChatGPT API)

Then clone this repo and install with pip:

git clone https://github.com/MobileLLM/DroidBot-GPT.git
cd DroidBot-GPT/
pip install -e .

If successfully installed, you should be able to execute python start.py -h.

How to use

  1. Prepare:

    • An app to use. Download the .apk file to your host machine.
    • A device or an emulator connected to your host machine via adb.
    • Set the OpenAI URL and API KEY with environment variables GPT_URL and GPT_KEY.
  2. Start DroidBot-GPT:

    python start.py -a <path_to_apk> -o output_dir -task <your_task>
    

    That's it! The options are mostly the same as DroidBot except for the -task option, where you can specify any task you want to complete. For example,

    • Create a contact named Alice with phone number 1234567.
    • Book a table for 4 people on Saterday.
    • Send a message to Sam to have a chat tonight.
    • ...

After execution, the action trace and screenshots can be found in output_dir/index.html.

How to design your own agent

  1. Decide which LLM (or other foundation models) to use. Modify how to query the model in the _query_llm function in droidbot/input_policy.py.

  2. Decide how you want to represent the app UI in droidbot/device_state.py. Each DeviceState instance is a snapshot of current smartphone UI, which include the view hierarchy (the views field), the screenshot (the screenshot_path field), etc.

  3. Design how to query the foundation model with the UI representation. Modify the code in the _get_action_with_LLM function in droidbot/input_policy.py. The function takes the current state (current_state parameter), the previous actions (action_history parameter) and the task description (self.task field) as input, and its job is to decide which action to take in the next step.

To further improve the agent performance, you may consider optimize how you invoke the LLM(s) and/or train/fine-tune the LLM(s).

Enjoy!

droidbot-gpt's People

Contributors

yuanchun-li avatar yzygitzh avatar jasonwangsonghe avatar nastya avatar hannol avatar codeslord avatar connglli avatar wenh18 avatar miniminyi avatar ilian avatar adepasquale avatar rrua avatar saumopal97 avatar ddlucad96 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.