othersideai / self-operating-computer Goto Github PK

View Code? Open in Web Editor NEW

7.0K 108.0 945.0 12.49 MB

A framework to enable multimodal models to operate a computer.

Home Page: https://www.hyperwriteai.com/self-operating-computer

License: MIT License

Python 100.00%

automation openai pyautogui

self-operating-computer's Introduction

Self-Operating Computer Framework

A framework to enable multimodal models to operate a computer.

Using the same inputs and outputs as a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective.

Key Features

Compatibility: Designed for various multimodal models.
Integration: Currently integrated with GPT-4v, Gemini Pro Vision, and LLaVa.
Future Plans: Support for additional models.

Ongoing Development

At HyperwriteAI, we are developing Agent-1-Vision a multimodal model with more accurate click location predictions.

Agent-1-Vision Model API Access

We will soon be offering API access to our Agent-1-Vision model.

If you're interested in gaining access to this API, sign up here.

Demo

final-low.mp4

Run `Self-Operating Computer`

Install the project

pip install self-operating-computer

Run the project

operate

Enter your OpenAI Key: If you don't have one, you can obtain an OpenAI key here

Give Terminal app the required permissions: As a last step, the Terminal app will ask for permission for "Screen Recording" and "Accessibility" in the "Security & Privacy" page of Mac's "System Preferences".

Using `operate` Modes

Multimodal Models `-m`

An additional model is now compatible with the Self Operating Computer Framework. Try Google's gemini-pro-vision by following the instructions below.

Start operate with the Gemini model

operate -m gemini-pro-vision

Enter your Google AI Studio API key when terminal prompts you for it If you don't have one, you can obtain a key here after setting up your Google AI Studio account. You may also need authorize credentials for a desktop application. It took me a bit of time to get it working, if anyone knows a simpler way, please make a PR.

Try Claude `-m claude-3`

Use Claude 3 with Vision to see how it stacks up to GPT-4-Vision at operating a computer. Navigate to the Claude dashboard to get an API key and run the command below to try it.

operate -m claude-3

Try LLaVa Hosted Through Ollama `-m llava`

If you wish to experiment with the Self-Operating Computer Framework using LLaVA on your own machine, you can with Ollama!
Note: Ollama currently only supports MacOS and Linux

First, install Ollama on your machine from https://ollama.ai/download.

Once Ollama is installed, pull the LLaVA model:

ollama pull llava

This will download the model on your machine which takes approximately 5 GB of storage.

When Ollama has finished pulling LLaVA, start the server:

ollama serve

That's it! Now start operate and select the LLaVA model:

operate -m llava

Important: Error rates when using LLaVA are very high. This is simply intended to be a base to build off of as local multimodal models improve over time.

Learn more about Ollama at its GitHub Repository

Voice Mode `--voice`

The framework supports voice inputs for the objective. Try voice by following the instructions below. Clone the repo to a directory on your computer:

git clone https://github.com/OthersideAI/self-operating-computer.git

Cd into directory:

cd self-operating-computer

Install the additional requirements-audio.txt

pip install -r requirements-audio.txt

Install device requirements For mac users:

brew install portaudio

For Linux users:

sudo apt install portaudio19-dev python3-pyaudio

Run with voice mode

operate --voice

Optical Character Recognition Mode `-m gpt-4-with-ocr`

The Self-Operating Computer Framework now integrates Optical Character Recognition (OCR) capabilities with the gpt-4-with-ocr mode. This mode gives GPT-4 a hash map of clickable elements by coordinates. GPT-4 can decide to click elements by text and then the code references the hash map to get the coordinates for that element GPT-4 wanted to click.

Based on recent tests, OCR performs better than som and vanilla GPT-4 so we made it the default for the project. To use the OCR mode you can simply write:

operate or operate -m gpt-4-with-ocr will also work.

Set-of-Mark Prompting `-m gpt-4-with-som`

The Self-Operating Computer Framework now supports Set-of-Mark (SoM) Prompting with the gpt-4-with-som command. This new visual prompting method enhances the visual grounding capabilities of large multimodal models.

Learn more about SoM Prompting in the detailed arXiv paper: here.

For this initial version, a simple YOLOv8 model is trained for button detection, and the best.pt file is included under model/weights/. Users are encouraged to swap in their best.pt file to evaluate performance improvements. If your model outperforms the existing one, please contribute by creating a pull request (PR).

Start operate with the SoM model

operate -m gpt-4-with-som

Contributions are Welcomed!:

If you want to contribute yourself, see CONTRIBUTING.md.

Feedback

For any input on improving this project, feel free to reach out to Josh on Twitter.

Join Our Discord Community

For real-time discussions and community support, join our Discord server.

If you're already a member, join the discussion in #self-operating-computer.
If you're new, first join our Discord Server and then navigate to the #self-operating-computer.

Follow HyperWriteAI for More Updates

Stay updated with the latest developments:

Follow HyperWriteAI on Twitter.
Follow HyperWriteAI on LinkedIn.

Compatibility

This project is compatible with Mac OS, Windows, and Linux (with X server installed).

OpenAI Rate Limiting Note

The gpt-4-vision-preview model is required. To unlock access to this model, your account needs to spend at least $5 in API credits. Pre-paying for these credits will unlock access if you haven't already spent the minimum $5.
Learn more here

self-operating-computer's People

Contributors

Stargazers

Watchers

Forkers

ralphx1 mrm8488 laywoniai ronnachum11 zebrajack mz0in akhil4rajan devdoshi ismat22 jeffara 0x5844 dattgoswami nrvo esharkythegreat lancechung8888 tomchapin hexmanshu danieltea maddyonline touristshaun thorstone137 scullabyte chukwudyre bahodirajabov luvwinnie gijigae org-tekeli-borisp ukiuki201277777 sholtomaud tonyxia2016 whatif-dev papalor alexanderkiehl 0xmgg youngfly93 enzg soi-20 chukowski dinodinu mason0510 svemulapalli stevenlee19820119 fukcinglife huangjiaju chengbo81 bozz2022 runargun munichbughunter athy125 mivanovitch levinehuang black-archivers timsamart maxiaoxifeng diegonunez77 logp jun784 ngym michaelhhogue shpetimhaxhiu kyaukyuai lyhiving braingearceo prodeveloper0 iwillcodeu rkp64 horw ailabteam crnsh belkmouf ruzhevich eltociear ypgaolele muharremokutan sourav-jyoti yukkiehuku1 elcubonegro emaranowski lennartkaden justindhillon k2m5t2 colkito calvin-ai bsetzer1 artemkolmykov tinnyposhy-x lilllepiclife sung206 scorpions11 qbaynodo icytwilight-muslanet jmac122 santapakwiqque truechilled95righthaja kabrony tearchoi-womanne heartonreporks finmalage-westphold readerenes46 a-flavoredbubble

self-operating-computer's Issues

Proposal: Transitioning from Chrome-Exclusive to Universal Browser Compatibility

Problem

Currently, the application is prompt to use Google Chrome by default, limiting accessibility and user experience for individuals using alternative browsers. This monolithic approach excludes a significant user base and hinders the platform's adaptability to diverse browser environments.

Proposal

This issue advocates for a transition from Chrome-centric development to a more inclusive approach that supports a broader range of web browsers. The goal is to enhance accessibility, improve user experience, and adhere to web standards that promote compatibility across different platforms.

Proposed Changes

When testing I realize that on MacOS you can open your default browser by just type in the search bar

browser

So instead of Google Chrome you can search browsers then enter it will open the browser without the need of user have to use Google Chrome. Since most browser have the search bar at the same location you can still use the default setting for it.

idea

I played with gpt4V on other projects and it definitely has a hard time figuring out coordinates. I used other model trained on image identification to find the coordinates of the box made around the object detected and then I can pass it to gpt 4 to perform an action. For your use case, I juste tested this model "https://huggingface.co/foduucom/web-form-ui-field-detection" Far from being perfect, but maybe an idea to build on. If you auto computer can detect and get the proper coordinates of the input fields in an image, it could help or at least add a level of redundancy to improve accuracy in clicking and inputing stuff at the right places.

Add downsampled images as context instead of screenshot.png?

Would it be advantageous to keep a collage of downsampled previous images, maybe to 160px x 90px and just stack them in a line left to right, one after another, and constantly pass this image as additional context for each action like here is a timeline of previous states the model has traversed through?

FYI: I would be happy to draft and code this feature out!

Integrate Set-of-Mark Visual Prompting for GPT-4V

I noticed that you currently seem to apply a grid to the images to assist the vision model:

https://github.com/OthersideAI/self-operating-computer/blob/main/operate/main.py#L462-L527

And mention this in the README:

Current Challenges
Note: The GPT-4v's error rate in estimating XY mouse click locations is currently quite high. This framework aims to track the progress of multimodal models over time, aspiring to achieve human-level performance in computer operation.

I was wondering, have you looked at using Set-of-Mark Prompting Visual Prompting for GPT-4V / similar techniques?

https://github.com/microsoft/SoM
- Set-of-Mark Prompting for LMMs
- Set-of-Mark Visual Prompting for GPT-4V
- We present Set-of-Mark (SoM) prompting, simply overlaying a number of spatial and speakable marks on the images, to unleash the visual grounding abilities in the strongest LMM -- GPT-4V. Let's using visual prompting for vision!
- https://arxiv.org/abs/2310.11441
  - Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
  - We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM in zero-shot setting outperforms the state-of-the-art fully-finetuned referring expression comprehension and segmentation model on RefCOCOg. Code for SoM prompting is made public at: this https URL.
- https://github.com/facebookresearch/segment-anything
  - Segment Anything
  - The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
  - The Segment Anything Model (SAM) produces high quality object masks from input prompts such as points or boxes, and it can be used to generate masks for all objects in an image. It has been trained on a dataset of 11 million images and 1.1 billion masks, and has strong zero-shot performance on a variety of segmentation tasks.
- https://github.com/UX-Decoder/Semantic-SAM
  - Official implementation of the paper "Semantic-SAM: Segment and Recognize Anything at Any Granularity"
  - In this work, we introduce Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity. We have trained on the whole SA-1B dataset and our model can reproduce SAM and beyond it.
  - Segment everything for one image. We output controllable granularity masks from semantic, instance to part level when using different granularity prompts.
- https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once
  - SEEM: Segment Everything Everywhere All at Once
  - [NeurIPS 2023] Official implementation of the paper "Segment Everything Everywhere All at Once"
  - We introduce SEEM that can Segment Everything Everywhere with Multi-modal prompts all at once. SEEM allows users to easily segment an image using prompts of different types including visual prompts (points, marks, boxes, scribbles and image segments) and language prompts (text and audio), etc. It can also work with any combination of prompts or generalize to custom prompts!
- https://github.com/IDEA-Research/GroundingDINO
  - Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
- https://github.com/IDEA-Research/OpenSeeD
  - [ICCV 2023] Official implementation of the paper "A Simple Framework for Open-Vocabulary Segmentation and Detection"
- https://github.com/IDEA-Research/MaskDINO
  - [CVPR 2023] Official implementation of the paper "Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation"
- https://github.com/facebookresearch/VLPart
  - [ICCV2023] VLPart: Going Denser with Open-Vocabulary Part Segmentation
  - Object detection has been expanded from a limited number of categories to open vocabulary. Moving forward, a complete intelligent vision system requires understanding more fine-grained object descriptions, object parts. In this work, we propose a detector with the ability to predict both open-vocabulary objects and their part segmentation.
https://github.com/ddupont808/GPT-4V-Act
- GPT-4V-Act: Chromium Copilot
- AI agent using GPT-4V(ision) capable of using a mouse/keyboard to interact with web UI
- GPT-4V-Act serves as an eloquent multimodal AI assistant that harmoniously combines GPT-4V(ision) with a web browser. It's designed to mirror the input and output of a human operator—primarily screen feedback and low-level mouse/keyboard interaction. The objective is to foster a smooth transition between human-computer operations, facilitating the creation of tools that considerably boost the accessibility of any user interface (UI), aid workflow automation, and enable automated UI testing.
- GPT-4V-Act leverages both GPT-4V(ision) and Set-of-Mark Prompting, together with a tailored auto-labeler. This auto-labeler assigns a unique numerical ID to each interactable UI element.
  
  By incorporating a task and a screenshot as input, GPT-4V-Act can deduce the subsequent action required to accomplish a task. For mouse/keyboard output, it can refer to the numerical labels for exact pixel coordinates.
  - https://openai.com/research/gpt-4v-system-card
    - GPT-4V(ision)
https://github.com/Jiayi-Pan/GPT-V-on-Web
- 👀🧠 GPT-4 Vision x 💪⌨️ Vimium = Autonomous Web Agent
- This project leverages GPT4V to create an autonomous / interactive web agent. The action space are discretized by Vimium.

Generate multiple responses in polling and select the most popular choice? Particularly for -accurate grid overwrite

I was wondering if there was a reason we only picked the top response, or the 0th one. Instead, what if we asked the model to generate 9 responses, and then use the one that popped up the most frequently as the answer?

There's a possibility this wouldn't work for general actions, but I think this would work particularly well for my -accurate grid overwrite where when the model tries to click on something, I simply ask it which grid it would like to click in. Where with 9 responses of a number between 0 - 15, or 0 - 3, I can just use whichever number was most popular.

Feature proposal: Containerize the app instead of directly running on the machine

@joshbickett Was going through the code-base and found that this app lacks a fundamental Dockerfile.

Use Code Models For More Precise DOM Navigation

Consider adding coding models like GPT-4 Turbo (128K token limit) to better navigate HTML DOM using playwright by @microsoft. See similar implementation here: BrowserGPT by @mayt.

operate issue

"Traceback (most recent call last):
File "C:\Users\KANNAN\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\KANNAN\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "C:\Users\KANNAN\AppData\Local\Programs\Python\Python310\Scripts\operate.exe_main.py", line 4, in
File "C:\Users\KANNAN\AppData\Local\Programs\Python\Python310\lib\site-packages\operate\main.py", line 30, in
client = OpenAI()
File "C:\Users\KANNAN\AppData\Local\Programs\Python\Python310\lib\site-packages\openai_client.py", line 93, in init
raise OpenAIError(
openai.OpenAIError: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable" while entering 'operate' I'm getting this error, anyone help me to solve this

Seek for cooperation and solution sharing

I have been researching in automotive computers for years.

Topics that you might be interested in that I have dug into:

I have dedicated repositories that you may be interested into:

agi-computer-control: automotive computer, which can see, hear and operate
metalazero: multi-platform computer automation attempts

Other similar projects that I am monitoring:

Apologize for my unorganized code structure. I am trying to improve development experience by AI generated documentation & usage demonstration and client-side LLM & semantic search, which may solve this long-standing task among all my previous repositories.

Documentation: Readme

Hi, I guess --voice is non functional .
Ref - #81

Maybe anyone makes a commit to mention something like (in progress) in front of --voice .
Or maybe remove --voice for now?

Clean up the github repo

There are some ways you can make this github repo look nicer. Press the gear icon in the about section:

Right now it looks like this:

Lets add a discription, and remove the checkmarks because we don't have packages or deployments yet. We can always change this. We can also add tags.

Before:

After:

Error parsing JSON: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

I have everything set up correctly, added 5$ worth credit to my OpenAI account but I still keep getting this error. I don't believe I have exceeded my current quota, could someone shed some light on this? Thanks.

I made a project exactly like this 6 months ago. Here are some things I learned.

Hi, just dropping a friendly pointer to my project at Tophness/ChatGPT-PC-Controller.
I noticed you're using pyautogui.
I experimented with this for a long time, but it isn't compatible with the latest windows ui frameworks like UIA and WPF, only the very old Win32 API that's deprecated in windows 11.

I notice you're just using pyautogui.click() and pyautogui.write() instead of directly finding/reading/editing/triggering windows control elements anyway, but it's much, much more powerful if you do.
GPT-Vision wasn't even available at the time I made it. It just directly knew what to do blindly.

Directly using control elements means it can run in the background without needing to take over the user's mouse and keyboard, or even hide the app it's controlling entirely.
It could just browse the web, crunch some numbers on some data it found and send off emails about it in parallel while you're editing a video, and you wouldn't even notice the difference.
It's also instant (no need to wait for delays between clicks and presses), and there's no room for error.
Even an unexpected window popping up is no issue, since it doesn't need the window to be active to control it, and it can activate it automatically and wait for it to be active if need be.
ChatGPT can also write out whole scripts that do the job from a single response.
For this and many other reasons (like reading pixel rgb values), I recommend using AutoIt.

I started off making an interpreter for Autoit that took in more natural language and wrote the code itself, but it seems ChatGPT is well versed enough in Autoit that you can just directly hook the dll calls using an AST.
I did this before OpenAI's Function Calling API existed, so it would only be that much more powerful now.

Feel free to copy anything I did, and hit me up if you'd like to merge these 2 in some way.
I would say you might as well leave a separate pyautogui mode that works as-is while we merge things like vision over to autoit mode.

API issue

If any new open ai API use is getting issues, don't worry. there is some issue from open ai side

Warning: this project appears to have blatantly ripped off the work of researchers over a year on a new multi-modal model, Atlas-1, and is attempting to scam open source devs into doing that work

Re-opening @michaelhhogue are you an official contributor to this project? Could you comment on where the name Agent-1 came from? This appears to have blatantly ripped of the work of researchers working hard over a year.

Any open source contributors should consider that this firm raised millions and is scamming open source devs into stealing work for them. They never responded to our claims they stole this work, even down to the name agent-1, which is incredibly shameful if true and our attorneys would love to hear from an official contributor.

We will be publishing our solution open source as well, and here is Atlas-1, which we've been training for over a year and published last month: https://youtu.be/IQuBA7MvUas

What is Agent-1 and where did the name come from? It appears these guys blatantly ripped off our work and are now scamming open source devs into copying it for them.

Request timed out.

Error parsing JSON: Request timed out.
[Self-Operating Computer][Error] something went wrong :(
[Self-Operating Computer][Error] AI response
Failed take action after looking at the screenshot

Enhancing Safety: User Confirmation for Command Execution

Feature Request:

Enhanced Security Confirmation
- Issue: Add security on user prompts for executing potentially dangerous commands.
- Description: Implement a robust security feature that ensures user confirmation before executing any potentially harmful actions.
- Plan:
  - Create a user-friendly dialog box that prompts for confirmation.
  - Allow users to configure settings to override this security feature if needed.
  - Implement this security layer in the codebase to prevent unintended actions.

Windows 11 installation issue: Poetry could not find a pyproject.toml file in path\self-operating-computer or its parents

Trying to install SOC on Windows 11.

I get to the part where it says:

cat requirements.txt | xargs poetry add

This of course won't work on Windows so I used "@(cat requirements.txt) | %{&poetry add $_}" in Powershell, but I get the error mentioned in the title:

"Poetry could not find a pyproject.toml file in..."

Also tried: poetry add depency-name by copypasting the dependency name from requirements.txt but I get the same error.

Ideas anyone? Now that I'm thinking I might running too new a version of Poetry, but not sure

Repeat open browser

I'd installed SOC on my computer, a MacBook Pro with M1 Pro arm chip.

Today, my first try to use SOC, but it repeat open my browser, not go further.

I don't know why.

Here is the log, may be can helping you to find the problem is.

[Self-Operating Computer]
Hello, I can help you with anything. What would you like done?
[User]
open brave, go to google drive, write a poem about spring.
[Self-Operating Computer] [Act] SEARCH Brave
[Self-Operating Computer] [Act] SEARCH COMPLETE Open program: Brave
[Self-Operating Computer] [Act] SEARCH Brave
[Self-Operating Computer] [Act] SEARCH COMPLETE Open program: Brave
[Self-Operating Computer] [Act] SEARCH Brave
[Self-Operating Computer] [Act] SEARCH COMPLETE Open program: Brave
[Self-Operating Computer] [Act] SEARCH Brave
[Self-Operating Computer] [Act] SEARCH COMPLETE Open program: Brave
[Self-Operating Computer] [Act] SEARCH Brave
[Self-Operating Computer] [Act] SEARCH COMPLETE Open program: Brave
^CTraceback (most recent call last):
File "/Users/chenyibin/self-operating-computer/venv/bin/operate", line 33, in
sys.exit(load_entry_point('self-operating-computer==1.0.0', 'console_scripts', 'operate')())
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/operate/main.py", line 612, in main_entry
main(args.model)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/operate/main.py", line 188, in main
response = get_next_action(model, messages, objective)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/operate/main.py", line 276, in get_next_action
content = get_next_action_from_openai(messages, objective)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/operate/main.py", line 340, in get_next_action_from_openai
response = client.chat.completions.create(
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/openai/_utils/_utils.py", line 299, in wrapper
return func(*args, **kwargs)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/openai/resources/chat/completions.py", line 594, in create
return self._post(
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/openai/_base_client.py", line 1055, in post
return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/openai/_base_client.py", line 834, in request
return self._request(
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/openai/_base_client.py", line 858, in _request
response = self._client.send(request, auth=self.custom_auth, stream=stream)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpx/_client.py", line 901, in send
response = self._send_handling_auth(
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpx/_client.py", line 929, in _send_handling_auth
response = self._send_handling_redirects(
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpx/_client.py", line 966, in _send_handling_redirects
response = self._send_single_request(request)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpx/_client.py", line 1002, in _send_single_request
response = transport.handle_request(request)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpx/_transports/default.py", line 228, in handle_request
resp = self._pool.handle_request(req)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 268, in handle_request
raise exc
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 251, in handle_request
response = connection.handle_request(request)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpcore/_sync/http_proxy.py", line 344, in handle_request
return self._connection.handle_request(request)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 133, in handle_request
raise exc
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 111, in handle_request
) = self._receive_response_headers(**kwargs)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 176, in _receive_response_headers
event = self._receive_event(timeout=timeout)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 212, in _receive_event
data = self._network_stream.read(
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpcore/_backends/sync.py", line 126, in read
return self._sock.recv(max_bytes)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/ssl.py", line 1259, in recv
return self.read(buflen)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/ssl.py", line 1132, in read
return self._sslobj.read(len)
KeyboardInterrupt

At last, thx for your great job again.

Feature: Ollama Support

Hello and thanks for this beautiful repo.

Would you consider adding open source model support, especially with Ollama?

Best,
Orkut

Object detection

Maybe a yolo object detection model trained on basic things to get coordinates? Or something like sam?

i mean as soon as there is a small model, gpt4 can check the dataset and add more correct examples to the trainigset

Keeps saying 'I'm sorry, but I can't assist with that request.'

Whatever task I give, it keeps saying this. I don't know how to resolve it.

Add support for speech recognition with a voice button

Can refer the below code sample

I've been a Plus member for 3 months but I keep getting this issue.

The model gpt-4-vision-preview does not exist or you do not have access to it. Learn more: https://help.openai.com/en/articles/7102672-how-can-i-access-gpt-4

Add support for LiteLLM to support API base change.

🚀 Refactor the code and use LiteLLM to support API base change.

🌐 As of now, GPT-4 Vision is expensive to use and GPT-4 usage is high.
Let consumers use their own API base by making the following change:

Example:

import openai

# Set your custom API base URL
openai.api_base = "http://myproxy-gpt.com/chatcompletions"

📈 This change ensures that you will receive more users and encourage more testing.
🧪 People will start testing it more often, which will ultimately contribute to
improving this product. 🛠️

Proposal: Support for addition of tests using frameworks like Unittest, Pytest etc.

Add a grid of coordinates

Since it likes to misclicks a lot, you could either train a model to do image segmentation or, you can with clever prompt engineering add a barebones grid asking to solve the puzzle, "in which coordinate can the search button be found" this should make it more robust, right?

Click always off

Can a coordinate grid be added to every screenshot so the clicks are not so off?

wrong coordinate

I asked it to play Spotify and it guessed the play bottom at x 78% y 46% which is wrong.

maybe for a more detailed guess we can have more gridlines?
something like this maybe

Does it support multiple monitor setups?

I have three 32" 4K monitors for my Mac Studio and keep getting this error for any command. I'm curious which monitor it selects for the screenshot. I can hear the audible screenshot noise, and then the error appears.

Error parsing JSON: 'ascii' codec can't encode character '\u201c' in position 7: ordinal not in range(128)
[Self-Operating Computer][Error] something went wrong :(
[Self-Operating Computer][Error] AI response
Failed take action after looking at the screenshot

When Linux?

Hi,

I'm Mr. Cryptic, a very friendly guy.

When Linux? I wouldn't touch a Mac with a 5-feet stick.

search for default browser instead of google chrome.

Feat: Navigating to Search Bar using `Cmd`/`Ctrl` + `L`

As mentioned in the Readme, probably cmd + L would be a better thing to navigate yo the search bar

Even I faced the issue of navigating to the search bar corrrectly as different browsers have different location for their seach bar (maybe)
#39 (comment)

If everyone approves, I would go ahead and implement this?

OpenAI Message: "can't assist with navigating or interacting with software on a computer in real-time..."

Is anyone else receiving this message?

Scrolling up and down not added

I just noticed that the model doesn't have access to scrolling up and down. Is this difficult to implement generally (asking mostly for Linux, but of course interested in Mac, and Windows)?

If so, I may try adding in a web mode and leverage Selenium to scroll.

testing web pages

will it be possible to go to a specific web page and login and perform test actions and write results in another file.

how's the token consume average do a single operate

what is the cost

This appears to be very similar to our Atlas-1 model, but with hard coded clicks. Is that correct?

Hey guys we've been training a very similar multi-modal model called Atlas-1, however we don't need to hard-code click positions like it appears here, because we trained our model to find UI-elements directly and solve the hardest problems in automations. With the name Agent-1, it almost seems copied, but I hope that's not correct.

We introduce the idea of "Semantic Targets" which understand the underlying intent of a target, and so it's robust to even future design changes.

You can see in our tutorial published last month, we can also search google and much more because Atlas-1 doesn't need to hard code click positions https://youtu.be/IQuBA7MvUas?si=lSaFpH0WMIKRtYrU

Error parsing JSON: [Errno 2] No such file or directory: 'screencapture'

search for latest news on AI and give top two news
Error parsing JSON: [Errno 2] No such file or directory: 'screencapture'
[Self-Operating Computer][Error] something went wrong :(
[Self-Operating Computer][Error] AI response
Failed take action after looking at the screenshot

getting this error on Linux (arch)

Not working on `Ubuntu 20.04`

I tried running this on Ubuntu 20.04, it is getting the commands right in the terminal, but it is not actually navigating to the desired location. For example, I asked it to open brave and navigate to this repository... It is getting the steps right, but it could not open Brave....

Possible Issue

I think the method to search apps is different in MacOS ( I haven't used Mac, so I am not aware though) and in my system....
Hence, it might mot be able to go to the apps menu to actually search for any software.. Maybe?

Improve Error Handling for Robustness

This issue aims to enhance the code's reliability by improving error handling in critical functions, such as get_next_action_from_openai and summarize. Currently, the error handling within these functions is basic and might not cover all potential errors that could occur during execution.

OpenAI API error: The model `gpt-4-vision-preview` does not exist or you do not have access to it.

Complete log:

[Self-Operating Computer]
Hello, I can help you with anything. What would you like done?
[User]
Open the Google Photos inside the Google Chrome
Error parsing JSON: Error code: 404 - {'error': {'message': 'The model `gpt-4-vision-preview` does not exist or you do not have access to it. Learn more: https://help.openai.com/en/articles/7102672-how-can-i-access-gpt-4.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}
[Self-Operating Computer][Error] something went wrong :(
[Self-Operating Computer][Error] AI response
Failed take action after looking at the screenshot
(venv) syzygy@Syzygys-MacBook-Pro self-operating-computer %

I have the GPT-4 since it launched, I also can use my keys with other tools like sgpt, but appears that for some reason my account is lacking this model, any ideas? I have tried the suggested URL, but isn't much there.

Error parsing JSON: X get_image failed: error 8 (73, 0, 967)

[Self-Operating Computer]
Hello, I can help you with anything. What would you like done?
[User]
google the word HI
Error parsing JSON: X get_image failed: error 8 (73, 0, 967)
[Self-Operating Computer][Error] something went wrong :(
[Self-Operating Computer][Error] AI response
Failed take action after looking at the screenshot

what could be the problem?

need more Multimodal Large Model support

Error during installation.

Collecting pyobjc-core==10.0 (from -r requirements.txt (line 32))
Using cached pyobjc-core-10.0.tar.gz (921 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [2 lines of output]
running egg_info
error: PyObjC requires macOS to build
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

Add windows support

I guess most people actually use windows, not macos

Add speech to narrate actions.

The feature request is this.

🔊 Utilize speech synthesis to narrate actions before execution.
- This will enhance user experience by providing audio cues.
- Make the bot more interactive and user-friendly.
- Improve accessibility for users with visual impairments.

Bug in Grid `.png`

Occasionally I notice the grid .png has a bug where it either didn't render the full image and some of it is gray, or the dimensions are wrong.

[Linux]: Error parsing JSON: X get_image failed: error 8 (73, 0, 1174) [Self-Operating Computer][Error] something went wrong :( [Self-Operating Computer][Error] AI response Failed take action after looking at the screenshot

Linux issue in the moment of run and permissions.

Proposal for Codebase Refactoring to Enhance Readability and Maintainability

I've been reviewing the project's codebase and noticed that all the logic and functions are currently contained within a single file. This structure, while functional, can make the code challenging to read and maintain. To improve the readability and maintainability of the code, I propose restructuring it by:

Separating functions into different files based on their functionality.
Creating directories to logically categorize these files.
This approach will not only make the code easier to navigate but also simplify future development efforts by providing a clearer modular structure.

I am eager to contribute to this enhancement and have already started working on a preliminary refactoring plan. My goal is to collaborate with the community to develop a structure that best suits our project's needs. I look forward to hearing your thoughts and suggestions on this proposal.