Code Monkey home page Code Monkey logo

os-copilot's Introduction

OS-Copilot

How it works:

Custom GPTs

ChatGPT Plus users can now create custom GPTs which support "Actions" (https://platform.openai.com/docs/actions) These are incredibly powerful as they allow a GPT to interact with the world. This project connects a custom GPT to a computer by giving it a set of actions it can choose from. Currently we have mouse clicking, keyboard typing, performing hotkeys (multiple keys at the same time), and an action that simply 'looks' at the screen.

You can use the OS-Copilot GPT on a smartphone or via the website, even using the same computer that you wish to control (although it becomes slightly harder to use the chat interface when it is using another application).

Seeing the screen

All major operating systems have accessibility APIs, these help people with vision impairments to read and use a computer. A challenging part of this project was designing a way to convert the screen reader data into HTML which the GPT can read in text format. Not all apps support screen readers, but other researchers are improving vision-based AI that could replace the need for screen reader support. There are both benefits and drawbacks to using this non-vision-based method, but in future both methods may be combined to support each other dynamically.

Project Future:

Crafting a computer agent robust enough to be genuinely useful is a challenging task, but there are smaller milestones along the way. One useful new feature will be the ability to convert an action sequence into a hotkey - the AI agent will write code that operates on the UI, and this can be triggered by a hotkey or UI condition. This method would avoid the latency of asking an AI, since the AI would only be needed to write the code sequence (and step in if the code didn't work), similar methods are being employed in robotics where the language-model writes code that controls the robot in a precise way.

Fine tuning a smaller model to run locally for this task would be useful to avoid the need to send any data to a 3rd party (currently OpenAI), and will allow for exciting improvements like continuous improvement and learning to use new apps.

Porting to platforms other than OSX is also a long term goal, the system already designed in a modular way - the OSX screen reading and control in a seperate module. However adding other OSs will require a lot of effort and programming expertise.

FAQ

  • Why don't you let it run code via an action? This would make it much more powerful (e.g. like AutoGPT)
    • Seeing the mouse move and words being typed on the screen has a lot of advantages over a command line script being run - it is understandable, you can debug it easier, and you can stop it as soon as you see something going wrong (simply moving the mouse overrides the action)
  • Where is the source code?
    • Source code is not very useful without documentation on how to use it. That takes time to write but I would like to release this if there is enough interest, so let me know if you think you have a use for it!
  • Is this possible to use without ChatGPT plus?
    • In theory, yes. The previous version of this was simpler and capable of using any language model, and using a specialised model would be interesting to investigate! But currently, GPT-4 is the best model for the job since it is able to perform multiple actions in a single chat-reply. However it would not be that difficult to hook this up to another model.

os-copilot's People

Contributors

supersational avatar

Stargazers

 avatar

Watchers

 avatar

os-copilot's Issues

Source Code ๐Ÿ‘€

Sounds very interesting! Would love to check this out. Maybe even help with documentation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.