Code Monkey home page Code Monkey logo

infra's Introduction

E2B Infra

Open Source Infrastructure

Powering Cloud Runtime for AI Agents

Visit e2b-dev/e2b repo for more information about how to start using E2B right now.

What is E2B Infra?

E2B is a cloud runtime for AI agents. In our main repository e2b-dev/e2b we are giving you SDKs and CLI to customize and manage environments and run your AI agents in the cloud.

This repository contains the infrastructure that powers the E2B platform.

Project Structure

In this monorepo, there are several components written in Go and a Terraform configuration for the deployment.

The main components are:

  1. API server
  2. Daemon running inside instances (sandboxes)
  3. Nomad driver for managing instances (sandboxes)
  4. Nomad driver for building environments (templates)

The following diagram shows the architecture of the whole project: E2B infrastructure diagram

Deployment

The infrastructure is deployed using Terraform and right now it is deployable on GCP only.

Setting the infrastructure up can be a little rough right now, but we plan to improve it in the future.

infra's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

infra's Issues

Update kernel to 6.1

We want to update the default version that is used for newly built templates to this release after checking if everything is compatible.

Allow other network protocols for communication with services inside sandbox

Users can use only the HTTP protocol when communicating with services they started inside the sandbox. This for example prevents users from communicating with a service they started inside their sandbox using WebSocket.

As far as I know, the biggest blocker in our infra at the moment is a setup of our network proxies. We could explore moving our network proxies to Layer 4 (TCP/UDP) which could enable every protocol out of the box.

Not only this would allow other protocols like Websockets but also for example running GUI applications and streaming graphics with the VNC protocol - important for AI agents that need to control apps and also for communicating what's happening to users.

SDK requests from browser to envd blocked via CORS

Envd seems to be blocking requests from browser because of CORS. The first OPTIONS request is successful but the next (start process) request fails on cors with no response headers.

We are setting CORS in envd server, but it looks like there still might be a problem.

Create public status page

Our users currently don't have a way to monitor health and uptime of our services.

We should create a https://status.e2b.dev page with monitoring of our services, uptime, and outage announcement

Missing entry in `/etc/hosts`

When running any sudo command, there's this line in the output:

user@fc1574dff44b:~$ sudo ls
> sudo: unable to resolve host fc1574dff44b: Name or service not known

Scrape and send proxies' and API logs to Grafana

Right now the logs from proxies and API are only accessible in Nomad (and that is for a limited time).
We can monitor the errors that users get in SDK (WS 502, API errors) better if we can query and browse the logs properly.

This should be achievable by scraping the proxy logs from the otel collector and/or scraping the API logs from Nomad (or changing the log handler in API to Otel compatible).

Missing process kill information

When a process in sandbox is killed we are propagating the -1 exit code, but we should also add a message so it it clear that the process was killed and why.

Provisioning time and permissions

Provisioning takes a long time because we are setting the permissions for the user so our users don't have to use sudo.

We want to make this fast and not use any "magic" that hides what exactly is happening โ€” it should behave more like Docker.

Spawning sandboxes gets stuck on loading snapshot

Sometimes when spawning a sandbox the process gets stuck on loading the snapshot and times out.

One of the possible causes of this is the network namespace handling in Go โ€” it is possible that the goroutine namespace could be somehow switched because of the way the threads and namespaces are handled.

This bug would cause the request for sandbox to return 500 "Cannot create a environment instance right now" error.

Add health check alert for Nomad and Consul

Right now we health check instances, but because our data plane includes Nomad (placing jobs) and Consul (KV store) right now we need to ensure these are healthy too.

The restarts of our client looks like to be caused by unhealthy Nomad too.

Proxies template change breaks connections

When we change the template in the client or session proxy the proxy should have been just reloaded without breaking connections.

Right now it looks like Nomad restarts the container and breaks all existing WS connections to sandboxes.

Clock drift on startup

There's a clock drift for the first few hundreds of milliseconds.

We could probably make explicit call for sync when creating the sandbox from env-instance-driver.

Missing Access-Origin-Allow-Host header

Check if headers aren't removed by nginx proxies used for sandbox networking.

I was getting the header inside the sandbox, but it wasn't present when requesting from outside.

Go context used for tracing is not closed properly

Somewhere in our API code the context that is used when starting instance/sandbox is sometimes not closed properly so it still continues even though the request itself ended.

Screenshot 2024-02-07 at 10 08 40

This results in nonsensical reported start times for instance.

When using some Debian images envd is not correctly started

With some Debian versions used as the base for template the envd is not correctly started which results in SDK trying to reconnect until the timeout, then exiting.

This problem also affects start cmd.

This is an example of Dockerfile that has this problem:

# This is the default devcontainer base, which is useful for having some common utils
FROM mcr.microsoft.com/devcontainers/typescript-node:0-18

# Install normal utils
RUN apt-get -y update; apt-get -y install curl git jq

Support for multiple clients

Currently, the E2B infra can run only on a single client. We're moving to a multi-client support in order to be able to scale the number of running sandboxes.

Add pool of preconfigured networks for sandboxes

When creating a lot of sandboxes at the same time the biggest bottleneck right now is the network setup that takes more and more time depending on the number of networks being created. When creating 20+ sandboxes the additional delay starts to cross 1+ seconds.

The solution here is to create a pool of pre-created networks that we can then immediately use while refilling the pool.

Too big docker context can probably cause OOM

Preparing sandbox template building (38490 files in Docker build context). 
Found ./e2b.Dockerfile that will be used to build the sandbox template.
Error: Build API request failed: Service Unavailable
    at zX (/usr/local/npm/lib/node_modules/@e2b/cli/dist/index.js:151:912)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async e.<anonymous> (/usr/local/npm/lib/node_modules/@e2b/cli/dist/index.js:123:1425)

Fixing the order or middlewares + some other limit setup could maybe fix this.
image

Logs for some sandboxes are missing

Logs for some sandboxes are missing both in Grafana and in our internal Loki when testing this on staging.
The problem should be either in our logs-collector (Vector) setup or in the envd in the logs pushing that involves MMDS data (we can test this inside sandboxes).

Update Firecracker to 1.7

Firecracker 1.7 was released and we want to update the default version that is used for newly built templates to this release after checking if everything is compatible.

Return better error when the template name is already taken

Currently we return following error:

Error: Server error: Internal Server Error, Error when inserting alias: failed to reserve env alias 'fuse': models: constraint failed: pq: duplicate key value violates unique constraint "env_aliases_pkey"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.