Code Monkey home page Code Monkey logo

infra's People

Contributors

cneira avatar dependabot[bot] avatar georgaberg avatar jakubno avatar mlejva avatar ncode avatar scar26 avatar strajk avatar valentatomas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

infra's Issues

Go context used for tracing is not closed properly

Somewhere in our API code the context that is used when starting instance/sandbox is sometimes not closed properly so it still continues even though the request itself ended.

Screenshot 2024-02-07 at 10 08 40

This results in nonsensical reported start times for instance.

Update kernel to 6.1

We want to update the default version that is used for newly built templates to this release after checking if everything is compatible.

Add pool of preconfigured networks for sandboxes

When creating a lot of sandboxes at the same time the biggest bottleneck right now is the network setup that takes more and more time depending on the number of networks being created. When creating 20+ sandboxes the additional delay starts to cross 1+ seconds.

The solution here is to create a pool of pre-created networks that we can then immediately use while refilling the pool.

Clock drift on startup

There's a clock drift for the first few hundreds of milliseconds.

We could probably make explicit call for sync when creating the sandbox from env-instance-driver.

Return better error when the template name is already taken

Currently we return following error:

Error: Server error: Internal Server Error, Error when inserting alias: failed to reserve env alias 'fuse': models: constraint failed: pq: duplicate key value violates unique constraint "env_aliases_pkey"

Logs for some sandboxes are missing

Logs for some sandboxes are missing both in Grafana and in our internal Loki when testing this on staging.
The problem should be either in our logs-collector (Vector) setup or in the envd in the logs pushing that involves MMDS data (we can test this inside sandboxes).

Allow other network protocols for communication with services inside sandbox

Users can use only the HTTP protocol when communicating with services they started inside the sandbox. This for example prevents users from communicating with a service they started inside their sandbox using WebSocket.

As far as I know, the biggest blocker in our infra at the moment is a setup of our network proxies. We could explore moving our network proxies to Layer 4 (TCP/UDP) which could enable every protocol out of the box.

Not only this would allow other protocols like Websockets but also for example running GUI applications and streaming graphics with the VNC protocol - important for AI agents that need to control apps and also for communicating what's happening to users.

Scrape and send proxies' and API logs to Grafana

Right now the logs from proxies and API are only accessible in Nomad (and that is for a limited time).
We can monitor the errors that users get in SDK (WS 502, API errors) better if we can query and browse the logs properly.

This should be achievable by scraping the proxy logs from the otel collector and/or scraping the API logs from Nomad (or changing the log handler in API to Otel compatible).

Missing Access-Origin-Allow-Host header

Check if headers aren't removed by nginx proxies used for sandbox networking.

I was getting the header inside the sandbox, but it wasn't present when requesting from outside.

Add health check alert for Nomad and Consul

Right now we health check instances, but because our data plane includes Nomad (placing jobs) and Consul (KV store) right now we need to ensure these are healthy too.

The restarts of our client looks like to be caused by unhealthy Nomad too.

Spawning sandboxes gets stuck on loading snapshot

Sometimes when spawning a sandbox the process gets stuck on loading the snapshot and times out.

One of the possible causes of this is the network namespace handling in Go โ€” it is possible that the goroutine namespace could be somehow switched because of the way the threads and namespaces are handled.

This bug would cause the request for sandbox to return 500 "Cannot create a environment instance right now" error.

Too big docker context can probably cause OOM

Preparing sandbox template building (38490 files in Docker build context). 
Found ./e2b.Dockerfile that will be used to build the sandbox template.
Error: Build API request failed: Service Unavailable
    at zX (/usr/local/npm/lib/node_modules/@e2b/cli/dist/index.js:151:912)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async e.<anonymous> (/usr/local/npm/lib/node_modules/@e2b/cli/dist/index.js:123:1425)

Fixing the order or middlewares + some other limit setup could maybe fix this.
image

When using some Debian images envd is not correctly started

With some Debian versions used as the base for template the envd is not correctly started which results in SDK trying to reconnect until the timeout, then exiting.

This problem also affects start cmd.

This is an example of Dockerfile that has this problem:

# This is the default devcontainer base, which is useful for having some common utils
FROM mcr.microsoft.com/devcontainers/typescript-node:0-18

# Install normal utils
RUN apt-get -y update; apt-get -y install curl git jq

Update Firecracker to 1.7

Firecracker 1.7 was released and we want to update the default version that is used for newly built templates to this release after checking if everything is compatible.

Proxies template change breaks connections

When we change the template in the client or session proxy the proxy should have been just reloaded without breaking connections.

Right now it looks like Nomad restarts the container and breaks all existing WS connections to sandboxes.

Missing process kill information

When a process in sandbox is killed we are propagating the -1 exit code, but we should also add a message so it it clear that the process was killed and why.

Missing entry in `/etc/hosts`

When running any sudo command, there's this line in the output:

user@fc1574dff44b:~$ sudo ls
> sudo: unable to resolve host fc1574dff44b: Name or service not known

Provisioning time and permissions

Provisioning takes a long time because we are setting the permissions for the user so our users don't have to use sudo.

We want to make this fast and not use any "magic" that hides what exactly is happening โ€” it should behave more like Docker.

Create public status page

Our users currently don't have a way to monitor health and uptime of our services.

We should create a https://status.e2b.dev page with monitoring of our services, uptime, and outage announcement

Support for multiple clients

Currently, the E2B infra can run only on a single client. We're moving to a multi-client support in order to be able to scale the number of running sandboxes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.