phoenixframework / flame Goto Github PK

View Code? Open in Web Editor NEW

677.0 677.0 25.0 171 KB

License: MIT License

Elixir 100.00%

flame's People

Contributors

Stargazers

Watchers

flame's Issues

Fly FLAME children seem to just timeout for no reason.

More info and logs here https://community.fly.io/t/flame-children-always-seem-to-time-out/17301

I don't see any errors from the children then suddenly the parent just times them out

Allow spec'ing backends from pool config or call site

Use case:
I have very asymmetric workloads, and I want to save costs by right-sizing the FLAME instances running the workloads.

As far as I understand, I can only spec resources in the Backend config right now, like:

  config :flame, FLAME.FlyBackend,
    token: System.fetch_env!("FLY_API_TOKEN"),
    cpus: 8,
    memory_mb: 8 * 1024

I'd like to be able to do it on the pool level like this:

      {FLAME.Pool,
       name: MyApp.FFMpegRunner,
       cpus: 8,
       memory_mb: 8 * 1024
      }

or maybe like this?

      {FLAME.Pool,
       name: MyApp.FFMpegRunner,
       backend_opts: [
           cpus: 2,
           memory_mb: 8 * 1024
         ]
      }

or alternatively add some option in the call, cast, place_child.

Example code for Elixir/fly.io/ffmpeg

I was at ElixirConf EU and was very impressed by FLAME. It would simplify our infrastructure significantly.

Our team is considering moving our PDF conversion over to Elixir/fly.io/flame and getting a hold of the example code that was calling out to ffmpeg would go a long way :)

{:remote_shutdown, _} is not always forwarded

According to the docs, the {ref, {:remote_shutdown, :idle}} message is forwarded to the backend. That's not always the case.

FLAME.Pool crashes when two calls are performed in parallel and no existing runners are active

When two FLAME.calls are done in parallel, the FLAME.Pool crashes with the following error:

2023-12-14T04:52:59.627 app[3d8d9d44f5e238] cdg [info] ** (stop) exited in: FLAME.Pool.call(PugNPlayPlatform.FFMpegRunner, #Function<1.122306120/0 in MyApp.some_function/1>, [])
2023-12-14T04:52:59.627 app[3d8d9d44f5e238] cdg [info] ** (EXIT) an exception was raised:
2023-12-14T04:52:59.627 app[3d8d9d44f5e238] cdg [info] ** (KeyError) key :count not found in: nil
2023-12-14T04:52:59.627 app[3d8d9d44f5e238] cdg [info] If you are using the dot syntax, such as map.field, make sure the left-hand side of the dot is a map
2023-12-14T04:52:59.627 app[3d8d9d44f5e238] cdg [info] (flame 0.1.6) lib/flame/pool.ex:420: FLAME.Pool.checkout_runner/4
2023-12-14T04:52:59.627 app[3d8d9d44f5e238] cdg [info] (flame 0.1.6) lib/flame/pool.ex:358: FLAME.Pool.handle_call/3
2023-12-14T04:52:59.627 app[3d8d9d44f5e238] cdg [info] (stdlib 5.0.2) gen_server.erl:1113: :gen_server.try_handle_call/4
2023-12-14T04:52:59.627 app[3d8d9d44f5e238] cdg [info] (stdlib 5.0.2) gen_server.erl:1142: :gen_server.handle_msg/6
2023-12-14T04:52:59.627 app[3d8d9d44f5e238] cdg [info] (stdlib 5.0.2) proc_lib.erl:241: :proc_lib.init_p_do_apply/3
2023-12-14T04:52:59.627 app[3d8d9d44f5e238] cdg [info] (flame 0.1.6) lib/flame/pool.ex:238: FLAME.Pool.exit!/3
2023-12-14T04:52:59.627 app[3d8d9d44f5e238] cdg [info] (elixir 1.15.4) lib/task/supervised.ex:101: Task.Supervised.invoke_mfa/2
2023-12-14T04:52:59.627 app[3d8d9d44f5e238] cdg [info] (elixir 1.15.4) lib/task/supervised.ex:36: Task.Supervised.reply/4

The error is in this line:

flame/lib/flame/pool.ex

Line 426 in a99467d

    
           runner_count == 0 || (min_runner.count == state.max_concurrency && runner_count < state.max) ->

And happens because the runner_count is 1 (one pending runner) but the min_runner in nil because map_size(state.runners) is 0.

This can be reproduced with the following code (when a FLAME.Pool is configured with name MyApp.FlamePoolName and min: 0):

Task.async(fn -> 
  FLAME.call(MyApp.FlamePoolName, fn ->
    IO.puts("1")
  end)
end)

Task.async(fn -> 
  FLAME.call(MyApp.FlamePoolName, fn ->
    IO.puts("2")
  end)
end)

Hot start a new node when a percentage of max_concurrency is reached

Currently FLAME will only start a new node when the maximum concurrency is reached which can lead to cold starts during heavy load.

Adding a new parameter for "spinup_at_percentage" and being able to pass in something like 0.8. So that a new node starts at 80% capacity to pre-empt load.

These could have an aggressive idle timer so they stop if that capacity isn't used.

Eventually perhaps monitoring the rate of incoming work so that new nodes are spun up if it can't cope.

Flame name conflict

This project looks really cool, but it's a bit unfortunate that the name conflicts with another really big and very active open source project with a logo that looks identical: https://github.com/flame-engine/flame

Are there any options for clarifying the name, like "Phoenix Flame" or something?

FlyBackend supports mounting volume

I'm using FLAME for my project to run ML workload running on Fly. To avoid download the 4GB model data every time starting a FLAME runner, a Fly volume needs to be attached to the instance.

Currently there is a PR #22 by @benbot, however, it seems abandoned for 3 months.

Token a brief look, the tricky part seems that:

Fly volumes don't auto-scale when creating new instance,
and the API requires the caller to explicit specify the volume id instead of just the name.

This might become trickier when multiple nodes in the cluster trying to start the runners at the startup time causing potential race condition.

I currently forked the FlyBackend and hardcoded the volume info to make things working in my project. But a proper built-in solution would be very nice.

Add telemetry events

Fly FLAME failing with timeout

Hi,

I attempted a simple Flame example in an existing app, but I'm running into an issue where the FLAME instance is killed immediately, but the caller sees a timeout.

Versions:

elixir=1.16.2
erlang-otp=25.3.2.10
{:flame, "~> 0.1.12"}

I tried to follow the doc's and setup:

# application.ex
        {FLAME.Pool,
         name: MyApp.FFMpegRunner,
         min: 0,
         max: 2,
         max_concurrency: 1,
         idle_shutdown_after: 30_000,
         timeout: :timer.minutes(30)},

# runtime.exs
  pool_size =
    if FLAME.Parent.get() do
      1
    else
      String.to_integer(System.get_env("POOL_SIZE") || "10")
    end

  config :loccal, MyApp.Repo,
    url: database_url,
    pool_size: pool_size,
    socket_options: maybe_ipv6

  config :flame, :backend, FLAME.FlyBackend

  config :flame, FLAME.FlyBackend,
    token: System.fetch_env!("FLAME_FLY_API_TOKEN"),
    cpus: 2,
    memory_mb: 8 * 1024,
    boot_timeout: 120_000

Then run a simple example in a remote Fly shell:

iex(myapp@908055ec2195e8)1> FLAME.call(MyApp.FFMpegRunner, fn -> {:ok, "Foo"} end)
** (exit) exited in: FLAME.Pool.call(MyApp.FFMpegRunner, #Function<43.3316493/0 in :erl_eval.expr/6>, [])
    ** (EXIT) time out
    (flame 0.1.12) lib/flame/pool.ex:268: FLAME.Pool.exit!/3
    iex:1: (file)
iex(myapp@908055ec2195e8)1>

I tried setting the timeouts really high, because I see that in the logs below.
It appears the FLAME instance is spawned, and attempts to run, but is then shut down.

One suspicious thing is that the endpoint in the logs in the 9185997ade1258 instance is https://example.com, while my app is the real domain in my shell:

iex(myapp@908055ec2195e8)3> Application.get_env(:loccal, LoccalWeb.Endpoint)[:url]
[host: "myapp.com", port: 443, scheme: "https"]

(I disabled selectively starting applications in the Flame instance, for rule out having made a mistake there)

Any idea of what I could be missing here?

Logs:

# Logging into a remote shell here, and executing the simple FLAME call
2024-04-23T15:00:20Z app[908055ec2195e8] cdg [info]2024/04/23 15:00:20 New SSH Session - XXXXXX
2024-04-23T15:00:37Z app[908055ec2195e8] cdg [info] WARN Reaped child process with pid: 560 and signal: SIGUSR1, core dumped? false
2024-04-23T15:02:22Z runner[9185997ade1258] cdg [info]Pulling container image registry.fly.io/my-app-8571:deployment-01HW5Q7P881CZB6TQ747W188PK
2024-04-23T15:02:23Z runner[9185997ade1258] cdg [info]Successfully prepared image registry.fly.io/my-app-8571:deployment-01HW5Q7P881CZB6TQ747W188PK (1.058413437s)
2024-04-23T15:02:26Z runner[9185997ade1258] cdg [info]Configuring firecracker
2024-04-23T15:02:26Z app[9185997ade1258] cdg [info][    0.230064] PCI: Fatal: No config space access function found
2024-04-23T15:02:27Z app[9185997ade1258] cdg [info] INFO Starting init (commit: 65db7f7)...
2024-04-23T15:02:27Z app[9185997ade1258] cdg [info] INFO Preparing to run: `/app/bin/server` as nobody
2024-04-23T15:02:27Z app[9185997ade1258] cdg [info] INFO [fly api proxy] listening at /.fly/api
2024-04-23T15:02:27Z app[9185997ade1258] cdg [info]2024/04/23 15:02:27 listening on [fdaa:0:31ed:a7b:ae02:ad7e:20b7:2]:22 (DNS: [fdaa::3]:53)
2024-04-23T15:02:27Z runner[9185997ade1258] cdg [info]Machine created and started in 5.019s
2024-04-23T15:02:33Z app[9185997ade1258] cdg [info]15:02:33.201 [info] Running MyAppWeb.Endpoint with cowboy 2.12.0 at :::4000 (http)
2024-04-23T15:02:33Z app[9185997ade1258] cdg [info]15:02:33.204 [info] Access MyAppWeb.Endpoint at https://example.com
2024-04-23T15:02:33Z app[9185997ade1258] cdg [info] WARN Reaped child process with pid: 382 and signal: SIGUSR1, core dumped? false
2024-04-23T15:02:38Z app[9185997ade1258] cdg [info]15:02:38.178 [info] Tzdata has updated the release from 2021e to 2024a
2024-04-23T15:02:53Z app[9185997ade1258] cdg [info] WARN Reaped child process with pid: 384 and signal: SIGUSR1, core dumped? false
2024-04-23T15:02:53Z app[9185997ade1258] cdg [info] WARN Reaped child process with pid: 385 and signal: SIGUSR1, core dumped? false
2024-04-23T15:03:40Z app[9185997ade1258] cdg [info] INFO Main child exited normally with code: 0
2024-04-23T15:03:40Z app[9185997ade1258] cdg [info] INFO Starting clean up.
2024-04-23T15:03:40Z app[9185997ade1258] cdg [info] WARN could not unmount /rootfs: EINVAL: Invalid argument
2024-04-23T15:03:40Z app[9185997ade1258] cdg [info][   73.750104] reboot: Restarting system
2024-04-23T15:03:40Z runner[9185997ade1258] cdg [info]machine restart policy set to 'no', not restarting
2024-04-23T15:04:21Z app[908055ec2195e8] cdg [info]15:04:21.071 [error] failed to connect to fly machine within 120000ms
2024-04-23T15:04:21Z app[908055ec2195e8] cdg [info]15:04:21.072 [error] GenServer #PID<0.6714.0> terminating
2024-04-23T15:04:21Z app[908055ec2195e8] cdg [info]** (stop) time out
2024-04-23T15:04:21Z app[908055ec2195e8] cdg [info]    (flame 0.1.12) lib/flame/fly_backend.ex:249: FLAME.FlyBackend.remote_boot/1
2024-04-23T15:04:21Z app[908055ec2195e8] cdg [info]    (flame 0.1.12) lib/flame/runner.ex:270: anonymous fn/4 in FLAME.Runner.handle_call/3
2024-04-23T15:04:21Z app[908055ec2195e8] cdg [info]    (stdlib 4.3.1.3) gen_server.erl:1149: :gen_server.try_handle_call/4
2024-04-23T15:04:21Z app[908055ec2195e8] cdg [info]    (stdlib 4.3.1.3) gen_server.erl:1178: :gen_server.handle_msg/6
2024-04-23T15:04:21Z app[908055ec2195e8] cdg [info]    (stdlib 4.3.1.3) proc_lib.erl:240: :proc_lib.init_p_do_apply/3
2024-04-23T15:04:21Z app[908055ec2195e8] cdg [info]Last message (from #PID<0.6713.0>): {:remote_boot, nil}
2024-04-23T15:04:21Z app[908055ec2195e8] cdg [info]15:04:21.073 [error] Task #PID<0.6713.0> started from MyApp.FFMpegRunner terminating
2024-04-23T15:04:21Z app[908055ec2195e8] cdg [info]** (stop) exited in: GenServer.call(#PID<0.6714.0>, {:remote_boot, nil}, :infinity)
2024-04-23T15:04:21Z app[908055ec2195e8] cdg [info]    ** (EXIT) time out
2024-04-23T15:04:21Z app[908055ec2195e8] cdg [info]    (elixir 1.16.2) lib/gen_server.ex:1114: GenServer.call/3
2024-04-23T15:04:21Z app[908055ec2195e8] cdg [info]    (flame 0.1.12) lib/flame/pool.ex:565: FLAME.Pool.start_child_runner/2
2024-04-23T15:04:21Z app[908055ec2195e8] cdg [info]    (elixir 1.16.2) lib/task/supervised.ex:101: Task.Supervised.invoke_mfa/2
2024-04-23T15:04:21Z app[908055ec2195e8] cdg [info]    (elixir 1.16.2) lib/task/supervised.ex:36: Task.Supervised.reply/4
2024-04-23T15:04:21Z app[908055ec2195e8] cdg [info]Function: #Function<1.15317371/0 in FLAME.Pool.async_boot_runner/1>
2024-04-23T15:04:21Z app[908055ec2195e8] cdg [info]    Args: []
2024-04-23T15:04:33Z app[908055ec2195e8] cdg [info]15:04:33.720 [info] CONNECTED TO Phoenix.LiveView.Socket in 43µs
2024-04-23T15:04:33Z app[908055ec2195e8] cdg [info]  Transport: :websocket

can't test async: (DBConnection.OwnershipError) cannot find ownership process

testing flame calls that involve the repo get hung up on the ownership process.

edit: simplified example

defmodule MyAppTest do
  test "flame accessing the db" do
    FLAME.call(MyPool, fn -> Repo.all(MySchema) end)
  end
end

raises

** (EXIT from #PID<0.544.0>) an exception was raised:
         ** (DBConnection.OwnershipError) cannot find ownership process for #PID<0.592.0>.

             (ecto_sql 3.9.2) lib/ecto/adapters/sql.ex:910: Ecto.Adapters.SQL.raise_sql_call_error/1
             (ecto_sql 3.9.2) lib/ecto/adapters/sql.ex:828: Ecto.Adapters.SQL.execute/6
             (ecto 3.9.6) lib/ecto/repo/queryable.ex:229: Ecto.Repo.Queryable.execute/4
             (ecto 3.9.6) lib/ecto/repo/queryable.ex:19: Ecto.Repo.Queryable.all/3
             (flame 0.1.7) lib/flame/runner.ex:408: anonymous fn/3 in FLAME.Runner.remote_call/4

this does not apply when async: false

Add kubernetes backend

This library is awesome and a perfect case for kubernetes

encode pool name in %Parent{}

Global Singleton Pool

I'm using FLAME for my project to run ML workload running on Fly.

Because initialing a model with Bumblebee could take minutes, I set the min runner to 1, and hope there is always 1 runner node live.

However, I have 2 web serving nodes (clustered), and during the starting up both of them were trying to create a new Fly machine.

Could we add a :global option in the Pool config, and make the Pool global within a cluster Like:

{FLAME.Pool,
      name: Thumbs.FFMpegRunner,
      min: 0,
      max: 10,
      max_concurrency: 5,
      idle_shutdown_after: 30_000,
      global: true},

Use :peer node for local backend

Some issues to solve will be to ensure shared code paths allow executing anonymous functions, as well as net kernel not being started in distributed mode in dev/test

fly machines don't always have env vars set in time

In my legacy fly deployment (generated pre-1.6.3), I'm occasionally having my flame-initiated application crash on startup when executing FLAME.call.

In my runtime.exs I have e.g. System.fetch_env!("BOX_ENTERPRISE_ID") and I'll get this in fly logs when flame executes

** (System.EnvError) could not fetch environment variable "BOX_ENTERPRISE_ID" because it is not set

This seems to only happen for environment variables I've set in fly.toml. It does not seem to happen for secrets I've set with fly secrets set.

I think the env vars are set late (after runtime.exs tries to look for them) because if I ssh into the running flame machine after the flame call has crashed, i see the expected env vars.

Allow specifying fly machine resources more specifically

Amazing project!
Thinking of a use case for dealing with memory intensive loads, would be amazing to be able to specify more specific resources for the specific Flame job. Similarly for loads with GPUs etc

For fly_backend.ex it seems like this supports just the named size, but I could imagine wanting to specify memory, cpu kind, gpu etc they seem to all be options under guest.

Preventing Restarts

Hello! Thanks for all you guys do in the phoenix community!

Issue

It seems as though a deployment (in fly, at least) sends a SIGTERM to the FLAME node, which causes the machine to spin down along with the other non-flame nodes. Imagine starting a heavy ffmpeg process, only to get spun down 10 seconds later without a chance to finish.

In my testing, it does not appear to respect the shutdown_timeout when the platform issues a shutdown.

To test this, I ran the following code:

defmodule FFMPEG do
  def start_test_stream do
    FLAME.cast(fn ->
      System.cmd("ffmpeg", ~w(-t 01:00:00 -re -f lavfi -i testsrc -f flv URL_GOES_HERE))
    end)
  end
end

iex> FFMPEG.start_test_stream()

This spins up a test stream and sends it to the given URL. This was a quick and dirty way to get ffmpeg into a long running process (1 hour) so I could have time to trigger and observe a deploy.

In this test I also set the following options:

      {
        FLAME.Pool,
        name: FFMpegRunner,
        min: 0,
        max: 5,
        max_concurrency: 1,
        single_use: false,
        timeout: :timer.hours(1),
        boot_timeout: :timer.minutes(1),
        shutdown_timeout: :timer.hours(1),
        idle_shutdown_after: 10_000
      }

Potential Solutions

For the fly_backend specifically, I wonder if we could adjust the POST to the machines API to spin up a node as part of another app? That way when application_1 gets a new deployment, the FLAME node inside of application_2 is not affected. (haven't tested this yet, no idea if its even plausible)
Could we somehow discourage the fly deployment system from restarting our FLAME nodes by setting some sort of metadata or flag?
Am I using FLAME wrong?

I'm open to any ideas.

phoenixframework / flame Goto Github PK

flame's People

Contributors

Stargazers

Watchers

Forkers

flame's Issues

Issue

Potential Solutions

Recommend Projects

Recommend Topics

Recommend Org