Code Monkey home page Code Monkey logo

Comments (13)

chrismccord avatar chrismccord commented on May 23, 2024 2

0.1.12 supports explicit link: false to FLAME.cast|call|place_child so you don't need to unlink/trap yourself. Thanks!

from flame.

chrismccord avatar chrismccord commented on May 23, 2024 1

@cadebward I believe I found a bug where we don't honor shutdown timeout in all shutdown cases. Can you try main and report back? I'll be able to take a closer look in a bit. Thanks!

from flame.

chrismccord avatar chrismccord commented on May 23, 2024

Check the Termination docs:

flame/lib/flame.ex

Lines 154 to 161 in 3a03d21

## Termination
FLAME runs a termination process to allow remotely spawned functions time to
complete before the node is terminated. This process is started automatically
with the library. The shutdown timeout by default is 30s, but can be configured
in your application configuration, such as `config/runtime.exs`:
config :flame, :terminator, shutdown_timeout: :timer.seconds(10)

You can configure the terminator's shutdown timeout. For now it needs to be a global config, but since we only allow a 1:1 named pool we should be able to pass the pool's shutdown_timeout to the child. For now you can configure it globally:

config :flame, :terminator, shutdown_timeout: :timer.hours(1)

Note, you can also pass it to the pool as an env: argument, and then in your runtime.exs, you could reference the env var to get per-pool termination configuration:

config :flame, :terminator, shutdown_timeout: String.to_integer(System.get_env("FLAME_SHUTDOWN_TIMEOUT") || "10000")

Thanks!

from flame.

cadebward avatar cadebward commented on May 23, 2024

Thanks for the response! I did have that globally configured. It still seems as though the deployment immediately shuts down the FLAME node. I'll do some more testing to be sure.

from flame.

chrismccord avatar chrismccord commented on May 23, 2024

Let me know! The started flame machines are completely independent of the app machines (fly infra wise) so a deployment won't tear them down on the fly side. So the only possibility is the parent node going down triggers the children to terminate per the regular monitoring flow, but the configured termination time is either ignored or there is a bug in the terminator causing an early exit.

from flame.

cadebward avatar cadebward commented on May 23, 2024

Hmm... something else is going on. After testing this out through a few deploys, it seems as though the FLAME node does not get swept up in a deployment. A new machine comes up, old machines spin down. The FLAME node is not included in these logs. In this example, my FLAME node is 17816409ad1408.

Updating existing machines in 'app-name' with bluegreen strategy

Cleanup Previous Deployment
  No hanging machines from a failed previous deployment

Creating green machines
  Created machine 3d8d74e3a74138 [app]

Waiting for all green machines to start
  Machine 3d8d74e3a74138 [app] - started

Waiting for all green machines to be healthy
  Machine 3d8d74e3a74138 [app] - 1/1 passing

Marking green machines as ready
  Machine 3d8d74e3a74138 [app] now ready

Waiting before cordoning all blue machines
  Machine 17816472fdd268 [app] cordoned

Waiting before stopping all blue machines

Stopping all blue machines

Waiting for all blue machines to stop
  Machine 17816472fdd268 [app] - stopped

Destroying all blue machines
  Machine 17816472fdd268 [app] destroyed

Deployment Complete

That all seems as expected. Then out of nowhere these logs show up indicating that the FLAME node has been sent a SIGTERM and is shutting down.

2024-03-08T23:38:45Z app[17816409ad1408] dfw [info] INFO Main child exited normally with code: 0
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info] INFO Starting clean up.
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info] WARN hallpass exited, pid: 315, status: signal: 15 (SIGTERM)
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info]2024/03/08 23:38:45 listening on [fdaa:0:44f2:a7b:241:18ab:d029:2]:22 (DNS: [fdaa::3]:53)
[flv @ 0x56420e644f00] Failed to update header with correct duration.12.2kbits/s speed=   1x    e=00:04:59.32 bitrate= 212.7kbits/s speed=   1x
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info][flv @ 0x56420e644f00] Failed to update header with correct filesize.
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info]frame= 7880 fps= 25 q=8.6 Lsize=    8161kB time=00:05:15.16 bitrate= 212.1kbits/s speed=   1x
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info]video:8038kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 1.534419%
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info]Exiting normally, received signal 15.
2024-03-08T23:38:46Z app[17816409ad1408] dfw [info][  320.770011] reboot: Restarting system
2024-03-08T23:38:46Z runner[17816409ad1408] dfw [info]machine restart policy set to 'no', not restarting
[ no more logs for 17816409ad1408 after this point]

I am not sure who is sending that signal.

from flame.

cadebward avatar cadebward commented on May 23, 2024

Does the flame pool need to be included in the supervision tree of flame nodes?

from flame.

chrismccord avatar chrismccord commented on May 23, 2024

Sorry I got sidetracked assuming it was a config issue. Currently spawned processes via cast are not directly supervised, so they are terminated by the VM as soon as the system is shutdown, but you can trap exit to avoid being nuked (with caveats):

    FLAME.cast(fn ->
      Process.flag(:trap_exit, true)
      System.cmd("ffmpeg", ~w(-t 01:00:00 -re -f lavfi -i testsrc -f flv URL_GOES_HERE))
    end)

The issue here is we don't control the full lifecycle of the cast, we only monitor it from the terminator, where FLAME.place_child owns the spawned child is a better fit. FLAME.place_child will teardown placed processes gracefully up to the shutdown_timeout – but there's a catch. By default we link to the caller so you need to explicitly unlink or trap exits for longterm workloads where the remote caller goes away:

  {:ok, pid} = FLAME.place_child(MyRunner, {Task, fn ->
    System.cmd("ffmpeg", ~w(-t 01:00:00 -re -f lavfi -i testsrc -f flv URL_GOES_HERE))
  end})
  Process.unlink(pid) # or trap exit on placed child

All that said, we can definitely make this better. Casts should be properly monitored and shutdown blocked up to shutdown_timeout just like placed children, without you having to worry about it. We could also introduce a :link option on place_child, but I need to think more about it.

from flame.

chrismccord avatar chrismccord commented on May 23, 2024

Does the flame pool need to be included in the supervision tree of flame nodes?

no

from flame.

chrismccord avatar chrismccord commented on May 23, 2024

The added nuance here (for myself and others) is we do a lot to avoid extra copying as part of the messagings/spawning of flame work, so easy defacto defaults to take care of this on behalf of the user for cast require an extra copy, and it only matters in the case of long-running work, which is probably best suited by placing a worker on another node. Still, I understand just wanting to run a function somewhere and not setting up a worker. Having cast trap exits by default and the terminator force kill the casts at the end of shutdown_timeout may be the best option.

from flame.

cadebward avatar cadebward commented on May 23, 2024

I deployed this code:

{:ok, pid} = FLAME.place_child(MyRunner, {Task, fn ->
    System.cmd("ffmpeg", ~w(-t 01:00:00 -re -f lavfi -i testsrc -f flv URL_GOES_HERE))
end})
Process.unlink(pid)

I then started the process and ran fly apps restart app-name. The FLAME node immediately shut down 😭

I then deployed this change instead:

FLAME.place_child(MyRunner, {Task, fn ->
  Process.flag(:trap_exit, true)
  System.cmd("ffmpeg", ~w(-t 01:00:00 -re -f lavfi -i testsrc -f flv URL_GOES_HERE))
end})

The result was the exact same.

The started flame machines are completely independent of the app machines (fly infra wise) so a deployment won't tear them down on the fly side.

I'm wondering about this. How can we double check that this is the case? I don't know anything about computers, but I have a hunch fly is still forcing this machine offline. A few seconds after I trigger the restart I see the logs:

2024-03-11T17:06:11Z app[784e774a20ed78] dfw [info] INFO Main child exited normally with code: 0
2024-03-11T17:06:11Z app[784e774a20ed78] dfw [info] INFO Starting clean up.
2024-03-11T17:06:11Z app[784e774a20ed78] dfw [info] WARN hallpass exited, pid: 315, status: signal: 15 (SIGTERM)
2024-03-11T17:06:12Z app[784e774a20ed78] dfw [info][   63.496255] reboot: Restarting system
2024-03-11T17:06:12Z runner[784e774a20ed78] dfw [info]machine restart policy set to 'no', not restarting

from flame.

cadebward avatar cadebward commented on May 23, 2024

Correction: I did not notice the subtle differences between your two code snippets. I tried this code:

    FLAME.cast(fn ->
      Process.flag(:trap_exit, true)
      System.cmd("ffmpeg", ~w(-t 01:00:00 -re -f lavfi -i testsrc -f flv URL_GOES_HERE))
    end)

And it DOES seem to be working as expected!

UPDATE: I also cancelled the ffmpeg process and the machine spun down as expected.

from flame.

chrismccord avatar chrismccord commented on May 23, 2024

Nice. Correction to my previous comment, casts with trap exit won't risk hanging the VM as its still the terminator awaiting the calls/casts/place_child's to exit within shutdown_timeout, so provided you have Process.flag(:trap_exit, true) in your cast, the terminator will allow the system shutdown to continue after the shutdown timeout.

from flame.

Related Issues (19)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.