Comments (13)
0.1.12 supports explicit link: false
to FLAME.cast|call|place_child
so you don't need to unlink/trap yourself. Thanks!
from flame.
@cadebward I believe I found a bug where we don't honor shutdown timeout in all shutdown cases. Can you try main and report back? I'll be able to take a closer look in a bit. Thanks!
from flame.
Check the Termination
docs:
Lines 154 to 161 in 3a03d21
You can configure the terminator's shutdown timeout. For now it needs to be a global config, but since we only allow a 1:1 named pool we should be able to pass the pool's shutdown_timeout to the child. For now you can configure it globally:
config :flame, :terminator, shutdown_timeout: :timer.hours(1)
Note, you can also pass it to the pool as an env:
argument, and then in your runtime.exs, you could reference the env var to get per-pool termination configuration:
config :flame, :terminator, shutdown_timeout: String.to_integer(System.get_env("FLAME_SHUTDOWN_TIMEOUT") || "10000")
Thanks!
from flame.
Thanks for the response! I did have that globally configured. It still seems as though the deployment immediately shuts down the FLAME node. I'll do some more testing to be sure.
from flame.
Let me know! The started flame machines are completely independent of the app machines (fly infra wise) so a deployment won't tear them down on the fly side. So the only possibility is the parent node going down triggers the children to terminate per the regular monitoring flow, but the configured termination time is either ignored or there is a bug in the terminator causing an early exit.
from flame.
Hmm... something else is going on. After testing this out through a few deploys, it seems as though the FLAME node does not get swept up in a deployment. A new machine comes up, old machines spin down. The FLAME node is not included in these logs. In this example, my FLAME node is 17816409ad1408
.
Updating existing machines in 'app-name' with bluegreen strategy
Cleanup Previous Deployment
No hanging machines from a failed previous deployment
Creating green machines
Created machine 3d8d74e3a74138 [app]
Waiting for all green machines to start
Machine 3d8d74e3a74138 [app] - started
Waiting for all green machines to be healthy
Machine 3d8d74e3a74138 [app] - 1/1 passing
Marking green machines as ready
Machine 3d8d74e3a74138 [app] now ready
Waiting before cordoning all blue machines
Machine 17816472fdd268 [app] cordoned
Waiting before stopping all blue machines
Stopping all blue machines
Waiting for all blue machines to stop
Machine 17816472fdd268 [app] - stopped
Destroying all blue machines
Machine 17816472fdd268 [app] destroyed
Deployment Complete
That all seems as expected. Then out of nowhere these logs show up indicating that the FLAME node has been sent a SIGTERM
and is shutting down.
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info] INFO Main child exited normally with code: 0
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info] INFO Starting clean up.
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info] WARN hallpass exited, pid: 315, status: signal: 15 (SIGTERM)
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info]2024/03/08 23:38:45 listening on [fdaa:0:44f2:a7b:241:18ab:d029:2]:22 (DNS: [fdaa::3]:53)
[flv @ 0x56420e644f00] Failed to update header with correct duration.12.2kbits/s speed= 1x e=00:04:59.32 bitrate= 212.7kbits/s speed= 1x
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info][flv @ 0x56420e644f00] Failed to update header with correct filesize.
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info]frame= 7880 fps= 25 q=8.6 Lsize= 8161kB time=00:05:15.16 bitrate= 212.1kbits/s speed= 1x
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info]video:8038kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 1.534419%
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info]Exiting normally, received signal 15.
2024-03-08T23:38:46Z app[17816409ad1408] dfw [info][ 320.770011] reboot: Restarting system
2024-03-08T23:38:46Z runner[17816409ad1408] dfw [info]machine restart policy set to 'no', not restarting
[ no more logs for 17816409ad1408 after this point]
I am not sure who is sending that signal.
from flame.
Does the flame pool need to be included in the supervision tree of flame nodes?
from flame.
Sorry I got sidetracked assuming it was a config issue. Currently spawned processes via cast are not directly supervised, so they are terminated by the VM as soon as the system is shutdown, but you can trap exit to avoid being nuked (with caveats):
FLAME.cast(fn ->
Process.flag(:trap_exit, true)
System.cmd("ffmpeg", ~w(-t 01:00:00 -re -f lavfi -i testsrc -f flv URL_GOES_HERE))
end)
The issue here is we don't control the full lifecycle of the cast, we only monitor it from the terminator, where FLAME.place_child
owns the spawned child is a better fit. FLAME.place_child
will teardown placed processes gracefully up to the shutdown_timeout – but there's a catch. By default we link to the caller so you need to explicitly unlink or trap exits for longterm workloads where the remote caller goes away:
{:ok, pid} = FLAME.place_child(MyRunner, {Task, fn ->
System.cmd("ffmpeg", ~w(-t 01:00:00 -re -f lavfi -i testsrc -f flv URL_GOES_HERE))
end})
Process.unlink(pid) # or trap exit on placed child
All that said, we can definitely make this better. Casts should be properly monitored and shutdown blocked up to shutdown_timeout just like placed children, without you having to worry about it. We could also introduce a :link
option on place_child, but I need to think more about it.
from flame.
Does the flame pool need to be included in the supervision tree of flame nodes?
no
from flame.
The added nuance here (for myself and others) is we do a lot to avoid extra copying as part of the messagings/spawning of flame work, so easy defacto defaults to take care of this on behalf of the user for cast
require an extra copy, and it only matters in the case of long-running work, which is probably best suited by placing a worker on another node. Still, I understand just wanting to run a function somewhere and not setting up a worker. Having cast
trap exits by default and the terminator force kill the casts at the end of shutdown_timeout may be the best option.
from flame.
I deployed this code:
{:ok, pid} = FLAME.place_child(MyRunner, {Task, fn ->
System.cmd("ffmpeg", ~w(-t 01:00:00 -re -f lavfi -i testsrc -f flv URL_GOES_HERE))
end})
Process.unlink(pid)
I then started the process and ran fly apps restart app-name
. The FLAME node immediately shut down 😭
I then deployed this change instead:
FLAME.place_child(MyRunner, {Task, fn ->
Process.flag(:trap_exit, true)
System.cmd("ffmpeg", ~w(-t 01:00:00 -re -f lavfi -i testsrc -f flv URL_GOES_HERE))
end})
The result was the exact same.
The started flame machines are completely independent of the app machines (fly infra wise) so a deployment won't tear them down on the fly side.
I'm wondering about this. How can we double check that this is the case? I don't know anything about computers, but I have a hunch fly is still forcing this machine offline. A few seconds after I trigger the restart I see the logs:
2024-03-11T17:06:11Z app[784e774a20ed78] dfw [info] INFO Main child exited normally with code: 0
2024-03-11T17:06:11Z app[784e774a20ed78] dfw [info] INFO Starting clean up.
2024-03-11T17:06:11Z app[784e774a20ed78] dfw [info] WARN hallpass exited, pid: 315, status: signal: 15 (SIGTERM)
2024-03-11T17:06:12Z app[784e774a20ed78] dfw [info][ 63.496255] reboot: Restarting system
2024-03-11T17:06:12Z runner[784e774a20ed78] dfw [info]machine restart policy set to 'no', not restarting
from flame.
Correction: I did not notice the subtle differences between your two code snippets. I tried this code:
FLAME.cast(fn ->
Process.flag(:trap_exit, true)
System.cmd("ffmpeg", ~w(-t 01:00:00 -re -f lavfi -i testsrc -f flv URL_GOES_HERE))
end)
And it DOES seem to be working as expected!
UPDATE: I also cancelled the ffmpeg process and the machine spun down as expected.
from flame.
Nice. Correction to my previous comment, casts with trap exit won't risk hanging the VM as its still the terminator awaiting the calls/casts/place_child's to exit within shutdown_timeout, so provided you have Process.flag(:trap_exit, true)
in your cast, the terminator will allow the system shutdown to continue after the shutdown timeout.
from flame.
Related Issues (19)
- encode pool name in %Parent{}
- Flame name conflict HOT 3
- Allow specifying fly machine resources more specifically HOT 1
- Flush failsafe timer
- FLAME.Pool crashes when two calls are performed in parallel and no existing runners are active
- fly machines don't always have env vars set in time HOT 1
- can't test async: (DBConnection.OwnershipError) cannot find ownership process HOT 1
- Fly FLAME children seem to just timeout for no reason. HOT 4
- {:remote_shutdown, _} is not always forwarded HOT 1
- FlyBackend supports mounting volume HOT 1
- Global Singleton Pool HOT 1
- Use :peer node for local backend
- Hot start a new node when a percentage of max_concurrency is reached HOT 1
- Example code for Elixir/fly.io/ffmpeg HOT 2
- Allow spec'ing backends from pool config or call site HOT 3
- Fly FLAME failing with timeout HOT 1
- Add telemetry events HOT 1
- Add kubernetes backend HOT 16
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from flame.