Comments (5)
@th0m I tried to reproduce this with:
- Nomad
1.1.4
- Nomad-driver-containerd:
master
However, I am seeing very strange behavior where I completely loose the task after I restart Nomad!
I tried to add some loggers in the RecoverTask
method to make sure if it's even getting called. I didn't see anything getting printed or RecoverTask
being called. This is very strange! since I clearly tested this path (when I restart Nomad, nomad will be able to recover the task and re-attach to the existing container process) in the earlier versions. Are you sure the container is still running when you restart Nomad?
Can you do one more test for me?
- Vagrant destroy (Start clean)
- In the
Vagrantfile
change Nomad version to1.1.4
and it will start the vagrant VM withNomad 1.1.4
. - vagrant up
- vagrant ssh
- nomad job run example/hello.nomad (Use this docker image
shm32/count:1.0
instead. It's a count example, which just prints an increasing count on the stdout) - systemctl restart nomad
- systemctl status nomad (Make sure Nomad is up and running)
After you have done the above steps:
echo $CONTAINERD_NAMESPACE
(This should outputnomad
. Make sure you are in thenomad
namespace, so you can see the images and containers when you runnerdctl
commands)- nerdctl ps
- nomad status
Does nomad status
or nerdctl ps
show you the running container? (For me the container is dead after I restart nomad)
root@vagrant:~/go/src/github.com/Roblox/nomad-driver-containerd/example# nomad status
No running jobs
root@vagrant:~/go/src/github.com/Roblox/nomad-driver-containerd/example# nerdctl ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
from nomad-driver-containerd.
Thanks for looking at this @shishir-a412ed
I had forgotten a step in my reproduce to remove -dev
from the Nomad systemd unit file in vagrant/setup.sh
.
This change is key as in dev
mode Nomad tears down all the jobs upon receiving SIGTERM
so the RecoverTask
function does not get called when Nomad comes back up (it might be a new behavior, not sure).
To make things clearer, I pushed a branch with all the required changes and I ran your steps to make sure I was still able to reproduce: th0m@73cd9fc
Here is the output after vagrant destroy
, vagrant up
and vagrant ssh
:
vagrant@vagrant:~/go/src/github.com/Roblox/nomad-driver-containerd$ nomad job run example/hello.nomad
==> 2021-09-02T18:15:34Z: Monitoring evaluation "975d60ce"
2021-09-02T18:15:34Z: Evaluation triggered by job "hello"
==> 2021-09-02T18:15:35Z: Monitoring evaluation "975d60ce"
2021-09-02T18:15:35Z: Evaluation within deployment: "b57e2538"
2021-09-02T18:15:35Z: Allocation "fcb0348a" created: node "2f7a1a27", group "hello-group"
2021-09-02T18:15:35Z: Evaluation status changed: "pending" -> "complete"
==> 2021-09-02T18:15:35Z: Evaluation "975d60ce" finished with status "complete"
==> 2021-09-02T18:15:35Z: Monitoring deployment "b57e2538"
✓ Deployment "b57e2538" successful
2021-09-02T18:15:58Z
ID = b57e2538
Job ID = hello
Job Version = 0
Status = successful
Description = Deployment completed successfully
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
hello-group 1 1 1 0 2021-09-02T18:25:56Z
vagrant@vagrant:~/go/src/github.com/Roblox/nomad-driver-containerd$ nomad job status hello
ID = hello
Name = hello
Submit Date = 2021-09-02T18:15:34Z
Type = service
Priority = 50
Datacenters = dc1
Namespace = default
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
hello-group 0 0 1 0 0 0
Latest Deployment
ID = b57e2538
Status = successful
Description = Deployment completed successfully
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
hello-group 1 1 1 0 2021-09-02T18:25:56Z
Allocations
ID Node ID Task Group Version Desired Status Created Modified
fcb0348a 2f7a1a27 hello-group 0 run running 35s ago 12s ago
vagrant@vagrant:~/go/src/github.com/Roblox/nomad-driver-containerd$ nomad alloc logs fcb0348a
Count is: 0
Count is: 1
Count is: 2
Count is: 3
Count is: 4
Count is: 5
Count is: 6
Count is: 7
Count is: 8
Count is: 9
Count is: 10
Count is: 11
vagrant@vagrant:~/go/src/github.com/Roblox/nomad-driver-containerd$ sudo systemctl restart nomad
vagrant@vagrant:~/go/src/github.com/Roblox/nomad-driver-containerd$ sudo systemctl status nomad
● nomad.service - nomad client + nomad server + nomad-driver-containerd
Loaded: loaded (/lib/systemd/system/nomad.service; disabled; vendor preset: enabled)
Active: active (running) since Thu 2021-09-02 18:16:21 UTC; 8s ago
Docs: https://nomadproject.io
Main PID: 15923 (nomad)
Tasks: 16
CGroup: /system.slice/nomad.service
├─15923 /usr/bin/nomad agent -bind=0.0.0.0 -config=/home/vagrant/go/src/github.com/Roblox/nomad-driver-containerd/example/agent.hcl -plugin-dir=/tm
└─15958 /tmp/nomad-driver-containerd/containerd-driver
Sep 02 18:16:27 vagrant nomad[15923]: 2021-09-02T18:16:27.476Z [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=d141eb1c
Sep 02 18:16:27 vagrant nomad[15923]: Desired Changes for "hello-group": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0)"
Sep 02 18:16:27 vagrant nomad[15923]: 2021-09-02T18:16:27.476Z [DEBUG] worker.service_sched: setting eval status: eval_id=d141eb1c-db3a-a9a8-3d78-755097c7
Sep 02 18:16:27 vagrant nomad[15923]: 2021-09-02T18:16:27.480Z [DEBUG] worker: updated evaluation: eval="<Eval "d141eb1c-db3a-a9a8-3d78-755097c79d70" JobI
Sep 02 18:16:27 vagrant nomad[15923]: 2021-09-02T18:16:27.480Z [DEBUG] worker: ack evaluation: eval_id=d141eb1c-db3a-a9a8-3d78-755097c79d70
Sep 02 18:16:28 vagrant nomad[15923]: 2021-09-02T18:16:28.502Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: reattached plugin process exited:
Sep 02 18:16:28 vagrant nomad[15923]: 2021-09-02T18:16:28.502Z [ERROR] client.alloc_runner.task_runner.task_hook: failed to start logmon: alloc_id=fcb0348
Sep 02 18:16:28 vagrant nomad[15923]: 2021-09-02T18:16:28.502Z [WARN] client.alloc_runner.task_runner.task_hook: logmon shutdown while making request: al
Sep 02 18:16:28 vagrant nomad[15923]: 2021-09-02T18:16:28.502Z [WARN] client.alloc_runner.task_runner.task_hook: logmon shutdown while making request; re
Sep 02 18:16:28 vagrant nomad[15923]: 2021-09-02T18:16:28.502Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=fcb0348a-
vagrant@vagrant:~/go/src/github.com/Roblox/nomad-driver-containerd$ nomad job status hello
ID = hello
Name = hello
Submit Date = 2021-09-02T18:15:34Z
Type = service
Priority = 50
Datacenters = dc1
Namespace = default
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
hello-group 0 0 1 0 0 0
Latest Deployment
ID = b57e2538
Status = successful
Description = Deployment completed successfully
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
hello-group 1 1 1 0 2021-09-02T18:25:56Z
Allocations
ID Node ID Task Group Version Desired Status Created Modified
fcb0348a 2f7a1a27 hello-group 0 run running 1m3s ago 40s ago
vagrant@vagrant:~/go/src/github.com/Roblox/nomad-driver-containerd$ nomad alloc logs -f fcb0348a
Count is: 0
Count is: 1
Count is: 2
Count is: 3
Count is: 4
Count is: 5
Count is: 6
Count is: 7
Count is: 8
Count is: 9
Count is: 10
Count is: 11
Count is: 12
Count is: 13
Count is: 14
vagrant@vagrant:~/go/src/github.com/Roblox/nomad-driver-containerd$ nomad status
ID Type Priority Status Submit Date
hello service 50 running 2021-09-02T18:15:34Z
root@vagrant:/home/vagrant/go/src/github.com/Roblox/nomad-driver-containerd# export CONTAINERD_NAMESPACE=nomad
root@vagrant:/home/vagrant/go/src/github.com/Roblox/nomad-driver-containerd# ./nerdctl ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
hello-task-f docker.io/shm32/count:1.0 "/tmp/count.sh" 7 minutes ago Up
from nomad-driver-containerd.
@th0m Aah you are right! (I am able to reproduce the issue now)
This change is key as in dev mode Nomad tears down all the jobs upon receiving SIGTERM so the RecoverTask function does not get called when Nomad comes back up (it might be a new behavior, not sure).
Yeah, seems like a regression. I use to test this code path (Recover task in containerd-driver
) in Nomad dev
mode in earlier versions, and it used to work nicely. Seems like something changed in the later versions which broke this! Will raise an issue with hashicorp/nomad
.
For this bug, I see this in logs (Will keep digging why it's happening) after systemctl restart nomad
Sep 02 18:46:24 vagrant nomad[16214]: 2021-09-02T18:46:24.593Z [ERROR] client.alloc_runner.task_runner.task_hook: failed to start logmon: alloc_id=ca5e1680-4ac1-6c3b-0d29-f3b1dc145d5d task=hello-task error="plugin is shut down"
Sep 02 18:46:24 vagrant nomad[16214]: 2021-09-02T18:46:24.593Z [WARN] client.alloc_runner.task_runner.task_hook: logmon shutdown while making request: alloc_id=ca5e1680-4ac1-6c3b-0d29-f3b1dc145d5d task=hello-task error="plugin is shut down"
Sep 02 18:46:24 vagrant nomad[16214]: 2021-09-02T18:46:24.593Z [WARN] client.alloc_runner.task_runner.task_hook: logmon shutdown while making request; retrying: alloc_id=ca5e1680-4ac1-6c3b-0d29-f3b1dc145d5d task=hello-task attempts=1 error="plugin is shut down"
Sep 02 18:46:25 vagrant nomad[16214]: 2021-09-02T18:46:25.605Z [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=ca5e1680-4ac1-6c3b-0d29-f3b1dc145d5d task=hello-task path=/tmp/nomad/alloc/ca5e1680-4ac1-6c3b-0d29-f3b1dc145d5d/alloc/logs/.hello-task.stdout.fifo @module=logmon timestamp=2021-09-02T18:46:25.604Z
Sep 02 18:46:25 vagrant nomad[16214]: 2021-09-02T18:46:25.605Z [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=ca5e1680-4ac1-6c3b-0d29-f3b1dc145d5d task=hello-task path=/tmp/nomad/alloc/ca5e1680-4ac1-6c3b-0d29-f3b1dc145d5d/alloc/logs/.hello-task.stderr.fifo @module=logmon timestamp=2021-09-02T18:46:25.605Z
After I restart nomad
, the stdout/stderr fifos are still there (as expected)
root@vagrant:/tmp/nomad/alloc/ca5e1680-4ac1-6c3b-0d29-f3b1dc145d5d/alloc/logs# ls
hello-task.stderr.0 hello-task.stdout.0
However, either Nomad
or containerd-driver
have trouble reattaching to them. Need to look into why this is happening!
from nomad-driver-containerd.
Great! Glad you were able to reproduce it, thank you.
from nomad-driver-containerd.
@th0m Fix should be available in release: https://github.com/Roblox/nomad-driver-containerd/releases/tag/v0.9.2
from nomad-driver-containerd.
Related Issues (20)
- hostname not populated in /etc/hosts for containerd tasks
- How to use template stanza HOT 8
- Running with Nomad inside containerd HOT 10
- Cannot launch task: stdout.fifo and stderr.fifo already closed HOT 2
- [feature request] windows support HOT 1
- Running with custom containerd snapshotter
- v0.9.3 reports as v0.9.2 HOT 1
- Allow mount source to be relative to task working directory
- Support bind mount propagation mode
- The same image seems to be pulled in parallel causing disk exhaustion HOT 3
- Running nomad as non-root user with rootless containerd HOT 2
- kata-container support ? HOT 1
- How do config force_pull? HOT 1
- inline seccomp_profile HOT 1
- Forward Redis port 6379 HOT 1
- Release 0.9.4 HOT 3
- Where are containers logs store?
- Unable to build on clean go install HOT 7
- [feature request] Extra hosts in the /etc/hosts HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nomad-driver-containerd.