Code Monkey home page Code Monkey logo

Comments (5)

shishir-a412ed avatar shishir-a412ed commented on June 8, 2024

@th0m I tried to reproduce this with:

  • Nomad 1.1.4
  • Nomad-driver-containerd: master

However, I am seeing very strange behavior where I completely loose the task after I restart Nomad!
I tried to add some loggers in the RecoverTask method to make sure if it's even getting called. I didn't see anything getting printed or RecoverTask being called. This is very strange! since I clearly tested this path (when I restart Nomad, nomad will be able to recover the task and re-attach to the existing container process) in the earlier versions. Are you sure the container is still running when you restart Nomad?

Can you do one more test for me?

  • Vagrant destroy (Start clean)
  • In the Vagrantfile change Nomad version to 1.1.4 and it will start the vagrant VM with Nomad 1.1.4.
  • vagrant up
  • vagrant ssh
  • nomad job run example/hello.nomad (Use this docker image shm32/count:1.0 instead. It's a count example, which just prints an increasing count on the stdout)
  • systemctl restart nomad
  • systemctl status nomad (Make sure Nomad is up and running)

After you have done the above steps:

  • echo $CONTAINERD_NAMESPACE (This should output nomad. Make sure you are in the nomad namespace, so you can see the images and containers when you run nerdctl commands)
  • nerdctl ps
  • nomad status

Does nomad status or nerdctl ps show you the running container? (For me the container is dead after I restart nomad)

root@vagrant:~/go/src/github.com/Roblox/nomad-driver-containerd/example# nomad status
No running jobs
root@vagrant:~/go/src/github.com/Roblox/nomad-driver-containerd/example# nerdctl ps
CONTAINER ID    IMAGE    COMMAND    CREATED    STATUS    PORTS    NAMES

from nomad-driver-containerd.

th0m avatar th0m commented on June 8, 2024

Thanks for looking at this @shishir-a412ed
I had forgotten a step in my reproduce to remove -dev from the Nomad systemd unit file in vagrant/setup.sh.
This change is key as in dev mode Nomad tears down all the jobs upon receiving SIGTERM so the RecoverTask function does not get called when Nomad comes back up (it might be a new behavior, not sure).
To make things clearer, I pushed a branch with all the required changes and I ran your steps to make sure I was still able to reproduce: th0m@73cd9fc

Here is the output after vagrant destroy, vagrant up and vagrant ssh:

vagrant@vagrant:~/go/src/github.com/Roblox/nomad-driver-containerd$ nomad job run example/hello.nomad                                                          
==> 2021-09-02T18:15:34Z: Monitoring evaluation "975d60ce"                                                                                                     
    2021-09-02T18:15:34Z: Evaluation triggered by job "hello"                                                                                                  
==> 2021-09-02T18:15:35Z: Monitoring evaluation "975d60ce"                                                                                                     
    2021-09-02T18:15:35Z: Evaluation within deployment: "b57e2538"                                                                                             
    2021-09-02T18:15:35Z: Allocation "fcb0348a" created: node "2f7a1a27", group "hello-group"                                                                  
    2021-09-02T18:15:35Z: Evaluation status changed: "pending" -> "complete"                                                                                   
==> 2021-09-02T18:15:35Z: Evaluation "975d60ce" finished with status "complete"                                                                                
==> 2021-09-02T18:15:35Z: Monitoring deployment "b57e2538"                                                                                                     
  ✓ Deployment "b57e2538" successful                                                                                                                           
                                                                                                                                                               
    2021-09-02T18:15:58Z                                                                                                                                       
    ID          = b57e2538                                                                                                                                     
    Job ID      = hello                                                                                                                                        
    Job Version = 0                                                                                                                                            
    Status      = successful                                                                                                                                   
    Description = Deployment completed successfully                                                                                                            
                                                                               
    Deployed                                                                   
    Task Group   Desired  Placed  Healthy  Unhealthy  Progress Deadline                                                                                        
    hello-group  1        1       1        0          2021-09-02T18:25:56Z     
vagrant@vagrant:~/go/src/github.com/Roblox/nomad-driver-containerd$ nomad job status hello                                                                     
ID            = hello                                                                                                                                          
Name          = hello                                                          
Submit Date   = 2021-09-02T18:15:34Z                                           
Type          = service                                                                                                                                        
Priority      = 50                                                             
Datacenters   = dc1                                                                                                                                            
Namespace     = default                                                                                                                                        
Status        = running                                                        
Periodic      = false                                                          
Parameterized = false                                                                                                                                          
                                                                               
Summary                                                                        
Task Group   Queued  Starting  Running  Failed  Complete  Lost                                                                                                 
hello-group  0       0         1        0       0         0           
                                                                               
Latest Deployment                                                              
ID          = b57e2538                                                         
Status      = successful                                                                                                                                       
Description = Deployment completed successfully                                                                                                                
                                                                                                                                                               
Deployed                                                                       
Task Group   Desired  Placed  Healthy  Unhealthy  Progress Deadline
hello-group  1        1       1        0          2021-09-02T18:25:56Z
                                                                               
Allocations                                                                    
ID        Node ID   Task Group   Version  Desired  Status   Created  Modified
fcb0348a  2f7a1a27  hello-group  0        run      running  35s ago  12s ago
vagrant@vagrant:~/go/src/github.com/Roblox/nomad-driver-containerd$ nomad alloc logs fcb0348a                                                                  
Count is: 0                                                                    
Count is: 1                                                                    
Count is: 2                                                                    
Count is: 3                                                                                                                                                    
Count is: 4                                                                                                                                                    
Count is: 5                                                                                                                                                    
Count is: 6                                                                                                                                                    
Count is: 7                                                                                                                                                    
Count is: 8                                                                                                                                                    
Count is: 9                                                                    
Count is: 10                                                                                                                                                   
Count is: 11                                                                                                                                                                                              
vagrant@vagrant:~/go/src/github.com/Roblox/nomad-driver-containerd$ sudo systemctl restart nomad
vagrant@vagrant:~/go/src/github.com/Roblox/nomad-driver-containerd$ sudo systemctl status nomad                                                                
● nomad.service - nomad client + nomad server + nomad-driver-containerd
   Loaded: loaded (/lib/systemd/system/nomad.service; disabled; vendor preset: enabled)                                                                        
   Active: active (running) since Thu 2021-09-02 18:16:21 UTC; 8s ago
     Docs: https://nomadproject.io                                             
 Main PID: 15923 (nomad)                                                       
    Tasks: 16                                                                  
   CGroup: /system.slice/nomad.service                                         
           ├─15923 /usr/bin/nomad agent -bind=0.0.0.0 -config=/home/vagrant/go/src/github.com/Roblox/nomad-driver-containerd/example/agent.hcl -plugin-dir=/tm 
           └─15958 /tmp/nomad-driver-containerd/containerd-driver

Sep 02 18:16:27 vagrant nomad[15923]:     2021-09-02T18:16:27.476Z [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=d141eb1c 
Sep 02 18:16:27 vagrant nomad[15923]: Desired Changes for "hello-group": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0)"     
Sep 02 18:16:27 vagrant nomad[15923]:     2021-09-02T18:16:27.476Z [DEBUG] worker.service_sched: setting eval status: eval_id=d141eb1c-db3a-a9a8-3d78-755097c7 
Sep 02 18:16:27 vagrant nomad[15923]:     2021-09-02T18:16:27.480Z [DEBUG] worker: updated evaluation: eval="<Eval "d141eb1c-db3a-a9a8-3d78-755097c79d70" JobI 
Sep 02 18:16:27 vagrant nomad[15923]:     2021-09-02T18:16:27.480Z [DEBUG] worker: ack evaluation: eval_id=d141eb1c-db3a-a9a8-3d78-755097c79d70                
Sep 02 18:16:28 vagrant nomad[15923]:     2021-09-02T18:16:28.502Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: reattached plugin process exited: 
Sep 02 18:16:28 vagrant nomad[15923]:     2021-09-02T18:16:28.502Z [ERROR] client.alloc_runner.task_runner.task_hook: failed to start logmon: alloc_id=fcb0348 
Sep 02 18:16:28 vagrant nomad[15923]:     2021-09-02T18:16:28.502Z [WARN]  client.alloc_runner.task_runner.task_hook: logmon shutdown while making request: al 
Sep 02 18:16:28 vagrant nomad[15923]:     2021-09-02T18:16:28.502Z [WARN]  client.alloc_runner.task_runner.task_hook: logmon shutdown while making request; re 
Sep 02 18:16:28 vagrant nomad[15923]:     2021-09-02T18:16:28.502Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=fcb0348a- 
vagrant@vagrant:~/go/src/github.com/Roblox/nomad-driver-containerd$ nomad job status hello                                                                     
ID            = hello                                                          
Name          = hello                                                          
Submit Date   = 2021-09-02T18:15:34Z                                           
Type          = service                                                        
Priority      = 50                                                             
Datacenters   = dc1                                                            
Namespace     = default                                                        
Status        = running                                                        
Periodic      = false                                                          
Parameterized = false                                                          

Summary                                                                        
Task Group   Queued  Starting  Running  Failed  Complete  Lost
hello-group  0       0         1        0       0         0

Latest Deployment                                                              
ID          = b57e2538                                                         
Status      = successful                                                       
Description = Deployment completed successfully

Deployed                                                                       
Task Group   Desired  Placed  Healthy  Unhealthy  Progress Deadline
hello-group  1        1       1        0          2021-09-02T18:25:56Z

Allocations                                                                    
ID        Node ID   Task Group   Version  Desired  Status   Created   Modified
fcb0348a  2f7a1a27  hello-group  0        run      running  1m3s ago  40s ago
vagrant@vagrant:~/go/src/github.com/Roblox/nomad-driver-containerd$ nomad alloc logs -f fcb0348a                                                               
Count is: 0
Count is: 1
Count is: 2
Count is: 3
Count is: 4
Count is: 5
Count is: 6
Count is: 7
Count is: 8
Count is: 9
Count is: 10
Count is: 11
Count is: 12
Count is: 13
Count is: 14
vagrant@vagrant:~/go/src/github.com/Roblox/nomad-driver-containerd$ nomad status                              
ID     Type     Priority  Status   Submit Date                                                                                                                 
hello  service  50        running  2021-09-02T18:15:34Z
root@vagrant:/home/vagrant/go/src/github.com/Roblox/nomad-driver-containerd# export CONTAINERD_NAMESPACE=nomad
root@vagrant:/home/vagrant/go/src/github.com/Roblox/nomad-driver-containerd# ./nerdctl ps
CONTAINER ID    IMAGE                        COMMAND            CREATED          STATUS    PORTS    NAMES
hello-task-f    docker.io/shm32/count:1.0    "/tmp/count.sh"    7 minutes ago    Up

from nomad-driver-containerd.

shishir-a412ed avatar shishir-a412ed commented on June 8, 2024

@th0m Aah you are right! (I am able to reproduce the issue now)

This change is key as in dev mode Nomad tears down all the jobs upon receiving SIGTERM so the RecoverTask function does not get called when Nomad comes back up (it might be a new behavior, not sure).

Yeah, seems like a regression. I use to test this code path (Recover task in containerd-driver) in Nomad dev mode in earlier versions, and it used to work nicely. Seems like something changed in the later versions which broke this! Will raise an issue with hashicorp/nomad.

For this bug, I see this in logs (Will keep digging why it's happening) after systemctl restart nomad

Sep 02 18:46:24 vagrant nomad[16214]:     2021-09-02T18:46:24.593Z [ERROR] client.alloc_runner.task_runner.task_hook: failed to start logmon: alloc_id=ca5e1680-4ac1-6c3b-0d29-f3b1dc145d5d task=hello-task error="plugin is shut down"
Sep 02 18:46:24 vagrant nomad[16214]:     2021-09-02T18:46:24.593Z [WARN]  client.alloc_runner.task_runner.task_hook: logmon shutdown while making request: alloc_id=ca5e1680-4ac1-6c3b-0d29-f3b1dc145d5d task=hello-task error="plugin is shut down"
Sep 02 18:46:24 vagrant nomad[16214]:     2021-09-02T18:46:24.593Z [WARN]  client.alloc_runner.task_runner.task_hook: logmon shutdown while making request; retrying: alloc_id=ca5e1680-4ac1-6c3b-0d29-f3b1dc145d5d task=hello-task attempts=1 error="plugin is shut down"
Sep 02 18:46:25 vagrant nomad[16214]:     2021-09-02T18:46:25.605Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=ca5e1680-4ac1-6c3b-0d29-f3b1dc145d5d task=hello-task path=/tmp/nomad/alloc/ca5e1680-4ac1-6c3b-0d29-f3b1dc145d5d/alloc/logs/.hello-task.stdout.fifo @module=logmon timestamp=2021-09-02T18:46:25.604Z
Sep 02 18:46:25 vagrant nomad[16214]:     2021-09-02T18:46:25.605Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=ca5e1680-4ac1-6c3b-0d29-f3b1dc145d5d task=hello-task path=/tmp/nomad/alloc/ca5e1680-4ac1-6c3b-0d29-f3b1dc145d5d/alloc/logs/.hello-task.stderr.fifo @module=logmon timestamp=2021-09-02T18:46:25.605Z

After I restart nomad, the stdout/stderr fifos are still there (as expected)

root@vagrant:/tmp/nomad/alloc/ca5e1680-4ac1-6c3b-0d29-f3b1dc145d5d/alloc/logs# ls
hello-task.stderr.0  hello-task.stdout.0

However, either Nomad or containerd-driver have trouble reattaching to them. Need to look into why this is happening!

from nomad-driver-containerd.

th0m avatar th0m commented on June 8, 2024

Great! Glad you were able to reproduce it, thank you.

from nomad-driver-containerd.

shishir-a412ed avatar shishir-a412ed commented on June 8, 2024

@th0m Fix should be available in release: https://github.com/Roblox/nomad-driver-containerd/releases/tag/v0.9.2

from nomad-driver-containerd.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.