Using a build of master, I'm encountering deadlocks when trying to scale to 64 spot fleet instances. The following is the stack trace received in the master log. After receiving this message, the fleet does not continue to scale up to the max cluster size, nor do some of the provisioned and launched workers receive queued tasks.
WARNING: Some health checks are reporting as unhealthy: [thread-deadlock : [AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#46] locked on java.util.concurrent.locks.ReentrantLock$NonfairSync@16e9a6e2 (owned by jenkins.util.Timer [#3]):
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at hudson.model.Queue.maintain(Queue.java:1381)
at hudson.model.Queue$1.call(Queue.java:294)
at hudson.model.Queue$1.call(Queue.java:291)
at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:101)
at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:91)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110)
at java.lang.Thread.run(Thread.java:745)
, jenkins.util.Timer [#3] locked on com.amazon.jenkins.ec2fleet.EC2FleetCloud@402a23d8 (owned by jenkins.util.Timer [#1]):
at com.amazon.jenkins.ec2fleet.IdleRetentionStrategy.check(IdleRetentionStrategy.java:38)
at com.amazon.jenkins.ec2fleet.IdleRetentionStrategy.check(IdleRetentionStrategy.java:15)
at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72)
at hudson.model.Queue._withLock(Queue.java:1303)
at hudson.model.Queue.withLock(Queue.java:1180)
at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:50)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
, jenkins.util.Timer [#1] locked on java.util.concurrent.locks.ReentrantLock$NonfairSync@16e9a6e2 (owned by jenkins.util.Timer [#3]):
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at hudson.model.Queue._withLock(Queue.java:1301)
at hudson.model.Queue.withLock(Queue.java:1180)
at jenkins.model.Nodes.addNode(Nodes.java:133)
at jenkins.model.Jenkins.addNode(Jenkins.java:1985)
at com.amazon.jenkins.ec2fleet.EC2FleetCloud.addNewSlave(EC2FleetCloud.java:354)
at com.amazon.jenkins.ec2fleet.EC2FleetCloud.updateStatus(EC2FleetCloud.java:311)
at com.amazon.jenkins.ec2fleet.CloudNanny.doRun(CloudNanny.java:42)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:50)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
]]
We are synchronizing on the parent object, but sadly I'm not familiar enough with Java to make meaningful changes. I imagine the parent object might want to manage synchronization rather than the children all trying to lock?