In a large deployment with 1000 cells (will likely also appear with less cells), during an update we encountered that apps crash due the fact that during container creation the policy-server-internal
gets overloaded and requests from the vxlan-policy-agent
time out.
During an update with 1000 cells we saw that multiple hundred and in another scenario even thousands of apps crashed. When apps getting drained to a different cell/on container creation they request the policies
from the policy-server-internal
. This will overload the policy-server-internal
and leads to timeouts since a lot of requests are issued at the same time => apps crashes
{"timestamp":"2018-12-04T16:16:43.381785739Z","level":"error","source":"guardian","message":"guardian.create.external-networker-result","data":{"action":"up","error":"exit status 1","handle":"27921ba0-afe9-4653-6dbb-d1ab","session":"6274","stderr":"cfnetworking: cni up failed: add network list failed: vpa response code: 500 with message: failed to force policy p
oll cycle: get-rules: failed to get policies: http client do: Get https://policy-server.service.cf.internal:4003/networking/v1/internal/policies?id=00ee43e2-cb6b-4bce-8b3d-34af44a4519b,e06649ce-40b7-40b3-8f83-339a832a44a7,83119d81-6269-410c-9c7e-6b0b5cfc2997,313261ae-4107-4f3a-8a99-5c63c2bc157c,165aebcb-cd59-4ab7-829f-a87b08642329,2207420e-9e14-4979-8c3d-eb1f0f8
4d25d,2237cfe8-cf0f-469c-b9cb-84fa8be75be2,61f88739-5ea3-40b7-8ba3-36100bf8665e,1d08fca4-5a11-44d4-85e4-90233fccd976,d5afd404-a250-4b59-8c38-106c765ebf06,2d105c1c-6aff-4f09-b83d-00550fac6245,b206bdb8-23fb-4d90-8a8c-24e2f6c6c4e9,fd886474-47ee-4709-8b06-1c88c954dfb5,75f1693d-7884-424a-9dba-ca09202a2eed,cb121d69-f5e9-4310-8d19-c636d528916f,5fa65bbe-ecea-4acd-8cb5-b
70ef56f87c5,6f3a12c9-b190-4ab8-b677-2786fff7abed,077a31a8-3925-4818-b42e-11adf7ad8772,d590b654-e32d-4f1b-985b-323daf349fb5,8045de6f-97fc-4df9-a964-12f06a4ffb2f,a229c94e-bad5-48e3-99c6-4ffa3df0eff6,03d19d9c-928f-48f6-b099-ce3cb36bc36c,a825d1e0-472e-425c-9bd6-4a344ef8b497,9d27ec24-084c-4b3a-b6dc-36c99807feab,9c31c071-6e28-4678-b8f3-653687e7ca0b,b1bf32c1-a9de-48d6-
8318-e2089ce15ce5,28d5b6a8-bd9b-4099-be30-a446133b0695,147a1af4-5a6c-48df-92b1-553c6585b21d,80ceb97e-5525-443b-8f39-dc785fd1a00d,fcba7369-eae4-4dac-9292-d3f706577472,48d694c2-dd3a-4c14-a09d-58ee14785d23,d31e5d50-5b65-4805-a45b-7e08c0f98e2c,370d95f2-29d0-4d0c-9f16-580be271d688,6aa1a0da-77ab-4669-a87d-27cb40e3dffa,fe326b2e-5b96-4b17-a98a-8e3f8faec489,d0fbf238-09a7
-41ae-8076-3ecd4d6cd4bd,c7c23786-6d46-4eb2-857c-5213564125c7,7ad92303-f5a4-42ba-b98e-0ebaa09daa2b,7b9dcad8-c6ce-485c-9a32-0e51af27f022,51cefe4b-ded6-4f36-9fc1-a6494fe45ae9,2ba95e1a-0b54-4dfe-862e-6be8cbaaccfd,1ec92530-84cd-4244-8414-073c63e9fd64,e1b1974c-ac2f-42f2-ba7d-1d128736ef28,2a1fbd35-6ec2-4cdf-99e3-65968c89a398,f40d6531-b263-43e9-b536-83005b8b5624,8354283
6-dc14-46c7-a6ce-1b485be51239,fe8d196c-2b2a-47cd-9c99-bd0d5851a733,09ae56ca-8a24-4f4a-a1d2-024de05e610c,983649b6-d1bd-4fff-b1f9-adfde60cfe2e,43758557-0568-4615-abbe-3b9926e1d46e: net/http: request canceled (Client.Timeout exceeded while awaiting headers)\n","stdin":"{\"Pid\":3120232,\"Properties\":{\"app_id\":\"fcba7369-eae4-4dac-9292-d3f706577472\",\"org_id\":\
"12d0e6f0-ba0e-45f7-835e-d9304d92c64c\",\"policy_group_id\":\"fcba7369-eae4-4dac-9292-d3f706577472\",\"ports\":\"8080\",\"space_id\":\"2ba95e1a-0b54-4dfe-862e-6be8cbaaccfd\"},\"netout_rules\":[{\"networks\":[{\"start\":\"0.0.0.0\",\"end\":\"9.255.255.255\"}]},{\"networks\":[{\"start\":\"11.0.0.0\",\"end\":\"147.204.7.255\"}]},{\"networks\":[{\"start\":\"147.204.
9.0\",\"end\":\"155.56.54.155\"}]},{\"networks\":[{\"start\":\"155.56.54.159\",\"end\":\"155.56.68.227\"}]},{\"networks\":[{\"start\":\"155.56.68.230\",\"end\":\"155.56.68.235\"}]},{\"networks\":[{\"start\":\"155.56.68.238\",\"end\":\"169.253.255.255\"}]},{\"networks\":[{\"start\":\"169.255.0.0\",\"end\":\"172.15.255.255\"}]},{\"networks\":[{\"start\":\"172.32.0
.0\",\"end\":\"192.167.255.255\"}]},{\"networks\":[{\"start\":\"192.169.0.0\",\"end\":\"255.255.255.255\"}]},{\"protocol\":1,\"networks\":[{\"start\":\"10.0.0.0\",\"end\":\"10.255.255.255\"}],\"ports\":[{\"start\":53,\"end\":53}]},{\"protocol\":2,\"networks\":[{\"start\":\"10.0.0.0\",\"end\":\"10.255.255.255\"}],\"ports\":[{\"start\":53,\"end\":53}]},{\"protocol
\":1,\"networks\":[{\"start\":\"169.254.0.0\",\"end\":\"169.254.255.255\"}],\"ports\":[{\"start\":53,\"end\":53}]},{\"protocol\":2,\"networks\":[{\"start\":\"169.254.0.0\",\"end\":\"169.254.255.255\"}],\"ports\":[{\"start\":53,\"end\":53}]},{\"protocol\":1,\"networks\":[{\"start\":\"172.16.0.0\",\"end\":\"172.31.255.255\"}],\"ports\":[{\"start\":53,\"end\":53}]}
,{\"protocol\":2,\"networks\":[{\"start\":\"172.16.0.0\",\"end\":\"172.31.255.255\"}],\"ports\":[{\"start\":53,\"end\":53}]},{\"protocol\":1,\"networks\":[{\"start\":\"192.168.0.0\",\"end\":\"192.168.255.255\"}],\"ports\":[{\"start\":53,\"end\":53}]},{\"protocol\":2,\"networks\":[{\"start\":\"192.168.0.0\",\"end\":\"192.168.255.255\"}],\"ports\":[{\"start\":53,\
"end\":53}]}],\"netin\":[{\"host_port\":0,\"container_port\":8080},{\"host_port\":0,\"container_port\":2222}]}","stdout":""}}
The http timeout seems hardcoded to 5s.