Comments (6)
Thanks @Shaked , yes this does makes sense, and this is exactly what we would recommend setting up manually.
Regrading timeouts, let me find out what we use internally, as we have never encountered timeout issues due to the load balancer...
One last remark, I think we should also add the trains-prod-example.com
suffix as a parameter, since all prefixes are fixed, it makes sense to export the only part that changes form one deployment to another.
What do you think?
from clearml-server-helm.
Thanks @Shaked , yes this does makes sense, and this is exactly what we would recommend setting up manually.
Great, I'll create a PR for that. I guess it should be under the trains-server-k8s
repository, right?
Regrading timeouts, let me find out what we use internally, as we have never encountered timeout issues due to the load balancer...
Great. Not sure why we faced it, but I added this yesterday:
nginx.ingress.kubernetes.io/proxy-connect-timeout: "300"
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
I haven't experienced any timeouts yet, but might as well be because I didn't played with it much.
One last remark, I think we should also add the trains-prod-example.com suffix as a parameter, since all prefixes are fixed, it makes sense to export the only part that changes form one deployment to another.
What do you think?
Yea that makes a lot of sense actually, so we can support 2 different cases:
Either developers could use ingress.host=trains-prod.example.com
which will automatically be appended to all 3 app
, api
and files
or if they, for some reason, would rather have different hosts, they could use ingress.app_host=trains-prod.example.com
, ingress.api_host=else-prod.example.com
and ingress.files_host=something-else-prod.example.com
Not sure if the 2nd option is even needed, but I don't mind to add it.
What do you think?
from clearml-server-helm.
I have an update regarding the timeouts. Now I'm seeing other 50x errors, such as 502, 503 (504 disappeared for now).
Looking into nginx LB
shows:
ku default logs -f steely-mule-nginx-ingress-controller-74d54f944f-9sxbz --since 20m | grep -v "HTTP/1.1\" 200" | grep -v '.well-known'
10.240.0.5 - [10.240.0.5] - - [07/Jan/2020:15:26:27 +0000] "GET /v2.1/events.add_batch HTTP/1.1" 503 198 "-" "python-requests/2.22.0" 1298 0.000 [trains-apiserver-service-8008] - - - - f863f3c72ec17b139f2074a16f8bff04
10.240.0.5 - [10.240.0.5] - - [07/Jan/2020:15:26:27 +0000] "GET /v2.1/tasks.get_by_id HTTP/1.1" 503 198 "-" "python-requests/2.22.0" 1289 0.000 [trains-apiserver-service-8008] - - - - 5872faf664f5f08d45d2c4a9402f637c
10.240.0.5 - [10.240.0.5] - - [07/Jan/2020:15:26:35 +0000] "GET /v2.1/events.add_batch HTTP/1.1" 503 198 "-" "python-requests/2.22.0" 1298 0.000 [trains-apiserver-service-8008] - - - - 9c1dfc69dca4cfecf6abb150f938c827
10.240.0.5 - [10.240.0.5] - - [07/Jan/2020:15:26:35 +0000] "GET /v2.1/tasks.get_by_id HTTP/1.1" 503 198 "-" "python-requests/2.22.0" 1289 0.000 [trains-apiserver-service-8008] - - - - d09b50fd480c22fafcfa93ecff0f377d
[07/Jan/2020:15:26:39 +0000]TCP200000.000
2020/01/07 15:26:51 [warn] 18318#18318: *137904157 a client request body is buffered to a temporary file /tmp/client-body/0000003046, client: 10.240.0.5, server: api.trains-stage.example.com, request: "GET /v2.1/events.add_batch HTTP/1.1", host: "api.trains-stage.example.com"
ku trains logs -f apiserver-75fc489669-x9k76 --since 20m | grep -vi 'returned 200'
[2020-01-07 15:41:49,579] [8] [INFO] [trains.non_responsive_tasks_watchdog] Starting cleanup cycle for running tasks last updated before 2020-01-07 13:41:49.579258
[2020-01-07 15:41:49,581] [8] [INFO] [trains.non_responsive_tasks_watchdog] Done
API server failed and restarted 11 times
kubectl -n trains get pods
NAME READY STATUS RESTARTS AGE
apiserver-75fc489669-x9k76 1/1 Running 11 14d
Using --previous
[2020-01-07 15:24:15,477] [8] [INFO] [trains.updates] TRAINS-SERVER new version available: upgrade to v0.13.0 is recommended!
[2020-01-07 15:24:16,662] [8] [INFO] [trains.service_repo] Returned 200 for tasks.get_by_id in 80ms
[2020-01-07 15:24:18,460] [8] [INFO] [trains.service_repo] Returned 200 for users.get_preferences in 3ms
[2020-01-07 15:24:18,753] [8] [INFO] [trains.service_repo] Returned 200 for tasks.ping in 3ms
[2020-01-07 15:24:18,783] [8] [INFO] [trains.service_repo] Returned 200 for tasks.get_by_id in 10ms
[2020-01-07 15:24:18,840] [8] [INFO] [trains.service_repo] Returned 200 for users.get_current_user in 4ms
[2020-01-07 15:24:18,991] [8] [INFO] [trains.service_repo] Returned 200 for projects.get_all_ex in 6ms
[2020-01-07 15:24:19,606] [8] [INFO] [trains.service_repo] Returned 200 for projects.get_all_ex in 2ms
[2020-01-07 15:24:20,551] [8] [INFO] [trains.service_repo] Returned 200 for tasks.get_all_ex in 396ms
[2020-01-07 15:24:20,562] [8] [INFO] [trains.service_repo] Returned 200 for tasks.get_all_ex in 413ms
/opt/trains/wrapper.sh: line 28: 8 Killed python3 server.py
Maybe it's related to the timeouts as well? What am I missing?
Note: the main reason I haven't upgraded to v0.13.0 is because of my previous Azure FlexVolume PR allegroai/clearml-server-k8s#2
Thank you!
from clearml-server-helm.
Hi @Shaked ,
Regarding timeouts, our defaults are also 5min on all three connection types, and it seems stable on our setups...
The 50x error codes, I think, are a byproduct of the pod restarts, which we think are derived from k8s memory limit configuration. This is why on v0.13.0 we increased the memory limit, and to be honest I think we should be more generous with that.
I suggest you set it at 500M and check if the errors/restarts continue.
p.s.
Great, I'll create a PR for that. I guess it should be under the trains-server-k8s repository, right?
Yes please 😄
Note: the main reason I haven't upgraded to v0.13.0 ...
With the 0.13 release things got delayed, but they promised to get the FlexVolume PR merged in the next couple of days, so I'm hoping you can upgrade soon :)
from clearml-server-helm.
Hey @bmartinn
Regarding timeouts, our defaults are also 5min on all three connection types, and it seems stable on our setups...
The 50x error codes, I think, are a byproduct of the pod restarts, which we think are derived from k8s memory limit configuration. This is why on v0.13.0 we increased the memory limit, and to be honest I think we should be more generous with that.
I suggest you set it at 500M and check if the errors/restarts continue.
I'm going to try this ASAP.
Yes please 😄
PR is available allegroai/clearml-server-k8s#3
With the 0.13 release things got delayed, but they promised to get the FlexVolume PR merged in the next couple of days, so I'm hoping you can upgrade soon :)
Merged :)
from clearml-server-helm.
Awesome!
I'll make sure we see to it :)
from clearml-server-helm.
Related Issues (6)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from clearml-server-helm.