Code Monkey home page Code Monkey logo

Comments (6)

bmartinn avatar bmartinn commented on June 1, 2024

Thanks @Shaked , yes this does makes sense, and this is exactly what we would recommend setting up manually.

Regrading timeouts, let me find out what we use internally, as we have never encountered timeout issues due to the load balancer...

One last remark, I think we should also add the trains-prod-example.com suffix as a parameter, since all prefixes are fixed, it makes sense to export the only part that changes form one deployment to another.

What do you think?

from clearml-server-helm.

Shaked avatar Shaked commented on June 1, 2024

Thanks @Shaked , yes this does makes sense, and this is exactly what we would recommend setting up manually.

Great, I'll create a PR for that. I guess it should be under the trains-server-k8s repository, right?

Regrading timeouts, let me find out what we use internally, as we have never encountered timeout issues due to the load balancer...

Great. Not sure why we faced it, but I added this yesterday:

    nginx.ingress.kubernetes.io/proxy-connect-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"

I haven't experienced any timeouts yet, but might as well be because I didn't played with it much.

One last remark, I think we should also add the trains-prod-example.com suffix as a parameter, since all prefixes are fixed, it makes sense to export the only part that changes form one deployment to another.
What do you think?

Yea that makes a lot of sense actually, so we can support 2 different cases:

Either developers could use ingress.host=trains-prod.example.com which will automatically be appended to all 3 app, api and files or if they, for some reason, would rather have different hosts, they could use ingress.app_host=trains-prod.example.com, ingress.api_host=else-prod.example.com and ingress.files_host=something-else-prod.example.com

Not sure if the 2nd option is even needed, but I don't mind to add it.

What do you think?

from clearml-server-helm.

Shaked avatar Shaked commented on June 1, 2024

@bmartinn

I have an update regarding the timeouts. Now I'm seeing other 50x errors, such as 502, 503 (504 disappeared for now).

Looking into nginx LB shows:

ku default logs -f steely-mule-nginx-ingress-controller-74d54f944f-9sxbz --since 20m | grep -v "HTTP/1.1\" 200" | grep -v '.well-known'

10.240.0.5 - [10.240.0.5] - - [07/Jan/2020:15:26:27 +0000] "GET /v2.1/events.add_batch HTTP/1.1" 503 198 "-" "python-requests/2.22.0" 1298 0.000 [trains-apiserver-service-8008] - - - - f863f3c72ec17b139f2074a16f8bff04
10.240.0.5 - [10.240.0.5] - - [07/Jan/2020:15:26:27 +0000] "GET /v2.1/tasks.get_by_id HTTP/1.1" 503 198 "-" "python-requests/2.22.0" 1289 0.000 [trains-apiserver-service-8008] - - - - 5872faf664f5f08d45d2c4a9402f637c
10.240.0.5 - [10.240.0.5] - - [07/Jan/2020:15:26:35 +0000] "GET /v2.1/events.add_batch HTTP/1.1" 503 198 "-" "python-requests/2.22.0" 1298 0.000 [trains-apiserver-service-8008] - - - - 9c1dfc69dca4cfecf6abb150f938c827
10.240.0.5 - [10.240.0.5] - - [07/Jan/2020:15:26:35 +0000] "GET /v2.1/tasks.get_by_id HTTP/1.1" 503 198 "-" "python-requests/2.22.0" 1289 0.000 [trains-apiserver-service-8008] - - - - d09b50fd480c22fafcfa93ecff0f377d
[07/Jan/2020:15:26:39 +0000]TCP200000.000
2020/01/07 15:26:51 [warn] 18318#18318: *137904157 a client request body is buffered to a temporary file /tmp/client-body/0000003046, client: 10.240.0.5, server: api.trains-stage.example.com, request: "GET /v2.1/events.add_batch HTTP/1.1", host: "api.trains-stage.example.com"
ku trains logs -f apiserver-75fc489669-x9k76 --since 20m | grep -vi 'returned 200'
[2020-01-07 15:41:49,579] [8] [INFO] [trains.non_responsive_tasks_watchdog] Starting cleanup cycle for running tasks last updated before 2020-01-07 13:41:49.579258
[2020-01-07 15:41:49,581] [8] [INFO] [trains.non_responsive_tasks_watchdog] Done

API server failed and restarted 11 times

kubectl -n trains get pods
NAME                                    READY   STATUS      RESTARTS   AGE
apiserver-75fc489669-x9k76              1/1     Running     11         14d

Using --previous

[2020-01-07 15:24:15,477] [8] [INFO] [trains.updates] TRAINS-SERVER new version available: upgrade to v0.13.0 is recommended!
[2020-01-07 15:24:16,662] [8] [INFO] [trains.service_repo] Returned 200 for tasks.get_by_id in 80ms
[2020-01-07 15:24:18,460] [8] [INFO] [trains.service_repo] Returned 200 for users.get_preferences in 3ms
[2020-01-07 15:24:18,753] [8] [INFO] [trains.service_repo] Returned 200 for tasks.ping in 3ms
[2020-01-07 15:24:18,783] [8] [INFO] [trains.service_repo] Returned 200 for tasks.get_by_id in 10ms
[2020-01-07 15:24:18,840] [8] [INFO] [trains.service_repo] Returned 200 for users.get_current_user in 4ms
[2020-01-07 15:24:18,991] [8] [INFO] [trains.service_repo] Returned 200 for projects.get_all_ex in 6ms
[2020-01-07 15:24:19,606] [8] [INFO] [trains.service_repo] Returned 200 for projects.get_all_ex in 2ms
[2020-01-07 15:24:20,551] [8] [INFO] [trains.service_repo] Returned 200 for tasks.get_all_ex in 396ms
[2020-01-07 15:24:20,562] [8] [INFO] [trains.service_repo] Returned 200 for tasks.get_all_ex in 413ms
/opt/trains/wrapper.sh: line 28:     8 Killed                  python3 server.py

Maybe it's related to the timeouts as well? What am I missing?

Note: the main reason I haven't upgraded to v0.13.0 is because of my previous Azure FlexVolume PR allegroai/clearml-server-k8s#2

Thank you!

from clearml-server-helm.

bmartinn avatar bmartinn commented on June 1, 2024

Hi @Shaked ,

Regarding timeouts, our defaults are also 5min on all three connection types, and it seems stable on our setups...

The 50x error codes, I think, are a byproduct of the pod restarts, which we think are derived from k8s memory limit configuration. This is why on v0.13.0 we increased the memory limit, and to be honest I think we should be more generous with that.
I suggest you set it at 500M and check if the errors/restarts continue.

p.s.

Great, I'll create a PR for that. I guess it should be under the trains-server-k8s repository, right?

Yes please 😄

Note: the main reason I haven't upgraded to v0.13.0 ...

With the 0.13 release things got delayed, but they promised to get the FlexVolume PR merged in the next couple of days, so I'm hoping you can upgrade soon :)

from clearml-server-helm.

Shaked avatar Shaked commented on June 1, 2024

Hey @bmartinn

Regarding timeouts, our defaults are also 5min on all three connection types, and it seems stable on our setups...

The 50x error codes, I think, are a byproduct of the pod restarts, which we think are derived from k8s memory limit configuration. This is why on v0.13.0 we increased the memory limit, and to be honest I think we should be more generous with that.
I suggest you set it at 500M and check if the errors/restarts continue.

I'm going to try this ASAP.

Yes please 😄

PR is available allegroai/clearml-server-k8s#3

With the 0.13 release things got delayed, but they promised to get the FlexVolume PR merged in the next couple of days, so I'm hoping you can upgrade soon :)

Merged :)

from clearml-server-helm.

bmartinn avatar bmartinn commented on June 1, 2024

Awesome!
I'll make sure we see to it :)

from clearml-server-helm.

Related Issues (6)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.