Hey, In order to connect between the services (app, files, api) and

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Thanks <a class="user-mention notranslate" data-hovercard-type="user" dat

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Subdomains and ingress about clearml-server-helm HOT 6 OPEN

allegroai commented on June 1, 2024

Subdomains and ingress

from clearml-server-helm.

Comments (6)

bmartinn commented on June 1, 2024

Thanks @Shaked , yes this does makes sense, and this is exactly what we would recommend setting up manually.

Regrading timeouts, let me find out what we use internally, as we have never encountered timeout issues due to the load balancer...

One last remark, I think we should also add the trains-prod-example.com suffix as a parameter, since all prefixes are fixed, it makes sense to export the only part that changes form one deployment to another.

What do you think?

from clearml-server-helm.

Shaked commented on June 1, 2024

Thanks @Shaked , yes this does makes sense, and this is exactly what we would recommend setting up manually.

Great, I'll create a PR for that. I guess it should be under the trains-server-k8s repository, right?

Regrading timeouts, let me find out what we use internally, as we have never encountered timeout issues due to the load balancer...

Great. Not sure why we faced it, but I added this yesterday:

    nginx.ingress.kubernetes.io/proxy-connect-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"

I haven't experienced any timeouts yet, but might as well be because I didn't played with it much.

One last remark, I think we should also add the trains-prod-example.com suffix as a parameter, since all prefixes are fixed, it makes sense to export the only part that changes form one deployment to another.
What do you think?

Yea that makes a lot of sense actually, so we can support 2 different cases:

Either developers could use ingress.host=trains-prod.example.com which will automatically be appended to all 3 app, api and files or if they, for some reason, would rather have different hosts, they could use ingress.app_host=trains-prod.example.com, ingress.api_host=else-prod.example.com and ingress.files_host=something-else-prod.example.com

Not sure if the 2nd option is even needed, but I don't mind to add it.

What do you think?

from clearml-server-helm.

Shaked commented on June 1, 2024

@bmartinn

I have an update regarding the timeouts. Now I'm seeing other 50x errors, such as 502, 503 (504 disappeared for now).

Looking into `nginx LB` shows:

ku default logs -f steely-mule-nginx-ingress-controller-74d54f944f-9sxbz --since 20m | grep -v "HTTP/1.1\" 200" | grep -v '.well-known'

10.240.0.5 - [10.240.0.5] - - [07/Jan/2020:15:26:27 +0000] "GET /v2.1/events.add_batch HTTP/1.1" 503 198 "-" "python-requests/2.22.0" 1298 0.000 [trains-apiserver-service-8008] - - - - f863f3c72ec17b139f2074a16f8bff04
10.240.0.5 - [10.240.0.5] - - [07/Jan/2020:15:26:27 +0000] "GET /v2.1/tasks.get_by_id HTTP/1.1" 503 198 "-" "python-requests/2.22.0" 1289 0.000 [trains-apiserver-service-8008] - - - - 5872faf664f5f08d45d2c4a9402f637c
10.240.0.5 - [10.240.0.5] - - [07/Jan/2020:15:26:35 +0000] "GET /v2.1/events.add_batch HTTP/1.1" 503 198 "-" "python-requests/2.22.0" 1298 0.000 [trains-apiserver-service-8008] - - - - 9c1dfc69dca4cfecf6abb150f938c827
10.240.0.5 - [10.240.0.5] - - [07/Jan/2020:15:26:35 +0000] "GET /v2.1/tasks.get_by_id HTTP/1.1" 503 198 "-" "python-requests/2.22.0" 1289 0.000 [trains-apiserver-service-8008] - - - - d09b50fd480c22fafcfa93ecff0f377d
[07/Jan/2020:15:26:39 +0000]TCP200000.000
2020/01/07 15:26:51 [warn] 18318#18318: *137904157 a client request body is buffered to a temporary file /tmp/client-body/0000003046, client: 10.240.0.5, server: api.trains-stage.example.com, request: "GET /v2.1/events.add_batch HTTP/1.1", host: "api.trains-stage.example.com"

ku trains logs -f apiserver-75fc489669-x9k76 --since 20m | grep -vi 'returned 200'
[2020-01-07 15:41:49,579] [8] [INFO] [trains.non_responsive_tasks_watchdog] Starting cleanup cycle for running tasks last updated before 2020-01-07 13:41:49.579258
[2020-01-07 15:41:49,581] [8] [INFO] [trains.non_responsive_tasks_watchdog] Done

API server failed and restarted 11 times

kubectl -n trains get pods
NAME                                    READY   STATUS      RESTARTS   AGE
apiserver-75fc489669-x9k76              1/1     Running     11         14d

Using --previous

[2020-01-07 15:24:15,477] [8] [INFO] [trains.updates] TRAINS-SERVER new version available: upgrade to v0.13.0 is recommended!
[2020-01-07 15:24:16,662] [8] [INFO] [trains.service_repo] Returned 200 for tasks.get_by_id in 80ms
[2020-01-07 15:24:18,460] [8] [INFO] [trains.service_repo] Returned 200 for users.get_preferences in 3ms
[2020-01-07 15:24:18,753] [8] [INFO] [trains.service_repo] Returned 200 for tasks.ping in 3ms
[2020-01-07 15:24:18,783] [8] [INFO] [trains.service_repo] Returned 200 for tasks.get_by_id in 10ms
[2020-01-07 15:24:18,840] [8] [INFO] [trains.service_repo] Returned 200 for users.get_current_user in 4ms
[2020-01-07 15:24:18,991] [8] [INFO] [trains.service_repo] Returned 200 for projects.get_all_ex in 6ms
[2020-01-07 15:24:19,606] [8] [INFO] [trains.service_repo] Returned 200 for projects.get_all_ex in 2ms
[2020-01-07 15:24:20,551] [8] [INFO] [trains.service_repo] Returned 200 for tasks.get_all_ex in 396ms
[2020-01-07 15:24:20,562] [8] [INFO] [trains.service_repo] Returned 200 for tasks.get_all_ex in 413ms
/opt/trains/wrapper.sh: line 28:     8 Killed                  python3 server.py

Maybe it's related to the timeouts as well? What am I missing?

Note: the main reason I haven't upgraded to v0.13.0 is because of my previous Azure FlexVolume PR allegroai/clearml-server-k8s#2

Thank you!

from clearml-server-helm.

bmartinn commented on June 1, 2024

Hi @Shaked ,

Regarding timeouts, our defaults are also 5min on all three connection types, and it seems stable on our setups...

The 50x error codes, I think, are a byproduct of the pod restarts, which we think are derived from k8s memory limit configuration. This is why on v0.13.0 we increased the memory limit, and to be honest I think we should be more generous with that.
I suggest you set it at 500M and check if the errors/restarts continue.

p.s.

Great, I'll create a PR for that. I guess it should be under the trains-server-k8s repository, right?

Yes please 😄

Note: the main reason I haven't upgraded to v0.13.0 ...

With the 0.13 release things got delayed, but they promised to get the FlexVolume PR merged in the next couple of days, so I'm hoping you can upgrade soon :)

from clearml-server-helm.

Shaked commented on June 1, 2024

Hey @bmartinn

Regarding timeouts, our defaults are also 5min on all three connection types, and it seems stable on our setups...

The 50x error codes, I think, are a byproduct of the pod restarts, which we think are derived from k8s memory limit configuration. This is why on v0.13.0 we increased the memory limit, and to be honest I think we should be more generous with that.
I suggest you set it at 500M and check if the errors/restarts continue.

I'm going to try this ASAP.

Yes please 😄

PR is available allegroai/clearml-server-k8s#3

With the 0.13 release things got delayed, but they promised to get the FlexVolume PR merged in the next couple of days, so I'm hoping you can upgrade soon :)

Merged :)

from clearml-server-helm.

bmartinn commented on June 1, 2024

Awesome!
I'll make sure we see to it :)

from clearml-server-helm.

Subdomains and ingress about clearml-server-helm HOT 6 OPEN

Comments (6)

Looking into `nginx LB` shows:

API server failed and restarted 11 times

Using --previous

Related Issues (6)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (6)

Looking into nginx LB shows:

API server failed and restarted 11 times

Using --previous

Related Issues (6)

Recommend Projects

Recommend Topics

Recommend Org

Looking into `nginx LB` shows: