Code Monkey home page Code Monkey logo

Comments (17)

Montana avatar Montana commented on July 26, 2024 1

Hi Northskool,

If switching from the custom built binaries in the travis-blue-public S3 bucket to the ones produced upstream (such as https://storage.googleapis.com/shellcheck/shellcheck-v0.4.6.linux.x86_64.tar.xz), make sure to add --strip-components=1 to the tar invocation, otherwise the shellcheck binary ends up in a subdirectory and so not on PATH, which is easy to miss since then the older preinstalled shellcheck binary in /usr/bin/local is silently used instead. I recommend using Docker to prevent these issues from occurring.

from terraform-travis-public.

Montana avatar Montana commented on July 26, 2024 1

Hi Northskool,

The chart Terraform generates doesn't tell me anything about Harvard's setup as it pertains to core infrastructure.

From taking a look at the flow chart, with only a single manager, you have no HA and when things are down, nothing will be rescheduled. You need at least 3 managers to support a single down manager scenario. When the single manager restarts, it's a race condition between the scheduler assigning a workload, and the existing node reconnecting, the worker node is unlikely to win this race.

Swarm will not proactively rescale or reschedule workloads either, when adding new nodes to the cluster, or just restarting a down node, existing tasks will continue to run in their current node until there's a change that forces them to be rescheduled. That change can be a downed node, or an update to a service. This improves HA since a new node may be unstable/flapping.

You may need to force Swarm to rebalance a service, so you can run:

docker service update --force $service_name

This will force an update without any other modifications to that service. (Replace $service_name with the name or id of your service.)

from terraform-travis-public.

Montana avatar Montana commented on July 26, 2024 1

Hi Northskool,

Looked into this, I've discovered that this is caused by the generate-latest-docker-image-tags script, which drops the commit hash and only looks at the release numbers.

While this is indeed unpredictable behavior, I don't think the effort of fixing it is necessarily worth it since this only happens for parallel branches and only on staging (where we rely on the latest Docker image instead of a specific version).

Furthermore, even fixing it to always retrieve the latest tag would still cause issues in the parallel branches case because people might not be aware someone else might have built a new image and deploy that one instead of their own.

from terraform-travis-public.

Montana avatar Montana commented on July 26, 2024 1

Hi Northskool,

Flux here at Travis lets us define rules for automatically pushing updates to Docker tags that match a specific pattern. We could use this to automatically deploy new code releases once the Docker image is pushed. When this happens, Flux commits the change of the release to the Git repo.

I imagine a pattern where tagged releases automatically go to production. Staging, I'm less sure of. Maybe it could be every tag, but I'm not sure about that. We may want to define some rules for what gets deployed there, or maybe we keep it manual. I'm not sure how Flux/Terraform would be an issue; would you care to elaborate on that?

from terraform-travis-public.

Northskool avatar Northskool commented on July 26, 2024

hi @Montana you were correct, it did end up in a subdirectory and not path. here at harvard university, we are recommended to use newer version of shellcheck what do u recommend? but thank you the subdirectory tip was the fix.

from terraform-travis-public.

Montana avatar Montana commented on July 26, 2024

Hi Northskool,

I have no idea how Harvard University has its system setup, particularly what you're trying to do. Shellcheck in itself should always be up-to-date.

I'm glad I could help you with the silent PATH.

from terraform-travis-public.

Northskool avatar Northskool commented on July 26, 2024

as it pertains to the setup at harvard, this is what terraform generated for me:

docker-worker

from terraform-travis-public.

Northskool avatar Northskool commented on July 26, 2024

one last thing @Montana,

there have been unpredictable results when running make plan to try and test a new worker version in staging.

bin/generate-latest-docker-image-tags does not know what the newest docker tag is for the most recent worker. Input data from the tags from docker hub:

v3.6.0-14-gc9f6bda
v3.6.0-14-g592d662
v3.6.0-13-g1c1f9ec
v3.6.0-12-g76c1dc7
v3.6.0-11-gca14cdd

from terraform-travis-public.

Northskool avatar Northskool commented on July 26, 2024

so as of now is there any way to make behavior more predictable in parallel branches, etc?

from terraform-travis-public.

Montana avatar Montana commented on July 26, 2024

There are some ways to make the behavior a lot more predictable:

  • google_compute_region_instance_group_manager supports instances in multiple zones, so no need for a group per zone
  • create_before_destroy on the google_compute_instance_template allows templates to be destroyed, and the managed group to be updated
  • name_prefix on the google_compute_instance_template adds a random suffix automatically instead of using a hash of the cloud-init metadata, thus being more conflict resistant

This will still leave us in a situation where instances without an instance template can exist (like with the NATs), causing unpredictable behavior but at least Terraform can be applied successfully now, which is an improvement.

from terraform-travis-public.

Northskool avatar Northskool commented on July 26, 2024

i noticed you folks were using Flux, is there a reason for this? I could see some collisions happening with Flux/Terraform possibly, will be my last question...

from terraform-travis-public.

Northskool avatar Northskool commented on July 26, 2024

will there be a kubernetes config you push out as well much like you did with this terraform repo?

from terraform-travis-public.

Montana avatar Montana commented on July 26, 2024

Hi Northskool,

I think on the surface; it would seem like Kubernetes already has pretty good support for deploying a mesh of distributed services: it will round-robin requests between a set of pods behind one DNS name and has ways to handle liveness and readiness, that said I will be pushing up something early in 2023 that's similar, yes.

To maintain and garner feature parity with other first-class gRPC implementations, I don't suggest using kube-proxy; it routes at L4, which does not fit well with today’s app-centric protocols. If a pod has multiple replicas, the gRPC connection will be made to a one-pod replica, and all calls will go to that pod replica. gRPC with default Kubernetes load balancing, in my opinion, is broken.

There are two options to load balance gRPC effectively in 2022:

  • Client-side load balancing.
  • L7 (application) proxy load balancing. For example, YARP and Envoy will distribute gRPC calls across replicas.

from terraform-travis-public.

Northskool avatar Northskool commented on July 26, 2024

is there L7 proxy support? you mentioned round robining...

from terraform-travis-public.

Montana avatar Montana commented on July 26, 2024

Hi Northskool,

You can use an API Gateway for Kubernetes, such as Ambassador Edge Stack, which can bypass kube-proxy altogether, routing traffic directly to Kubernetes Pods. Ambassador is built on Envoy Proxy, an L7 proxy, so each gRPC request is load balanced between available Pods.

Something to note here since Kubernetes needs to pull Docker images to run in its containers, you'll need to push the Docker images you use to some Docker registry that has permissions and access to Google Container Engine, whether that be gcr.io, DockerHub, etc., any will work. However, for production use, you might want to minimize traffic across boundaries as a cost-reduction effort.

Remember to run kubectl get services to show the current services:

NAME      CLUSTER-IP     EXTERNAL-IP  PORT(S)   AGE
postgres  10.107.246.55  <none>       5432/TCP  5s

In my example above you have a running Postgres server, reachable from anywhere in the cluster at postgres:5432.

from terraform-travis-public.

Northskool avatar Northskool commented on July 26, 2024

could this be done through a ring hash?

from terraform-travis-public.

Montana avatar Montana commented on July 26, 2024

The ring hash approach is used for both “sticky sessions” (where a cookie is set to ensure that all requests from a client arrive at the same Pod) and for “session affinity” (which relies on client IP or some other piece of client state). If you're using Envoy Proxy, you can define policies like this:

filter_metadata:
    envoy.lb:
      hash_key: "YOUR HASH KEY"

The opportunity cost with ring hash is that it can be more challenging to evenly distribute load between different backend servers, causing more server burn rate, since client workloads may not be equal. In addition, the computation cost of the hash adds some latency to requests, particularly at scale. Worth noting; like all hash-based load balancers, it is only effective when protocol routing is used that specifies a value to hash on.

from terraform-travis-public.

Related Issues (2)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.