Comments (17)
Hi Northskool,
If switching from the custom built binaries in the travis-blue-public S3
bucket to the ones produced upstream (such as https://storage.googleapis.com/shellcheck/shellcheck-v0.4.6.linux.x86_64.tar.xz), make sure to add --strip-components=1
to the tar invocation, otherwise the shellcheck binary ends up in a subdirectory and so not on PATH
, which is easy to miss since then the older preinstalled shellcheck binary in /usr/bin/local
is silently used instead. I recommend using Docker to prevent these issues from occurring.
from terraform-travis-public.
Hi Northskool,
The chart Terraform generates doesn't tell me anything about Harvard's setup as it pertains to core infrastructure.
From taking a look at the flow chart, with only a single manager, you have no HA and when things are down, nothing will be rescheduled. You need at least 3 managers to support a single down manager scenario. When the single manager restarts, it's a race condition between the scheduler assigning a workload, and the existing node reconnecting, the worker node is unlikely to win this race.
Swarm will not proactively rescale or reschedule workloads either, when adding new nodes to the cluster, or just restarting a down node, existing tasks will continue to run in their current node until there's a change that forces them to be rescheduled. That change can be a downed node, or an update to a service. This improves HA since a new node may be unstable/flapping.
You may need to force Swarm to rebalance a service, so you can run:
docker service update --force $service_name
This will force an update without any other modifications to that service. (Replace $service_name
with the name or id of your service.)
from terraform-travis-public.
Hi Northskool,
Looked into this, I've discovered that this is caused by the generate-latest-docker-image-tags
script, which drops the commit hash and only looks at the release numbers.
While this is indeed unpredictable behavior, I don't think the effort of fixing it is necessarily worth it since this only happens for parallel branches and only on staging (where we rely on the latest Docker image instead of a specific version).
Furthermore, even fixing it to always retrieve the latest tag would still cause issues in the parallel branches case because people might not be aware someone else might have built a new image and deploy that one instead of their own.
from terraform-travis-public.
Hi Northskool,
Flux here at Travis lets us define rules for automatically pushing updates to Docker tags that match a specific pattern. We could use this to automatically deploy new code releases once the Docker image is pushed. When this happens, Flux commits the change of the release to the Git repo.
I imagine a pattern where tagged releases automatically go to production. Staging, I'm less sure of. Maybe it could be every tag, but I'm not sure about that. We may want to define some rules for what gets deployed there, or maybe we keep it manual. I'm not sure how Flux/Terraform would be an issue; would you care to elaborate on that?
from terraform-travis-public.
hi @Montana you were correct, it did end up in a subdirectory and not path. here at harvard university, we are recommended to use newer version of shellcheck what do u recommend? but thank you the subdirectory tip was the fix.
from terraform-travis-public.
Hi Northskool,
I have no idea how Harvard University has its system setup, particularly what you're trying to do. Shellcheck in itself should always be up-to-date.
I'm glad I could help you with the silent PATH
.
from terraform-travis-public.
as it pertains to the setup at harvard, this is what terraform generated for me:
from terraform-travis-public.
one last thing @Montana,
there have been unpredictable results when running make plan to try and test a new worker version in staging.
bin/generate-latest-docker-image-tags does not know what the newest docker tag is for the most recent worker. Input data from the tags from docker hub:
v3.6.0-14-gc9f6bda
v3.6.0-14-g592d662
v3.6.0-13-g1c1f9ec
v3.6.0-12-g76c1dc7
v3.6.0-11-gca14cdd
from terraform-travis-public.
so as of now is there any way to make behavior more predictable in parallel branches, etc?
from terraform-travis-public.
There are some ways to make the behavior a lot more predictable:
google_compute_region_instance_group_manager
supports instances in multiple zones, so no need for a group per zonecreate_before_destroy
on thegoogle_compute_instance_template
allows templates to be destroyed, and the managed group to be updatedname_prefix
on thegoogle_compute_instance_template
adds a random suffix automatically instead of using a hash of the cloud-init metadata, thus being more conflict resistant
This will still leave us in a situation where instances without an instance template can exist (like with the NATs), causing unpredictable behavior but at least Terraform can be applied successfully now, which is an improvement.
from terraform-travis-public.
i noticed you folks were using Flux, is there a reason for this? I could see some collisions happening with Flux/Terraform possibly, will be my last question...
from terraform-travis-public.
will there be a kubernetes config you push out as well much like you did with this terraform repo?
from terraform-travis-public.
Hi Northskool,
I think on the surface; it would seem like Kubernetes already has pretty good support for deploying a mesh of distributed services: it will round-robin requests between a set of pods behind one DNS name and has ways to handle liveness and readiness, that said I will be pushing up something early in 2023 that's similar, yes.
To maintain and garner feature parity with other first-class gRPC implementations, I don't suggest using kube-proxy
; it routes at L4, which does not fit well with today’s app-centric protocols. If a pod has multiple replicas, the gRPC connection will be made to a one-pod replica, and all calls will go to that pod replica. gRPC with default Kubernetes load balancing, in my opinion, is broken.
There are two options to load balance gRPC effectively in 2022:
- Client-side load balancing.
- L7 (application) proxy load balancing. For example, YARP and Envoy will distribute gRPC calls across replicas.
from terraform-travis-public.
is there L7 proxy support? you mentioned round robining...
from terraform-travis-public.
Hi Northskool,
You can use an API Gateway for Kubernetes, such as Ambassador Edge Stack, which can bypass kube-proxy
altogether, routing traffic directly to Kubernetes Pods. Ambassador is built on Envoy Proxy, an L7 proxy, so each gRPC request is load balanced between available Pods.
Something to note here since Kubernetes needs to pull Docker images to run in its containers, you'll need to push the Docker images you use to some Docker registry that has permissions and access to Google Container Engine, whether that be gcr.io, DockerHub, etc., any will work. However, for production use, you might want to minimize traffic across boundaries as a cost-reduction effort.
Remember to run kubectl get services
to show the current services:
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
postgres 10.107.246.55 <none> 5432/TCP 5s
In my example above you have a running Postgres server, reachable from anywhere in the cluster at postgres:5432
.
from terraform-travis-public.
could this be done through a ring hash?
from terraform-travis-public.
The ring hash approach is used for both “sticky sessions” (where a cookie is set to ensure that all requests from a client arrive at the same Pod) and for “session affinity” (which relies on client IP or some other piece of client state). If you're using Envoy Proxy, you can define policies like this:
filter_metadata:
envoy.lb:
hash_key: "YOUR HASH KEY"
The opportunity cost with ring hash is that it can be more challenging to evenly distribute load between different backend servers, causing more server burn rate, since client workloads may not be equal. In addition, the computation cost of the hash adds some latency to requests, particularly at scale. Worth noting; like all hash-based load balancers, it is only effective when protocol routing is used that specifies a value to hash on.
from terraform-travis-public.
Related Issues (2)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from terraform-travis-public.