gardener / ops-toolbelt Goto Github PK

Useful tools and operations guide for gardener landscapes

License: Apache License 2.0

Shell 63.43% Python 31.73% Vim Script 4.85%

ops-toolbelt's Introduction

Ops Toolbelt

What is the ops-toolbelt

The ops-toolbelt aims to be a standard operator container image with pre-installed useful tools for troubleshooting issues on gardener landscapes. The ops-toolbelt images can be used by the Gardener Dashboard web consoles functionality to log into the garden, seed, or shoot clusters.

The pods created with this image can be both general pods and node-bound pods (behaving as if being on the node directly) Starting a pod with the ops-toolbel timage requires a running Kubelet and healthy control plane, a working VPN connection, and sufficient capacity on the node.

Usage

Running a container locally

The simplest way of using the ops-toolbelt is to just run the following command:

$ docker run -it europe-docker.pkg.dev/sap-se-gcp-k8s-delivery/releases-public/eu_gcr_io/gardener-project/gardener/ops-toolbelt:latest

  __ _  __ _ _ __ __| | ___ _ __   ___ _ __   ___| |__   ___| | |
 / _` |/ _` | '__/ _` |/ _ \ '_ \ / _ \ '__| / __| '_ \ / _ \ | |
| (_| | (_| | | | (_| |  __/ | | |  __/ |    \__ \ | | |  __/ | |
 \__, |\__,_|_|  \__,_|\___|_| |_|\___|_|    |___/_| |_|\___|_|_|
 |___/

Run ghelp to get information about installed tools and packages

You can then add personal configurations to your ops-toolbelt container for tools like kubectl, gcloud and so on ...

Running the ops-toolblet as privileged pod on a node

Get the names of the nodes on your cluster and then run hacks/ops-pod with the node you want to start the pod on:

$ kubectl get nodes
NAME                                            STATUS   ROLES    AGE    VERSION
node1   Ready    <none>   56d    v1.11.10-gke.5
node2   Ready    <none>   72d    v1.11.10-gke.5
node3   Ready    <none>   150d   v1.11.10-gke.5

$ ./hacks/ops-pod node1
node name provided ...
Deploying ops pod on node1

pod/ops-pod created
Waiting for pod to be running...
Waiting for pod to be running...
This container comes with the following preinstalled tools:
curl tree silversearcher-ag htop less vim tmux bash-completion dnsutils netcat-openbsd iproute2 dstat ngrep tcpdump python-minimal jq yaml2json kubectl pip cat mdv

The sourced dotfiles are located under /root/dotfiles.
Additionally you can add your own personal git settings in /root/dotfiles/.config/git/config_personal

The following variables have been exported:
DOTFILES_USER=root DOTFILES_HOME=/root/dotfiles

root at node1 in /
$

Use ./hacks/ops-pod --help to check what other options are available

Building ops-toolbelt images

Dockerfiles for the images are generated from files in the dockerfile-config directory.

To build all pre-configured images run:

$ .ci/build

Known issues

Currently there's a known issue when using /bin/sh. We implemented a color scheme and also added some helper function to display in /bin/bash terminal which doesn't work in /bin/sh. As workaround when you want to use some script which by default needs to utilize /bin/sh please use /bin/bash instead if possible: (take chroot for example)

$ chroot /some_dir /bin/bash

ops-toolbelt's People

Contributors

Stargazers

Watchers

ops-toolbelt's Issues

Hardcoded `etcd_host` is used as precalculated argument for `etcdctl --endpoints`

What happened:
Right now, etcd_host is set to etcd-main-local and used as precalculated argument for etcdctl --endpoints.

ops-toolbelt/install_on_demand/.etcdctl

Line 24 in d4e5d33

local etcd_host="etcd-main-local"

But, etcd's endpoints is not always necessary be etcd-main-local as we also have etcd-events-* pods which supports etcd-events-local as endpoints. Moreover, we also have virtual-garden-etcd-events/main-* which support virtual-garden-etcd-main/events-local. So, it can't be always https://etcd-main-local:2379.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know:

Environment:

Scripts to ease work with terraform

What would you like to be added:
We should add scripts that can ease work with terraform states.

Why is this needed:
Sometimes there are errors caused by terraform job pods:

A resource has already been created by a previous execution but the terraform state was not properly saved
There are resource leaks due to multiple executions of terraform jobs

To fix these the operator has to either manually delete the created resources on the cloud provider or manually update the terraform state configmaps kept in the Shoot's namespace on the Seed.

To ease this work we can introduce a couple of scripts which take care of automatically setting up the terraform work directory and properly updating the state's configmap.

Wrong cert path is used for `etcdctl`

What happened:
I tried to use etcdctl but faced error.

I have no name!@etcd-main-0:/$ etcdctl member list -w table
CA certificate not found in the expected path (/proc/14/root/var/etcd/ssl/client/client/ca/bundle.crt).
You can use --cacert=/path/to/file
bash: /dev/bull: Permission denied
etcd host https://etcd-main-local:2379 not reachable within 5 seconds.

Then I used etcdctl with flags, it works but still I saw the error msg which doesn't look good.

I have no name!@etcd-main-0:/$ etcdctl --cert=/proc/${ETCD_PID}/root/var/etcd/ssl/client/client/tls.crt --key=/proc/${ETCD_PID}/root/var/etcd/ssl/client/client/tls.key --cacert=/proc/${ETCD_PID}/root/var/etcd/ssl/client/ca/bundle.crt --endpoints=https://etcd-main-local:2379 member list -w table
CA certificate not found in the expected path (/proc/14/root/var/etcd/ssl/client/client/ca/bundle.crt).
You can use --cacert=/path/to/file
bash: /dev/bull: Permission denied
etcd host https://etcd-main-local:2379 not reachable within 5 seconds.
You can use --endpoints=https://${ETCD_HOST}:${ETCD_PORT}
+------------------+---------+-------------+---------------------------------------------------------------------+----------------------------------------------------------------------+------------+
|        ID        | STATUS  |    NAME     |                             PEER ADDRS                              |                             CLIENT ADDRS                             | IS LEARNER |
+------------------+---------+-------------+---------------------------------------------------------------------+----------------------------------------------------------------------+------------+
| c9a510845e8844f0 | started | etcd-main-0 | http://etcd-main-0.etcd-main-peer.shoot--hc-ci--a0bcb8-hdl.svc:2380 | https://etcd-main-0.etcd-main-peer.shoot--hc-ci--a0bcb8-hdl.svc:2379 |      false |
+------------------+---------+-------------+---------------------------------------------------------------------+----------------------------------------------------------------------+------------+

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know:

Environment:

Switching default namespace of ops-pod to 'kube-system'

What would you like to be added:
Switching default namespace of ops-pod to 'kube-system'

Why is this needed:

Currently it gets the namespace from the context which is defined via the kubectl config and this points to default if I use gardenctl and target a cluster.

The problem with putting the ops-pod in the default namespace is that due to the podSecurity configuration it is not allowed to start privileged containers there.

Any objections for changing the default to kube-system and not reading it from the config so we can actually start the ops-pod without having to manually provide the namespace?

alias k does not auto-complete

What happened:
alias k for kubectl does not auto-complete, see also https://kubernetes.io/docs/tasks/tools/included/optional-kubectl-configs-bash-linux/#bash

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know:

Environment:

[Bug] Container Image ARM64 Incompatibility

What happened:
The currently built Ops-Toolbelt image (eu.gcr.io/gardener-project/gardener/ops-toolbelt-gardenctl:0.13.0) is not compatible with ARM64 architectures like Apple Silicon.

What you expected to happen:
The image to work the same way it works on AMD64 based architectures.

How to reproduce it (as minimally and precisely as possible):

Execute the following line on an ARM64 device:

> docker run -it --rm eu.gcr.io/gardener-project/gardener/ops-toolbelt-gardenctl:latest bash

Create a file:

$ cat <<EOF > foo
> bar
> EOF
runtime: failed to create new OS thread (have 2 already; errno=22)
fatal error: newosproc

Anything else we need to know:
Docker already warns users about an incompatibility when the container is started:

WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested

Environment:
Apple M1 Max

Rebuild Image to get latest kubectl version

Please rebuild the image so that the latest kubectl version is included. Thanks

Provide Trimmed-Down Images per Infrastructure and Kubernetes Version

What would you like to be added:
Today the images:

are overly large (2x or 4x as large as the original first images; e.g. 200MB vs. 100MB for the plain toolbelt and 436MB vs. 116MB for the toolbelt with the aws CLI image)
contain all infrastructures CLIs even though an operator needs usually only one (for a particular cluster)
contains a 1.17 kubectl, which can be only used for clusters within the skew policy, so is mostly the wrong one for the currently supported Kubernetes versions in Gardener

Why is this needed:
The current images are too large, not specific enough (which adds to the first problem and results in longer startup times and 2-4x higher network usage and costs), and features an outdated kubectl that may fail/will be useless for clusters of a particular version.

What do you suggest:
Let's:

First understand why the images became so large and what we can do to cut them down and get them smaller again (fewer components, cleanup, different base OS if it's not too disruptive)
Build a matrix of images (infrastructure x kubectl version), so that a user and the dashboard can pick the right one (gardener/dashboard#842) and of course also the big one (for local usage)

Bash prompt errors when mounting /hostroot

What happened:

I am using the ops-toolbelt image with the host volume "/" mounted under "/hostroot".

When trying to change the root directory of my current shell process by executing chroot /hostroot - it works, but it seems the shell prompt is misconfigured.

$ chroot /hostroot
/bin/sh: 1: get_kube_ctx: not found
/bin/sh: 1: get_git_status: not found
\[\033]0;ip-10-250-8-164.eu-north-1.compute.internal\007\]\n\[\033[1m\]\[\]\u\[\033[0m\] at \[\033[1m\]\[\]ip-10-250-8-164.eu-north-1.compute.internal\[\033[0m\] in \[\033[1m\]\[\]\w\n\[\033[0m\]\[\033[1m\]$ \[\033[0m\]

subsequent commands:

$ docker ps
.....
80c-910a-0eabd6637adb_0
f82f0c731108        gcr.io/google_containers/pause-amd64:3.1                                           "/pause"                 2 hours ago         Up 2 hours                              k8s_POD_node-exporter-qmqw9_kube-system_d0b0e8f7-68e3-4227-aab0-72b798439bdd_0
/bin/sh: 1: get_kube_ctx: not found
/bin/sh: 1: get_git_status: not found

What you expected to happen:
No errors when executing command when mounting the host root directory.

How to reproduce it (as minimally and precisely as possible):

Create a pod that mounts the host root directory:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/psp: gardener.privileged
  labels:
    app: root
#    networking.gardener.cloud/to-dns: allowed
  name: rootpod-test
#  namespace: kube-system
spec:
  containers:
    - command:
        - sleep
        - "10000000"
      image: eu.gcr.io/gardener-project/gardener/ops-toolbelt
      imagePullPolicy: Always
      name: root-container
      resources: {}
      securityContext:
        privileged: true
        runAsUser: 0
      stdin: true
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
        - mountPath: /hostroot
          name: root-volume
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostNetwork: true
  hostPID: true
  #nodeName: ip-10-250-4-151.eu-north-1.compute.internal
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
  volumes:
    - hostPath:
        path: /
        type: ""
      name: root-volume

exec into the pod
execute chroot /hostroot

Anything else we need to know:

There are no errors when using the busybox image. That is e.g. what gardenctl shell uses.

This is where the promp is configured

Environment:

After introduction of "matrix images", launch script not adapted

What would you like to be added:
Can we please update the launch script https://github.com/gardener/ops-toolbelt/blob/master/hacks/ops-pod or how are you picking the right image for the job, that respects the cloud provider and Kubernetes version?

E.g. (see #76 for why there is this ugly Kubernetes version defaulting as we don't have ops-toolbelt images for all Gardener-supported Kubernetes versions):

# determine cloud provider and kubernetes version to select the right (specific) image
provider=$(kubectl -n kube-system get cm shoot-info -o jsonpath={.data.provider} | sed -r "s/^azure$/az/"  | sed -r "s/^alicloud$/aliyun/" | sed -r -n "s/^(aliyun|aws|az|gcp|openstack)$/-\1/p")
version=$(kubectl -n kube-system get cm shoot-info -o jsonpath={.data.kubernetesVersion} | sed -r -n "s/^1\.(17|18|19|20|21|22)\.[0-9]+$/-k1.\1/p")

...and then:

image: eu.gcr.io/gardener-project/gardener/ops-toolbelt${provider}${version}

Why is this needed:

To pick the right image for the job.

cc @petersutter
cc @neo-liang-sap @plkokanov @jfortin-sap
cc @dguendisch @hendrikKahl @BeckerMax

wireguard and iptables

What would you like to be added:

Please add the wg (wireguard) command and a simple script command:

#!/bin/bash
table=
iptables-save | while IFS= read -r line; do
  if [ "${line#\**}" != "$line" ]; then
    table="$line"
  else
    echo "$table: $line"
  fi
done | grep "$@"

It can be used to simplify the analysis of iptables entries.

Thanks

Why is this needed:

Add ping utility

What would you like to be added:
ping is currently not available on the image and should be part of it

Why is this needed:

Registry messy, images built with arbitrary versions, no latest tag

What would you like to be added:
I have difficulties picking the right image, because:

The "generic" image without suffix has a latest tag, but the others don't - why not? How would I pick the latest version?
The new images with suffixes are available in versions that I cannot really guess (internal component version which is unrelated to anything, e.g. unrelated to the Kubernetes version, which could have been an alternative approach generating less clutter), e.g.:
- https://console.cloud.google.com/gcr/images/gardener-project/eu/gardener/ops-toolbelt-aws-k1.17?authuser=1&project=gardener-project -> 0.12.0
- https://console.cloud.google.com/gcr/images/gardener-project/eu/gardener/ops-toolbelt-aws-k1.22?authuser=1&project=gardener-project -> 0.16.0
Basically/it seems, only stuff from 1.19 seems to be updated and is available in 0.16.0, the rest is stale.
The registry is in a messy state, tags that are not updated anymore are all kept/were never cleaned up (e.g. stuff with cloud provider but without Kubernetes version, gardenctl images, ...).

What do you recommend:

Build images from 1.15 till 1.24, because these are the version we support (if 1.15 and 1.16 is actual effort, then leave them out, but we need 1.23 and 1.24)
Set the latest tag for all built images
Make sure, only the intended images are used by ops, referred to by ops-guide, launched by Dashboard (as web terminal image)
Cleanup registry and remove all images with tags that are no longer updated

Why is this needed:

To pick the right image for the job.
To avoid people using outdated images.

cc @petersutter
cc @neo-liang-sap @plkokanov @jfortin-sap
cc @dguendisch @hendrikKahl @BeckerMax

Set explicit CLI versions for reproducible builds

What would you like to be added:

Vedran:
I noticed some CLIs have versions, some not. What's our strategy? Explicit versions are better, because you then have reproducible builds, but then someone needs to update the CLIs regularly. Maybe you/SRE could do so? Maybe only for the core components and the Gardener extensions teams for the CLIs?

Why is this needed:
to provide reproducible builds

install_k9s is broken for k9s >=0.27.0

What happened:
The included k9s installer has stopped working, as k9s have renamed the relevant tarball from k9s_Linux_x86_64.tar.gz to k9s_Linux_amd64.tar.gz as of v0.27.0

$ install_k9s
installing latest version v0.27.3
curl: no URL specified!
curl: try 'curl --help' or 'curl --manual' for more information

What you expected to happen:
A working installation of k9s

How to reproduce it (as minimally and precisely as possible):
Start ops-toolbelt container and run install_k9s

Add crictl

What would you like to be added:

Consider adding the debugging tool crictl for CRI deamons.
See the available functionality here.

Why is this needed:

Soon Gardener will have support for different ContainerRuntimes (gvisor & KataContainers) see here.
The crictl tool could be a good addition to docker cli and ctr cli (when containerd is available).

interact with a CRI runtime directly
crictl offers a more kubernetes-friendly view of containers
- knows about pods and namespace (docker does not),
- see Kubernetes pulled images. crtctl images lists the images available on a kubernetes node vs docker images list the images on the host know to docker. See
talks over the CRI interface - so "understands" all CRI compatible runtimes.

Note:

crictl per default connects to the dockershim on unix:///var/run/dockershim.sock. When running on a node that has CRI enabled (labels on node) then crictl should connect to the containerd socket
unix:///run/containerd/containerd.sock.

Only support one kubectl version and remove all images containing infra-clis

What would you like to be added:
I'm wondering if we currently need this matrix of infra-clis x kubectl versions as introduced with #50

My proposal:

we only have one ops-toolbelt image with the "latest" kubectl version. Once have the means in the dashboard to support different images depending on the shoot infra type and shoot k8s version for the webterminals, we can build the matrix again.
do not provide images including infra CLIs. As we do not ship gardenctl-v2, it does not give much benefit to have the infra-cli included.

Why is this needed:
Having this matrix of infra-clis x kubectl versions is currently not needed and just causes unnecessary resources and effort to assess findings of the vulnerability scan

adding kubetail to aggregate logs from multiple pods

What would you like to be added:
Adding logs tools kubetail to aggregate logs from multiple pod to easily observation error in troubleshooting
more information kubetail

Why is this needed:
As multiple pods and the name standard base on k8s feature. we can tail/logs log easily in daily for SRE troubleshooting gardener cluster

check multiple kube-apiserver pod

kubetail kube-apiserver -c kube-apiserver
Will tail 4 logs...
kube-apiserver-58b7d8d686-5tfmb
kube-apiserver-58b7d8d686-df6jl
kube-apiserver-58b7d8d686-hc86m
kube-apiserver-58b7d8d686-jgtnw
[kube-apiserver-58b7d8d686-5tfmb] I0403 02:39:39.558310       1 trace.go:81] Trace[1361818141]: "List etcd3: key=/pods, resourceVersion=, limit: 0, continue: " (started: 2020-04-03 02:39:39.018387292 +0000 UTC m=+595932.863717368) (total time: 539.892009ms):
[kube-apiserver-58b7d8d686-5tfmb] Trace[1361818141]: [539.892009ms] [539.892009ms] END

without full name standard check aaaabbbb.infra.tf-apply-dxxxs
kubetail infra.tf
an so on.

another reason choose kubetail is more light as shell script only 12k compare with other log tools eg ktail

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Update dependency containerd/nerdctl to v1.7.6

Detected dependencies

github-actions

.github/workflows/reuse-tool-lint.yaml

actions/checkout v4.1.4@0ad4b8fadaa221de15dcec353f45205ec38ea70b

fsfe/reuse-action v3.0.0@a46482ca367aef4454a87620aa37c2be4b2f8106

regex

.ci/build

gardenlinux/gardenlinux 1443.3

dockerfile-configs/common-components.yaml

bronze1man/yaml2json v1.3

containerd/nerdctl 1.7.5

kubernetes/kubernetes v1.28.9

gardener / ops-toolbelt Goto Github PK

ops-toolbelt's Introduction

Ops Toolbelt

What is the ops-toolbelt

Usage

Running a container locally

Running the ops-toolblet as privileged pod on a node

Building ops-toolbelt images

Known issues

ops-toolbelt's People

Contributors

Stargazers

Watchers

Forkers

ops-toolbelt's Issues

Open

Detected dependencies

Recommend Projects

Recommend Topics

Recommend Org