Code Monkey home page Code Monkey logo

kubernetes-the-right-way's People

Contributors

anton-johansson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

kubernetes-the-right-way's Issues

Prometheus metrics for Kubernetes components

Note: Technically not related to this repository other than the fact that I might need custom switches on other components than kube-apiserver. But I'll give it a go here anyway, maybe it's a good discussion topic. :)

I noticed that all Kubernetes components and etcd exposes a /metrics path with Prometheus metrics. So I was thinking that I should start scraping these, and see if I can find any pre-built dashboards for Grafana.

I just have something to ask/discuss here.

kube-apiserver should easily be accessible by my Prometheus pod, as long as I give the serviceaccount access to the /metrics path (not sure how I do that, though, will need to investigate).

Regarding kube-scheduler and kube-controller-manager, I can access them over HTTP on ports 10259 and 10257 respectively. However, they have quite some strange CA certificates and I'm not able to use my own access token. I suppose switches --tls-cert-file and --tls-private-key-file will solve the strange CA certificate, but I'm not sure how to actually authenticate (avoiding 401 Unauthorized). Do you any ideas?

When it comes to etcd, I can access that pretty easily. However, I need to use the client certificate and key stored on the masters (etcd.pem and etcd-key.pem), and I can't really access them from my Prometheus pod. I'm not sure I want to either. I guess this is something that is interesting here.

kube-proxy should be fairly simple. It only listens to 127.0.0.1:10249 by defualt, but that's changable with a switch, so it should be fine.

Finally: I wouldn't want to hardcode all server IPs in my Prometheus configuration file. It would be great if I could use Kubernetes services for this. I see that I have some endpoints (kubectl get endpoints -n kube-system), like kube-controller-manager, but they're set to <none>. I guess I could create my services manually (once) and utilize them. But I wouldn't want Prometheus to round-robin requests to them. I would want it to perform a DNS-lookup and scrape all targets of that DNS-lookup. Somehow... :) Ideas? For the worker nodes, it would be nice if I could utilize kubectl get nodes to find IP adresses of nodes, and there reach kube-proxy.

Just close this if you feel it's too off-topic, and I'll try elsewhere.

Cleanup: Clean kubeconfig

Would it be a good idea to clean up the ~/.kube/config when running cleanup.yml?

$ kubectl config unset clusters.{{ cluster_name }}
$ kubectl config unset contexts.{{ cluster_name }}
$ kubectl config unset users.{{ cluster_name }}

We don't have the cluster_name variable in cleanup.yml though. :)

Adding/removing etcd nodes

I just tried adding and removing masters to my cluster (etcd is running on my master nodes). My plan was to replace my existing three (really bad) masters with three new masters.

I added the servers to the Ansible inventory and just ran the playbook. It didn't turn out the way I expected.

After googling around, I found out that adding new nodes to (and removing from) etcd requires you to use etcdctl. So to add nodes, you run (from one of the existing etcd nodes):

$ etcdctl --cert-file /etc/etcd/pki/etcd.pem --key-file /etc/etcd/pki/etcd-key.pem --ca-file /etc/etcd/pki/ca.pem --endpoints https://127.0.0.1:2379 member add k8s-master-10 https://<ip-address>:2380

Then you get some info back that you should use in your service file:

  --initial-cluster k8s-master-10=https://<ip-address>:2380,k8s-master-03=https://<ip-address>:2380,k8s-master-02=https://<ip-address>:2380,k8s-master-01=https://<ip-address>:2380
  --initial-cluster-state existing

Note the list of initial cluster nodes and also the initial cluster state.

After doing this manually for each node, I could run the Ansible playbook again to tweak the configuratin files so they are equal on all servers. That worked.

Same goes when removing servers. I had to do:

$ etcdctl --cert-file /etc/etcd/pki/etcd.pem --key-file /etc/etcd/pki/etcd-key.pem --ca-file /etc/etcd/pki/ca.pem --endpoints https://127.0.0.1:2379 member remove <node-id>

... before actually shutting them down and excluding them from the cluster.

This requires a bit of manual work (which is just fine, if it is necessary). Do you have any suggestion on how this can be automated in a better way? Or if not, could/should it be documented within KTRW?

Error starting the kube-proxy service

According to the kube-proxy.service.j2 we're running kube-proxy with this command:

/usr/local/bin/kube-proxy --config=/etc/kubernetes/config/kube-proxy.yml

When I do that I get the following error:

F1214 16:56:03.067953   14535 server.go:361] invalid configuration: no configuration has been provided

I googled around a bit, and found this, which shows a different command. If I run this, my kube-proxy runs:

/usr/local/bin/kube-proxy --master=https://{{ cluster_hostname }}:{{ cluster_port }} --kubeconfig=/etc/kubernetes/config/kube-proxy.kubeconfig --v=2

The above command makes kube-proxy.yml a bit useless. Am I missing something? It's worth noting that I haven't gotten my cluster fully set up yet, so I haven't verified that everything works with the above command. :)

EDIT: Probably same issue as kelseyhightower/kubernetes-the-hard-way/issues/391

Pod security policies

I started messing around with pod security policies to easily notice if my containers are running as root (and doing more complex rules of course). I couldn't get it working, and I ended up in this issue: kubernetes/kubernetes#53063

The argument value for --enable-admission-plugins is hardcoded in KTRW. What about having it as an Ansible setting, with a default value matching the currently hardcoded value?

Note: I haven't tested if this actually works for me yet, I just stumbled across it. What's your take on this?

Unwanted changes when running playbook

The playbook does a good job of not restarting services that haven't changed. However, there are still some places that are considered changed, even if you run the playbook on a cluster where you expect no changes.

PLAY RECAP **********************************************************************************************************************************************************************************************************************************
k8s-master-1               : ok=50   changed=3    unreachable=0    failed=0   
k8s-master-2               : ok=50   changed=3    unreachable=0    failed=0   
k8s-node-1                 : ok=33   changed=2    unreachable=0    failed=0   
k8s-node-2                 : ok=33   changed=2    unreachable=0    failed=0   
localhost                  : ok=25   changed=0    unreachable=0    failed=0

I use same machines for etcd and masters, so above is a bit misleading. It's actually etcd and not matsers causing unwanted changes. For etcd, the following ones are considered changed:

TASK [etcd : Download etcd] *****************************************************************************************************************************************************************************************************************
changed: [k8s-master-1]

TASK [etcd : Unarchive etcd tarball] ********************************************************************************************************************************************************************************************************
changed: [k8s-master-1]

TASK [etcd : Remove tmp download files] *****************************************************************************************************************************************************************************************************
changed: [k8s-master-1] => (item=etcd-v3.3.12-linux-amd64)
changed: [k8s-master-1] => (item=etcd-v3.3.12-linux-amd64.tar.gz)

For the nodes, these are the ones that are considered changed:

TASK [cni : Ensure directories exist] *******************************************************************************************************************************************************************************************************
ok: [k8s-node-1] => (item=/etc/cni/net.d)
changed: [k8s-node-1] => (item=/opt/cni/bin)

TASK [cni : Download cni] *******************************************************************************************************************************************************************************************************************
changed: [k8s-node-1]

Suggested solution for etcd

Can we keep the downloaded etcd tarball? It would consume unnecessary disk space, but at least it wouldn't trigger a change. Or maybe there is a way of ignoring changes for certain steps in the playbook? In theory, only the step Move etcd binaries into place is important (which doesn't trigger any changes).

Suggested solution for cni

Not really sure what the problem is here. It looks like it triggers a change on the /opt/cni/bin directory... Then it triggers another change when we extract CNI into that directory.

The task attempts to set the directory permissions to 755. But after CNI is extracted, it looks like the permissions are 775. Not sure why though...

EDIT: Seems it works the same when I do this locally using tar. Maybe we should just set 775 for /opt/cni/bin` instead?

Service cluster IP range/CIDR

The service cluster IP range is hardcoded to 10.32.0.0/24. I'm guessing we should make this a parameter, with a default value instead?

Also, why /24? Doesn't this limit the number of services in the cluster to 256? I see that Kubernetes default is 10.0.0.0/24 though, so maybe it's common.

Downtime during upgrades

Even though we've enabled some kind of HA solution with serial_all=50%, I still get some downtime during upgrades. I assume this is because when containerd is up again on node 1, we almost immediately take containerd down on node 2. If deployments only have two replicas (which is a common case), it makes sense that this deployment will have downtime.

What can we do in order to properly wait between nodes? One idea would be to actually ask the API server. We could check node status with kubectl get node <node-name>. Could we also check something like number of non-ready pods on a specific node? Maybe you actually have non-ready pods intentionally?

Another idea could be a configurable delay between nodes? Maybe that's sufficient?

Does Ansible have a way of waiting for user input here? If so, that would be even better, so I can manually confirm that the node is back up at 100%.

Permissions on config directories

The playbook sets a bunch of permissions to 755, see here.

We had an idea of using a controller host (a very simple VM), where we execute the playbook for different clusters. This way, we make sure we always have the correct ~/.ktrw directory, we can easily back it up and we avoid risks of re-creating certificates, etc. It also seems a bit quicker to run it like that compared to over from localhost over VPN (which we do a lot these days).

The fact that KTRW wants 755 makes it a bit difficult to work with these with different users. It would be nice if they could be 775 instead, so we could have group permissions. But maybe that's not optimal for when they actually reach the destination servers... There, we'd want 755 I guess?

Do you have any ideas or suggestions, @amimof?

kube-proxy on master nodes?

I'm trying to install the metrics server into my cluster. It requires you to add an APIService that registers itself as an API extension in the API server. However, my masters needs to be able to access this using a Service clusterIP, which it currently cannot, so the APIService fails.

Reading around a bit:
kubernetes/kubernetes#66231

It looks like people install kube-proxy on the masters to achieve this, but it feels a bit weird.

Have you got any idea on how to do this best with Kubernetes The Right Way? There is also a discussion here where they recommend adding an additional API server as a pod inside the cluster, but I'm not quite sure...

EDIT: Running kube-proxy on masters feels really odd. It's not something that I'd want to do.

Additional node tools

It's probably not KTRW's responsibility, but would it be a good idea to install additional tools on the nodes, if specific parameters are set in the inventory of course?

For example, I feel it would be great to have crictl installed so it's easier to debug when issues occur. There are maybe other useful tools that could be optional for KTRW.

Again, it's not really KTRW's responsibility. Thoughts?

Ansible variable for keys check

Background:

I have a working cluster with 3 masters and 5 nodes. I ask my colleague to add another node into the cluster. He clones KTRW and our repository that contains our inventory file. He adds the new node into the inventory file.

If he runs the Ansible playbook now, it will destroy the cluster, since he has no keys on his machine.

I was thinking that we could have a validation parameter. If the parameter is set to true, it could check if vital keys are missing (for example the service-account-key.pem) and if so, simply fail the playbook, explaining that the user needs the keys to continue.

What do you think? Just as a safety measure.

/etc/resolv.conf

The kubelet is configured with the following resolv config:
resolvConf: "/run/systemd/resolve/resolv.conf"

This implies that one is using systemd-resolvd, right? I just had systemd-resolvd disabled due to a temporary local DNS issue and manually configured my /etc/resolv.conf. This means that the directory /run/systemd/resolve/ does not exist and the kubelet is complaining about that.

It feels like it would be better to use /etc/resolv.conf directly? Not sure what implications this might cause though. If I have systemd-resolvd enabled, /etc/resolv.conf is a symlink to /run/systemd/resolve/stub-resolv.conf which does not have the same contents.

We'll get back to using systemd-resolvd as soon as the issue is fixed, so it's not a superbig deal. It's just an idea/discussion.

Configurable options to API server

I'm planning on doing an LDAP-integration for authentication to our cluster, so we can use a detailed permission system. To do this, I need to use aditional options to the API server:

--oidc-issuer-url=https://dex.example.com
--oidc-client-id=example-app
--oidc-ca-file=/etc/ssl/certs/openid-ca.pem
--oidc-username-claim=email
--oidc-groups-claim=groups

Any idea on how we can add a generic way of adding API-server options?

Maybe some prefixed Ansible variables? Like kube-apiserver.oidc-issuer-url=https://dex.example.com?

Pod networking

I know pod networking isn't something that should be handled by this repository, but I have a minor question. I've just gotten into pod networking, I've previously worked with a single worker node, meaning that the bridge supplied by this repository has worked wonders.

This repository supplies a bridge using the subnet 10.19.0.0/16.
This repository installs kube-controller-manager with the argument --cluster-cidr=10.19.0.0/16, which I believe is in charge of appointing IP addresses to each new pod.

I've now installed Flannel that should be used as an overlay network for proper pod networking. The configuration for it uses the subnet 10.244.0.0/16 (from the default in the flannel repository).

My question: How am I supposed to tell my cluster to use the Flannel subnet instead? One idea would be to have an Ansible parameter for this, but I'm kind of clueless here. Is that the right way to go? Or am I missing something? :)

kubectl port-forward

kelseyhightower/kubernetes-the-hard-way/issues/78

To be able to run kubectl port-forward, the package socat needs to be installed on the workers. Should it be installed with the playbook from this repository, or should those "additional" things be installed separately?

Port forwarding is mostly done during development I would suppose, so maybe you wouldn't want extra packages in a production environment?

All nodes but the first one gets "unauthorized"

I have three nodes:

  • k8s-worker-01
  • k8s-worker-02
  • k8s-worker-03

When starting the kubelet on the first one, it all works good. But on the other two nodes, I get the following errors:

Jan 07 11:41:52 k8s-worker-02 kubelet[7251]: E0107 11:41:52.642553    7251 kubelet_node_status.go:94] Unable to register node "k8s-worker-02" with API server: nodes is forbidden: User "system:anonymous" cannot create resource "nodes" in API group "" at the cluster scope
Jan 07 11:41:52 k8s-worker-02 kubelet[7251]: E0107 11:41:52.537626    7251 kubelet.go:2266] node "k8s-worker-02" not found
Jan 07 11:41:52 k8s-worker-02 kubelet[7251]: E0107 11:41:52.637743    7251 kubelet.go:2266] node "k8s-worker-02" not found

... along with a bunch other, all complaining about unauthorized user "system:anonymous".

Is this something you've seen before? I've tried manually using the certificates on the nodes, and that works:

$ curl \
    --cacert /etc/kubernetes/pki/ca.pem \
    --cert /etc/kubernetes/pki/cert.pem \
    --key /etc/kubernetes/pki/key.pem \
    https://k8s-master-01.viskanint.local:6443/api/v1/nodes

Ansible deprecation warnings in tests

Test playbook tests/main.yml outputs deprecation warnings which needs to be adressed. The warnings are related to the docker_container module. An example of a warnings is the following

TASK [Bring up etcd] ***********************************************************
[DEPRECATION WARNING]: Please note that docker_container handles networks 
slightly different than docker CLI. If you specify networks, the default 
network will still be attached as the first network. (You can specify 
purge_networks to remove all networks not explicitly listed.) This behavior 
will change in Ansible 2.12. You can change the behavior now by setting the new
 `networks_cli_compatible` option to `yes`, and remove this warning by setting 
it to `no`. This feature will be removed in version 2.12. Deprecation warnings 
can be disabled by setting deprecation_warnings=False in ansible.cfg.

Also

[DEPRECATION WARNING]: Param 'ipam_options' is deprecated. See the module docs 
for more information. This feature will be removed in version 2.12. Deprecation
 warnings can be disabled by setting deprecation_warnings=False in ansible.cfg.

See this build for more details

Custom parameters to all components

We previously added support for configuring kube-apiserver parameters dynamically. This is nice, and we should probably have this for all components?

However, how do we do it for kubelet, where the configuration comes from a file instead of command line arguments? Maybe the exact same way?

For example, one parameter that I might want to change is maxPods (or --max-pods).

Cleanup: Unmounting volumes for secrets and configmaps

I saw that you recently added an unmounting step in cleanup.yml, which is nice. I had that issue a few days ago.

However, today I tried to clean up a cluster that had a few deployments with config maps and secrets, and they appear to be mounted separately, so I had to do something like this manually before continuing:

anton@node01:~$ sudo umount /var/lib/kubelet/pods/1d8b27bf-02ba-11e9-8e3f-080027281615/volumes/kubernetes.io~secret/nginx-ingress-token-qpwsq
anton@node01:~$ sudo umount /var/lib/kubelet/pods/6d3fb723-03a6-11e9-b379-080027281615/volumes/kubernetes.io~secret/default-token-9wjxq
anton@node01:~$ sudo umount /var/lib/kubelet/pods/eb91be51-02bd-11e9-8e3f-080027281615/volumes/kubernetes.io~secret/default-token-tg277
anton@node01:~$ sudo umount /var/lib/kubelet/pods/0d18545c-03ae-11e9-b379-080027281615/volumes/kubernetes.io~secret/default-token-9wjxq
anton@node01:~$ sudo umount /var/lib/kubelet/pods/0d18545c-03ae-11e9-b379-080027281615/volume-subpaths/config/kibana/0
anton@node01:~$ sudo umount /var/lib/kubelet/pods/6d3fb723-03a6-11e9-b379-080027281615/volume-subpaths/config/elasticsearch/1

Do you think it would be possible to find these in cleanup.yml and unmount them, in a good way?

EDIT: Or is the preferred way to make sure that the cluster is empty before running the cleanup?

Configurable expire dates on certificates

Currently all certificates expire 5 years after creation.

Do we want to utilize a parameter for this value? Also, maybe a separate parameter specifically for the common authority certificates for kube-apiserver and etcd, maybe also with a bit longer default?

Maybe also have another parameter for forcing recreation of common authorities, regenerate_ca_certificates=True (in additionl to regenerate_certificates).

When the time comes to renew certificates (common authorities specifically) it would be nice with a zero-downtime routine. I'll see if I can try to test this routine (as soon as I have time). If it only means downtime for state updates (such as Ingress controller config and node updates and similar), I think it's OK. As long as traffic are still routed properly to the containers.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.