senthilrch / kube-fledged Goto Github PK

A kubernetes operator for creating and managing a cache of container images directly on the cluster worker nodes, so application pods start almost instantly

License: Apache License 2.0

Makefile 4.71% Dockerfile 0.09% Go 80.33% Shell 8.56% Python 4.42% Starlark 0.44% Mustache 1.45%

kubernetes k8s operator image cache pull delete registry repository add-on

kube-fledged's People

Contributors

Stargazers

Watchers

Forkers

dreamer-89-zz kannanvr kalpanathanneeru boxrick gogulakrishnans bhuvanessr way2go-nrcs dinesh2807 karthikeyareddy saikrishna9912 rajesh-gawande karlmutch jagadeeshvenkatesh vijaysharmasir duckty gsm-asad adheipsingh sundeep-paulraj ramkumarocp25 junglehouse amarajit ksatchit dvasilen avitaknik leofang94 clix-dev-llc marvel-works santashjena95 reddeppas zhipengzuo noteable-io paperspace suyambuganesh82 itsmurugappan kubernetescheatsheet leandromribeiro bobower brainfair reddymh count0ru diogouchoas jharris- shaneutt nawabshakib redaqafssaoui cybernetics aland-zhang futuretea fangtangjing cnchef woshizilong lokeshdevops7023 franklinharry liyongxian terrych0u rosskirkpat maopeng0714 anubhav06 aimanfatima kushal-gopal democracytoday niladrih sandeep-v1404 utkarshshah0 code4y kaushik200020 robwittman czomo proth1 rstudio nightbloom udaytadka anshuman35 impradeep22 shreyashkale9 sengarrohit pradipgorai09 ajayryeruva shobanamanickaraj js-9 cxz pczerkas amruthsoft9 shehanster firjest roee-landesman luke-lombardi yeahdongcn davidcollom fmichlik michael-bowen-sc marianod92 81887821 gaocegege n0gu devsisters wangmin362 konturn jinx-heniux iq-scm

kube-fledged's Issues

chmod: cannot access '../code-generator/generate-groups.sh': No such file or directory

bhuvaneswari_senthilraja@cloudshell:~/src/github.com/senthilrch/kube-fledged (crack-meridian-275714)$ hack/update-codegen.sh
#!/usr/bin/env bash
go: finding k8s.io/code-generator v0.17.2
go: downloading k8s.io/code-generator v0.17.2
go: extracting k8s.io/code-generator v0.17.2
go: downloading k8s.io/gengo v0.0.0-20190822140433-26a664648505
go: downloading gonum.org/v1/gonum v0.0.0-20190331200053-3d26580ed485
go: downloading github.com/spf13/pflag v1.0.5
go: extracting k8s.io/gengo v0.0.0-20190822140433-26a664648505
go: extracting github.com/spf13/pflag v1.0.5
go: downloading golang.org/x/tools v0.0.0-20190920225731-5eefd052ad72
go: downloading github.com/emicklei/go-restful v2.9.5+incompatible
go: extracting github.com/emicklei/go-restful v2.9.5+incompatible
go: downloading github.com/go-openapi/spec v0.19.3
go: extracting github.com/go-openapi/spec v0.19.3
go: downloading github.com/go-openapi/jsonreference v0.19.3
go: downloading github.com/go-openapi/jsonpointer v0.19.3
go: downloading github.com/go-openapi/swag v0.19.5
go: extracting github.com/go-openapi/jsonreference v0.19.3
go: extracting github.com/go-openapi/jsonpointer v0.19.3
go: downloading github.com/PuerkitoBio/purell v1.1.1
go: extracting github.com/go-openapi/swag v0.19.5
go: downloading github.com/mailru/easyjson v0.7.0
go: extracting github.com/PuerkitoBio/purell v1.1.1
go: downloading github.com/PuerkitoBio/urlesc v0.0.0-20170810143723-de5bf2ad4578
go: extracting github.com/mailru/easyjson v0.7.0
go: extracting github.com/PuerkitoBio/urlesc v0.0.0-20170810143723-de5bf2ad4578
go: extracting golang.org/x/tools v0.0.0-20190920225731-5eefd052ad72
go: extracting gonum.org/v1/gonum v0.0.0-20190331200053-3d26580ed485
chmod: cannot access '../code-generator/generate-groups.sh': No such file or directory
bhuvaneswari_senthilraja@cloudshell:~/src/github.com/senthilrch/kube-fledged (crack-meridian-275714)$ vi hack/update-codegen.sh                               
bhuvaneswari_senthilraja@cloudshell:~/src/github.com/senthilrch/kube-fledged (crack-meridian-275714)$ ecgo $GOPATH
-bash: ecgo: command not found
bhuvaneswari_senthilraja@cloudshell:~/src/github.com/senthilrch/kube-fledged (crack-meridian-275714)$ echo $GOPATH
/home/bhuvaneswari_senthilraja/gopath:/google/gopath

garbage collection worker to clean-up hanging jobs

if the imagepuller pod is unable to pull the image, it keeps retrying. The job that created the imagepuller pod has an activeDeadlineSeconds = 3600. so users can query the imagepuller pod's events to understand the reason for failure.

we need a garbage collection worker in image manager to cleanup jobs that have reached activeDeadlineSeconds.

Error updating ImageCache(imagecache1) status to 'Aborted'

I0501 12:23:22.302552 1 controller.go:125] Setting up event handlers
I0501 12:23:22.305612 1 fledged.go:74] Starting pre-flight checks
I0501 12:23:22.325033 1 controller.go:161] No dangling or stuck jobs found...
E0501 12:23:22.337601 1 controller.go:201] Error updating ImageCache(imagecache1) status to 'Aborted': ImageCache.fledged.k8s.io "imagecache1" is invalid: status.startTime: Invalid value: "null": status.startTime in body must be of type string: "null"
F0501 12:23:22.338115 1 fledged.go:76] Error running pre-flight checks: ImageCache.fledged.k8s.io "imagecache1" is invalid: status.startTime: Invalid value: "null": status.startTime in body must be of type string: "null"
goroutine 1 [running]:
github.com/golang/glog.stacks(0xc000287500, 0xc00026a000, 0xe3, 0x135)
/home/bhuvaneswari_senthilraja/gopath/pkg/mod/github.com/golang/[email protected]/glog.go:769 +0xb8
github.com/golang/glog.(*loggingT).output(0x1f21280, 0xc000000003, 0xc00009c230, 0x1e8e496, 0xa, 0x4c, 0x0)
/home/bhuvaneswari_senthilraja/gopath/pkg/mod/github.com/golang/[email protected]/glog.go:720 +0x372
github.com/golang/glog.(*loggingT).printf(0x1f21280, 0x3, 0x140c225, 0x23, 0xc0000b1f00, 0x1, 0x1)
/home/bhuvaneswari_senthilraja/gopath/pkg/mod/github.com/golang/[email protected]/glog.go:655 +0x14b
github.com/golang/glog.Fatalf(...)
/home/bhuvaneswari_senthilraja/gopath/pkg/mod/github.com/golang/[email protected]/glog.go:1148
main.main()
/home/bhuvaneswari_senthilraja/src/github.com/senthilrch/kube-fledged/cmd/fledged.go:76 +0x616

generate events for every successful or failed sync and refresh action

fledged controller needs to raise events for every successful or failed sync and refresh action, with appropriate human-readable message. these events could then be picked by other event processing tools for visibility and security.

Enhancement: modify helm operator crd group to "charts.helm.kubefledged.io" & update openapiv3schema

The CRD of the helm operator uses the group "charts.helm.k8s.io". Since *.k8s.io is a reserved group, propose to modify this to "charts.helm.kubefledged.io". Also update openapiv3schema of the CRD.

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: kubefledgeds.charts.helm.kubefledged.io
spec:
  group: charts.helm.kubefledged.io
  names:
    kind: KubeFledged
    listKind: KubeFledgedList
    plural: kubefledgeds
    singular: kubefledged
  scope: Namespaced
  versions:
  - name: v1alpha2
    schema:
      openAPIV3Schema:
        description: KubeFledged is the Schema for the kubefledgeds API
        properties:
          apiVersion:
            description: 'APIVersion defines the versioned schema of this representation
              of an object. Servers should convert recognized schemas to the latest
              internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources'
            type: string
          kind:
            description: 'Kind is a string value representing the REST resource this
              object represents. Servers may infer this from the endpoint the client
              submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
            type: string
          metadata:
            type: object
          spec:
            description: Spec defines the desired state of KubeFledged
            type: object
            x-kubernetes-preserve-unknown-fields: true
          status:
            description: Status defines the observed state of KubeFledged
            type: object
            x-kubernetes-preserve-unknown-fields: true
        type: object
    served: true
    storage: true
    subresources:
      status: {}

Images with latest tag not published for arm & arm64

Images with latest tag not published for linux/arm and linux/arm64 platforms

This issue is applicable for all four images:

kubefledged-controller
kubefledged-webhook-server
kubefledged-cri-client
kubefledged-operator

Validate content of image cache spec

In existing implementation, fledged controller doesn't performs any validation on the contents of the image cache spec. When erroneous content is present in the image cache spec, this leads to unnecessary and undesirable actions from fledged and image manager, leading to handing jobs and hanging entries in imageworkstatus map of image manager. Following validations need to be implemented:-

validate if nodeSelector resolves to atleast one node
image list is not empty
image name is not repeated within the spec
if validation fails, such image cache resources should not be refreshed during refresh cycle

Deployment via Helm chart fail

I know that 2 issues already exists about installing kube-fledged via YAML and operator fail but this one is about installing it with Helm, and it's not exactly the same problem.

When I run:

$ curl -fsSL https://raw.githubusercontent.com/senthilrch/kube-fledged/master/deploy/webhook-create-signed-cert.sh | bash -s -- --namespace ${KUBEFLEDGED_NAMESPACE}

I got this error message: The CertificateSigningRequest "kubefledged-webhook-server.kube-fledged" is invalid: spec.signerName: Invalid value: "kubernetes.io/legacy-unknown": the legacy signerName is not allowed via this API version.

My k8s version:

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", BuildDate:"2020-09-16T13:32:58Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}

PS: FYI, if it can be useful:

$ kubectl api-versions | grep certificates.k8s.io | head -1
certificates.k8s.io/v1

EDIT:

If I try to edit the script and remove the signerName key, I got this quite interesting error message:
error: error validating "STDIN": error validating data: ValidationError(CertificateSigningRequest.spec): missing required field "signerName" in io.k8s.api.certificates.v1.CertificateSigningRequestSpec; if you choose to ignore these errors, turn validation off with --validate=false. Not sure if the issue is coming from kube-fledged or k8s itself 🤔

Incompatibility with carvel-kapp: object has been modified

We are leveraging "vmware-tanzu/carvel-kapp" as our deployment tool.

When deploying the refresh annotation kubefledged.k8s.io/refresh-imagecache the action is failing in the kube-fledged-controller with following error:

kubefledged-controller-69dbc794d8-b2hg4 E0707 09:40:58.095098       1 controller.go:491] Error updating imagecache status to Processing: Operation cannot be fulfilled on imagecaches.kubefledged.io "my-image": the object has been modified; please apply your changes to the latest version and try again                                                                                                                  

kubefledged-controller-69dbc794d8-b2hg4 E0707 09:40:58.095124       1 controller.go:366] error syncing imagecache: Operation cannot be fulfilled on imagecaches.kubefledged.io "my-image": the object has been modified; please apply your changes to the latest version and try again

We highly believe the issue is related to a race condition, described more in detail here:

kubernetes/kubernetes#67761

Workaround used within kapp:
https://github.com/vmware-tanzu/carvel-kapp/blob/d5b8c43b5678039c961b8369b2231860af9b5f90/pkg/kapp/resources/resources.go#L436-L438

Add gofmt, golint, govet checks to CI

Add gofmt, golint, govet checks to CI.

Add necessary verify scripts to "hack" directory.
Add these new verify scripts to .travis.yml.

Enhancement: Use environment variable for specifying cri-client-image

Instead of specifying cri-client-image as command argument in controller container, specify it using environment variable.

Feature Request - Have a yaml only (no bash) way to deploy

At the moment, the installation requires running make targets which further calls some bash scripts to create certs.

While this works, the approach however is not GitOps friendly. It would be great if there is a way to install only using the yamls, be it standalone yaml(s) or a helm chart. Doing so, it will enable people to directly use yamls or helm charts in their GitOps repos (example Flux or ArgoCD)

Test kube-fledged in GKE and fix bugs

kube-fledged is not yet tested in any cloud-native kubernetes services. It needs to be tested in Google Kubernetes Engine (GKE) using Google Container Registry (GCR). If any bugs are identified, these need to be fixed.

Deployment via YAML and operator both fail

I have Kubernetes 1.19.8
Deployment of either operator or YAML fails (due to different reasons):

YAML:

> make deploy-using-yaml
kubectl apply -f deploy/kubefledged-namespace.yaml
namespace/kube-fledged created
bash deploy/webhook-create-signed-cert.sh
creating certs in tmpdir /var/folders/mh/365k5jv95l52mb37tlgjgj500000gq/T/tmp.jnYff36T
Generating RSA private key, 2048 bit long modulus
..................................+++
...................+++
e is 65537 (0x10001)
Warning: certificates.k8s.io/v1beta1 CertificateSigningRequest is deprecated in v1.19+, unavailable in v1.22+; use certificates.k8s.io/v1 CertificateSigningRequest
certificatesigningrequest.certificates.k8s.io/kubefledged-webhook-server.kube-fledged created
NAME                                      AGE   SIGNERNAME                     REQUESTOR   CONDITION
kubefledged-webhook-server.kube-fledged   1s    kubernetes.io/legacy-unknown   u-55kdm     Pending
certificatesigningrequest.certificates.k8s.io/kubefledged-webhook-server.kube-fledged approved
ERROR: After approving csr kubefledged-webhook-server.kube-fledged, the signed certificate did not appear on the resource. Giving up after 10 attempts.
make: *** [deploy-using-yaml] Error 1

Operator:

make deploy-using-operator
# Create the namespaces for operator and kubefledged
kubectl create namespace kubefledged-operator
namespace/kubefledged-operator created
kubectl create namespace kube-fledged
namespace/kube-fledged created
# Deploy the operator to a separate namespace
sed -i 's|{{OPERATOR_NAMESPACE}}|kubefledged-operator|g' deploy/kubefledged-operator/deploy/service_account.yaml
sed: 1: "deploy/kubefledged-oper ...": extra characters at the end of d command
make: *** [deploy-using-operator] Error 1

Even when I manually fix the sed issue by using hardcoded namespace, certificate generation fails in operator install.

Default image pull policy for images with no tags

At present, the default image pull policy for images with ":latest" tag is PullAlways, whereas for images with no tag, it is IfNotPresent.

However since images with no tag actually implies the ":latest" tag, PullAlways should be the default policy for such images. So following code changes required:-

File: pkg/images/image_helpers.go
Function: newImagePullJob

		if latestimage := strings.Contains(image, ":latest") || !strings.Contains(image, ":"); latestimage {
			pullPolicy = corev1.PullAlways
		}

Imagecache status reason should have the imagecache action

Imagecache status reason should have the imagecache action.

Actual status field:-

    "status": {
        "completionTime": "2018-12-09T14:57:25Z",
        "message": "All requested images pulled succesfully to respective nodes",
        "reason": "ImagesPulledSuccessfully",
        "startTime": "2018-12-09T14:57:09Z",
        "status": "Succeeded"
    }

Expected status field:-

    "status": {
        "completionTime": "2018-12-09T14:57:25Z",
        "message": "All requested images pulled succesfully to respective nodes",
        "reason": "ImageCacheRefresh",
        "startTime": "2018-12-09T14:57:09Z",
        "status": "Succeeded"
    }

Controller fails to pick up ImageCache resource on fresh install if webhook server is down

When installing kube-fledged directly from the helm chart, both the imagecache resource, webhook and the controller servers are deployed at same time and if the controller comes up before the webhook server, then the controller runs into the following error:

I0414 00:43:56.288736       1 controller.go:122] Setting up event handlers
I0414 00:43:56.289021       1 main.go:75] Starting pre-flight checks
I0414 00:43:56.487536       1 controller.go:158] No dangling or stuck jobs found...
I0414 00:43:56.494524       1 controller.go:208] No dangling or stuck imagecaches found...
I0414 00:43:56.494553       1 main.go:79] Pre-flight checks completed
I0414 00:43:56.494582       1 controller.go:223] Starting fledged controller
I0414 00:43:56.494588       1 controller.go:226] Waiting for informer caches to sync
I0414 00:43:56.681989       1 controller.go:231] Starting image cache worker
I0414 00:43:56.682138       1 controller.go:238] Starting cache refresh worker
I0414 00:43:56.682155       1 controller.go:242] Started workers
I0414 00:43:56.682162       1 image_manager.go:340] Starting image manager
I0414 00:43:56.682170       1 image_manager.go:343] Waiting for informer caches to sync
I0414 00:43:56.682179       1 controller.go:429] Starting to sync image cache kernels-default(create)
E0414 00:43:56.700798       1 controller.go:491] Error updating imagecache status to Processing: Internal error occurred: failed calling webhook "validate-image-cache.kubefledged.k8s.io": Post "https://kubefledged-webhook-server.management.svc:3443/validate-image-cache?timeout=1s": no endpoints available for service "kubefledged-webhook-server"
E0414 00:43:56.701170       1 controller.go:366] error syncing imagecache: Internal error occurred: failed calling webhook "validate-image-cache.kubefledged.k8s.io": Post "https://kubefledged-webhook-server.management.svc:3443/validate-image-cache?timeout=1s": no endpoints available for service "kubefledged-webhook-server"
E0414 00:43:56.701206       1 controller.go:377] error syncing imagecache: Internal error occurred: failed calling webhook "validate-image-cache.kubefledged.k8s.io": Post "https://kubefledged-webhook-server.management.svc:3443/validate-image-cache?timeout=1s": no endpoints available for service "kubefledged-webhook-server"
I0414 00:43:57.082511       1 image_manager.go:348] Started image manager

Then the controller does not pick up the imagecache resource that was deployed. The current workaround is to manually delete the controller pod and then it's able to pick up the imagecache resource.

Getting x509: certificate signed by unknown authority Error

I deployed kube-fledged on my k8s cluster created using kubespray and tried to cache nginx image onto all of my worker nodes and I am getting the below error
kubectl create -f deploy/kubefledged-imagecache.yaml
Error from server (InternalError): error when creating "deploy/kubefledged-imagecache.yaml": Internal error occurred: failed calling webhook "validate-image-cache.kubefledged.k8s.io": Post https://kubefledged-webhook-server.kube-fledged.svc:3443/validate-image-cache?timeout=1s: x509: certificate signed by unknown authority

This is what I see in the logs of webhook-server container
[centos@infra-vm kube-fledged]$ kubectl logs kubefledged-webhook-server-678d8f44d5-fk7kk -n kube-fledged
I1223 15:11:59.201779 1 main.go:282] Wehook server listening on :443
2020/12/23 16:56:49 http: TLS handshake error from 10.233.117.0:38036: remote error: tls: bad certificate
2020/12/23 17:04:49 http: TLS handshake error from 10.233.117.0:41302: remote error: tls: bad certificate
2020/12/23 17:04:59 http: TLS handshake error from 10.233.117.0:41368: remote error: tls: bad certificate
[centos@infra-vm kube-fledged]$

Can anyone please help me out or provide me some pointers on how to resolve this error

Allow kube-fledged to be deployed in any namespace

IMO hardcoding the namespace which the controller lives in is an antipattern:

https://github.com/senthilrch/kube-fledged/search?q=fledgedNameSpace&unscoped_q=fledgedNameSpace

Can this be changed so that kube-fledged can be deployed to any namespace?

E0224 13:17:38.339997       1 image_manager.go:442] Error creating job in node ip-10-0-30-82: namespaces "kube-fledged" not found
E0224 13:17:38.340905       1 image_manager.go:424] error pulling image 'golang:1.13' to node 'ip-10-0-30-82': namespaces "kube-fledged" not found

New flag "--image-pull-policy" documentation

New flag "--image-pull-policy" to be added in README.md in section "Configuration flags" section.

Configuration Flags

--image-pull-deadline-duration: Maximum duration allowed for pulling an image. After this duration, image pull is considered to have failed. default "5m"

--image-cache-refresh-frequency: The image cache is refreshed periodically to ensure the cache is up to date. Setting this flag to "0s" will disable refresh. default "15m"

--docker-client-image: The image name of the docker client. the docker client is used when deleting images during purging the cache".

--image-pull-policy: Image pull policy for pulling images into the cache. Possible values are 'IfNotPresent' and 'Always'. Default value is 'IfNotPresent'. Default value for Images with ':latest' tag is 'Always'

--stderrthreshold: Log level. set the value of this flag to INFO

Description of new flag "--image-pull-policy" in cmd/fledged.go needs to be updated as follows:-

flag.StringVar(&imagePullPolicy, "image-pull-policy", "", " Image pull policy for pulling images into the cache. Possible values are 'IfNotPresent' and 'Always'. Default value is 'IfNotPresent'. Default value for Images with ':latest' tag is 'Always'")

Enhancement request: avoid hardcode busybox init container

Avoid hardcode image

kube-fledged/pkg/images/image_helpers.go

Line 87 in 163cd80

Image: "busybox:1.29.2",

Proposal:
Set image for busybox via Env or args like --cri-client-image

Increase CI code coverage to 5%

No go tests exists today. Goal of this issue is to increase the code coverage from 0% to 5%. Packages to be considered for achieving the target %

kube-fledged/cmd
kube-fledged/cmd/app
kube-fledged/pkg/images

Increase unit test coverage of package "kube-fledged/cmd/app" to > 75%

Current UT code coverage is 15%.
Need to increase it to > 75%

Enhancement: Use "kubefledged.io" as domain for annotation keys

Kube-fledged uses following two annotation keys for cache refresh and purge:-

kubefledged.k8s.io/refresh-cache
kubefledged.k8s.io/purge-cache

Since *.k8s.io is a reserved/protected domain, the annotation keys should be changed to:-

kubefledged.io/refresh-cache
kubefledged.io/purge-cache

Feature: Allow ImageCache CR to be created in a different namespace; not restricted to the one to which kube-fledged was deployed

Hello,
I try to create ImageCache not in kube-fledged NS.
After this pulling jobs don`t create and I see many errors in kubefledged-controller like:

E0729 09:25:24.612184 1 image_manager.go:446] error pulling image 'myimage' to node 'mynode': the namespace of the provided object does not match the namespace sent on the request E0729 09:25:24.810254 1 image_manager.go:464] Error creating job in node &Node....

If I deploy ImageCache to kube-fledged NS all fine

Envorinment:
bare-metal k8s version: 1.20.7

I am seeing Image pull failed and the message is "PodInitializing" under failure for every node.

I am trying to cache images that are 10+ GB on few nodes on my cluster. Any idea what this PodInitializing failure means?
[centos@infra-vm ~]$ kubectl get imagecaches imagecache1 -n kube-fledged -o json
{
"apiVersion": "kubefledged.k8s.io/v1alpha1",
"kind": "ImageCache",
"metadata": {
"creationTimestamp": "2021-02-11T21:36:39Z",
"generation": 3,
"name": "imagecache1",
"namespace": "kube-fledged",
"resourceVersion": "1209980",
"selfLink": "/apis/kubefledged.k8s.io/v1alpha1/namespaces/kube-fledged/imagecaches/imagecache1",
"uid": "01386121-a9dd-4180-bd73-0163414a7515"
},
"spec": {
"cacheSpec": [
{
"images": [
"nginx",
"tomcat:10.0.0"
]
},
{
"images": [
"mec-docker.envrmnt.com/dreamscape/validator/mec-prometheus-server:latest"
],
"nodeSelector": {
"slice-size": "ABOVE-6GB"
}
}
],
"imagePullSecrets": [
{
"name": "jfrog"
}
]
},
"status": {
"completionTime": "2021-02-11T21:41:48Z",
"failures": {
"mec-docker.envrmnt.com/dreamscape/validator/mec-prometheus-server:latest": [
{
"message": "",
"node": "hp-gpu-node3",
"reason": "PodInitializing"
},
{
"message": "",
"node": "hp-gpu-node2",
"reason": "PodInitializing"
},
{
"message": "",
"node": "hp-gpu-node0",
"reason": "PodInitializing"
},
{
"message": "",
"node": "hp-gpu-node6",
"reason": "PodInitializing"
},
{
"message": "",
"node": "hp-gpu-node7",
"reason": "PodInitializing"
},
{
"message": "",
"node": "hp-gpu-node4",
"reason": "PodInitializing"
},
{
"message": "",
"node": "hp-gpu-node5",
"reason": "PodInitializing"
},
{
"message": "",
"node": "hp-gpu-node1",
"reason": "PodInitializing"
}
]
},
"message": "Image pull failed for some images. Please see "failures" section",
"reason": "ImageCacheCreate",
"startTime": "2021-02-11T21:36:39Z",
"status": "Failed"
}
}

Deployment using the Operator 0.8.0

make deploy-using-operator returns the following on EKS 1.20 flavor:

The CertificateSigningRequest "kubefledged-webhook-server.kube-fledged" is invalid: spec.signerName: Invalid value: "kubernetes.io/legacy-unknown": the legacy signerName is not allowed via this API version

Question: before setup kube-fledged use case

@senthilrch @sundeepk1 ,

kube-fledged looks very good for pulling large images from diffrent zone for EKS autoscaling during peak hours. i have below question before going to setup kube-fledged in our eks clusters on our use-case.

1)We have image around 4 GB size OF image and pulling from different region from artifcatory taking more then 10 Mins and it is added autoscaling delays on cluster,If we use kube-fledged it will solve our issue ?

2)If step 1 is yes then we are running only one container (4 GB )per node so how it will cache IMAGE on new launched server to avoid pulling image from artifcatory ? or it will pull from other running worker nodes ? could you please confirm ?

Thanks !

Coverage error during Travis build restart

When a Travic CI build is restarted, the step where the unit test coverage results are exported to Coveralls fails:-

E0606 10:58:12.643083   25517 image_manager.go:393] Unexpected type in workqueue: struct {}{}
812
--- PASS: TestProcessNextWorkItem (0.00s)
813
PASS
814
coverage: 83.3% of statements
815
ok  	github.com/[secure]/kube-fledged/pkg/images	0.098s	coverage: 83.3% of statements
816
?   	github.com/[secure]/kube-fledged/pkg/signals	[no test files]
817
?   	github.com/[secure]/kube-fledged/pkg/webhook	[no test files]
818
The command "make test" exited with 0.
819
2.14s
$ $(go env GOPATH | awk 'BEGIN{FS=":"} {print $1}')/bin/goveralls -coverprofile=coverage.out -service=travis-ci
820
Bad response status from coveralls: 422
821
{"message":"service_job_id (693837965) must be unique for Travis Jobs not supplying a Coveralls Repo Token","error":true}
822
The command "$(go env GOPATH | awk 'BEGIN{FS=":"} {print $1}')/bin/goveralls -coverprofile=coverage.out -service=travis-ci" exited with 1.
823
824
825
Done. Your build exited with 1.

Imagecache spec not restored back when update validation fails

Scenario:-

User has already created an imagecache. images are cached successfully. now user make an update to imagecache. by mistake user has performed an update that is not allowed. in this case, validation of the updated spec fails in the fledged controller and status/reason/message are updated appropriately. however the spec is not restored back to the working spec.

suggest to add an annotation to imagecache spec when updating the validation failure status. when controller sees imagecache spec change, together with the added annotation, do not perform sync action; instead remove the annotation. this will prevent infinite update sync and also restore the spec back to original.

Installing via Helm Charts

Hi,

Would it be possible to support installing via Helm Charts?

Thanks!

different image cache status and reason during refresh

when the fledged controller is pulling images during a refresh cycle, currently the status, reason and message fields of image cache are updated to same value as during image cache creation.

we need different image cache status, reason and message during refresh so users know the exact current action being performed.

Increase unit test coverage of package "kube-fledged/pkg/images" to > 75%

implement cross build

The current Makefile generates only amd64 binaries of fledged controller. We will need to implement cross build capability in the Makefile to produce binaries for other architectures.

Refer below url for e.g.:-
https://github.com/docker/cli/blob/master/scripts/build/cross

Image deletion during image cache update

When user performs image cache update action, if user removes images from cache spec, then fledged controller needs to detect this and subsequently delete the images from respective nodes.

Enhancement: Install validatingwebhookconfiguration as helm pre-install hook

When deploying kube-fledged via helm chart or helm operator, sometimes it is possible the validatingwebhookconfiguration is not persisted before the webhook server pod is started. This can cause improper initialization of the webhook server performed in the init container.

Install the kube-fledged validatingwebhookconfiguration as helm pre-install hook, so it gets persisted before the webhook server deployment is created.

Enhancement: Use self-signed certs for webhook server

Kube-fledged installation currently relies on Kubernetes CertificateSigningRequest to generate the server certificate for the webhook server. This creates problems:-

The stable v1 version of CertificateSigningRequest doesn't supports signer kubernetes/legacy-unknown
The supported in-built signers are not usable to generate a server certificate. There's support only for client certificates
On clusters that have only v1 enabled, it is impossible to generate server certificate.
Bash script is used to generate certificate, so installing via GitOps (e.g. ArgoCD) is not fully supported.
Ref issue #75 , issue #76

The solution is to generate self-signed certificate for the webhook server and add the CA bundle to the validatingwebhookconfiguration, using init container or as init method within the webhook server.

Enhancement: When deploying via helm operator, wait for operator ready status

When deploying kube-fledged via helm operator, wait on the operator deployment until it is ready, before applying the KubeFledged CR. This will ensure the operator doesn't misses the CR.

improve the speed of processing "imagepullqueue" in image manager

The image manager routine in fledged controller is a single go-routing that processes image pull requests pushed into "imagepullqueue". While this has advantages of lower memory footprint and requires no lock mechanisms in accessing the imagepullstatus map, there are situations where an imagepull might take a long time (either the size of the image is huge, or network is unreliable and slow, or proper imagepullsecrets were not specified etc.). This causes the image manager routine to get blocked inside the "updateImageCacheStatus()" method till the "imagePullDeadlineDuration" is reached. See the code below:-

func (m *ImageManager) updateImageCacheStatus() error {
	//time.Sleep(m.imagePullDeadlineDuration)
	wait.Poll(time.Second, m.imagePullDeadlineDuration,
		func() (done bool, err error) {
			done, err = true, nil
			for _, ipres := range m.imagepullstatus {
				if ipres.Status == ImagePullResultStatusJobCreated {
					done, err = false, nil
					return
				}
			}
			return
		})

In order to improve the speed of processing in image manager we need to introduce appropriate concurrent processing so that failing image pulls do not unnecessarily block the processing of other requests in the imagepullqueue.

We need to investigate further to arrive at the optimal concurrency design for improving the speed + one this is not complex. One option is to have "updateImageCacheStatus()" executed inside a go routine using lock/unlock for protecting the imagepullstatus map. Another option is to increase the number of image manager workers. Other options if any should be explored.

Update README: Include steps to create local image for fledge client

Default value for the docker client is:
FLEDGED_DOCKER_CLIENT_IMAGE_NAME=senthilrch/fledged-docker-client:latest

We need to create/update this variable as per developer

"startTime" and "completionTime" in ImageCache Status

Description:

We will need to have "startTime" and "completionTime" in ImageCache Status. This is the general practice in Kubernetes API objects that have a definite start and end time. When fledged controller syncs an image cache, it first updates the Status.Status field to "Processing". During this update, the "startTime" field should be updated to the current UTC time. Similarly during sync action "ImageCacheStatusUpdate" when status of processing is updated to Status field, the "completionTime" field should be updated.

Changes in types.go:

// ImageCacheStatus is the status for a ImageCache resource
type ImageCacheStatus struct {
	Status         ImageCacheActionStatus         `json:"status"`
	Reason         string                         `json:"reason"`
	Message        string                         `json:"message"`
	Failures       map[string][]NodeReasonMessage `json:"failures,omitempty"`
	StartTime      *metav1.Time                   `json:"startTime,omitempty"`
	CompletionTime *metav1.Time                   `json:"completionTime,omitempty"`
}

pulling images into tainted nodes by adding toleration

if there are tainted nodes in the cluster, then the image puller pod will never get scheduled into that node. consequently image will never get pulled into such nodes. since these are jobs for pulling image and not actual application pods, fledged should add matching toleration when firing image puller pods into such nodes.

when the sync handler in the controller writes the image pull request into imagepullqueue, it should also add toleration if applicable. image manager should than add this toleration in the pod manifest when firing the image puller job.

type images.ImagePullRequest need to be modified.
code in controller.go and image_manager.go need to be modified.

Deploying using both YAML and operator approaches fail

Hi,

When I run make deploy-using-operator against a KinD cluster, the Kubernetes resources seems to be applied successfully, but at runtime the kubefledged-operator outputs several messages of this type:

{"level":"error","ts":1623058742.2220135,"logger":"controller-runtime.manager.controller.kubefledged-controller","msg":"Reconciler error","name":"kubefledged","namespace":"ku │
│ befledged-operator","error":"failed to install release: clusterroles.rbac.authorization.k8s.io \"kubefledged\" is forbidden: user \"system:serviceaccount:kubefledged-operator:kubefledged-operator\" (groups │
│ =[\"system:serviceaccounts\" \"system:serviceaccounts:kubefledged-operator\" \"system:authenticated\"]) is attempting to grant RBAC permissions not currently held:\n{APIGroups:[\"kubefledged.io\"], Resourc │
│ es:[\"imagecaches\"], Verbs:[\"get\" \"list\" \"watch\" \"update\"]}\n{APIGroups:[\"kubefledged.io\"], Resources:[\"imagecaches/status\"], Verbs:[\"patch\"]}","stacktrace":"github.com/go-logr/zapr.(*zapLog │
│ ger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0. │
│ 8.3/pkg/internal/controller/controller.go:302\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/con │
│ troller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:216\ │
│ nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s │
│ .io/[email protected]/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.Ji │
│ tterUntil\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185 │
│ \nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:99"}

If I try using make deploy-using-yaml, the following error is thrown in the kubefledge-controller container at runtime:

│ I0607 09:47:53.858698       1 controller.go:122] Setting up event handlers                                                                                                                                    │
│ I0607 09:47:53.865008       1 main.go:75] Starting pre-flight checks                                                                                                                                          │
│ I0607 09:47:53.884879       1 controller.go:158] No dangling or stuck jobs found...                                                                                                                           │
│ E0607 09:47:53.886271       1 controller.go:180] Error listing imagecaches: imagecaches.kubefledged.io is forbidden: User "system:serviceaccount:kube-fledged:kubefledged-controller" cannot list resource "i │
│ magecaches" in API group "kubefledged.io" in the namespace "kube-fledged"                                                                                                                                     │
│ F0607 09:47:53.887140       1 main.go:77] Error running pre-flight checks: imagecaches.kubefledged.io is forbidden: User "system:serviceaccount:kube-fledged:kubefledged-controller" cannot list resource "im │
│ agecaches" in API group "kubefledged.io" in the namespace "kube-fledged"                                                                                                                                      │
│ goroutine 1 [running]:                                                                                                                                                                                        │
│ github.com/golang/glog.stacks(0xc0004d0900, 0xc0004b8180, 0x116, 0x16b)                                                                                                                                       │
│     /go/pkg/mod/github.com/golang/[email protected]/glog.go:769 +0xb9                                                                                                                   │
│ github.com/golang/glog.(*loggingT).output(0x2340ea0, 0xc000000003, 0xc000498cb0, 0x1cd0433, 0x7, 0x4d, 0x0)                                                                                                   │
│     /go/pkg/mod/github.com/golang/[email protected]/glog.go:720 +0x3b3                                                                                                                  │
│ github.com/golang/glog.(*loggingT).printf(0x2340ea0, 0x3, 0x17e04ca, 0x23, 0xc000519f28, 0x1, 0x1)                                                                                                            │
│     /go/pkg/mod/github.com/golang/[email protected]/glog.go:655 +0x153                                                                                                                  │
│ github.com/golang/glog.Fatalf(...)                                                                                                                                                                            │
│     /go/pkg/mod/github.com/golang/[email protected]/glog.go:1148                                                                                                                        │
│ main.main()                                                                                                                                                                                                   │
│     /go/src/github.com/senthilrch/kube-fledged/cmd/controller/main.go:77 +0x61f

Add new flag --image-pull-policy

Current implementation uses imagePullPolicy = IfNotPresent when pulling images. This is hardcoded in pkg/images/helpers.go. We will need to allow the user to specify the image pull policy via a new flag --image-pull-policy. Possible values are Always and IfNotPresent. Default value: IfNotPresent. Default to Always if :latest tag is specified

CRD Resource API version deprecated

When installing the Helm chart, I get the following notice:

apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition

swagger api documentation

We will need to implement scripts to generate swagger spec and html api reference documents based on swagger spec. Refer following urls as e.g.:-

Implementation of gosec checks

Go security checks need to add.

hanging image cache status and jobs when fledged is oom-killed

When fledged container runs on a node with limited memory, it might get killed by Linux OOM-Killer. when this happens kubelet will recreate a new fledged container automatically. However the image cache status would get stuck in processing and also leaves image puller jobs and pods hanging.

When fledged starts-up, it needs to clean-up any active/completed imagepuller jobs/pods + also reset the stuck status of image caches to "Failed" with appropriate reason and message. This way the image caches marked as Failed will be picked up by the refreshworker and get processed.

Nov 01 22:40:10 worker1 kernel: Out of memory: Kill process 1202 (fledged) score 1002 or sacrifice child
Nov 02 02:52:52 worker1 kernel: docker-containe invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=-999
Nov 02 02:52:57 worker1 kernel:  [<ffffffffbe79ac94>] oom_kill_process+0x254/0x3d0
Nov 02 02:52:57 worker1 kernel:  [<ffffffffbe79b4d6>] out_of_memory+0x4b6/0x4f0
Nov 02 02:52:57 worker1 kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Nov 02 02:52:57 worker1 kernel: Out of memory: Kill process 9014 (fledged) score 1001 or sacrifice child
Nov 02 02:52:57 worker1 kernel: exe invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=-999
Nov 02 02:52:57 worker1 kernel:  [<ffffffffbe79ac94>] oom_kill_process+0x254/0x3d0
Nov 02 02:52:57 worker1 kernel:  [<ffffffffbe79b4d6>] out_of_memory+0x4b6/0x4f0
Nov 02 02:52:58 worker1 kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Nov 02 02:52:58 worker1 kernel: Out of memory: Kill process 9046 (fledged) score 1001 or sacrifice child
Nov 02 03:08:15 worker1 kernel: flanneld invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=-998
Nov 02 03:08:19 worker1 kernel:  [<ffffffffbe79ac94>] oom_kill_process+0x254/0x3d0
Nov 02 03:08:19 worker1 kernel:  [<ffffffffbe79b4d6>] out_of_memory+0x4b6/0x4f0
Nov 02 03:08:25 worker1 kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Nov 02 03:08:25 worker1 kernel: Out of memory: Kill process 15319 (fledged) score 1001 or sacrifice child
Nov 02 03:08:25 worker1 kernel: exe invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=-999
Nov 02 03:08:25 worker1 kernel:  [<ffffffffbe79ac94>] oom_kill_process+0x254/0x3d0
Nov 02 03:08:25 worker1 kernel:  [<ffffffffbe79b4d6>] out_of_memory+0x4b6/0x4f0
Nov 02 03:08:25 worker1 kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name

Multi-arch image for cri-client

The multi-arch image for kubefledged-cri-client needs to be updated for arm and arm64 architectures. Currently the linux/amd64 binaries of docker client and crictl are being packaged into the arm and arm64 images as well; this is incorrect.

Dockerfile needs to be updated using ${TARGET_PLATFORM} variable:-
https://github.com/senthilrch/kube-fledged/blob/master/build/Dockerfile.cri_client

implement deletion of images in image cache

when image cache resource is deleted (or) when image cache resource is modified by removing images, images already cached before in the worker nodes have to be deleted. fledged controller needs to take action to fire new jobs that would delete the image. this functionality needs to be implemented.