centaurusinfra / arktos Goto Github PK

Arktos for large-scale cloud platform

License: Apache License 2.0

Starlark 6.37% Dockerfile 0.09% Makefile 0.13% Shell 3.70% C 0.01% Go 89.37% sed 0.01% Ruby 0.01% PowerShell 0.18% Python 0.12% Lua 0.03% HTML 0.01% C++ 0.01%

arktos's Introduction

Arktos

What is Arktos

Arktos is an open source project designed for large scale cloud compute infrastructure. It is evolved from the open source project Kubernetes codebase with core design changes.

Arktos aims to be an open source solution to address key challenges of large-scale clouds, including system scalability, resource efficiency, multitenancy, edge computing, and the native support for the fast-growing modern workloads such as containers and serverless functions.

Architecture

Key Features

Large Scalability

Arktos achieves a scalable architecture by partitioning and scaling out components, including API Server, storage, controllers and data plane. The eventual goal of Arktos is to support 300K nodes with a single regional control plane.

Multitenancy

Arktos implements a hard multitenancy model to meet the strict isolation requirement highly desired by public cloud environment. It's based on the virtual cluster idea and all isolations are transparent to tenants. Each tenant feels it's a dedicated cluster for them.

Unified Container/VM Orchestration

In addition to container orchestration, Arktos implements a built-in support for VMs. In Arktos a pod can contain either containers or a VM. They are scheduled the same way in a same resource pool. This enables cloud providers use a single converged stack to manage all cloud hosts.

More Features

There are more features under development, such as cloud-edge scheduling, in-place vertical scaling, etc. Check out our releases for more information.

Build Arktos

Arktos requires a few dependencies to build and run, and a bash script is provided to install them.

After the prerequisites are installed, you just need to clone the repo and run "make":

Note: you need to have a working Go 1.13 environment. Go 1.14 and above is not supported yet.

mkdir -p $GOPATH/src/github.com
cd $GOPATH/src/github.com
git clone https://github.com/CentaurusInfra/arktos
cd arktos
make

Run Arktos

The easiest way to run Arktos is to bring up a single-node cluster in your local development box:

cd $GOPATH/src/github.com/arktos
./hack/arktos-up.sh

The above command shows how to set up arktos with default network solution, bridge. With release 1.0, an advanced network solution, Mizar, is introduced into Arktos. The integration with Mizar allows tenant pods/services to be truely isolated from pods/services in another tenant. To start Arktos cluster with Mizar, make sure you are using Ubuntu 18.04+, run the following command:

cd $GOPATH/src/github.com/arktos
CNIPLUGIN=mizar ./hack/arktos-up.sh

After the Arktos cluster is up, you can access the cluster with kubectl tool released in Arktos just like what you do with a Kubernetes cluster. For example:

cd $GOPATH/src/github.com/arktos
./cluster/kubectl.sh get nodes

To setup a multi-node cluster, please refer to Arktos Cluster Setup Guide. And this guide gives detailed instructions if you want to enable partitions in the cluster.

To setup an Arktos scale out cluster in Google Cloud, please refer to Setting up Arktos scale out environment in Google Cloud.

To setup an Arktos scale out cluster in local dev environment, follow the instructions on Setting up local dev environment for scale out.

Community Meetings

Pacific Time: Tuesday, 6:00PM PT (Weekly). Please check our discussion page here for the latest meeting information.

Resources: Meeting Link | Meeting Agenda | Meeting Summary

Documents and Support

The design document folder contains the detailed design of already implemented features, and also some thoughts for planned features.

The user guide folder provides information about these features from users' perspective.

To report a problem, please create an issue in the project repo.

To ask a question, please start a new discussion here.

arktos's People

Contributors

Stargazers

Watchers

Forkers

xiaoningding sindica sonyafenge vinaykul chenqianfzh yb01 pdgetrf zasherif hong-chang xww coderkevinzhang clu2xlu zongbaoliu kzhang2-cloud ypjin dingyin krishnamk00 withlin arashkaffamanesh timyinshi vonrosenchild kriti-sc hustwl540 click2cloudrupal toenytv cathyhongzhang vovakaplenko nkwangjun william-wang x15zhang compactcfusr helloliuyiyang khangvd click2cloud-shivani r4b3rt h-w-chen jeffwan kompute romatotti wilhelmguo anyisalin click2cloud-centaurus nanjj click2cloud-rninja q131172019 archichris ikeeip tanushreedhongale gptankit159 uvid489 sharma-ritesh prerit-spec bhaveshwahane lixiong4788 click2cloud-lamda dobeuinc showsmall hnd4r7 omrachalwar smile-luobin

arktos's Issues

[Node Agent] Virtlet VM is created with wrong QoS class

What happened: Create a Guaranteed QoS Virtlet VM by specifying the below YAML.

What you expected to happen: QoS class of the VM should show as guaranteed. It shows up as best effort instead.

How to reproduce it (as minimally and precisely as possible):
See setup instructions in issue #115

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Network plugin and version (if this is a network-related bug):
Others:

(This issue is ported from the old repo with original author vinaykul.)

Modify bearer authenticator to extract tenant information

What would you like to be added:

Add tenant info extraction from bearer token

Why is this needed:

To provide tenant level authentication for incoming request

Create a tenant controller to watch tenant objects, and perform initialization and cleanup work

What would you like to be added:

Why is this needed:

extract tenant from kubectl config file

What would you like to be added:

kubectl config set-context goose-context-2 --cluster=local-up-cluster --namespace=goose --user=fatgoose --tenant=goosecompany

--tenant flag is not taking effect

Why is this needed:

to provide tenant info for rbac validation

Design alternative to improve alkaid/alcor integration

Design alternative to improve alkaid/alcor integration:
this is to track the comment of #143 regarding alternative design consideration:

Can we let Alcor Control plane to moving the device to the desired CNI netns, instead of asking CNI plug-in to handle it? This could potentially cause a race condition when customer and CNI plug-in try to operate on the same device in the same time.

An alternative is to notify Alcor Control plane with final ns so that Alcor could move it to the final ns on behalf of customers.

Why is this needed:
the comment suggests changes of design. this needs to be discussed and decided.

(This issue is ported from the old repo with original author h-w-chen.)

arktos-up.sh does not pick up latest changes when building binaries

If you pull the latest code changes and run arktos-up.sh, somehow it does not build binaries.

Review and fix E2E tests

Review the E2E test suite and decide which cases should be kept for Arktos, and what needs to be added.
Fix the test failures. (We will file breakdown items if it requires lots of efforts)

Start an blog site for Arktos

Start a blog site for Arktos, so that we can announce releases, promote key features, etc.

List/Watch by range needs unit test/integration test

What would you like to be added:
Unit test coverage for list/watch by range
Integration test for list/watch by range - to make sure all resources support hashkey range filtering and owerReference range filtering

Why is this needed:
To facilitate later controller migration to workload controller manager. (Deployment has hashkey filtering issue during migration.)

copyright check won't run in Mac

What happened:

% make update
cat: /proc/1/sched: No such file or directory
WARN: Skipping copyright check for in-container build as git repo is not available
Running in silent mode, run with SILENT=false if you want to see script logs.
Running in short-circuit mode; run with FORCE_ALL=true to force all scripts to run.
Running update-generated-protobuf
Running update-codegen
cat: /proc/1/sched: No such file or directory
cat: /proc/1/sched: No such file or directory
Running update-generated-runtime
Running update-generated-device-plugin
Running update-generated-api-compatibility-data
Running update-generated-docs
cat: /proc/1/sched: No such file or directory
cat: /proc/1/sched: No such file or directory
Running update-generated-swagger-docs
Running update-openapi-spec
cat: /proc/1/sched: No such file or directory
cat: /proc/1/sched: No such file or directory
Running update-bazel
Running update-gofmt
cat: /proc/1/sched: No such file or directory
WARN: Skipping copyright check for in-container build as git repo is not available
Update scripts completed successfully

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Arktos version (use kubectl version):
Cloud provider or hardware configuration:
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Network plugin and version (if this is a network-related bug):
Others:

Automatically set up test environments for data partitioned environment

What would you like to be added:
Currently, multiple api servers with data parition support are in place. We need to make test environment set up easily for the following senario:

Single ETCD cluster
Two or more api servers with data parition
One kube controller manager
One scheduler
Two workload controller mangers

Ideal output:
Option 1: a script with input arguments to set up one or two api servers
Option 2: incorporate into kubeadm
Option 3: Open for discussion

Why is this needed:
Facilitate dev experience and test. Possibly will used in performance test (kube-up)

Remove the definition of system tenant

What would you like to be added:

refactor the current code to remove the definition of system tenant.

Why is this needed:

Background: We optimized the multi-tenancy resource model in the latest design review. Based on the optimization there is no need to have a dedicated built-in tenant "system". The previous resources under system tenant will be put under the "system world", which doesn't have any tenant name.

[Node Agent: Runtime] Investigate the long-term solution of runtime service design

design points and current(short term) impl:

Extend CRI vs new interface
this is mainly impact internal implementation, and interface passing among kubelet components
Short term: Extend current CRI and add new method to existing interface definition in kubelet

Construct our own runtime ( with some borrowed code ) vs extending Virtlet ( i.e. tied to CRI )
Short term: extending virtlet by adding NEW methods to its CRI implementation

Image management in runtime component vs built-into kubelet
Short term: keep the current design to have image management interface in runtime.

Network integration
Short term:
CNI model + CRI
Keep current POD-networking model in k8s, extend CRI Sandbox methods to support NIC updates.

libvirt per pod ( which virtlet and kubevirt has ) vs libvirt per node model
libvirt per NODE

pod-contianer model vs VM-direct model
short term: keep the current POD-container model

pod state control and management change to coordinate with actions(before, during and after) to maintain expected POD state behavior in kubelet and scheduler ( maybe controller as well )
new logic being worked on already

(This issue is ported from the old repo with original author yb01)

Create new tenant make events related to the tenant populated to all api servers

What happened:
When multiple api servers work with single ETCD cluster and have data partitioned, if tenant data is not partitioned (meaning /registry/tenants entry is not in api server data partition list), all events belongs to the newly created tenant will be populated to all api servers.

What you expected to happen:
Only events belong to the api server shall be pupolated there.

How to reproduce it (as minimally and precisely as possible):
Set up two api servers, data parition file content as follows:
Api server 1:
/registry/pods/,,tenant2
/registry/replicasets/,,tenant2
/registry/deployments/,,tenant2

Api server 2:
/registry/pods/,tenant2,tenant3
/registry/deployments/,tenant2,tenant3
/registry/replicasets/,tenant2,tenant3

After created tenant11, we created namespace2 for tenant11, then created deployment tenant11-deployment2 for tenant11. Both api servers have the events in their log.

(Note: "/registry/pods/,tenant2,tenant3" was not in data parition file during the demo. Therefore, it might be false alarm. Check the senario with correct pod partition first.)

Anything else we need to know?:

Environment:

Arktos version (use kubectl version):
Added kube config files to kubelet and kube proxy (#54) bc87e67
Cloud provider or hardware configuration: AWS EC2
OS (e.g: cat /etc/os-release): Ubuntu 16.04
Kernel (e.g. uname -a):
Install tools:
Network plugin and version (if this is a network-related bug):
Others:

Multi-tenancy network: design spec of service IP changes

What would you like to be added:
Figure out what's needed to support service IP allocation and management in a multi-tenancy cluster.

Why is this needed:
For the strong isolation case, multiple tenants cannot share a single IP space for services.

Update RBAC Authorizer to enforce tenant check

What would you like to be added:

Add the following logics in addition to current RBAC authorizer logics:

If an API request is issued by tenant A and the target resource is under tenant B, deny the request.
If the target resource is also under tenant A, only evaluate ClusterRoles and Roles defined in tenant A (by simply skipping all other ClusterRoles and Roles).

Why is this needed:
RBAC authorizer is the commonly used authorizer. To support multi-tenancy access control, we need to add extra logics so that: 1) no tenants can access other tenants' resource . 2) system user can still access all resource.

Moving performance test from GCE to AWS

What would you like to be added:
Make performance test works in AWS

Why is this needed:
GCE capacity is very limited. It is very hard to get enough machine for performance test in required time.

copyright are wrongly updated when doing 'make update'

What happened:
After running 'make update', a lot of unchanged files are updated with new copyright.

What you expected to happen:
No change for unchanged files.

How to reproduce it (as minimally and precisely as possible):
Clean clone Arktos repo, then run 'make update'.

Anything else we need to know?:

Seen these error messages on screen:
Inspecting copyright files, writing logs to _output/ArktosCopyrightTool.log
sed: 1: "/Users/yinding/workspac ...": transform strings are not the same length
In log file, sample error messages are:
Copied file /Users/yinding/workspace/src/k8s.io/kubernetes/pkg/cloudfabric-controller/replicaset/doc.go has K8s copyright but not Arktos copyright. Skipping.
Copied file /Users/yinding/workspace/src/k8s.io/kubernetes/pkg/cloudfabric-controller/replicaset/replica_set_utils.go has K8s copyright but not Arktos copyright. Skipping.

Environment:

Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Network plugin and version (if this is a network-related bug):
Others:

Deployment with tenant failed to create pods, error: serviceaccount "default" not found

What happened:
Using Deployment with tenants information to deploy pods, deployment created successfully, but pods failed to create with error:

message: 'pods "pods-test1-56b9fc7dfc-" is forbidden: error looking up service
      account scale-tenent2/ns-test1/default: serviceaccount "default" not found'

What you expected to happen:
pods should be created successfully

How to reproduce it (as minimally and precisely as possible):

run "kubectl apply -f pods-test1.yaml", get "deployment.apps/pods-test1 created"

apiVersion: apps/v1
kind: Deployment
metadata:
  name: "pods-test1"
  namespace: "ns-test1"
  tenant: scale-tenent2
spec:
  replicas: 5
  selector:
    matchLabels:
      app: "pods-test1"
  template:
    metadata:
      labels:
        app: "pods-test1"
    spec:
      containers:
      - name: "pods-test1"
        image: kahootali/counter:1.0

run "kubectl get pods -n ns-test1 --tenant scale-tenent2" to check pods, and get "No resources found."
run "kubectl get deployment pods-test1 -n ns-test1 --tenant scale-tenent2 -o yaml" to check detail and get information:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{},"name":"pods-test1","namespace":"ns-test1","tenant":"scale-tenent2"},"spec":{"replicas":5,"selector":{"matchLabels":{"app":"pods-test1"}},"template":{"metadata":{"labels":{"app":"pods-test1"}},"spec":{"containers":[{"image":"kahootali/counter:1.0","name":"pods-test1"}]}}}}
  creationTimestamp: "2020-02-05T20:38:26Z"
  generation: 1
  hashKey: 8055224089065378592
  name: pods-test1
  namespace: ns-test1
  resourceVersion: "129862"
  selfLink: /apis/extensions/v1beta1/tenants/scale-tenent2/namespaces/ns-test1/deployments/pods-test1
  tenant: scale-tenent2
  uid: 86d39239-6fe7-4a22-aa65-03d514a40dd7
spec:
  progressDeadlineSeconds: 600
  replicas: 5
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: pods-test1
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: pods-test1
    spec:
      containers:
      - image: kahootali/counter:1.0
        imagePullPolicy: IfNotPresent
        name: pods-test1
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
status:
  conditions:
  - lastTransitionTime: "2020-02-05T20:38:26Z"
    lastUpdateTime: "2020-02-05T20:38:26Z"
    message: Created new replica set "pods-test1-56b9fc7dfc"
    reason: NewReplicaSetCreated
    status: "True"
    type: Progressing
  - lastTransitionTime: "2020-02-05T20:38:26Z"
    lastUpdateTime: "2020-02-05T20:38:26Z"
    message: Deployment does not have minimum availability.
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Available
  - lastTransitionTime: "2020-02-05T20:38:28Z"
    lastUpdateTime: "2020-02-05T20:38:28Z"
    message: 'pods "pods-test1-56b9fc7dfc-" is forbidden: error looking up service
      account scale-tenent2/ns-test1/default: serviceaccount "default" not found'
    reason: FailedCreate
    status: "True"
    type: ReplicaFailure
  observedGeneration: 1
  unavailableReplicas: 5

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"2", Minor:"17+", GitVersion:"v2.17.1-83+fed7f520caa45f-dirty", GitCommit:"fed7f520caa45fd1f424377605b3e5820055218b", GitTreeState:"dirty", BuildDate:"2020-02-05T17:48:08Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"2", Minor:"17+", GitVersion:"v2.17.1-83+fed7f520caa45f-dirty", GitCommit:"fed7f520caa45fd1f424377605b3e5820055218b", GitTreeState:"dirty", BuildDate:"2020-02-05T17:45:25Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration:
Google Cloud
OS (e.g: cat /etc/os-release):

NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Kernel (e.g. uname -a):

Linux alkaiddev 5.0.0-1029-gcp #30~18.04.1-Ubuntu SMP Mon Jan 13 05:40:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Install tools:
Network plugin and version (if this is a network-related bug):
Others:

Runtime TODOs: Refactor to add new files for VM, and separated interfaces from container

(No description in the original issue)

(This issue is migrated from the old repo and was originally filed by yb01. It's pending triage.)

Runtime TODOs: Remove the needs to connect to k8s: includes image mapping, virtlet config map removal and refactor the code to preserver the info in virtlet if needed

this will be the code change in the Alkaid-vm-runtime project. however, add issue here to track in general effort.

(This item is migrated from the old repo and was original issued by yb01. It's pending triage.)

API Discovery: return different resource G/V/K for the specific user

regular resource kind vs CRD
API Access: validate if the specified resources G/V/K is available for the given user

What would you like to be added:

Why is this needed:

Need the official site of arktos.io

What would you like to be added:

Why is this needed:

[vnic-hotplug] nic state/reason info should be part of nic status exposed to user

What would you like to be added:
runtime provides full nic status info including the current state and reason, which makes kubelet able to report it back to api server, and user able to see actual nic state as part of pod status.

Why is this needed:
current runtime api has ListNetworkInterfaces which only returns name/portid etc, not the actual state and reason information. This api only exposes identities of processed nic, lacking the processing result. For better user experience, it is desirable to have the state and reason info as well. We need to evaluate the design options that provides such needed info.

(This issue is ported from the old repo with original author h-w-chen.)

Multi-tenancy network: design spec of DNS support

What would you like to be added:

Finalize the design of DNS server deployment & configuration in Arktos.

Why is this needed:
In a multi-tenancy cluster, different tenants cannot share a single DNS server. Also pods won't be able to access a single DNS pod if these pods are in different VPCs.

Admission controller to block certain cluster resources

Admission controller: support tenant

What would you like to be added:
Add checks that denies access to cluster resources such as node and daemonSet for tenanted users

Why is this needed:
Resources like node and daemonSet are not accessible by regular tenant users

Remove or update OWNERS files in all sub-folders

What would you like to be added:

Update OWNERS files in all sub-folders. Or remove them before we have detailed owners for each sub-folder.

Why is this needed:

There are OWNERS files under root folder and sub-folders in the repo. We already updated the one under root folder, but lots of OWNERS file under sub-folders are still the ones inherited from Kubernetes.

Arktos needs its own storage services

Arktos is positioned as a unified platform for natively orchestrating container, VM and also bare metal machines. it is believed that Arktos will be directly on top of the hardware management layer in the DC from stack layout's perspective.

This proposition puts Arktos storage service in a strange situation without its own storage service that provides services similar to Cinder in OpenStack or EBS in AWS. for examples, a few possible approaches without its own storage service, that can be considered:

Setup Arktos side by side with OpenStack, use the old, dedicated stand-alone cinder provider along with Arktos to provide remote storage service.
Set up Arktos on top of OpenStack bare metal service, and use OpenStack to provide hardware level support.
Side by side with OpenStack and develop our own flex-volume driver to utilize storage services from OpenStack.

As we can see, neither of them can satisfy our requirements for Arktos. However, creating a new storage service, e.g. to build something similar to Cinder, is a very sizable effort. This issue is created and serves as a initiate point to further discuss the storage solutions for Arktos.

Update to go 1.13

What would you like to be added:
Currently the project fails to build using 1.13. We need to update.

Why is this needed:

Document the Keystone integration with Arktos and any possible required changes

What would you like to be added:
Document the Keystone integration with Arktos and any possible required changes.

Why is this needed:
To integrate Arktos with OpenStack.

Endpoints consolidation changes for api server data parition

What would you like to be added:
Currently Kuberenetes service endpoint for api server is using one entry for all api server instances as they function the same. We need to be able to add multiple service end points for api servers when their data is partitioned.

Why is this needed:

Give identity for each api server partition
Allow further support for multiple api server instance for each partition
Fix integration tests tear down issue.

Tasks break down

Create data partition registry and sync data with informer; (DONE)
Working on reset ETCD watch channel (Done - 3/18)
. Leftover task: get service group id from commandline when api server start
Api service endpoints consolidation (Pending CR and e2e test - 3/26)
Notify api server clients that there are new servers (in backlog)

kubectl top node doesn't work with error: unable to handle the request (get nodes.metrics .k8s.io)

What happened:
Run "kubectl get nodes" get correct nodes information, but run "kubectl top node" get error :

Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics
.k8s.io)

Run "kubectl get pods" get correct pods informaiton, but run "kubectl top pod" get error :

Error from server (ServiceUnavailable): the server is currently unable to handle the request (get pods.metrics.
k8s.io)

What you expected to happen:
run "kubectl top node", get correct node resource usage information
run "kubectl top pod", get correct pod resource usage information

How to reproduce it (as minimally and precisely as possible):

run ./cluster/kube-up.sh to start arktos on GCP
after cluster running, run "kubectl top node" to check node resource usage status
Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"2", Minor:"17+", GitVersion:"v2.17.1-83+fed7f520caa45f-dirty", GitCommit:"fed7f520caa45fd1f424377605b3e5820055218b", GitTreeState:"dirty", BuildDate:"2020-02-05T17:48:08Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"2", Minor:"17+", GitVersion:"v2.17.1-83+fed7f520caa45f-dirty", GitCommit:"fed7f520caa45fd1f424377605b3e5820055218b", GitTreeState:"dirty", BuildDate:"2020-02-10T18:52:47Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}

commit d511022b09bcfd54bddcfe488960916b184d0f56 (origin/master, origin/HEAD, master)
Author: Jun Shao <[email protected]>
Date:   Thu Jan 23 16:51:30 2020 -0800

Cloud provider or hardware configuration:
Google Cloud
OS (e.g: cat /etc/os-release):

NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Kernel (e.g. uname -a):

Linux alkaiddev 5.0.0-1029-gcp #30~18.04.1-Ubuntu SMP Mon Jan 13 05:40:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Install tools:
Network plugin and version (if this is a network-related bug):
Others:

High CPU usage and Memory usage from kube-apiserver

What happened:
arktos kube-apiserver has higher CPU usage and memory usage than kubernetes
run 5000 nodes with 15000 pods, arktos' average CPU usage is 4500% (96 cores master), Memory usage is 13.7GB; kubernetes' average CPU usage is 500% (96 cores master), Memory usage is 6.8GB;

What you expected to happen:
should be around same or less cpu and memory usage

How to reproduce it (as minimally and precisely as possible):

deploy pods to 5000 nodes cluster
monitor CPU usage and memory usage status
Anything else we need to know?:
test environment: kubemark simulated nodes
Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"2", Minor:"17+", GitVersion:"v2.17.1-83+fed7f520caa45f-dirty", GitCommit:"fed7f520caa45fd1f424377605b3e5820055218b", GitTreeState:"dirty", BuildDate:"2020-02-05T17:48:08Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"2", Minor:"17+", GitVersion:"v2.17.1-83+fed7f520caa45f-dirty", GitCommit:"fed7f520caa45fd1f424377605b3e5820055218b", GitTreeState:"dirty", BuildDate:"2020-02-10T18:52:47Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}

commit d511022b09bcfd54bddcfe488960916b184d0f56 (origin/master, origin/HEAD, master)
Author: Jun Shao <[email protected]>
Date:   Thu Jan 23 16:51:30 2020 -0800

Cloud provider or hardware configuration:
Google Cloud
OS (e.g: cat /etc/os-release):

NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Kernel (e.g. uname -a):

Linux alkaiddev 5.0.0-1029-gcp #30~18.04.1-Ubuntu SMP Mon Jan 13 05:40:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Install tools:
Network plugin and version (if this is a network-related bug):
Others:

need design for VPC (tenant networking)

tracking work items of multi-tenancy related design.

What would you like to be added:
define a set of high-level resources for tenant, which depicts network resources, to accommodate multi-tenancy in k8s.

Why is this needed:
multi-tenant may make isolated networking desirable, which would break canonical k8s networking model.

(This issue is ported from the old repo with original author h-w-chen)

kube-controller-manager/workload-controller-manager on admin cluster will crash when running 5000 hollow-nodes pods

What happened:
using kubemark to simulate 5000 nodes, kube-controller-manager/workload-controller-manager on admin cluster will crash when running 5000 hollow-nodes pods

What you expected to happen:
no crash on admin cluster
How to reproduce it (as minimally and precisely as possible):

run the commands below to start admin cluster and kubemark cluster

export MASTER_DISK_SIZE=750GB KUBE_GCE_ZONE=us-west2-b MASTER_SIZE=n1-highmem-96 KUBE_GCE_NETWORK=default
./cluster/kube-up.sh
export MASTER_DISK_SIZE=750GB KUBE_GCE_ZONE=us-west2-b MASTER_SIZE=n1-highmem-96 KUBE_GCE_NETWORK=kubemark
./test/kubemark/start-kubemark.sh

start 5000 hollow nodes
check admin cluster status and found kube-controller-manager/workload-controller-manager will crash around 500 nodes.
Anything else we need to know?:

Environment:

Arktos version (use kubectl version):

commit 126c90ad2230452054a1a2b8e5f3c4f2c5cf4540 (HEAD)
Merge: b474e793 f0b3df9d
Author: Yin Ding <[email protected]>
Date:   Fri Feb 7 14:55:51 2020 -0800
 
    Merge pull request #9 from Sindica/controller-s3-merge
    
    Support multiple api server configurations

Cloud provider or hardware configuration:
Google Cloud
OS (e.g: cat /etc/os-release):

NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Kernel (e.g. uname -a):

Linux alkaiddev 5.0.0-1029-gcp #30~18.04.1-Ubuntu SMP Mon Jan 13 05:40:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Install tools:
Network plugin and version (if this is a network-related bug):
Others:

Git Tag missing

When I try to "make quick-release" to build the binary, I found an issue that seems to only reproduce in a clean Ubuntu EC2 box. The quick-release would generate error "git tag missing".
We will run a couple of git commands to resolve it.
~/go/src/k8s.io/alkaid$ git version
git version 2.17.1
~/go/src/k8s.io/alkaid$ git tag -a v2.17.1

(This issue is migrated from the old repo and was originally filed by jeffzhu503. It's pending triage.)

Support tenant short path in endpoint handler

What would you like to be added:
API Server accepts full paths for all API operations as well as short path for compatibility reasons. A short path does not have tenant information in the url. For example, for full path

/api/v1/tenants/org-a/namespaces/ns-1/pods/pod1

the short path is

/api/v1/namespaces/ns-1/pods/pod1

The endpoint handler inside API Server should extract the tenant information from the identify associated with current API request, and then change the path to a full path. This happens transparently to clients.

Why is this needed:
This enables users to use their existing CLI tool (such as Kubectl) without changes, and only one-time setting of tenant credential information is needed.

Support persistent block storage volumes for VMs

What would you like to be added:
Verify at least one remote block storage (preferably Cinder as we need to integrate with OpenStack) works for Arktos. So that VMs still have the same volume when it's restarted, either on same node or a different node.

Why is this needed:
VMs need persistent block storage to act as disks.

Runtime TODOs: Revisit the way image service handling, consider add it to the imageSpec, instead of looking up the map

(no description in the original issue.)

(This issue is migrated from the old repo and was originally issued by yb01. It's pending triage.)

API Server Partition Configuration has to restart once tenants are added

Currently there is a configuration file to load all the entries. Have to restart the apiserver to pick up changes.

Need to find out a solution to load dynamically.

Controller instance workload distribution auto rebalance

What would you like to be added:
Currently in workload controller managers, controller instances hashkey is generated at start up or only readjust when nearby controller instances are removed.

Need to auto adjust controller instance hashkey when there is no controller instances changes.

Why is this needed:
Maintain healthy and auto workload balanced kubernetes cluster.

arktos up occasionally update wrong file and make test fail

What happened:
Run ./hack/arktos-up.sh, sometimes it updates file pkg/scheduler/apis/config/v1alpha1/zz_generated.conversion.go

A portion of the changes:

import
-	configv1alpha1 "k8s.io/component-base/config/v1alpha1"

@@ -147,16 +146,18 @@ func autoConvert_v1alpha1_KubeSchedulerConfiguration_To_config_KubeSchedulerConf
 	}
 	out.HardPodAffinitySymmetricWeight = in.HardPodAffinitySymmetricWeight
 	if err := Convert_v1alpha1_KubeSchedulerLeaderElectionConfiguration_To_config_KubeSchedulerLeaderElectionConfiguration(&in.LeaderElection, &out.LeaderElection, s); err != nil {
 		return err
 	}
-	if err := configv1alpha1.Convert_v1alpha1_ClientConnectionConfiguration_To_config_ClientConnectionConfiguration(&in.ClientConnection, &out.ClientConnection, s); err != nil {
+	// TODO: Inefficient conversion - can we improve it?
+	if err := s.Convert(&in.ClientConnection, &out.ClientConnection, 0); err != nil {
 		return err
 	}

The change in it would stop api server working

What you expected to happen:
arktos-up.sh should not make changes to pkg/scheduler/apis/config/v1alpha1/zz_generated.conversion.go .

How to reproduce it (as minimally and precisely as possible):
Run ./hack/arktos-up.sh several time and try to reproduce

Anything else we need to know?:
It happens after arktos_copyright.sh was introduced. Previously, Qian saw it sometimes in "make update" occasionally. Looks like arktos_copyright.sh triggered some script that cause this when running arktos_copyright.sh
This senarios does not repeat consistently.

Environment:

Arktos version (use kubectl version):
Cloud provider or hardware configuration:
OS (e.g: cat /etc/os-release): ubuntu 16.04
Kernel (e.g. uname -a):
Install tools:
Network plugin and version (if this is a network-related bug):
Others:

endpoint resource should support tenant

What happened:
getting endpoint resource gets no tenant information

What you expected to happen:
endpoint resource should have tenant in its metadata

How to reproduce it (as minimally and precisely as possible):
kubectl get endpoint --all-namespaces

Anything else we need to know?:
this lack of multi-tenancy support of endpoint type seems lead to api server perf degradation observed by kubemark perf test (5000 node, ~20K pods).

ETCD upgrade from 3.3 to 3.4

What would you like to be added:
Use ETCD 3.4 in arktos

Why is this needed:
Performance enhancement in 3.4

Current Identified Tasks:

Fork ETCD 3.4 latest released code - DONE 3/5/2020 (currently 3.4.4 commit c65a9e2dd1fd500ca4191b1f22ddfe5e019b3ca1) into Futurewei repo
https://github.com/futurewei-cloud/etcd
Make sure ETCD 3.4.4 works with arktos-up.sh (preliminary test works)
Make sure ETCD 3.4.4 works with "make update" (ETCD healthcheck failing)
Make sure ETCD 3.4.4 works with "make quick-release"
Make sure ETCD 3.4.4 works with "make"
Make sure ETCD 3.4.4 works with kubemark

Note:

Currently ETCD 3.4.3 is used in kubernetes release 1.18. ETCD got updated to 3.4.4 on 2020-02-24) (https://github.com/futurewei-cloud/etcd/blob/master/CHANGELOG-3.4.md).
Futurewei fork of ETCD: https://github.com/futurewei-cloud/etcd/

kubectl create ns with --tenant doesn't work

What happened:
create namespace using "kubectl create ns" with --tenant doesn't work, namespace still created under default tenant

What you expected to happen:
namespace should be created under assigned tenant

How to reproduce it (as minimally and precisely as possible):

create tenant using yaml file, run "kubectl apply -f tenant2.yaml"

apiVersion: v1
kind: Tenant
metadata:
  name: scale-tenent2

run "kubectl create ns ns-test2 --tenant scale-tenent2"
run "kubectl get ns", get result

NAME                                      STATUS        AGE    TENANT
default                                   Active        170m   default
kube-node-lease                           Active        170m   default
kube-public                               Active        170m   default
kube-system                               Active        170m   default
ns-test2                                  Active        7s     default

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"2", Minor:"17+", GitVersion:"v2.17.1-83+fed7f520caa45f-dirty", GitCommit:"fed7f520caa45fd1f424377605b3e5820055218b", GitTreeState:"dirty", BuildDate:"2020-02-05T17:48:08Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"2", Minor:"17+", GitVersion:"v2.17.1-83+fed7f520caa45f-dirty", GitCommit:"fed7f520caa45fd1f424377605b3e5820055218b", GitTreeState:"dirty", BuildDate:"2020-02-05T17:45:25Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration:
Google Cloud
OS (e.g: cat /etc/os-release):

NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Kernel (e.g. uname -a):

Linux alkaiddev 5.0.0-1029-gcp #30~18.04.1-Ubuntu SMP Mon Jan 13 05:40:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Install tools:
Network plugin and version (if this is a network-related bug):
Others:

API Discovery: not return unavailable resource G/V/K like nodes, daemonSets

What would you like to be added:

Why is this needed:

need design for network resource recycle in case of down node

What would you like to be added:
need design for k8s to ensure mizar-mp able to recycle network resource allocated to pods on node that is lost (due to node crash, network partition etc)

Why is this needed:

k8s garbage collection

(This issue is ported from the old repo with original author h-w-chen)

tests for multi-tenancy aware api server

the tests verify:

the api server responds correctly to the new multi-tenancy apis.
the data are correctly saved/updated in etcd.

(This issue is ported from the old repo with original author chenqianfzh.)

ETCD partition research

What would you like to be added:
Research ETCD data partition requirement and options
Output: design doc and workable feature list for ETCD data partition

Why is this needed:
To scale out kubernetes clusters as much as possible.

Performance testing: SchedulingThroughput is one fourth of pre-Alkaid

What happened:
performance testing result for SchedulingThroughput_density
Alkaid (01/16 build and 01/20 build):

  "average": 4.980833333333334,
  "perc50": 5,
  "perc90": 5,
  "perc99": 5.2

Pre-Alkaid (06/18/2019 build)

  "average": 19.736842105263147,
  "perc50": 20,
  "perc90": 20,
  "perc99": 20.2

What you expected to happen:
equal or higher throughput data than base version: commit 75e9476
How to reproduce it (as minimally and precisely as possible):

prepare kubemark test environments
run kubernetes performance testing:

./run-e2e.sh --provider=kubemark --report-dir=[HOME]/logs/perf-test/gce-100/0116 --kubeconfig=[HOME]/go/src/k8s.io/alkaid/test/kubemark/resources/kubeconfig.kubemark --testconfig=testing/density/config.yaml --testconfig=testing/load/config.yaml

Anything else we need to know?:

Environment:

Arktos version (use kubectl version):

alkaid 01/16 build and 01/20 build

Cloud provider or hardware configuration:

Admin master: 1 | e2-standard-8 (8 vCPUs, 32 GB memory)      | cos-73-11647-163-0
Admin nodes: 3 | n1-standard-4 (4 vCPUs, 15 GB memory) |     cos-73-11647-163-0
             5 | n1-standard-8 (8 vCPUs, 30 GB memory) | cos-73-11647-163-0
kubemark master: 1 | e2-standard-8 (8 vCPUs, 32 GB memory)      | cos-beta-73-11647-64-0

Software Build: Pre-Alkaid 06/2019 75e94764, Alkaid 01/16/2020, Alkaid 01/20/2020
Cloud Provider: GCE

OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Network plugin and version (if this is a network-related bug):
Others:

centaurusinfra / arktos Goto Github PK

arktos's Introduction

Arktos

What is Arktos

Architecture

Key Features

Large Scalability

Multitenancy

Unified Container/VM Orchestration

More Features

Build Arktos

Note: you need to have a working Go 1.13 environment. Go 1.14 and above is not supported yet.

Run Arktos

Community Meetings

Documents and Support

arktos's People

Contributors

Stargazers

Watchers

Forkers

arktos's Issues

Recommend Projects

Recommend Topics

Recommend Org