Code Monkey home page Code Monkey logo

sokar's Introduction

Sokar

Go Report Card Maintainability Coverage Status FOSSA Status

Quality Gate Status Code Smells Lines of Code Maintainability Rating Reliability Rating Security Rating Technical Debt Vulnerabilities

Overview

Purpose

Sokar is a generic alert based auto-scaler for cloud systems.

If you are running your microservices on a container orchestration system like Nomad or kubernetes, then you are probably also need to scale them based on the load varying over time. The same situation applies if your system runs directly on AWS EC2 instances and thus they have to be scaled out with increased and scaled in with a reduction of the load. Usually the decision to scale is made based on metrics like current CPU/ RAM utilization or requests per second. But often you might want to use custom metrics like the length of a job-queue, the number of processed images per second or even a combination of those.

Here comes sokar into play. Sokar is a generic auto-scaler that makes scale up/ down decisions based on scale alerts. He constantly evaluates the incoming scaling alerts, aggregates them and then scales the desired ScaleObject (i.e. microservice or an EC2 instance). Even if multiple metrics shall be taken into account for scaling the ScaleObject, those metrics just have to be expressed as scaling alerts and sokar will use them accordingly for scaling.

doc/overview_coarse.png

Benefit

  1. Possibility to combine multiple metrics to be taken into account for the scaling decisions. The impact of those metrics, expressed as scale alerts, can be easily adjusted by configuring suitable weights.
  2. Use the connectors to scale the actual ScaleObject. No need to implement the communication with the Container Orchestration System in this regard. The supported connectors can be found here.
  3. Configurable and ready to use capacity planning, providing separate cool downs for up and down scaling. Further more it is possible to select the planning mode which fits best for your workload.

State

At the moment sokar is able to scale Nomad jobs, Nomad instances (running on AWS) and AWS EC2 instances. For details about the changes see the changelog.

Build and Run

One can build and run sokar either in docker or directly on the host.

On Host

# build
make build

# run (as scaler for nomad jobs)
make run.nomad-job

Docker

# build
make docker.build

# pull it from docker hub
docker pull thobe/sokar

# run (as scaler for nomad jobs)
docker run -p 11000:11000 thobe/sokar:latest
# example
docker run -p 11000:11000 thobe/sokar:latest

For more configuration options and how to specify if sokar shall run as scaler for nomad jobs, nomad instances or AWS instances see Config.md.

Features

Links

sokar's People

Contributors

codacy-badger avatar thomasobenaus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

sokar's Issues

Move to sonar-cloud

Codacity seams to be buggy and slow, hence it would be better to move forward and try another tool.

Robustness against Deployments

Sokar modifies the job specification upon scaling the job. To be concrete the count of the job is modified.
Also the CI/CD system modifies the job specification without knowing the current count of the job that was set by sokar.
This conflicting knowledge/ information on both systems leads to the result that the CI/CD system will overwrite the count that was set by sokar and thus reverts the scaling.

CapacityPlanner Linear Mode

Implement a linear scaling mode for the CapacityPlanner.
Therefore he should respect the speed of changing the scaleCounter (represented by the scaleFactor). This means the faster the scaleCounter grows/ changes the more instances shall be deployed at once.

LogLevel Configurable

Currently the used log-level is Debug.
But it should be configurable to decide between spamming and needed information.

Sokar skipped scale action metric does not work in case the ScalingTarget is missing

Root Cause

The problem is, sokar only wants to fire this metric in case a scale UP/DOWN is really needed.
Therefore sokar first evaluates what is the current state for the job to scale. Thus he try to obtain the current job count by requesting the scaling-target (i.e. nomad, see: https://github.com/ThomasObenaus/sokar/blob/master/scaler/scale.go#L101).

As a consequence if there is no scaling target defined it fails before increasing the metric.

Options

Short-Term:

Fire the metric in case of dry run each time the capacity-planner want's to adjust the count, without reflecting the actual scale state of the job.

Short-Term 2:

Implement a mocked scaling-target for AWS instances.

Mid-Term:

Implement a scaling target for AWS instances being able to obtain the actual instance count.
Problem: AWS access key handling has to be implemented/ provided OR the instances where sokar runs on have to have IAM Instance Roles with needed permissions.

Metrics for allocated resources

For scaling a nomad data-center the information about how many resources would be needed in case all the currently running jobs would be scaled by x (e.g. 1).

This number could be taken to decide for a scale up/ down.

At best the x is configurable to give the user a bit more flexibility.

Sokar is Able to Modify Autoscaling Group in Dry-Run Mode

Regardless of the dry-run parameter value, the watcher can issue draining commands and autoscaling group size reduction:

if err := s.openScalingTicket(expected, false); err != nil {

Reproduction:

  • Run sokar with SK_SCA_MODE:"nomad-dc" and SK_DRY_RUN:"1"
  • Modify autoscaling group to have desired size more than SK_SCALE_OBJECT_MAX
  • Observe autoscaling group modified by sokar after SK_SCA_WATCHER_INTERVAL time

Is it a desired behavior or is it better not to apply any changes to infrastructure if sokar is in dry-run?

Node draining for Nomad

In order to really support a downscaling for nomad nodes the jobs have to be drained from that node beforehand.

The feature of draining a node and then downscale by removing the drained nodes only will be implemented with this ticket.

Instance Downscaling does not complete

With #106 one part of the problem that the nomad worker downscaling did not complete successfully was solved.
But tests showed that depending on the your AWS setup (type and amount of EC2 instances) and load on the nomad workers the termination of an instance can take more than 3 minutes.

The problem is that in the current implementation sokar monitors (waits for) the termination of AWS EC2 instances for at max 3 minutes. If the termination takes more then 3 minutes sokar assumes that the termination and thus the whole downscaling failed.

Add License

Maybe it's also worth to integrate https://fossa.com. Fossa enables automated license and complience checks, even of the used dependencies.

Improve Code Quality - Config

In current code the anti-pattern of config structs is used too often.
That idiom hides too much information from the caller. Instead explicit parameters should be used or the pattern of functional options in case the parameter list would get too big.

Vendoring of generated nomad API does not work

In the current configuration a dep ensure will lead to build errors due to incompatibility between the generated nomad API code and the code-generator used for this.

Due to that the vendoring folder had to be checked in.

Refactor Job to generic Name

Currently the term job is used for the objective to be scaled.
But this does not fit any more and is even confusing when sokar is in scaler.mode "data-center".

Thus it would be nice to rename the term "Job" to something more generic.

i.e. scaling-object, object-to-scale, ...

Refactor Job to ScaleObject - Part 2

With #64 the term job was changed to scale object in order to have a more generic term that also fits for the scaling mode for data-centers not only for scaling jobs.

Now in this ticket the clean up (i.e. removal of deprecated references) shall be done.

Fill Readme

The readme of the project is the main entry point and at least has to contain a brief description what sokar is about.

ScaleBy End-Point does not work due to cooling down

It looks like scaleBy triggered by the endpoint should ignore cooling down. I didn't manage to trigger upscaling when other alerts are active at the same time.

	May 3rd 2019, 18:31:07.000	No scale needed.	debug
	May 3rd 2019, 18:31:07.000	Aggregation	info
	May 3rd 2019, 18:31:07.000	[2360150543] fire=true,start=2019-05-03 16:08:30.307334594 +0000 UTC,exp=2019-05-03 16:31:00.328831153 +0000 UTC m=+5441.140788755	debug
	May 3rd 2019, 18:31:06.000	Scale DOWN.	info
	May 3rd 2019, 18:31:06.000	ScaleCounter updated by -1.000000 to -11.000000. Scaling-Alert: 'DownScalingRCWCpuUsage' (-1.000000 wps).	debug
	May 3rd 2019, 18:31:06.000	Aggregation	info
	May 3rd 2019, 18:31:06.000	ScaleAlertPool:	debug
	May 3rd 2019, 18:31:06.000	[2360150543] fire=true,start=2019-05-03 16:08:30.307334594 +0000 UTC,exp=2019-05-03 16:31:00.328831153 +0000 UTC m=+5441.140788755	debug
	May 3rd 2019, 18:31:06.000	Refresh gradient 0.000000. Scale needed.	debug
	May 3rd 2019, 18:31:06.000	Scale Event received: {-0.99995995}	info
	May 3rd 2019, 18:31:06.000	Skip scale event. Sokar is cooling down.	info
	May 3rd 2019, 18:31:05.000	ScaleCounter updated by -1.000000 to -10.000000. Scaling-Alert: 'DownScalingRCWCpuUsage' (-1.000000 wps).	debug
	May 3rd 2019, 18:31:05.000	[2360150543] fire=true,start=2019-05-03 16:08:30.307334594 +0000 UTC,exp=2019-05-03 16:31:00.328831153 +0000 UTC m=+5441.140788755	debug
	May 3rd 2019, 18:31:05.000	No scale needed.	debug
	May 3rd 2019, 18:31:05.000	Refresh gradient -1.000001. Evaluation period (10.000000s) exceeded.	debug
	May 3rd 2019, 18:31:05.000	Aggregation	info
	May 3rd 2019, 18:31:05.000	ScaleAlertPool:	debug
	May 3rd 2019, 18:31:04.000	Check job state (not implemented yet).	error
	May 3rd 2019, 18:31:04.000	No scale needed.	debug
	May 3rd 2019, 18:31:04.000	ScaleCounter updated by -1.000000 to -9.000000. Scaling-Alert: 'DownScalingRCWCpuUsage' (-1.000000 wps).	debug
	May 3rd 2019, 18:31:04.000	[2360150543] fire=true,start=2019-05-03 16:08:30.307334594 +0000 UTC,exp=2019-05-03 16:31:00.328831153 +0000 UTC m=+5441.140788755	debug
	May 3rd 2019, 18:31:04.000	ScaleAlertPool:	debug
	May 3rd 2019, 18:31:04.000	Aggregation	info
	May 3rd 2019, 18:31:03.000	No scale needed.	debug
	May 3rd 2019, 18:31:03.000	Aggregation	info
	May 3rd 2019, 18:31:03.000	[2360150543] fire=true,start=2019-05-03 16:08:30.307334594 +0000 UTC,exp=2019-05-03 16:31:00.328831153 +0000 UTC m=+5441.140788755	debug
>>	May 3rd 2019, 18:31:03.000	Skip scale event. Sokar is cooling down.	info
	May 3rd 2019, 18:31:03.000	ScaleAlertPool:	debug
>>	May 3rd 2019, 18:31:03.000	ScaleBy Percentage Endpoint with '50 %' called.	info
	May 3rd 2019, 18:31:03.000	ScaleCounter updated by -1.000000 to -8.000000. Scaling-Alert: 'DownScalingRCWCpuUsage' (-1.000000 wps).	debug
	May 3rd 2019, 18:31:02.000	ScaleCounter updated by -1.000000 to -7.000000. Scaling-Alert: 'DownScalingRCWCpuUsage' (-1.000000 wps).	debug
	May 3rd 2019, 18:31:02.000	No scale needed.	debug
	May 3rd 2019, 18:31:02.000	Aggregation

Scheduled Scaling

For some use-cases it is known at which point in time the load (number of requests) increases. Thus it is known upfront when the resources to handle the load are needed.
This means to be prepared it would be sufficient to prescale the system or to schedule specific a scale of the system.

With this ticket the feature of scheduled scaling will be added to sokar.

Scheduled scaling means:
1.That sokar ensures that at a certain time span the scale of the system is not less than X and not more than Y.
2. That sokar still regards scaling alerts during this time span.

CapacityPlanner stepwise mode

Implement a step-wise mode for the CapacityPlanner.
This means it should be configurable in a step-wise manner how much instances the CapacityPlanner shall scale (respecting the scaleFactor).

Sokar does not respect min/max on deployment

If sokar is deployed he should directly check if the current scale of the service is "out of bounds".
This means if the service is currently deployed with less than the min or more than the max bound he has to adjust accordingly.

Implement retry

Currently if the scale (in/out) is interrupted by a deployment it gets cancelled immediately.
This should be implemented a bit more robust by adding at least a retry policy.

AWS EC2 Scaler

Currently only nomad jobs can be scaled.
It would be also nice if sokar is able to also scale a group of AWS EC2 instances (AutoScalingGroup).

Sokar does not know about current state of AIA on deployment

The main source of information for sokar is the alerts received from alertmanager.

It seams to be the case that the alertmanager updates the state of its alerting targets only in case there is a change on the alerts.
This leads to the problem that a newly deployed sokar does not know about the currently firing alerts .

Separate scaling cooldown by type

Currently sokar's cooldown mechanism does not regard the direction if the scaling.
This means even though it is possible to configure different cooldowns for down- and up-scaling, it is not possible to directly scale up after a down-scaling.

Why?:

  • Sokar just memorizes the timestamp for the last scale action, no matter if it is a down- or up-scaling.
  • If configured differently sokar just waits longer/ less for the next down- / up-scaling event.
  • But in each case sokar will wait for the next down-scaling even if there was only an up-scaling event beforehand.

With this ticket the down- and up-scaling cooldown should be really separated from one another.

Documentation ScalerModes is missing

With the v0.0.6 release the a mode for scaling datacenters was added beside the feature of scaling nomad jobs.
For both the documentation is missing. Especially for the datacenter mode it is important, because it relies on some preconditions of the infrastructure to be scaled (i.e. ASG has to be tagged appropriately, AWS credentials have to be provided, ...).

ScaleBy End-Point

For manual intervention a scaleby end-point would be nice.
ScaleBy - relative scaling based on count and percentage.

Downscaling of AWS instances fails (Throttling: Rate exceeded)

Since the monitoring of termination of AWS instances contains a bug the complete downscaling can fail. At least sokar thinks it failed.

Ticket applied. Scaling was failed (Error adjusting scalingObject count to 11: Throttling: Rate exceeded\n\tstatus code: 400, request id: 0d9e1fcb-f49d-11e9-973f-f1b0ba24974b.). New count is 12. Scaling in 46.780182 .

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.