thomasobenaus / sokar Goto Github PK

View Code? Open in Web Editor NEW

10.0 5.0 3.0 85.33 MB

Alertbased Auto Scaler for Nomad

License: GNU Lesser General Public License v3.0

Makefile 2.06% Go 96.34% Dockerfile 0.68% Python 0.91%

nomad nomad-cluster nomad-job autoscaling containers scaling aws-ec2 aws-asg

sokar's Introduction

Sokar

Overview

Purpose

Sokar is a generic alert based auto-scaler for cloud systems.

If you are running your microservices on a container orchestration system like Nomad or kubernetes, then you are probably also need to scale them based on the load varying over time. The same situation applies if your system runs directly on AWS EC2 instances and thus they have to be scaled out with increased and scaled in with a reduction of the load. Usually the decision to scale is made based on metrics like current CPU/ RAM utilization or requests per second. But often you might want to use custom metrics like the length of a job-queue, the number of processed images per second or even a combination of those.

Here comes sokar into play. Sokar is a generic auto-scaler that makes scale up/ down decisions based on scale alerts. He constantly evaluates the incoming scaling alerts, aggregates them and then scales the desired ScaleObject (i.e. microservice or an EC2 instance). Even if multiple metrics shall be taken into account for scaling the ScaleObject, those metrics just have to be expressed as scaling alerts and sokar will use them accordingly for scaling.

Benefit

Possibility to combine multiple metrics to be taken into account for the scaling decisions. The impact of those metrics, expressed as scale alerts, can be easily adjusted by configuring suitable weights.
Use the connectors to scale the actual ScaleObject. No need to implement the communication with the Container Orchestration System in this regard. The supported connectors can be found here.
Configurable and ready to use capacity planning, providing separate cool downs for up and down scaling. Further more it is possible to select the planning mode which fits best for your workload.

State

At the moment sokar is able to scale Nomad jobs, Nomad instances (running on AWS) and AWS EC2 instances. For details about the changes see the changelog.

Build and Run

One can build and run sokar either in docker or directly on the host.

On Host

# build
make build

# run (as scaler for nomad jobs)
make run.nomad-job

Docker

# build
make docker.build

# pull it from docker hub
docker pull thobe/sokar

# run (as scaler for nomad jobs)
docker run -p 11000:11000 thobe/sokar:latest
# example
docker run -p 11000:11000 thobe/sokar:latest

For more configuration options and how to specify if sokar shall run as scaler for nomad jobs, nomad instances or AWS instances see Config.md.

Features

sokar's People

Contributors

Stargazers

Watchers

Forkers

vasilerobert85 hutsts tobbidd

sokar's Issues

Create Integration Test

Vendoring of generated nomad API does not work

In the current configuration a dep ensure will lead to build errors due to incompatibility between the generated nomad API code and the code-generator used for this.

Due to that the vendoring folder had to be checked in.

Sokar is Able to Modify Autoscaling Group in Dry-Run Mode

Regardless of the dry-run parameter value, the watcher can issue draining commands and autoscaling group size reduction:

sokar/scaler/watchScalingObject.go

Line 46 in f1f98ad

if err := s.openScalingTicket(expected, false); err != nil {

Reproduction:

Run sokar with SK_SCA_MODE:"nomad-dc" and SK_DRY_RUN:"1"
Modify autoscaling group to have desired size more than SK_SCALE_OBJECT_MAX
Observe autoscaling group modified by sokar after SK_SCA_WATCHER_INTERVAL time

Is it a desired behavior or is it better not to apply any changes to infrastructure if sokar is in dry-run?

Sokar skipped scale action metric does not work in case the ScalingTarget is missing

Root Cause

The problem is, sokar only wants to fire this metric in case a scale UP/DOWN is really needed.
Therefore sokar first evaluates what is the current state for the job to scale. Thus he try to obtain the current job count by requesting the scaling-target (i.e. nomad, see: https://github.com/ThomasObenaus/sokar/blob/master/scaler/scale.go#L101).

As a consequence if there is no scaling target defined it fails before increasing the metric.

Options

Short-Term:

Fire the metric in case of dry run each time the capacity-planner want's to adjust the count, without reflecting the actual scale state of the job.

Short-Term 2:

Implement a mocked scaling-target for AWS instances.

Mid-Term:

Implement a scaling target for AWS instances being able to obtain the actual instance count.
Problem: AWS access key handling has to be implemented/ provided OR the instances where sokar runs on have to have IAM Instance Roles with needed permissions.

Refactor Job to ScaleObject - Part 2

With #64 the term job was changed to scale object in order to have a more generic term that also fits for the scaling mode for data-centers not only for scaling jobs.

Now in this ticket the clean up (i.e. removal of deprecated references) shall be done.

Improve Code Quality - Config

In current code the anti-pattern of config structs is used too often.
That idiom hides too much information from the caller. Instead explicit parameters should be used or the pattern of functional options in case the parameter list would get too big.

API Documentation missing

Documentation of the API is not available thus it is cumbersome to understand/ work with sokar.

Node draining for Nomad

In order to really support a downscaling for nomad nodes the jobs have to be drained from that node beforehand.

The feature of draining a node and then downscale by removing the drained nodes only will be implemented with this ticket.

Implement retry

Currently if the scale (in/out) is interrupted by a deployment it gets cancelled immediately.
This should be implemented a bit more robust by adding at least a retry policy.

Make alert expiration time configurable

Currently it's hardcoded.
See: https://github.com/ThomasObenaus/sokar/blob/master/scaleAlertAggregator/scaleAlertPool.go#L46

AWS Region Problem

As stated at the aws region has to be provided in order to use a non-profile aws session.
Currently this is hard-coded (see: https://github.com/ThomasObenaus/sokar/blob/master/nomadWorker/session.go#L20).

LogLevel Configurable

Currently the used log-level is Debug.
But it should be configurable to decide between spamming and needed information.

Move to sonar-cloud

Codacity seams to be buggy and slow, hence it would be better to move forward and try another tool.

Sokar does not respect downscaling cooldown

It looks like only the upscaling cooldown is applied and downscaling is ignored

CapacityPlanner stepwise mode

Implement a step-wise mode for the CapacityPlanner.
This means it should be configurable in a step-wise manner how much instances the CapacityPlanner shall scale (respecting the scaleFactor).

Robustness against Deployments

Sokar modifies the job specification upon scaling the job. To be concrete the count of the job is modified.
Also the CI/CD system modifies the job specification without knowing the current count of the job that was set by sokar.
This conflicting knowledge/ information on both systems leads to the result that the CI/CD system will overwrite the count that was set by sokar and thus reverts the scaling.

Downscaling of AWS instances fails (Throttling: Rate exceeded)

Since the monitoring of termination of AWS instances contains a bug the complete downscaling can fail. At least sokar thinks it failed.

Ticket applied. Scaling was failed (Error adjusting scalingObject count to 11: Throttling: Rate exceeded\n\tstatus code: 400, request id: 0d9e1fcb-f49d-11e9-973f-f1b0ba24974b.). New count is 12. Scaling in 46.780182 .

ScaleBy End-Point does not work due to cooling down

It looks like scaleBy triggered by the endpoint should ignore cooling down. I didn't manage to trigger upscaling when other alerts are active at the same time.

	May 3rd 2019, 18:31:07.000	No scale needed.	debug
	May 3rd 2019, 18:31:07.000	Aggregation	info
	May 3rd 2019, 18:31:07.000	[2360150543] fire=true,start=2019-05-03 16:08:30.307334594 +0000 UTC,exp=2019-05-03 16:31:00.328831153 +0000 UTC m=+5441.140788755	debug
	May 3rd 2019, 18:31:06.000	Scale DOWN.	info
	May 3rd 2019, 18:31:06.000	ScaleCounter updated by -1.000000 to -11.000000. Scaling-Alert: 'DownScalingRCWCpuUsage' (-1.000000 wps).	debug
	May 3rd 2019, 18:31:06.000	Aggregation	info
	May 3rd 2019, 18:31:06.000	ScaleAlertPool:	debug
	May 3rd 2019, 18:31:06.000	[2360150543] fire=true,start=2019-05-03 16:08:30.307334594 +0000 UTC,exp=2019-05-03 16:31:00.328831153 +0000 UTC m=+5441.140788755	debug
	May 3rd 2019, 18:31:06.000	Refresh gradient 0.000000. Scale needed.	debug
	May 3rd 2019, 18:31:06.000	Scale Event received: {-0.99995995}	info
	May 3rd 2019, 18:31:06.000	Skip scale event. Sokar is cooling down.	info
	May 3rd 2019, 18:31:05.000	ScaleCounter updated by -1.000000 to -10.000000. Scaling-Alert: 'DownScalingRCWCpuUsage' (-1.000000 wps).	debug
	May 3rd 2019, 18:31:05.000	[2360150543] fire=true,start=2019-05-03 16:08:30.307334594 +0000 UTC,exp=2019-05-03 16:31:00.328831153 +0000 UTC m=+5441.140788755	debug
	May 3rd 2019, 18:31:05.000	No scale needed.	debug
	May 3rd 2019, 18:31:05.000	Refresh gradient -1.000001. Evaluation period (10.000000s) exceeded.	debug
	May 3rd 2019, 18:31:05.000	Aggregation	info
	May 3rd 2019, 18:31:05.000	ScaleAlertPool:	debug
	May 3rd 2019, 18:31:04.000	Check job state (not implemented yet).	error
	May 3rd 2019, 18:31:04.000	No scale needed.	debug
	May 3rd 2019, 18:31:04.000	ScaleCounter updated by -1.000000 to -9.000000. Scaling-Alert: 'DownScalingRCWCpuUsage' (-1.000000 wps).	debug
	May 3rd 2019, 18:31:04.000	[2360150543] fire=true,start=2019-05-03 16:08:30.307334594 +0000 UTC,exp=2019-05-03 16:31:00.328831153 +0000 UTC m=+5441.140788755	debug
	May 3rd 2019, 18:31:04.000	ScaleAlertPool:	debug
	May 3rd 2019, 18:31:04.000	Aggregation	info
	May 3rd 2019, 18:31:03.000	No scale needed.	debug
	May 3rd 2019, 18:31:03.000	Aggregation	info
	May 3rd 2019, 18:31:03.000	[2360150543] fire=true,start=2019-05-03 16:08:30.307334594 +0000 UTC,exp=2019-05-03 16:31:00.328831153 +0000 UTC m=+5441.140788755	debug
>>	May 3rd 2019, 18:31:03.000	Skip scale event. Sokar is cooling down.	info
	May 3rd 2019, 18:31:03.000	ScaleAlertPool:	debug
>>	May 3rd 2019, 18:31:03.000	ScaleBy Percentage Endpoint with '50 %' called.	info
	May 3rd 2019, 18:31:03.000	ScaleCounter updated by -1.000000 to -8.000000. Scaling-Alert: 'DownScalingRCWCpuUsage' (-1.000000 wps).	debug
	May 3rd 2019, 18:31:02.000	ScaleCounter updated by -1.000000 to -7.000000. Scaling-Alert: 'DownScalingRCWCpuUsage' (-1.000000 wps).	debug
	May 3rd 2019, 18:31:02.000	No scale needed.	debug
	May 3rd 2019, 18:31:02.000	Aggregation

Test fails if sokar is instantiated multiple times

Reason seams to be the global (singleton) metrics collector.

Upgrade to go 1.13.x

Stricter linting using golangci-lint

Provide Configuration End-Point

Create an endpoint where one could easily request the currently used configuration of sokar.

Use go-base for helper implementation

https://github.com/ThomasObenaus/go-base

Scheduled Scaling

For some use-cases it is known at which point in time the load (number of requests) increases. Thus it is known upfront when the resources to handle the load are needed.
This means to be prepared it would be sufficient to prescale the system or to schedule specific a scale of the system.

With this ticket the feature of scheduled scaling will be added to sokar.

Scheduled scaling means:
1.That sokar ensures that at a certain time span the scale of the system is not less than X and not more than Y.
2. That sokar still regards scaling alerts during this time span.

Metrics Documentation is missing

To understand the metrics and its use it is crutial to document them accordingly.

Deployability in Nomad

Make sokar deployable as service in nomad.

ScaleBy End-Point

For manual intervention a scaleby end-point would be nice.
ScaleBy - relative scaling based on count and percentage.

Make ScalingObjectWatcherCycle configurable

Version/ Buildinfo Endpoint

Provide a version endpoint.

I.e. https://github.com/povilasv/prommod is worth a look for this.

Use better CLI/CFG lib

Switch to:

https://github.com/spf13/viper + pflag (for reading the config)
or use cobra
See: https://medium.com/@skdomino/writing-better-clis-one-snake-at-a-time-d22e50e60056

Separate scaling cooldown by type

Currently sokar's cooldown mechanism does not regard the direction if the scaling.
This means even though it is possible to configure different cooldowns for down- and up-scaling, it is not possible to directly scale up after a down-scaling.

Why?:

Sokar just memorizes the timestamp for the last scale action, no matter if it is a down- or up-scaling.
If configured differently sokar just waits longer/ less for the next down- / up-scaling event.
But in each case sokar will wait for the next down-scaling even if there was only an up-scaling event beforehand.

With this ticket the down- and up-scaling cooldown should be really separated from one another.

Instance Downscaling does not complete

With #106 one part of the problem that the nomad worker downscaling did not complete successfully was solved.
But tests showed that depending on the your AWS setup (type and amount of EC2 instances) and load on the nomad workers the termination of an instance can take more than 3 minutes.

The problem is that in the current implementation sokar monitors (waits for) the termination of AWS EC2 instances for at max 3 minutes. If the termination takes more then 3 minutes sokar assumes that the termination and thus the whole downscaling failed.

Move from go dep to go modules

More information see:

https://blog.golang.org/using-go-modules

Scale End-Point in dc mode does not work

In sca.dc mode (datacenter mode) sokar is not able to change the desired count of the auto scaling group of the data-center he is responsible for.

Documentation ScalerModes is missing

With the v0.0.6 release the a mode for scaling datacenters was added beside the feature of scaling nomad jobs.
For both the documentation is missing. Especially for the datacenter mode it is important, because it relies on some preconditions of the infrastructure to be scaled (i.e. ASG has to be tagged appropriately, AWS credentials have to be provided, ...).

CapacityPlanner Linear Mode

Implement a linear scaling mode for the CapacityPlanner.
Therefore he should respect the speed of changing the scaleCounter (represented by the scaleFactor). This means the faster the scaleCounter grows/ changes the more instances shall be deployed at once.

Test Fails if value is changed

If the value set with https://github.com/ThomasObenaus/sokar/blob/master/scaler/watchScalingObject_test.go#L67
is changed afterwards, the test fails.

Fill Readme

The readme of the project is the main entry point and at least has to contain a brief description what sokar is about.

Expose used dependencies and versions

https://github.com/povilasv/prommod

Basic Metrics for all Components

All components of sokar shall expose their state by metrics.

Scaler
ScaleEventEmitter
ScaleAlertAggregator
CapactiyPlanner

Metrics for allocated resources

For scaling a nomad data-center the information about how many resources would be needed in case all the currently running jobs would be scaled by x (e.g. 1).

This number could be taken to decide for a scale up/ down.

At best the x is configurable to give the user a bit more flexibility.

Refactor Job to generic Name

Currently the term job is used for the objective to be scaled.
But this does not fit any more and is even confusing when sokar is in scaler.mode "data-center".

Thus it would be nice to rename the term "Job" to something more generic.

i.e. scaling-object, object-to-scale, ...