Code Monkey home page Code Monkey logo

sokar's Issues

CapacityPlanner stepwise mode

Implement a step-wise mode for the CapacityPlanner.
This means it should be configurable in a step-wise manner how much instances the CapacityPlanner shall scale (respecting the scaleFactor).

AWS EC2 Scaler

Currently only nomad jobs can be scaled.
It would be also nice if sokar is able to also scale a group of AWS EC2 instances (AutoScalingGroup).

ScaleBy End-Point

For manual intervention a scaleby end-point would be nice.
ScaleBy - relative scaling based on count and percentage.

Improve Code Quality - Config

In current code the anti-pattern of config structs is used too often.
That idiom hides too much information from the caller. Instead explicit parameters should be used or the pattern of functional options in case the parameter list would get too big.

Sokar does not know about current state of AIA on deployment

The main source of information for sokar is the alerts received from alertmanager.

It seams to be the case that the alertmanager updates the state of its alerting targets only in case there is a change on the alerts.
This leads to the problem that a newly deployed sokar does not know about the currently firing alerts .

Sokar is Able to Modify Autoscaling Group in Dry-Run Mode

Regardless of the dry-run parameter value, the watcher can issue draining commands and autoscaling group size reduction:

if err := s.openScalingTicket(expected, false); err != nil {

Reproduction:

  • Run sokar with SK_SCA_MODE:"nomad-dc" and SK_DRY_RUN:"1"
  • Modify autoscaling group to have desired size more than SK_SCALE_OBJECT_MAX
  • Observe autoscaling group modified by sokar after SK_SCA_WATCHER_INTERVAL time

Is it a desired behavior or is it better not to apply any changes to infrastructure if sokar is in dry-run?

Instance Downscaling does not complete

With #106 one part of the problem that the nomad worker downscaling did not complete successfully was solved.
But tests showed that depending on the your AWS setup (type and amount of EC2 instances) and load on the nomad workers the termination of an instance can take more than 3 minutes.

The problem is that in the current implementation sokar monitors (waits for) the termination of AWS EC2 instances for at max 3 minutes. If the termination takes more then 3 minutes sokar assumes that the termination and thus the whole downscaling failed.

Documentation ScalerModes is missing

With the v0.0.6 release the a mode for scaling datacenters was added beside the feature of scaling nomad jobs.
For both the documentation is missing. Especially for the datacenter mode it is important, because it relies on some preconditions of the infrastructure to be scaled (i.e. ASG has to be tagged appropriately, AWS credentials have to be provided, ...).

Vendoring of generated nomad API does not work

In the current configuration a dep ensure will lead to build errors due to incompatibility between the generated nomad API code and the code-generator used for this.

Due to that the vendoring folder had to be checked in.

Robustness against Deployments

Sokar modifies the job specification upon scaling the job. To be concrete the count of the job is modified.
Also the CI/CD system modifies the job specification without knowing the current count of the job that was set by sokar.
This conflicting knowledge/ information on both systems leads to the result that the CI/CD system will overwrite the count that was set by sokar and thus reverts the scaling.

Sokar does not respect min/max on deployment

If sokar is deployed he should directly check if the current scale of the service is "out of bounds".
This means if the service is currently deployed with less than the min or more than the max bound he has to adjust accordingly.

Sokar skipped scale action metric does not work in case the ScalingTarget is missing

Root Cause

The problem is, sokar only wants to fire this metric in case a scale UP/DOWN is really needed.
Therefore sokar first evaluates what is the current state for the job to scale. Thus he try to obtain the current job count by requesting the scaling-target (i.e. nomad, see: https://github.com/ThomasObenaus/sokar/blob/master/scaler/scale.go#L101).

As a consequence if there is no scaling target defined it fails before increasing the metric.

Options

Short-Term:

Fire the metric in case of dry run each time the capacity-planner want's to adjust the count, without reflecting the actual scale state of the job.

Short-Term 2:

Implement a mocked scaling-target for AWS instances.

Mid-Term:

Implement a scaling target for AWS instances being able to obtain the actual instance count.
Problem: AWS access key handling has to be implemented/ provided OR the instances where sokar runs on have to have IAM Instance Roles with needed permissions.

Scheduled Scaling

For some use-cases it is known at which point in time the load (number of requests) increases. Thus it is known upfront when the resources to handle the load are needed.
This means to be prepared it would be sufficient to prescale the system or to schedule specific a scale of the system.

With this ticket the feature of scheduled scaling will be added to sokar.

Scheduled scaling means:
1.That sokar ensures that at a certain time span the scale of the system is not less than X and not more than Y.
2. That sokar still regards scaling alerts during this time span.

Refactor Job to ScaleObject - Part 2

With #64 the term job was changed to scale object in order to have a more generic term that also fits for the scaling mode for data-centers not only for scaling jobs.

Now in this ticket the clean up (i.e. removal of deprecated references) shall be done.

LogLevel Configurable

Currently the used log-level is Debug.
But it should be configurable to decide between spamming and needed information.

Downscaling of AWS instances fails (Throttling: Rate exceeded)

Since the monitoring of termination of AWS instances contains a bug the complete downscaling can fail. At least sokar thinks it failed.

Ticket applied. Scaling was failed (Error adjusting scalingObject count to 11: Throttling: Rate exceeded\n\tstatus code: 400, request id: 0d9e1fcb-f49d-11e9-973f-f1b0ba24974b.). New count is 12. Scaling in 46.780182 .

Fill Readme

The readme of the project is the main entry point and at least has to contain a brief description what sokar is about.

Move to sonar-cloud

Codacity seams to be buggy and slow, hence it would be better to move forward and try another tool.

Refactor Job to generic Name

Currently the term job is used for the objective to be scaled.
But this does not fit any more and is even confusing when sokar is in scaler.mode "data-center".

Thus it would be nice to rename the term "Job" to something more generic.

i.e. scaling-object, object-to-scale, ...

Separate scaling cooldown by type

Currently sokar's cooldown mechanism does not regard the direction if the scaling.
This means even though it is possible to configure different cooldowns for down- and up-scaling, it is not possible to directly scale up after a down-scaling.

Why?:

  • Sokar just memorizes the timestamp for the last scale action, no matter if it is a down- or up-scaling.
  • If configured differently sokar just waits longer/ less for the next down- / up-scaling event.
  • But in each case sokar will wait for the next down-scaling even if there was only an up-scaling event beforehand.

With this ticket the down- and up-scaling cooldown should be really separated from one another.

Add License

Maybe it's also worth to integrate https://fossa.com. Fossa enables automated license and complience checks, even of the used dependencies.

Node draining for Nomad

In order to really support a downscaling for nomad nodes the jobs have to be drained from that node beforehand.

The feature of draining a node and then downscale by removing the drained nodes only will be implemented with this ticket.

ScaleBy End-Point does not work due to cooling down

It looks like scaleBy triggered by the endpoint should ignore cooling down. I didn't manage to trigger upscaling when other alerts are active at the same time.

	May 3rd 2019, 18:31:07.000	No scale needed.	debug
	May 3rd 2019, 18:31:07.000	Aggregation	info
	May 3rd 2019, 18:31:07.000	[2360150543] fire=true,start=2019-05-03 16:08:30.307334594 +0000 UTC,exp=2019-05-03 16:31:00.328831153 +0000 UTC m=+5441.140788755	debug
	May 3rd 2019, 18:31:06.000	Scale DOWN.	info
	May 3rd 2019, 18:31:06.000	ScaleCounter updated by -1.000000 to -11.000000. Scaling-Alert: 'DownScalingRCWCpuUsage' (-1.000000 wps).	debug
	May 3rd 2019, 18:31:06.000	Aggregation	info
	May 3rd 2019, 18:31:06.000	ScaleAlertPool:	debug
	May 3rd 2019, 18:31:06.000	[2360150543] fire=true,start=2019-05-03 16:08:30.307334594 +0000 UTC,exp=2019-05-03 16:31:00.328831153 +0000 UTC m=+5441.140788755	debug
	May 3rd 2019, 18:31:06.000	Refresh gradient 0.000000. Scale needed.	debug
	May 3rd 2019, 18:31:06.000	Scale Event received: {-0.99995995}	info
	May 3rd 2019, 18:31:06.000	Skip scale event. Sokar is cooling down.	info
	May 3rd 2019, 18:31:05.000	ScaleCounter updated by -1.000000 to -10.000000. Scaling-Alert: 'DownScalingRCWCpuUsage' (-1.000000 wps).	debug
	May 3rd 2019, 18:31:05.000	[2360150543] fire=true,start=2019-05-03 16:08:30.307334594 +0000 UTC,exp=2019-05-03 16:31:00.328831153 +0000 UTC m=+5441.140788755	debug
	May 3rd 2019, 18:31:05.000	No scale needed.	debug
	May 3rd 2019, 18:31:05.000	Refresh gradient -1.000001. Evaluation period (10.000000s) exceeded.	debug
	May 3rd 2019, 18:31:05.000	Aggregation	info
	May 3rd 2019, 18:31:05.000	ScaleAlertPool:	debug
	May 3rd 2019, 18:31:04.000	Check job state (not implemented yet).	error
	May 3rd 2019, 18:31:04.000	No scale needed.	debug
	May 3rd 2019, 18:31:04.000	ScaleCounter updated by -1.000000 to -9.000000. Scaling-Alert: 'DownScalingRCWCpuUsage' (-1.000000 wps).	debug
	May 3rd 2019, 18:31:04.000	[2360150543] fire=true,start=2019-05-03 16:08:30.307334594 +0000 UTC,exp=2019-05-03 16:31:00.328831153 +0000 UTC m=+5441.140788755	debug
	May 3rd 2019, 18:31:04.000	ScaleAlertPool:	debug
	May 3rd 2019, 18:31:04.000	Aggregation	info
	May 3rd 2019, 18:31:03.000	No scale needed.	debug
	May 3rd 2019, 18:31:03.000	Aggregation	info
	May 3rd 2019, 18:31:03.000	[2360150543] fire=true,start=2019-05-03 16:08:30.307334594 +0000 UTC,exp=2019-05-03 16:31:00.328831153 +0000 UTC m=+5441.140788755	debug
>>	May 3rd 2019, 18:31:03.000	Skip scale event. Sokar is cooling down.	info
	May 3rd 2019, 18:31:03.000	ScaleAlertPool:	debug
>>	May 3rd 2019, 18:31:03.000	ScaleBy Percentage Endpoint with '50 %' called.	info
	May 3rd 2019, 18:31:03.000	ScaleCounter updated by -1.000000 to -8.000000. Scaling-Alert: 'DownScalingRCWCpuUsage' (-1.000000 wps).	debug
	May 3rd 2019, 18:31:02.000	ScaleCounter updated by -1.000000 to -7.000000. Scaling-Alert: 'DownScalingRCWCpuUsage' (-1.000000 wps).	debug
	May 3rd 2019, 18:31:02.000	No scale needed.	debug
	May 3rd 2019, 18:31:02.000	Aggregation

Implement retry

Currently if the scale (in/out) is interrupted by a deployment it gets cancelled immediately.
This should be implemented a bit more robust by adding at least a retry policy.

CapacityPlanner Linear Mode

Implement a linear scaling mode for the CapacityPlanner.
Therefore he should respect the speed of changing the scaleCounter (represented by the scaleFactor). This means the faster the scaleCounter grows/ changes the more instances shall be deployed at once.

Metrics for allocated resources

For scaling a nomad data-center the information about how many resources would be needed in case all the currently running jobs would be scaled by x (e.g. 1).

This number could be taken to decide for a scale up/ down.

At best the x is configurable to give the user a bit more flexibility.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.