The sokar's discuss from thomasobenaus

Deployability in Nomad

Make sokar deployable as service in nomad.

CapacityPlanner stepwise mode

Implement a step-wise mode for the CapacityPlanner.
This means it should be configurable in a step-wise manner how much instances the CapacityPlanner shall scale (respecting the scaleFactor).

Make alert expiration time configurable

Currently it's hardcoded.
See: https://github.com/ThomasObenaus/sokar/blob/master/scaleAlertAggregator/scaleAlertPool.go#L46

Metrics Documentation is missing

To understand the metrics and its use it is crutial to document them accordingly.

AWS EC2 Scaler

Currently only nomad jobs can be scaled.
It would be also nice if sokar is able to also scale a group of AWS EC2 instances (AutoScalingGroup).

Stricter linting using golangci-lint

Test Fails if value is changed

If the value set with https://github.com/ThomasObenaus/sokar/blob/master/scaler/watchScalingObject_test.go#L67
is changed afterwards, the test fails.

ScaleBy End-Point

For manual intervention a scaleby end-point would be nice.
ScaleBy - relative scaling based on count and percentage.

In current code the anti-pattern of config structs is used too often.
That idiom hides too much information from the caller. Instead explicit parameters should be used or the pattern of functional options in case the parameter list would get too big.

Make ScalingObjectWatcherCycle configurable

Sokar does not know about current state of AIA on deployment

The main source of information for sokar is the alerts received from alertmanager.

It seams to be the case that the alertmanager updates the state of its alerting targets only in case there is a change on the alerts.
This leads to the problem that a newly deployed sokar does not know about the currently firing alerts .

Sokar is Able to Modify Autoscaling Group in Dry-Run Mode

Regardless of the dry-run parameter value, the watcher can issue draining commands and autoscaling group size reduction:

sokar/scaler/watchScalingObject.go

Line 46 in f1f98ad

if err := s.openScalingTicket(expected, false); err != nil {

Reproduction:

Run sokar with SK_SCA_MODE:"nomad-dc" and SK_DRY_RUN:"1"
Modify autoscaling group to have desired size more than SK_SCALE_OBJECT_MAX
Observe autoscaling group modified by sokar after SK_SCA_WATCHER_INTERVAL time

Is it a desired behavior or is it better not to apply any changes to infrastructure if sokar is in dry-run?

Instance Downscaling does not complete

With #106 one part of the problem that the nomad worker downscaling did not complete successfully was solved.
But tests showed that depending on the your AWS setup (type and amount of EC2 instances) and load on the nomad workers the termination of an instance can take more than 3 minutes.

The problem is that in the current implementation sokar monitors (waits for) the termination of AWS EC2 instances for at max 3 minutes. If the termination takes more then 3 minutes sokar assumes that the termination and thus the whole downscaling failed.

Metrics should reflect current scale count after deployment

Currently in case sokar is newly deployed the planned + current scale count metrics show 0 even if in reality the scale-object has another count in reality.
This is confusing and should be fixed.

Documentation ScalerModes is missing

With the v0.0.6 release the a mode for scaling datacenters was added beside the feature of scaling nomad jobs.
For both the documentation is missing. Especially for the datacenter mode it is important, because it relies on some preconditions of the infrastructure to be scaled (i.e. ASG has to be tagged appropriately, AWS credentials have to be provided, ...).

Upgrade to go 1.13.x

Remove deprecated config flags

E.g. the sca.nomad.mode is deprecated. This should be cleaned up (see: Config.md)

Vendoring of generated nomad API does not work

In the current configuration a dep ensure will lead to build errors due to incompatibility between the generated nomad API code and the code-generator used for this.

Due to that the vendoring folder had to be checked in.

Version/ Buildinfo Endpoint

Provide a version endpoint.

I.e. https://github.com/povilasv/prommod is worth a look for this.

Provide Configuration End-Point

Create an endpoint where one could easily request the currently used configuration of sokar.

Robustness against Deployments

Sokar modifies the job specification upon scaling the job. To be concrete the count of the job is modified.
Also the CI/CD system modifies the job specification without knowing the current count of the job that was set by sokar.
This conflicting knowledge/ information on both systems leads to the result that the CI/CD system will overwrite the count that was set by sokar and thus reverts the scaling.

Sokar does not respect min/max on deployment

If sokar is deployed he should directly check if the current scale of the service is "out of bounds".
This means if the service is currently deployed with less than the min or more than the max bound he has to adjust accordingly.

Sokar skipped scale action metric does not work in case the ScalingTarget is missing

Root Cause

The problem is, sokar only wants to fire this metric in case a scale UP/DOWN is really needed.
Therefore sokar first evaluates what is the current state for the job to scale. Thus he try to obtain the current job count by requesting the scaling-target (i.e. nomad, see: https://github.com/ThomasObenaus/sokar/blob/master/scaler/scale.go#L101).

As a consequence if there is no scaling target defined it fails before increasing the metric.

Options

Short-Term:

Fire the metric in case of dry run each time the capacity-planner want's to adjust the count, without reflecting the actual scale state of the job.

Short-Term 2:

Implement a mocked scaling-target for AWS instances.

Mid-Term:

Implement a scaling target for AWS instances being able to obtain the actual instance count.
Problem: AWS access key handling has to be implemented/ provided OR the instances where sokar runs on have to have IAM Instance Roles with needed permissions.

Scheduled Scaling

For some use-cases it is known at which point in time the load (number of requests) increases. Thus it is known upfront when the resources to handle the load are needed.
This means to be prepared it would be sufficient to prescale the system or to schedule specific a scale of the system.

With this ticket the feature of scheduled scaling will be added to sokar.

Scheduled scaling means:
1.That sokar ensures that at a certain time span the scale of the system is not less than X and not more than Y.
2. That sokar still regards scaling alerts during this time span.

Refactor Job to ScaleObject - Part 2

With #64 the term job was changed to scale object in order to have a more generic term that also fits for the scaling mode for data-centers not only for scaling jobs.

Now in this ticket the clean up (i.e. removal of deprecated references) shall be done.

LogLevel Configurable

Currently the used log-level is Debug.
But it should be configurable to decide between spamming and needed information.

Use go-base for helper implementation

https://github.com/ThomasObenaus/go-base

Scale End-Point in dc mode does not work

In sca.dc mode (datacenter mode) sokar is not able to change the desired count of the auto scaling group of the data-center he is responsible for.

AWS Region Problem

As stated at the aws region has to be provided in order to use a non-profile aws session.
Currently this is hard-coded (see: https://github.com/ThomasObenaus/sokar/blob/master/nomadWorker/session.go#L20).

Downscaling of AWS instances fails (Throttling: Rate exceeded)

Since the monitoring of termination of AWS instances contains a bug the complete downscaling can fail. At least sokar thinks it failed.

Ticket applied. Scaling was failed (Error adjusting scalingObject count to 11: Throttling: Rate exceeded\n\tstatus code: 400, request id: 0d9e1fcb-f49d-11e9-973f-f1b0ba24974b.). New count is 12. Scaling in 46.780182 .

Fill Readme

The readme of the project is the main entry point and at least has to contain a brief description what sokar is about.

Move to sonar-cloud

Codacity seams to be buggy and slow, hence it would be better to move forward and try another tool.

Refactor Job to generic Name

Currently the term job is used for the objective to be scaled.
But this does not fit any more and is even confusing when sokar is in scaler.mode "data-center".

Thus it would be nice to rename the term "Job" to something more generic.

i.e. scaling-object, object-to-scale, ...

API Documentation missing

Documentation of the API is not available thus it is cumbersome to understand/ work with sokar.

Use better CLI/CFG lib

Switch to:

https://github.com/spf13/viper + pflag (for reading the config)
or use cobra
See: https://medium.com/@skdomino/writing-better-clis-one-snake-at-a-time-d22e50e60056

Implement Dry-/ no-scale mode

For testing a no-scale or dry run mode is needed.

Create Integration Test

Separate scaling cooldown by type

Currently sokar's cooldown mechanism does not regard the direction if the scaling.
This means even though it is possible to configure different cooldowns for down- and up-scaling, it is not possible to directly scale up after a down-scaling.

Why?:

Sokar just memorizes the timestamp for the last scale action, no matter if it is a down- or up-scaling.
If configured differently sokar just waits longer/ less for the next down- / up-scaling event.
But in each case sokar will wait for the next down-scaling even if there was only an up-scaling event beforehand.

With this ticket the down- and up-scaling cooldown should be really separated from one another.

Basic Metrics for all Components

All components of sokar shall expose their state by metrics.

Scaler
ScaleEventEmitter
ScaleAlertAggregator
CapactiyPlanner

Add License

Maybe it's also worth to integrate https://fossa.com. Fossa enables automated license and complience checks, even of the used dependencies.

Node draining for Nomad

In order to really support a downscaling for nomad nodes the jobs have to be drained from that node beforehand.

The feature of draining a node and then downscale by removing the drained nodes only will be implemented with this ticket.

ScaleBy End-Point does not work due to cooling down

It looks like scaleBy triggered by the endpoint should ignore cooling down. I didn't manage to trigger upscaling when other alerts are active at the same time.

	May 3rd 2019, 18:31:07.000	No scale needed.	debug
	May 3rd 2019, 18:31:07.000	Aggregation	info
	May 3rd 2019, 18:31:07.000	[2360150543] fire=true,start=2019-05-03 16:08:30.307334594 +0000 UTC,exp=2019-05-03 16:31:00.328831153 +0000 UTC m=+5441.140788755	debug
	May 3rd 2019, 18:31:06.000	Scale DOWN.	info
	May 3rd 2019, 18:31:06.000	ScaleCounter updated by -1.000000 to -11.000000. Scaling-Alert: 'DownScalingRCWCpuUsage' (-1.000000 wps).	debug
	May 3rd 2019, 18:31:06.000	Aggregation	info
	May 3rd 2019, 18:31:06.000	ScaleAlertPool:	debug
	May 3rd 2019, 18:31:06.000	[2360150543] fire=true,start=2019-05-03 16:08:30.307334594 +0000 UTC,exp=2019-05-03 16:31:00.328831153 +0000 UTC m=+5441.140788755	debug
	May 3rd 2019, 18:31:06.000	Refresh gradient 0.000000. Scale needed.	debug
	May 3rd 2019, 18:31:06.000	Scale Event received: {-0.99995995}	info
	May 3rd 2019, 18:31:06.000	Skip scale event. Sokar is cooling down.	info
	May 3rd 2019, 18:31:05.000	ScaleCounter updated by -1.000000 to -10.000000. Scaling-Alert: 'DownScalingRCWCpuUsage' (-1.000000 wps).	debug
	May 3rd 2019, 18:31:05.000	[2360150543] fire=true,start=2019-05-03 16:08:30.307334594 +0000 UTC,exp=2019-05-03 16:31:00.328831153 +0000 UTC m=+5441.140788755	debug
	May 3rd 2019, 18:31:05.000	No scale needed.	debug
	May 3rd 2019, 18:31:05.000	Refresh gradient -1.000001. Evaluation period (10.000000s) exceeded.	debug
	May 3rd 2019, 18:31:05.000	Aggregation	info
	May 3rd 2019, 18:31:05.000	ScaleAlertPool:	debug
	May 3rd 2019, 18:31:04.000	Check job state (not implemented yet).	error
	May 3rd 2019, 18:31:04.000	No scale needed.	debug
	May 3rd 2019, 18:31:04.000	ScaleCounter updated by -1.000000 to -9.000000. Scaling-Alert: 'DownScalingRCWCpuUsage' (-1.000000 wps).	debug
	May 3rd 2019, 18:31:04.000	[2360150543] fire=true,start=2019-05-03 16:08:30.307334594 +0000 UTC,exp=2019-05-03 16:31:00.328831153 +0000 UTC m=+5441.140788755	debug
	May 3rd 2019, 18:31:04.000	ScaleAlertPool:	debug
	May 3rd 2019, 18:31:04.000	Aggregation	info
	May 3rd 2019, 18:31:03.000	No scale needed.	debug
	May 3rd 2019, 18:31:03.000	Aggregation	info
	May 3rd 2019, 18:31:03.000	[2360150543] fire=true,start=2019-05-03 16:08:30.307334594 +0000 UTC,exp=2019-05-03 16:31:00.328831153 +0000 UTC m=+5441.140788755	debug
>>	May 3rd 2019, 18:31:03.000	Skip scale event. Sokar is cooling down.	info
	May 3rd 2019, 18:31:03.000	ScaleAlertPool:	debug
>>	May 3rd 2019, 18:31:03.000	ScaleBy Percentage Endpoint with '50 %' called.	info
	May 3rd 2019, 18:31:03.000	ScaleCounter updated by -1.000000 to -8.000000. Scaling-Alert: 'DownScalingRCWCpuUsage' (-1.000000 wps).	debug
	May 3rd 2019, 18:31:02.000	ScaleCounter updated by -1.000000 to -7.000000. Scaling-Alert: 'DownScalingRCWCpuUsage' (-1.000000 wps).	debug
	May 3rd 2019, 18:31:02.000	No scale needed.	debug
	May 3rd 2019, 18:31:02.000	Aggregation

Move api end-points to /api/xyz

This is the preparation for separating the ui end-points from those who represents the api of sokar.

Implement retry

Currently if the scale (in/out) is interrupted by a deployment it gets cancelled immediately.
This should be implemented a bit more robust by adding at least a retry policy.

Move from go dep to go modules

More information see:

https://blog.golang.org/using-go-modules

CapacityPlanner Linear Mode

Implement a linear scaling mode for the CapacityPlanner.
Therefore he should respect the speed of changing the scaleCounter (represented by the scaleFactor). This means the faster the scaleCounter grows/ changes the more instances shall be deployed at once.

At best the x is configurable to give the user a bit more flexibility.

Sokar does not respect downscaling cooldown

It looks like only the upscaling cooldown is applied and downscaling is ignored

thomasobenaus / sokar Goto Github PK

sokar's Issues

Root Cause

Options

Short-Term:

Short-Term 2:

Mid-Term:

Recommend Projects

Recommend Topics

Recommend Org