thomasobenaus / sokar Goto Github PK
View Code? Open in Web Editor NEWAlertbased Auto Scaler for Nomad
License: GNU Lesser General Public License v3.0
Alertbased Auto Scaler for Nomad
License: GNU Lesser General Public License v3.0
Make sokar deployable as service in nomad.
Implement a step-wise mode for the CapacityPlanner.
This means it should be configurable in a step-wise manner how much instances the CapacityPlanner shall scale (respecting the scaleFactor).
Currently it's hardcoded.
See: https://github.com/ThomasObenaus/sokar/blob/master/scaleAlertAggregator/scaleAlertPool.go#L46
To understand the metrics and its use it is crutial to document them accordingly.
Currently only nomad jobs can be scaled.
It would be also nice if sokar is able to also scale a group of AWS EC2 instances (AutoScalingGroup).
If the value set with https://github.com/ThomasObenaus/sokar/blob/master/scaler/watchScalingObject_test.go#L67
is changed afterwards, the test fails.
For manual intervention a scaleby end-point would be nice.
ScaleBy - relative scaling based on count and percentage.
In current code the anti-pattern of config structs is used too often.
That idiom hides too much information from the caller. Instead explicit parameters should be used or the pattern of functional options in case the parameter list would get too big.
The main source of information for sokar is the alerts received from alertmanager.
It seams to be the case that the alertmanager updates the state of its alerting targets only in case there is a change on the alerts.
This leads to the problem that a newly deployed sokar does not know about the currently firing alerts .
Regardless of the dry-run parameter value, the watcher can issue draining commands and autoscaling group size reduction:
sokar/scaler/watchScalingObject.go
Line 46 in f1f98ad
Reproduction:
SK_SCA_MODE:"nomad-dc"
and SK_DRY_RUN:"1"
SK_SCALE_OBJECT_MAX
SK_SCA_WATCHER_INTERVAL
timeIs it a desired behavior or is it better not to apply any changes to infrastructure if sokar is in dry-run?
With #106 one part of the problem that the nomad worker downscaling did not complete successfully was solved.
But tests showed that depending on the your AWS setup (type and amount of EC2 instances) and load on the nomad workers the termination of an instance can take more than 3 minutes.
The problem is that in the current implementation sokar monitors (waits for) the termination of AWS EC2 instances for at max 3 minutes. If the termination takes more then 3 minutes sokar assumes that the termination and thus the whole downscaling failed.
With the v0.0.6 release the a mode for scaling datacenters was added beside the feature of scaling nomad jobs.
For both the documentation is missing. Especially for the datacenter mode it is important, because it relies on some preconditions of the infrastructure to be scaled (i.e. ASG has to be tagged appropriately, AWS credentials have to be provided, ...).
E.g. the sca.nomad.mode
is deprecated. This should be cleaned up (see: Config.md)
In the current configuration a dep ensure
will lead to build errors due to incompatibility between the generated nomad API code and the code-generator used for this.
Due to that the vendoring folder had to be checked in.
Provide a version endpoint.
I.e. https://github.com/povilasv/prommod is worth a look for this.
Create an endpoint where one could easily request the currently used configuration of sokar.
Sokar modifies the job specification upon scaling the job. To be concrete the count of the job is modified.
Also the CI/CD system modifies the job specification without knowing the current count of the job that was set by sokar.
This conflicting knowledge/ information on both systems leads to the result that the CI/CD system will overwrite the count that was set by sokar and thus reverts the scaling.
If sokar is deployed he should directly check if the current scale of the service is "out of bounds".
This means if the service is currently deployed with less than the min or more than the max bound he has to adjust accordingly.
The problem is, sokar only wants to fire this metric in case a scale UP/DOWN is really needed.
Therefore sokar first evaluates what is the current state for the job to scale. Thus he try to obtain the current job count by requesting the scaling-target (i.e. nomad, see: https://github.com/ThomasObenaus/sokar/blob/master/scaler/scale.go#L101).
As a consequence if there is no scaling target defined it fails before increasing the metric.
Fire the metric in case of dry run each time the capacity-planner want's to adjust the count, without reflecting the actual scale state of the job.
Implement a mocked scaling-target for AWS instances.
Implement a scaling target for AWS instances being able to obtain the actual instance count.
Problem: AWS access key handling has to be implemented/ provided OR the instances where sokar runs on have to have IAM Instance Roles with needed permissions.
For some use-cases it is known at which point in time the load (number of requests) increases. Thus it is known upfront when the resources to handle the load are needed.
This means to be prepared it would be sufficient to prescale the system or to schedule specific a scale of the system.
With this ticket the feature of scheduled scaling will be added to sokar.
Scheduled scaling means:
1.That sokar ensures that at a certain time span the scale of the system is not less than X and not more than Y.
2. That sokar still regards scaling alerts during this time span.
With #64 the term job
was changed to scale object
in order to have a more generic term that also fits for the scaling mode for data-centers not only for scaling jobs.
Now in this ticket the clean up (i.e. removal of deprecated references) shall be done.
Currently the used log-level is Debug.
But it should be configurable to decide between spamming and needed information.
In sca.dc mode (datacenter mode) sokar is not able to change the desired count of the auto scaling group of the data-center he is responsible for.
As stated at the aws region has to be provided in order to use a non-profile aws session.
Currently this is hard-coded (see: https://github.com/ThomasObenaus/sokar/blob/master/nomadWorker/session.go#L20).
Since the monitoring of termination of AWS instances contains a bug the complete downscaling can fail. At least sokar thinks it failed.
Ticket applied. Scaling was failed (Error adjusting scalingObject count to 11: Throttling: Rate exceeded\n\tstatus code: 400, request id: 0d9e1fcb-f49d-11e9-973f-f1b0ba24974b.). New count is 12. Scaling in 46.780182 .
The readme of the project is the main entry point and at least has to contain a brief description what sokar is about.
Codacity seams to be buggy and slow, hence it would be better to move forward and try another tool.
Currently the term job is used for the objective to be scaled.
But this does not fit any more and is even confusing when sokar is in scaler.mode "data-center".
Thus it would be nice to rename the term "Job" to something more generic.
i.e. scaling-object, object-to-scale, ...
Documentation of the API is not available thus it is cumbersome to understand/ work with sokar.
Switch to:
For testing a no-scale or dry run mode is needed.
Currently sokar's cooldown mechanism does not regard the direction if the scaling.
This means even though it is possible to configure different cooldowns for down- and up-scaling, it is not possible to directly scale up after a down-scaling.
Why?:
With this ticket the down- and up-scaling cooldown should be really separated from one another.
All components of sokar shall expose their state by metrics.
Maybe it's also worth to integrate https://fossa.com. Fossa enables automated license and complience checks, even of the used dependencies.
In order to really support a downscaling for nomad nodes the jobs have to be drained from that node beforehand.
The feature of draining a node and then downscale by removing the drained nodes only will be implemented with this ticket.
It looks like scaleBy triggered by the endpoint should ignore cooling down. I didn't manage to trigger upscaling when other alerts are active at the same time.
May 3rd 2019, 18:31:07.000 No scale needed. debug
May 3rd 2019, 18:31:07.000 Aggregation info
May 3rd 2019, 18:31:07.000 [2360150543] fire=true,start=2019-05-03 16:08:30.307334594 +0000 UTC,exp=2019-05-03 16:31:00.328831153 +0000 UTC m=+5441.140788755 debug
May 3rd 2019, 18:31:06.000 Scale DOWN. info
May 3rd 2019, 18:31:06.000 ScaleCounter updated by -1.000000 to -11.000000. Scaling-Alert: 'DownScalingRCWCpuUsage' (-1.000000 wps). debug
May 3rd 2019, 18:31:06.000 Aggregation info
May 3rd 2019, 18:31:06.000 ScaleAlertPool: debug
May 3rd 2019, 18:31:06.000 [2360150543] fire=true,start=2019-05-03 16:08:30.307334594 +0000 UTC,exp=2019-05-03 16:31:00.328831153 +0000 UTC m=+5441.140788755 debug
May 3rd 2019, 18:31:06.000 Refresh gradient 0.000000. Scale needed. debug
May 3rd 2019, 18:31:06.000 Scale Event received: {-0.99995995} info
May 3rd 2019, 18:31:06.000 Skip scale event. Sokar is cooling down. info
May 3rd 2019, 18:31:05.000 ScaleCounter updated by -1.000000 to -10.000000. Scaling-Alert: 'DownScalingRCWCpuUsage' (-1.000000 wps). debug
May 3rd 2019, 18:31:05.000 [2360150543] fire=true,start=2019-05-03 16:08:30.307334594 +0000 UTC,exp=2019-05-03 16:31:00.328831153 +0000 UTC m=+5441.140788755 debug
May 3rd 2019, 18:31:05.000 No scale needed. debug
May 3rd 2019, 18:31:05.000 Refresh gradient -1.000001. Evaluation period (10.000000s) exceeded. debug
May 3rd 2019, 18:31:05.000 Aggregation info
May 3rd 2019, 18:31:05.000 ScaleAlertPool: debug
May 3rd 2019, 18:31:04.000 Check job state (not implemented yet). error
May 3rd 2019, 18:31:04.000 No scale needed. debug
May 3rd 2019, 18:31:04.000 ScaleCounter updated by -1.000000 to -9.000000. Scaling-Alert: 'DownScalingRCWCpuUsage' (-1.000000 wps). debug
May 3rd 2019, 18:31:04.000 [2360150543] fire=true,start=2019-05-03 16:08:30.307334594 +0000 UTC,exp=2019-05-03 16:31:00.328831153 +0000 UTC m=+5441.140788755 debug
May 3rd 2019, 18:31:04.000 ScaleAlertPool: debug
May 3rd 2019, 18:31:04.000 Aggregation info
May 3rd 2019, 18:31:03.000 No scale needed. debug
May 3rd 2019, 18:31:03.000 Aggregation info
May 3rd 2019, 18:31:03.000 [2360150543] fire=true,start=2019-05-03 16:08:30.307334594 +0000 UTC,exp=2019-05-03 16:31:00.328831153 +0000 UTC m=+5441.140788755 debug
>> May 3rd 2019, 18:31:03.000 Skip scale event. Sokar is cooling down. info
May 3rd 2019, 18:31:03.000 ScaleAlertPool: debug
>> May 3rd 2019, 18:31:03.000 ScaleBy Percentage Endpoint with '50 %' called. info
May 3rd 2019, 18:31:03.000 ScaleCounter updated by -1.000000 to -8.000000. Scaling-Alert: 'DownScalingRCWCpuUsage' (-1.000000 wps). debug
May 3rd 2019, 18:31:02.000 ScaleCounter updated by -1.000000 to -7.000000. Scaling-Alert: 'DownScalingRCWCpuUsage' (-1.000000 wps). debug
May 3rd 2019, 18:31:02.000 No scale needed. debug
May 3rd 2019, 18:31:02.000 Aggregation
This is the preparation for separating the ui end-points from those who represents the api of sokar.
Currently if the scale (in/out) is interrupted by a deployment it gets cancelled immediately.
This should be implemented a bit more robust by adding at least a retry policy.
More information see:
Implement a linear scaling mode for the CapacityPlanner.
Therefore he should respect the speed of changing the scaleCounter (represented by the scaleFactor). This means the faster the scaleCounter grows/ changes the more instances shall be deployed at once.
Reason seams to be the global (singleton) metrics collector.
For scaling a nomad data-center the information about how many resources would be needed in case all the currently running jobs would be scaled by x (e.g. 1).
This number could be taken to decide for a scale up/ down.
At best the x is configurable to give the user a bit more flexibility.
It looks like only the upscaling cooldown is applied and downscaling is ignored
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.