Code Monkey home page Code Monkey logo

chaos-lambda's Introduction

About

EC2 instances are volatile and can be recycled at any time without warning. Amazon recommends running them under Auto Scaling Groups to ensure overall service availability, but it's easy to forget that instances can suddenly fail until it happens in the early hours of the morning when everyone is on holiday.

Chaos Lambda increases the rate at which these failures occur during business hours, helping teams to build services that handle them gracefully.

Quick setup

Create the lambda function in the region you want it to target using the cloudformation/templates/lambda_standalone.json CloudFormation template. There are two parameters you may want to change:

  • Schedule: change if the default run times don't suit you (once per hour between 10am UTC and 4pm UTC, Monday to Friday); see http://docs.aws.amazon.com/AmazonCloudWatch/latest/events/ScheduledEvents.html for documentation on the syntax.
  • DefaultProbability: by default all Auto Scaling Groups in the region are targets; set this to 0.0 and only ASGs with a chaos-lambda-termination tag (see below) will be affected.

Notifications

Termination Topic

By deploying the lambda_standalone.json CloudFormation template, an SNS topic will be created with the name ChaosLambdaTerminationTopic. For each instance that gets terminated, a notification will be published using this structure:

{
  "event_name": "chaos_lambda.terminating",
  "asg_name": "my-autoscaling-group",
  "instance_id": "i-00001234"
}

By default, no subscriptions are created to this topic, so it is up to you to subscribe a queue or another lambda if you wish.

Failure topic

To receive notifications if the lambda function fails for any reason, create another stack using the cloudformation/templates/alarms.json template. This takes the lambda function name (something similar to chaos-lambda-ChaosLambdaFunction-EM2XNWWNZTPW) and the email address to send the alerts to.

Probability of termination

Every time the lambda triggers it examines all the Auto Scaling Groups in the region and potentially terminates one instance in each. The probability of termination can be changed at the ASG level with a tag, and at a global level with the DefaultProbability stack parameter.

At the ASG level the probability can be controlled by adding a chaos-lambda-termination tag with a value between 0.0 (never terminate) and 1.0 (always terminate). Typically this would be used to opt out a legacy system (0.0).

The DefaultProbability parameter sets the probability of termination for any ASG without a valid chaos-lambda-termination tag. If set to 0.0 the system becomes "opt-in", where any ASG without this tag is ignored. The default is 0.166 (or 1 in 6).

Enabling/disabling

The lambda is triggered by a CloudWatch Events rule, the name of which can be found from the ChaosLambdaFunctionOutput output of the lambda stack. Locate this rule in the AWS console under the Rules section of the CloudWatch service, and you can disable or enable it via the Actions button.

Regions

By default the lambda will target ASGs running in the same region. It's generally a good idea to avoid cross-region actions, but if necessary an alternative list of one or more region names can be specified in the Regions stack parameter.

The value is a comma separated list of region names with optional whitespace, so the following are all valid and equivalent:

  • ap-south-1,eu-west-1,us-east-1
  • ap-south-1, eu-west-1, us-east-1
  • ap-south-1 , eu-west-1 , us-east-1

Log messages

Chaos Lambda log lines always start with a timestamp and a word specifying the event type. The timestamp is of the form YYYY-MM-DDThh:mm:ssZ, eg 2015-12-11T14:00:37Z, and the timezone will always be Z. The different event types are described below.

bad-probability

<timestamp> bad-probability [<value>] in <asg name>

Example:

2015-12-11T14:07:21Z bad-probability [not often] in test-app-ASG-7LJI5SY4VX6T

If the value of the chaos-lambda-termination tag isn't a number between 0.0 and 1.0 inclusive then it will be logged in one of these lines. The square brackets around the value allow CloudWatch Logs to find the full value even if it contains spaces.

result

<timestamp> result <instance id> is <state>

Example:

2015-12-11T14:00:40Z result i-fe705d77 is shutting-down

After asking EC2 to terminate each of the targeted instances the new state of each is logged with a result line. The <state> value is taken from the code property of the InstanceState AWS type described at http://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_InstanceState.html

targeting

<timestamp> targeting <instance id> in <asg name>

Example:

2015-12-11T14:00:38Z targeting i-168f9eaf in test-app-ASG-1LOMEKEVBXXXS

The targeting lines list all of the instances that are about to be terminated, before the TerminateInstances call occurs.

triggered

<timestamp> triggered <region>

Example:

2015-12-11T14:00:37Z triggered eu-west-1

Generated when the lambda is triggered, indicating the region that will be affected.

chaos-lambda's People

Contributors

danielthepope avatar dimpavloff avatar pclifford avatar stevebirks avatar tmoco avatar yavor-atanasov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chaos-lambda's Issues

python3 upgrade

Hey, we’re currently upgrading our Lambda’s nodejs runtime versions. We noticed we have a Lambda that is based of the chaos-lambda-repo which uses the Python2.7 runtime. However, in June 2021 AWS will be dropping support for Python2. Are there plans to be upgrading the repo Python3?

Enable opt-in mode on a per-auto-scaling group mode

In my environment I would prefer(at least at first) to run this in a controlled manner, in order to detect configuration drift that would make new machines fail to set up and enter a spinning state.

  • only run on a given day(or days) of the week, at a given time when I am likely available for fixing any issues it may find, for example at 10 AM on working days. This can be set up using a custom cron rule in the event generator.
  • always terminate a single node from each of the AutoScaling groups where it is enabled in order to check if a new node would still be able to start and replace it.
  • the enabled groups (or their instances, whichever is easier) should be selected using tags
  • it should be an opt-in, so other groups would not be touched

Multi-region operation

The stack should only need to be deployed in a single region, and it should be able to trigger instance terminations is any other region, because I would like to only set it up once, not once per region.

Unify CloudFormation stacks

It may be desirable to have a single stack that accepts parameters for configuration options, such as the email to send alarms to.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.