Code Monkey home page Code Monkey logo

playbooks's Introduction

StackPulse Playbooks

stackpulse-logo

StackPulse automates and orchestrates incident response and management, empowering SREs and developers to reduce toil, fix issues faster and deliver reliable software services. This repository contains a set of ready-to-use resources, created by StackPulse together with our partners and the community of our users, that will help you get started with managing and improving the reliability of your services.

To learn more about StackPulse, please refer to our platform description or to our product documentation. For your conveniece, playbooks presented in this repository are arranged by use case.

Use-Cases

  1. Alert Enrichment and System Diagnostics

  2. Incident Management and Orchestration

Alert Enrichment and System Diagnostics

Playbooks in this section enrich, analyze and triage alerts in real-time. They highlight the important data to be use for remediation by the on-caller engineers. Utilizing them in the incident response routine improves MTTR (Mean Time to Resolve) across all teams and personnel, as well as helps leveraging best diagnostics / troubleshooting practices for each system component regardless of the on-call engineers expertise level.


Kubernetes Rollback

This playbook enables an operator to easily and safely scale a service deployment by changing Kubernetes Horizontal Pod Autoscaler (HPA) settings. Executing the playbook will get the new HPA range from the user and issue both the scale command and create a pull request to solidify the change in a GitOps repo.

env slack github
import_in_stackpulse

Kubernetes Service Scale

This playbook enables an operator to easily and properly rollback a Kubernetes deployment using a GitOps FluxCD. Executing the playbook will interactively retrieve the list of services to the user and then the commit history for the selected service. Once the correct target tag is chosen, the playbook continues to safely rollback the service and lock the CD pipeline.

env slack github
import_in_stackpulse

Kubernetes Job Failed

This playbook extracts logs from a failed Kubernetes job, sending them as a snippet to Slack, and optionally allows to delete or rerun, asking via Slack.

env slack
import_in_stackpulse

Java Memory Dump

This playbook takes a heap memory dump of a java application running in a Kubernetes pod, and ask the user interactively whether to restart the pod after the dump.

env slack
import_in_stackpulse

Kubernetes Pod Restarting

This playbook solves consistent Pod Restarting events in a Kubernetes cluster. It gets the latest started pods in the namespace provided either by alert or by the user, then gets the current and previous (if exists) logs of the relevant container.

env slack
import_in_stackpulse


Postgres Long Running Sessions

This playbook collects all non-idle long running sessions from PostgresSQL instance and send it to Slack recipients.

env service slack
import_in_stackpulse

RabbitMQ Queues Overview

This playbook collects an overview about RabbitMQ instance and classify it's most consumption queues by: messages, unacknowledged messages, messages bytes and memory and send it to Slack recipients.

env service slack
import_in_stackpulse

Elasticsearch Get Stats

This playbook collects diagnostic information, stats and metrics from an Elasticsearch cluster and sends it to specified recepients in Slack.

service slack
import_in_stackpulse

Redis Get Big Keys

This playbook queries a Redis host and retrieves the current big keys. It then sends that output to Slack recipients of your choice as a snippet.

service slack
import_in_stackpulse

Redis Diagnostics

This playbook collects Redis cluster diagnostics that focus on common factors to high memory consumption and performance issues. It then sends that output via Slack.

service slack
import_in_stackpulse

Linux Diagnostics

This playbook queries utilization of CPU, memory and storage for a given host and sends the output to Slack recipients of choice.

ssh slack
import_in_stackpulse
Auto Scaling Group Execute Command

This playbook executes a command on each instance of an AWS Auto Scaling Group and and then verifies its health.

ssh slack
import_in_stackpulse

Incident Management and Orchestration

Playbooks in this section help automate incident management and communication flows across the organization or specific per teams and services.


Create Incident War-room (Slack)

This playbook creates a Slack Incident War Room when an incident is created, invites the relevant participants (based on incident details or other logic).

slack
import_in_stackpulse

Create Incident War-room (Slack, Zoom, PagerDuty)

Playbook that creates Incident War Room in Slack (and/or Video Conferencing software) when an incident is created, invites the relevant participants based on incident details (and/or on on-call rotation schedules).

slack zoom pagerduty
import_in_stackpulse

Create Incident War-room (Slack, Zoom, OpsGenie)

Playbook that creates Incident War Room in Slack (and/or Video Conferencing software) when an incident is created, invites the relevant participants based on incident details (and/or on on-call rotation schedules).

slack zoom opsgenie
import_in_stackpulse

Archive Incident War Room (Slack)

Playbook that runs upon incident resolution and asks the incident commander whether to archive the Slack War-room that belonged to the incident.

slack
import_in_stackpulse

playbooks's People

Contributors

hagaishapira avatar nshem avatar talbe avatar moshebe avatar vic3lord avatar eldadru avatar leonidbelkind avatar oriser avatar tomer-alony avatar eldadlivni avatar rzsp avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.