pythian / opsviz Goto Github PK

View Code? Open in Web Editor NEW

26.0 37.0 24.0 3.49 MB

Ruby 60.23% Perl 13.61% HTML 21.93% Gherkin 1.74% Python 1.15% Shell 1.34%

opsviz's Introduction

OpsVis Stack

The first step in DevOps transformation is knowing where change is needed

Overview

This repository includes the CloudFormation json and OpsWorks cookbooks to stand up a complete ELK stack in AWS.

Out of the box, it is Highly Available within one availability zone and automatically scales on load and usage.

It also builds everything with private-only ip addresses and restricts all external access to two endpoints:

Logs and metrics flow in through HA RabbitMQ with SSL terminated at the ELB
All dashboards and elasticsearch requests are protected by doorman and hosted together on a “dashboard” host

Components Architecture Diagram

CloudFormation Script
VPC
- ELBs
- 1 Public and 4 Private subnets
OpsWorks
- Bastion
- Sensu server
- Dashboards (Grafana, Kibana, Graphite, Sensu)
- CarbonRelay (both replication and fanout)
- CarbonCache (two carbon caches per instance along with the graphite webapp) image
- Elasticsearch
- Logstash
- Rabbitmq
- Statsd

Setup

Upload an SSL Certificate to AWS for the RabbitMQ ELB - and note the generated ARN Instructions
Create a new CloudFormation stack on the CloudFormation Dashboard image
Choose "Upload a template to Amazon S3" on upload cloudformation.json (the template is larger than 51000 bytes, so it needs to be uploaded to S3)
See Cloudformation parameters section on specifics for parameters image
During options I recommend disabling rollback on failture so you can see logs on OpsWorks boxes when recipes fail image

Cloudformation parameters

All of these will need to be filled in, for secure passwords and a secure erlang cookie you can use gen_secrets.py

CookbooksRef - The git reference to checkout for custom cookbooks
CookbooksRepo - The git url for your custom cookbooks
CookbooksSshKey - The ssh key needed to clone the cookbooks repo
DoormanPassword - Password to use for alternate authentication through doorman. Leave empty for none
ElasticSearchVolumeSize - Size of disk in GB to use for elasticsearch ebs volumes
GithubOauthAppId - Github Oauth App Id to use for doorman authentication. Leave empty for none.
GithubOauthOrganization - Github Organization to allow through doorman. Leave empty for none.
GithubOauthSecret - Github Oauth App Secret to use for doorman authentication. Leave empty for none.
GraphiteVolumeSize - Size of disk in GB to use for graphite ebs volumes
OpsWorksStackColor - RGB Color to use for OpsWorks Stack
PagerDutyAPIKey - The pagerduty api key if you want sensu alerts forwarded to pagerduty
RabbitMQCertificateARN - ARN of the certificate to use for rabbitmq
RabbitMQLogstashExternalPassword - RabbitMQ Password
RabbitMQLogstashExternalUser - RabbitMQ User
RabbitMQLogstashInternalPassword - RabbitMQ Password
RabbitMQLogstashInternalUser - RabbitMQ User
RabbitMQPassword - RabbitMQ Password
RabbitMQSensuPassword - RabbitMQ Sensu Password
RabbitMQStatsdPassword - RabbitMQ Statsd Password
RabbitMQUser - RabbitMQ User
RabbitMQVolumeSize - Size of disk in GB to use for elasticsearch ebs volumes
Route53DomainName - The domain name to append to dns records
Route53ZoneId - The zone id to add dns records to on instance setup. If empty updates won't happen
Version - *Just a place holder for version *

Additional Info
- CookbooksRepo
  
  Should Point to this repository. This will point OpsWorks to use the custom chef cookbooks from this repo when provisioning the instances.
- DoormanPassword, GithubOauthAppId, GithubOauthOrganization, GithubOauthSecret
  
  Doorman Sits in front of the nginx reverse proxy for elasticsearch, kibana, sensu dashboard, grafana, and graphite This allows you to protect all endpoints, including a public facing elasticsearch endpoint for kibana and granfana, with a single github account.
- RabbitMQCertificateARN
  
  This the ARN identifier for the SSL Certificate that needs to be uploaded before the creation of the this stack. It is used to attach to the RabbitMQ ELB for ssl termination
- Route53DomainName, Route53ZoneId
  
  If specified, during the setup event of recipes dns names will be created for your instances and elbs. For example if the domain name was example.com and the stack name was opsvis, then the following dns records would be created
  - rabbitmq.opsvis.example.com => RabbitMQ ELB
  - dashboard.opsvis.example.com => Dashboard ELB
  - elasticsearch.opsvis.example.com => Elasticsearch ELB
  - graphite.opsvis.example.com => Graphite ELB

External Access

All instances other than the NAT and Bastion hosts are within the private subnets and cannot be accessed directly

RabbitMQ has a public facing ELB in front of it with SSL termination. The dashboard instance has an ELB in front of it so the dasbhoards for grafana, kibana, graphite, and sensu are publicly accessible (Authentication is still required) The bastion host has an Elastic IP attached to it and is on the public subnet so that you can SSH into the box and from there SSH into other boxes on the private subnet

SSH Users SSH Users are managed by OpsWorks. After creating the stack login to the OpsWorks dashboard to see the list if IAM users. From there you can assign SSH and Sudo access to individual users as well as upload public keys. After making changes OpsWorks will run a chef recipe on all boxes to update the user accounts accordingly on each instance

External Clients

A separate cookbook has been created that contains recipes for installing external clients. The are abstracted out from OpsWorks and AWS so they can be ran on any machine to start sending logs to this OpViz Stack.

See bb_external for more documentation

External Logstash Clients

To setup an external logstash client.

Install logstash according to documentation
Update the config to push logs to the rabbitmq ELB

External Statsd Clients

Setup statsd to push metrics rabbitmq where graphite will pull out of.

Install statsd client according to documentation
Install rabbitmq backend npm install git+https://github.com/mrtazz/statsd-amqp-backend.git

Setup config as follows

 {
   backends: [ "statsd-amqp-backend" ]
   , dumpMessages: true
   , debug: true
   , amqp: {
     host: '<RabbitMQ ELB>'
     , port: 5671
     , login: 'statsd'
     , password: '<statsd RabbitMQ password>'
     , vhost: '/statsd'
     , defaultExchange: 'statsd'
     , messageFormat: 'graphite'
     , ssl: {
       enabled : true
       , keyFile : '/dev/null'
       , certFile : '/dev/null'
       , caFile : '/dev/null'
       , rejectUnauthorized : false
     }
   }
 }

External Sensu Clients

We use the public facing RabbitMQ as the transport layer for external sensu clients.

Install sensu client according to documentation
Update client config /etc/sensu/conf.d/client.json

Update rabbitmq config /etc/sensu/conf.d/rabbitmq.json

 {
   "rabbitmq": {
     "host": "<RabbitMQ ELB>",
     "port": 5671,
     "user": "sensu",
     "password": "<sensu RabbitMQ password>",
     "vhost": "/sensu",
     "ssl" : true
   }
 }

Updating Sensu Checks and Metrics

Todo: At this time we don't have a way to drive sensu checks or metrics directly from CloudFormation parameters or any other external definitions. This would make it easier to update sensu without needing to worry about making changes directly to the sensu config without configuration management or making standalone checks on each client

Option 1: SSH into the sensu box and make changes according to sensu documentation
Option 2: Setup standalone checks on each external client according to documentation
Handlers

When adding a check as type metric set the handlers to "graphite". This will forward any metrics onto graphite for us automatically

Custom JSON

This Custom Json is the Custom Json block that is set as the OpsWorks custom json. It drives a lot of the custom configuration that chef uses to customize the boxes. Its currently embedded in the CloudFormation script so that we can inject parameters into the custom json.

If changes need to be made to the custom json you can do it from the OpsWorks stack's stack settings page. If you make changes make sure that you don't update the CloudFormation stack as it will overwrite the custom OpsWork's settings you made.

Todo: At some point it would be nice to allow a user to inject their own custom json into the CloudFormation processes without having to manually make changes to the monolithic CloudFormation.json file

Using create_stack

There is a provided script called create_stack that can be used for launching an opsviz stack. It creates random passwords and generates a self signed certificate. Basic usage:

./create_stack -c 'https://github.com/pythian/opsviz.git' --region us-west-2 teststack

You can also provide custom parameters. For example, if you want to use your own password for RabbitMQ:

./create_stack -c 'https://github.com/pythian/opsviz.git' --region us-west-2 --param RabbitMQPassword=hunter3 teststack

Multiple --param options can be specified.

In order to use the script, you must setup access keys. See the boto configuration doc for more information.

Building with Vagrant

The included Vagrantfile will build the Opsvis stack on four virtual machines rabbitmq-1, logstash-1, elastic-1 and dashboard-1.

Prerequisites

Install vagrant

Install vagrant plugins

vagrant plugin install vagrant-berkshelf
vagrant plugin install vagrant-hostmanager
vagrant plugin install vagrant-omnibus

Install Chef Development Kit

Add chef to your path:

export PATH=/opt/chefdk/bin:$PATH

bundle and bring up the virtual machines

bundle
vagrant up --provider virtualbox --provision

Current machine states:

rabbitmq-1                running (virtualbox)
logstash-1                running (virtualbox)
elastic-1                 running (virtualbox)
dashboard-1               running (virtualbox)

Once complete the dashboard will be available locally at:

Default doorman password is opsvis.

Vagrant Configuration

nodes.json

Configures each node including roles, ip, hostnames, CPU, memory, etc.

... snip ...
    "dashboard-1": {
      ":node": "Dashboard-1",
      ":ip": "10.10.3.10",
      ":host": "dashboard-1.opsvis.dev",
      ":lb": "dashboard.opsvis.dev",
      ":tags": [
        "dashboard",
        "graphite"
      ],
      ":roles": [
        "dashboard"
      ],
      "ports": [
        {
          ":host": 2201,
          ":guest": 22,
          ":id": "ssh"
        },
        {
          ":host": 8090,
          ":guest": 80,
          ":id": "httpd"
        }
      ],
      ":memory": 2048,
      ":cpus": 1
    }
... snip ...

node.json

Custom JSON to overwrite default configs.

Roles

roles/ contains specific node roles and run_list as referenced in Vagrantfile.

opsviz's People

Contributors

Stargazers

Watchers

opsviz's Issues

Resolution of ELBs can be cached by nginx

Today I found kibana broken on an opsviz stack. This appeared to be caused by the ip changing on the ELB of the proxypass, but nginx was still trying to use the old IP. It looks like we can change the config to force DNS resolution everytime rather than caching the resolution:

http://serverfault.com/a/593003/105633

Another option may be to give the ELB an EIP.

Fix NGINX proxy for the sensu healthcheck

Uchiwa makes an API call to this address:

(dashboard_url)/sensu/health/sensu

This is supposed to return an object similar to this:

{"Sensu":{"output":"ok"}}

However, nginx is redirecting it to the events page. This causes the alerts on datacenter being undefined due to the way javascript is parsing it.

Add flapjack extension and flapjack.json to sensu

https://github.com/sensu/sensu-community-plugins/blob/master/extensions/handlers/flapjack.rb

Restart of stack resulted in dashboard1 failing setup execution

This might be tied to issue #7 but dashboard1 instance failed restart as well due to inability to start sensu-api service. Re-running the setup step resulted in the dashboard1 setup completing successfully, but logging the issue to see if there's any recipe changes that can prevent this from happening or if it's just unfortunate luck!

[2014-12-12T14:39:47+00:00] INFO: Processing sensu_service[sensu-api] action start (sensu::api_service line 20)
[2014-12-12T14:39:47+00:00] INFO: Processing service[sensu-api] action start (/var/lib/aws/opsworks/cache.stage2/cookbooks/sensu/providers/service.rb line 46)
[2014-12-12T14:39:49+00:00] INFO: Retrying execution of service[sensu-api], 2 attempt(s) left
[2014-12-12T14:39:56+00:00] INFO: Retrying execution of service[sensu-api], 1 attempt(s) left
[2014-12-12T14:40:02+00:00] INFO: Retrying execution of service[sensu-api], 0 attempt(s) left

================================================================================
Error executing action `start` on resource 'service[sensu-api]'
================================================================================


Mixlib::ShellOut::ShellCommandFailed
------------------------------------
Expected process to exit with [0], but received '1'
---- Begin output of /etc/init.d/sensu-api start ----
STDOUT: * Starting sensu-api
...fail!
STDERR: 
---- End output of /etc/init.d/sensu-api start ----
Ran /etc/init.d/sensu-api start returned 1


Cookbook Trace:
---------------
/var/lib/aws/opsworks/cache.stage2/cookbooks/sensu/providers/service.rb:127:in `block in class_from_file'


Resource Declaration:
---------------------
# In /var/lib/aws/opsworks/cache.stage2/cookbooks/sensu/providers/service.rb

46:     service new_resource.service do
47:       provider service_provider
48:       supports :status => true, :restart => true
49:       retries 3
50:       retry_delay 5
51:       action :nothing
52:       subscribes :restart, resources("ruby_block[sensu_service_trigger]"), :delayed
53:     end
54:   when "runit"



Compiled Resource:
------------------
# Declared in /var/lib/aws/opsworks/cache.stage2/cookbooks/sensu/providers/service.rb:46:in `load_current_resource'

service("sensu-api") do
provider Chef::Provider::Service::Debian
action [:nothing]
updated true
supports {:status=>true, :restart=>true}
retries 0
retry_delay 5
service_name "sensu-api"
enabled true
pattern "sensu-api"
cookbook_name "sensu"
end




================================================================================
Error executing action `start` on resource 'sensu_service[sensu-api]'
================================================================================


Mixlib::ShellOut::ShellCommandFailed
------------------------------------
service[sensu-api] (/var/lib/aws/opsworks/cache.stage2/cookbooks/sensu/providers/service.rb line 46) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '1'
---- Begin output of /etc/init.d/sensu-api start ----
STDOUT: * Starting sensu-api
...fail!
STDERR: 
---- End output of /etc/init.d/sensu-api start ----
Ran /etc/init.d/sensu-api start returned 1


Cookbook Trace:
---------------
/var/lib/aws/opsworks/cache.stage2/cookbooks/sensu/providers/service.rb:127:in `block in class_from_file'


Resource Declaration:
---------------------
# In /var/lib/aws/opsworks/cache.stage2/cookbooks/sensu/recipes/api_service.rb

20: sensu_service "sensu-api" do
21:   init_style node.sensu.init_style
22:   action [:enable, :start]
23: end



Compiled Resource:
------------------
# Declared in /var/lib/aws/opsworks/cache.stage2/cookbooks/sensu/recipes/api_service.rb:20:in `from_file'

sensu_service("sensu-api") do
action [:enable, :start]
updated true
retries 0
retry_delay 2
cookbook_name "sensu"
recipe_name "api_service"
init_style "sysv"
service "sensu-api"
end



[2014-12-12T14:40:09+00:00] INFO: Running queued delayed notifications before re-raising exception
[2014-12-12T14:40:09+00:00] INFO: package[sensu] sending create action to ruby_block[sensu_service_trigger] (delayed)
[2014-12-12T14:40:09+00:00] INFO: Processing ruby_block[sensu_service_trigger] action create (sensu::default line 20)
[2014-12-12T14:40:09+00:00] INFO: ruby_block[sensu_service_trigger] called
[2014-12-12T14:40:09+00:00] INFO: cookbook_file[/etc/sensu/extensions/graphite.rb] sending restart action to sensu_service[sensu-server] (delayed)
[2014-12-12T14:40:09+00:00] INFO: Processing sensu_service[sensu-server] action restart (sensu::server_service line 20)
[2014-12-12T14:40:09+00:00] INFO: Processing service[sensu-server] action restart (/var/lib/aws/opsworks/cache.stage2/cookbooks/sensu/providers/service.rb line 46)
[2014-12-12T14:40:16+00:00] INFO: service[sensu-server] restarted
[2014-12-12T14:40:16+00:00] INFO: ruby_block[sensu_service_trigger] sending restart action to service[sensu-server] (delayed)
[2014-12-12T14:40:16+00:00] INFO: Processing service[sensu-server] action restart (/var/lib/aws/opsworks/cache.stage2/cookbooks/sensu/providers/service.rb line 46)
[2014-12-12T14:40:22+00:00] INFO: service[sensu-server] restarted
[2014-12-12T14:40:22+00:00] INFO: ruby_block[sensu_service_trigger] sending restart action to service[sensu-api] (delayed)
[2014-12-12T14:40:22+00:00] INFO: Processing service[sensu-api] action restart (/var/lib/aws/opsworks/cache.stage2/cookbooks/sensu/providers/service.rb line 46)

================================================================================
Error executing action `restart` on resource 'service[sensu-api]'
================================================================================


Mixlib::ShellOut::ShellCommandFailed
------------------------------------
Expected process to exit with [0], but received '1'
---- Begin output of /etc/init.d/sensu-api restart ----
STDOUT: * Stopping sensu-api
...done.
* Starting sensu-api
...fail!
STDERR: /sbin/start-stop-daemon: warning: failed to kill 4038: No such process
---- End output of /etc/init.d/sensu-api restart ----
Ran /etc/init.d/sensu-api restart returned 1


Resource Declaration:
---------------------
# In /var/lib/aws/opsworks/cache.stage2/cookbooks/sensu/providers/service.rb

46:     service new_resource.service do
47:       provider service_provider
48:       supports :status => true, :restart => true
49:       retries 3
50:       retry_delay 5
51:       action :nothing
52:       subscribes :restart, resources("ruby_block[sensu_service_trigger]"), :delayed
53:     end
54:   when "runit"



Compiled Resource:
------------------
# Declared in /var/lib/aws/opsworks/cache.stage2/cookbooks/sensu/providers/service.rb:46:in `load_current_resource'

service("sensu-api") do
provider Chef::Provider::Service::Debian
action [:nothing]
updated true
supports {:status=>true, :restart=>true}
retries 0
retry_delay 5
service_name "sensu-api"
enabled true
pattern "sensu-api"
cookbook_name "sensu"
end



[2014-12-12T14:40:24+00:00] ERROR: Running exception handlers
[2014-12-12T14:40:24+00:00] ERROR: Exception handlers complete
[2014-12-12T14:40:24+00:00] FATAL: Stacktrace dumped to /var/lib/aws/opsworks/cache.stage2/chef-stacktrace.out
[2014-12-12T14:40:24+00:00] ERROR: Chef::Exceptions::MultipleFailures
[2014-12-12T14:40:24+00:00] FATAL: Chef::Exceptions::ChildConvergeError: Chef run process exited unsuccessfully (exit code 1)

Extract bb_external into opviz-client repo

We want to move the client installation into a separate repository and have migrated the code into pythian/opviz-client

However, so as not to break installs using master/HEAD we have not deleted it from pythian/opsviz

Until that happens, changes to the bb_external cookbook will need to be maintained in two places.

Creation of cert in 'create_stack' fails if stack of same name was previously created

If you use create_stack, and it fails at any point after uploading the cert, then a re-run of the script will result in the following error:

writing RSA key
Traceback (most recent call last):
  File "create_stack", line 166, in <module>
    main()
  File "create_stack", line 163, in main
    stack_creator.create_stack()
  File "create_stack", line 128, in create_stack
    self.prepare_cert()
  File "create_stack", line 72, in prepare_cert
    self.cert_arn = self.upload_cert()
  File "create_stack", line 63, in upload_cert
    private_key=self.ssl_key)
  File "/home/vagrant/.virtualenvs/opvis/lib/python2.7/site-packages/boto/iam/connection.py", line 799, in upload_server_cert
    verb='POST')
  File "/home/vagrant/.virtualenvs/opvis/lib/python2.7/site-packages/boto/iam/connection.py", line 102, in get_response
    raise self.ResponseError(response.status, response.reason, body)
boto.exception.BotoServerError: BotoServerError: 409 Conflict
<ErrorResponse xmlns="https://iam.amazonaws.com/doc/2010-05-08/">
  <Error>
    <Type>Sender</Type>
    <Code>EntityAlreadyExists</Code>
    <Message>The Server Certificate with name opsvistest1_cert already exists.</Message>
  </Error>
  <RequestId>e9c7caab-b2bf-11e4-9420-5bb7c3142238</RequestId>
</ErrorResponse>

I would expect to handle this by either removing the previous one and re-attempting, or ignoring and letting it use the existing one.

sensu cluster

I think it would be interesting to have at least two sensu-server nodes for several reasons, one of which is that this server is the core of notifications, but it would also allow for easier maintenance of this part of the stack. Sensu works well in cluster. It's able to spread out health checks events on multiple nodes automatically, spreading out load in large installations.
Graphite and rabbitmq are already clustered, and I also saw some efforts for switching from redis to elasticache (I really like to use "redishappy" to manage redis clusters in non AWS environments).
I think the sensu-server deserves its own cluster too!

Make the statsD flush time configurable

This has implications for Grafana

Add a client installer through piping through bash

add the mongo nagios checks to sensu

https://github.com/mzupan/nagios-plugin-mongodb

Add statsd recipes back to the logstash layer

this will likely involve fixing the upstream to not signal a restart when installing for the first time as well as checking the path for node before setting it in the attributes file

Update logstash cookbook to support conditionals

The current version of the logstash cookbook support conditionals for filters: lusis/chef-logstash#175. We should update the version in our berksfile.lock.

Add Google Apps authentication

Doorman support Google Apps based authentication. Lets make this a configuration option from the Cloudformation script.

Here are the Google Apps config options: https://github.com/movableink/doorman/blob/master/conf.example.js#L65

Logstash exchange is not created on stack setup.

The logstash rabbitmq input plugin is not currently able to create an exchange if it doesn't exist. Producers (logstash output plugins) will create exchanges if they don't exist.

This means that when opsviz stack is spun up, rabbitmq1 and logstash1 instance generate many log error messages and connections aren't established properly until the first rabbitmq logstash producer is created.

As a quick fix, the opsviz recipes should support creating the logstash exchange on the default vhost '/' during the logstash install recipe.

Here's an excerpt from the chat discussion:

[2:31 PM] Derek Downey: hmm, it doesn't do what I expected then :( They don't have an 'exchange_type' option on the rabbitmq input http://logstash.net/docs/1.4.2/inputs/rabbitmq
my thought was that 'type' was the same thing, and that it only would create the exchange if a type was also specified
[2:34 PM] Taylor Ludwig: oh gotcha, since the output has an "exchange_type", you were expecting there to be an "exchange_type" option on the input too
[2:35 PM] Derek Downey: yes
[2:35 PM] Taylor Ludwig: that's only relevant on creating the exchange, right? So maybe since the output is the only one that is actually creating the exchange if it doesn't exist, thats why its absent front the input
[2:36 PM] Alex Lovell-Troy: depends on how the code is structured, but that would make sense
[2:36 PM] Taylor Ludwig: But yeah, the "types" option is logstash's message type, so its a setting outside of rabbitmq stuff
[2:36 PM] Derek Downey: I'm still not used to the pattern that only the output (producer) is creating exchanges/queues. To me this sounds like an easy exploit.
plus the stack starts up with a bunch of errors in logstash/rabbitmq until there's at least one producer :)
[2:39 PM] Taylor Ludwig: yeah that alone seems like a good reason to create it.
snip
[2:46 PM] Alex Lovell-Troy: actually...
this should be in logstash
[2:46 PM] Derek Downey: logstash input?
[2:46 PM] Alex Lovell-Troy: yeah
[2:46 PM] Taylor Ludwig: yeah that seems more logical,
[2:46 PM] Derek Downey: that's where I would do it
snip
[2:51 PM] Taylor Ludwig: oh i thought you were talking about the logstash install recipe. I thought we determined only the output creates the exchange not the input, sorry im getting lost going through 3 chats right now
[2:54 PM] Derek Downey: we determined that is how it currently works. I think something needs to create the exchange without requiring producers to avoid the errors, but the logstash input doesn't support it (this still is crazy to me!) We had discussed using rabbitmq management plugin to create the exchange as an alternative
that's my understanding of the discussion, anyway
[2:56 PM] Taylor Ludwig: yeah so my thinking is just use the rest api to create the exchange and do it either on the logstash server install recipe or the rabbitmq install recipe
[3:04 PM] Derek Downey: if doing that, I'd have a preference towards the logstash server install recipe

create_stack script should accept config file

I was going to raise an issue to accept a default instance size for all the layers, which is useful for testing the stack.

The driving factor is that passing all the --param options to customize can lead to mistakes.

However, I think a cleaner solution is to read from a config file.

Make sure any created instance/service role has permissions to do the ec2 clustering for elasticsearch

Review sensu plugin es-node-metrics and modify it to handle missing fielddata_breaker info

After upgrading elasticsearch to 1.4.4 the sensu check can no longer find fielddata_breaker info in the node report which causes the whole plugin to fail. The likely fix is to add exception handling around the checks that use fielddata_breaker at line 72. Should we also log or fail silently? Should we attempt to get this metric somewhere else?

Horizontally scalable layers should have load-based instances

Doorman authentication with optional parameters

Currently, there are two conditionals on the doorman config file that create modules: app_id and password.

If these are left empty in the parameters, the config blocks still get generated and will cause doorman inability to start up at best case, worst case security holes with empty passwords.

I'll submit a patch later if it hasn't been done.

Add a longer ping timeout to ec2 elasticsearch discovery and limit to the our internal security group

elasticsearch cloud-aws plugin supports a few options that would make clustering more reliable. We should turn them on by default.

https://github.com/elastic/elasticsearch-cloud-aws#ec2-discovery

I think limiting clustering to our internal security group should be sufficient, but we might want to think about increasing the ping timeout as well.

t2 instance sizes not supported

This is a known issue, but Bastion instance type indicates it should support t2.small and t2.medium.

These instance types do not work with the default AMI. We should support the smaller instances for testing purposes.

Allow Sensu to use elasticache instead of local redis

foxycoder chef-logstash dependency doesn't work on centos because of upstart

The current logstash cookbook that we rely on in bb_external is https://github.com/foxycoder/chef-logstash

The readme claims that it should work in other platforms but it only has been tested in ubuntu/debian systems.

The problem is the startup script relies on 'upstart', which as far as I know is Ubuntu only (without perhaps depending on the upstart cookbook: http://upstart.ubuntu.com/cookbook/#id416

I'm not sure we should be requiring upstart jobs, but if we don't find a different provider, we should at least fork the foxycoder repo with any fixes to allow init.d script. Or systemd as RHEL7 and Ubuntu seem to be going that route: http://www.markshuttleworth.com/archives/1316

incorporate the percona nagios checks into sensu

http://www.percona.com/doc/percona-monitoring-plugins/1.0/nagios/

Add flapjack support

Between sensu and pagerduty, flapjack provides a customizable way to roll up, group, and escalate alerts which makes both pieces better. Plus, we use it at Pythian.

https://github.com/flapjack/flapjack

Create script to perform current manual work

Currently, there is some manual work that needs to be done such as generate secrets and RabbitMQ SSL cert. A wrapper script which does this generation and then calls createStack would be nice to have.

Update recipes to be more FHS compliant

Currently we have a big mix of file locations in the opsviz stack. Things are much easier to troubleshoot when things like logfiles and config files are easy to find. Logs should go to /var/log and config files should be somewhere under /etc/. One example is elasticsearch, which logs to /usr/local/var/log.

CloudFormation setup fails at EC2 instance creation

The CloudFormation setup fails at the EC2 instance creation. When looking at OpsWorks the instances are still running the setup phase of the chef run and then eventually get setup fine, with no errors.

There should probably be a wait condition added to the CloudFormation json that waits for the EC2 instances to finish being setup before failing the CloudFormation run.

WaitCondition: http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/using-cfn-waitcondition.html

CreationPolicy Attribute: http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-creationpolicy.html

Doorman config changes don't restart service

Changing doorman configuration doesn't restart the doorman service

Create ability/method to override sensu check thresholds

It would be great if we could override the sensu check thresholds setup by the custom_json on a per-client basis. I found this as an example: https://gist.github.com/piavlo/5774621

Is this the best way to do it?

Restarting rabbitmq1 node results in failed setup on 'set_policy ha-all'

I had stopped all instances in my opswork stack and upon restarting the rabbitmq1 instance, received the following error:

[2014-12-12T14:36:27+00:00] INFO: Enabling RabbitMQ plugin 'rabbitmq_management'.
[2014-12-12T14:36:27+00:00] INFO: rabbitmq_plugin[rabbitmq_management] not queuing delayed action restart on service[rabbitmq-server] (delayed), as it's already been queued
[2014-12-12T14:36:27+00:00] INFO: Processing execute[rabbitmq-plugins enable rabbitmq_management] action run (/var/lib/aws/opsworks/cache.stage2/cookbooks/rabbitmq/providers/plugin.rb line 39)
[2014-12-12T14:36:28+00:00] INFO: execute[rabbitmq-plugins enable rabbitmq_management] ran successfully
[2014-12-12T14:36:28+00:00] INFO: Processing rabbitmq_plugin[rabbitmq_management_visualiser] action enable (rabbitmq::mgmt_console line 27)
[2014-12-12T14:36:28+00:00] INFO: Enabling RabbitMQ plugin 'rabbitmq_management_visualiser'.
[2014-12-12T14:36:28+00:00] INFO: rabbitmq_plugin[rabbitmq_management_visualiser] not queuing delayed action restart on service[rabbitmq-server] (delayed), as it's already been queued
[2014-12-12T14:36:28+00:00] INFO: Processing execute[rabbitmq-plugins enable rabbitmq_management_visualiser] action run (/var/lib/aws/opsworks/cache.stage2/cookbooks/rabbitmq/providers/plugin.rb line 39)
[2014-12-12T14:36:29+00:00] INFO: execute[rabbitmq-plugins enable rabbitmq_management_visualiser] ran successfully
[2014-12-12T14:36:29+00:00] INFO: Processing execute[chown -R rabbitmq:rabbitmq /var/lib/rabbitmq] action run (rabbitmq_cluster::default line 10)
[2014-12-12T14:36:29+00:00] INFO: execute[chown -R rabbitmq:rabbitmq /var/lib/rabbitmq] ran successfully
[2014-12-12T14:36:29+00:00] INFO: Processing rabbitmq_user[guest] action delete (rabbitmq_cluster::default line 12)
[2014-12-12T14:36:29+00:00] INFO: Processing rabbitmq_policy[ha-all] action set (rabbitmq_cluster::default line 16)
[2014-12-12T14:36:29+00:00] INFO: Done setting RabbitMQ policy 'ha-all'.
[2014-12-12T14:36:29+00:00] INFO: Processing execute[set_policy ha-all] action run (/var/lib/aws/opsworks/cache.stage2/cookbooks/rabbitmq/providers/policy.rb line 66)

================================================================================
Error executing action `run` on resource 'execute[set_policy ha-all]'
================================================================================


Mixlib::ShellOut::ShellCommandFailed
------------------------------------
Expected process to exit with [0], but received '2'
---- Begin output of rabbitmqctl set_policy ha-all "^(?!amq\.).*" '{"ha-mode":"all","ha-sync-mode":"automatic"}' --priority 1 ----
STDOUT: Setting policy "ha-all" for pattern "^(?!amq\\.).*" to "{\"ha-mode\":\"all\",\"ha-sync-mode\":\"automatic\"}" with priority "1" ...
STDERR: Error: unable to connect to node rabbit@rabbitmq1: nodedown

DIAGNOSTICS
===========

attempted to contact: [rabbit@rabbitmq1]

rabbit@rabbitmq1:
* connected to epmd (port 4369) on rabbitmq1
* epmd reports: node 'rabbit' not running at all
no other nodes on rabbitmq1
* suggestion: start the node

current node details:
- node name: rabbitmqctl11602@rabbitmq1
- home dir: /var/lib/rabbitmq
- cookie hash: FUWzw5ayMo2aD4GJFavYFA==
---- End output of rabbitmqctl set_policy ha-all "^(?!amq\.).*" '{"ha-mode":"all","ha-sync-mode":"automatic"}' --priority 1 ----
Ran rabbitmqctl set_policy ha-all "^(?!amq\.).*" '{"ha-mode":"all","ha-sync-mode":"automatic"}' --priority 1 returned 2


Resource Declaration:
---------------------
# In /var/lib/aws/opsworks/cache.stage2/cookbooks/rabbitmq/providers/policy.rb

66:     execute "set_policy #{new_resource.policy}" do
67:       command cmd
68:     end
69: 



Compiled Resource:
------------------
# Declared in /var/lib/aws/opsworks/cache.stage2/cookbooks/rabbitmq/providers/policy.rb:66:in `block in class_from_file'

execute("set_policy ha-all") do
action "run"
retries 0
retry_delay 2
command "rabbitmqctl set_policy ha-all \"^(?!amq\\.).*\" '{\"ha-mode\":\"all\",\"ha-sync-mode\":\"automatic\"}' --priority 1"
backup 5
returns 0
cookbook_name "rabbitmq_cluster"
end



[2014-12-12T14:36:29+00:00] INFO: Running queued delayed notifications before re-raising exception
[2014-12-12T14:36:29+00:00] INFO: template[/etc/rabbitmq/rabbitmq-env.conf] sending restart action to service[rabbitmq-server] (delayed)
[2014-12-12T14:36:29+00:00] INFO: Processing service[rabbitmq-server] action restart (rabbitmq::default line 79)
[2014-12-12T14:36:33+00:00] INFO: service[rabbitmq-server] restarted
[2014-12-12T14:36:33+00:00] ERROR: Running exception handlers
[2014-12-12T14:36:33+00:00] ERROR: Exception handlers complete
[2014-12-12T14:36:33+00:00] FATAL: Stacktrace dumped to /var/lib/aws/opsworks/cache.stage2/chef-stacktrace.out
[2014-12-12T14:36:33+00:00] ERROR: execute[set_policy ha-all] (/var/lib/aws/opsworks/cache.stage2/cookbooks/rabbitmq/providers/policy.rb line 66) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '2'
---- Begin output of rabbitmqctl set_policy ha-all "^(?!amq\.).*" '{"ha-mode":"all","ha-sync-mode":"automatic"}' --priority 1 ----
STDOUT: Setting policy "ha-all" for pattern "^(?!amq\\.).*" to "{\"ha-mode\":\"all\",\"ha-sync-mode\":\"automatic\"}" with priority "1" ...
STDERR: Error: unable to connect to node rabbit@rabbitmq1: nodedown

DIAGNOSTICS
===========

attempted to contact: [rabbit@rabbitmq1]

rabbit@rabbitmq1:
* connected to epmd (port 4369) on rabbitmq1
* epmd reports: node 'rabbit' not running at all
no other nodes on rabbitmq1
* suggestion: start the node

current node details:
- node name: rabbitmqctl11602@rabbitmq1
- home dir: /var/lib/rabbitmq
- cookie hash: FUWzw5ayMo2aD4GJFavYFA==
---- End output of rabbitmqctl set_policy ha-all "^(?!amq\.).*" '{"ha-mode":"all","ha-sync-mode":"automatic"}' --priority 1 ----
Ran rabbitmqctl set_policy ha-all "^(?!amq\.).*" '{"ha-mode":"all","ha-sync-mode":"automatic"}' --priority 1 returned 2
[2014-12-12T14:36:33+00:00] FATAL: Chef::Exceptions::ChildConvergeError: Chef run process exited unsuccessfully (exit code 1)

I am assuming it's a timing issue, but not 100% sure.

Cert creation in create_stack should be optional

Cert creation should be optional if a cert is passed in as a --param.

add the galera specific nagios checks

http://www.fromdual.com/galera-cluster-nagios-plugin-en

Parameterize instance sizes and number of nodes

For testing purposes we don't really need c3.large instances, and might want to customize the number of nodes for elasticsearch in particular.

This is a feature request for customizing instance sizes (per role) and number of nodes for elasticsearch for now.

erlang cookie needs to be unique per installation

We're hardcoding an erlang cookie here that could be used to circumvent RMQ security. It should be generated by cloudformation when processing the template.

VPC creation should be optional

We should add a param for a VPC id and only create the VPC if that param is not set.

Multiple AZ

"Highly Available within one availability zone". For now I have just gone through some of the documentation, but it looks like support for multiple AZ would make sense, specifically when services are clustered.

Support mysql slow query log dashboards

Add required logstash inputs for parsing mysql slow query log and create Kibana dashboard to visualize.

Nginx is appending a slash to many paths

Nginx is appending a slash to the end some of the urls and it shouldn't be. I think this is causing some issues with some things.

/elasticsearch/_nodes is actually this when it hits the ES server: [elasticsearch]:9200//_nodes (notice the double slash).

Marvel needs to be accessed like this: [doorman_elb]/elasticsearch_plugin/marvel and may redirect you the first run.

I can see there is a trailing slash here: https://github.com/pythian/opsviz/blob/master/site-cookbooks/bb_monitor/templates/default/nginx/dashboard.erb#L22. Removing that trailing slash should be the proper way to set this up. I also notice that many of the other routes are doing the same thing. If there isn't a specific reason those are setup like that I'll go ahead and fix this.

Add RabbitMQ management plugin to dashboard

RabbitMQ has a dashboard https://www.rabbitmq.com/management.html

It runs on the RabbitMQ server over port 15672. The RabbitMQ ELB has a listener for port 15672, but the external security group for that ELB only uses port 5671. https://github.com/pythian/opsviz/blob/master/cloudformation.json#L1182

We should make sure this dashboard plugin is installed and available on the dashboard. It probably makes sense to use the Dashboard ELB and Nginx to forward /rabbitmq to the RabbitMQ ELB on port 15672. This puts the RMQ dashboard behind Doorman's authentication.

Agile health checks

I have a different approach when it comes to creating sensu checks. It's not really an issue, just a different way of doing things so thought I'd mention it here.
I usually install a graphite client on all the servers I am monitoring (when possible, and I prefer Diamond).
I then create the sensu checks to verify data against graphite metrics - You'll say that I am hammering graphite with a bunch of queries, and I am aware of this - until now this hasn't been an issue in environments with around 150 boxes.
But this is why I say it is more agile: I can leverage graphite math functions to try and flatten anomalies and try to find only the relevant signal in the noise.
I then use the check-data.rb script for sensu-community-plugins this way:

/etc/sensu/plugins/check-data.rb -a 120 -s ${graphite_host} -t 'minSeries($graphite_prefix.hostname -s.diskspace.*.byte_percentfree)' -w :::params.graphite.diskspace.bytes.free.warning|20::: -c :::params.graphite.diskspace.bytes.free.critical|10:::

minSeries() might not be the best option here, but this is just an example.