waggle-sensor / beehive-server Goto Github PK

View Code? Open in Web Editor NEW

13.0 28.0 17.0 28.23 MB

Waggle cloud software for aggregation, storage and analysis of sensor data from Waggle nodes.

Shell 17.71% Python 76.38% Makefile 2.59% HTML 1.77% Dockerfile 1.55%

beehive-server's Introduction

This repo is deprecated! Please see Beehive v2 for the latest version of Beehive!

Beehive Server

Beehive Server is a set of services that collect sensor data from Waggle IoT devices.

For an overview of Waggle visit https://wa8.gl/

System Requirements

OS: Linux, OSX
Software: Make, Docker

Docker

We'll assume Docker CE (Community Edition) version 17.01 or later (check minimal version) is installed on the system.

Installation guides for Docker https://docs.docker.com/install/)

Installation

git clone https://github.com/waggle-sensor/beehive-server
cd beehive-server

Optional: The default location for persistent data is the data subfolder. If you want to change this, define the BEEHIVE_ROOT. See next section for further documentation.

export BEEHIVE_ROOT=`pwd`/data     # this exmaple is the default

Start beehive:

./do.sh deploy

BEEHIVE_ROOT: Persistent data

By default all your beehive data be stored in a data/ subfolder in your checked out git repository. This data directory will contain:

Databases
Nodes Keys
Beehive Keys
RMQ data

If you remove this directory you loose all persistent stuff. The incoming data from the nodes also gets stored under this directory.

To change location of your data folder, set the BEEHIVE_ROOT variable, e.g.

export BEEHIVE_ROOT=${HOME}/beehive-server-data

(Pro tip: store the beehive variable in you ~/.bashrc or similar)

beehive-server's People

Contributors

Stargazers

Watchers

Forkers

tgoodyear kauzclay shannons-ds willengler mgailhac smuniz-vates dawnkd hpan8 jweezy24 wgerlach vforgione chris36314 hkim-dev dinoforcered adugyan jswantek

beehive-server's Issues

Survey "collaborator" beehive deployments

It seems like there's some interest in having other folks deploy beehive's? (If not, we can ignore this.) A few considerations which make that tricky are:

We use a number of large, infrastructure components such as RabbitMQ and Cassandra (and maybe Elasticsearch soon-ish). Proper operations and support during a failure require an ops or sysadmin experienced with these tools to address major issues. How sure are we that other teams will have these people? (We're certainly not in the business of providing that kind of support, right?)
What kind of systems are we anticipating deploying to? It may be ok to deploy to a single VM, but if the intent is to scale up over time, we probably want to consider moving towards an approach where core pieces of infrastructure such as RabbitMQ, Cassandra and Elasticsearch "preexist" and are clustered across multiple machines to provide high availability and performance. (In other words, beehive becomes a service built above that layer of infrastructure. Similar to how you'd build on top of AWS services.) Will most collaborators have a person capable of doing thing kind of provisioning and deployment? If our goal is to scale up, then designing for the lowest common denominator is potentially conflicting with this. (Also note that this is the recommended and intended method of deployment for Cassandra in production.)

Prototype static version of beehive dataset interface

One possible improvement we can do is build a static version of beehive which is regenerated on a schedule. This would dramatically improve page serving performance across the board and give us some room to add sanitization to the datasets until we've cleaned inconsistencies up.

This also has the side effect of completely eliminating direct database access for datasets to the outside work and so could eliminate any security mistakes which show up. (Even though, this really shouldn't be a problem...)

I think this is still worth prototyping, even though we now have nginx performing caching and have moved off the development server. As an example, the build-index tool in the data-exporter generates a "friendly" summary of all the datasets to make sure things look reasonable.

Define roles in Cassandra

For safety reasons, I think it makes sense to define a few different roles for Cassandra use. A few examples are:

Select from sensor data only role. This could be used by services which only need to present the data.
Select from all tables. This could be used by services which need to export all tables for backups.
Select / insert into sensor data tables. This could be used for ETL services.
Admin. The name says it all.

These could help limit the damage that could be done if a service was compromised or improperly designed. This could also help prepare us for a design where the database is easily provisioned into different production, development and testing keyspaces.

Investigate options for bulk data export

I think the simplest way to do this without having to build and significantly change any other layers on beehive is to exposed Cassandra locally within beehive and add an "exporter" role who can only do a select on specified data tables. (I think the last part is important even to just prevent us from making a mistake. You don't want an exporter to accidentally destroy a table!)

This would allow us to write a couple special purpose tools with good performance to do things like bulk backups and exports.

This could even be scheduled to periodically batch, compress and store the data on a mass data store like S3 daily.

Standardize beehive's ssh config for node controllers and edge processors

I'm currently using an ssh config rule which allows ssh-ing devices via a config file. We should make something like this standard on beehive so there is a clear method to accessing nodes for debugging, applying updates, ad-hoc image pulls, etc.

To start out, things can be as simple as ssh nc12345 where 12345 is the port number. Later, we can do a search based method where you can use other criteria to connect.

Transition off of flask development server

See: http://flask.pocoo.org/docs/0.12/deploying/

Consider using docker stack for development / testing deployments

I've had an extremely good experience with using the docker stack features for doing a local development / testing deployment on my own laptop. I've been tracking that experiment here: https://github.com/seanshahkarami/beehive-stack

We should strongly consider transitioning the way beehive is deployed to use something like this. A few benefits are:

There's a single file describing which containers are deployed, with what settings, how they're related, what volumes they use, etc.
A single command can create and destroy the entire deployment.
Services are managed by docker, so we won't need both docker and systemd configurations.
It's easy to run locally. I already use this on my laptop. Imagine how useful that would have been for summer students...
Future / reach benefits: Plays nice with docker swarms...I've already been able to run services across a desktop machine clustered to beehive2. If we're not moving to something like AWS, we should measure the performance of this. This would help us add hardware fault tolerance, redundancy and help us scale horizontally.

Investigate current install scripts

Before reworking too much, we should go through the install scripts clearly, see what's going on there and know what can be cleaned up. (This may have been done already during Bill's last week.)

Better error message and status codes from beehive-flask

Currently, beehive-flask does not produce useful feedback on what's happening internally when things go wrong. In order to improve development and debugging, it's important to get reasonable error messages and status codes in exceptional cases. Here are a couple examples:

The database is not up yet. Flask should not crash - it should give the user a friendly warning that the database is not available right now and to retry later.
An invalid or nonexistent node is queried. Flask should make this clear and not just return a page with no datasets.

Having proper feedback without having to dig into the logs would go a long way for debugging and, more importantly, providing a good user experience.

Ensure update-node-users is enabled during install

It seems that this isn't automatically run during install.

405 Method Not Allowed on beehive-registration server

This first post request is giving me the error when running the develop beehive-server.

import requests

nodeid = '0000000000000001'

r = requests.post(f'http://localhost:80/api/registration?nodeid={nodeid}')
print('POST')
print(r.text)

Error following instructions

I wanted to try out the Array of Things cloud system, so I tried following the Docker instructions in the readme (docker pull waggle/beehive-server:latest, docker network create beehive...). I set my $DATA directory to export DATA=${HOME}/waggle, ran the command to create the client certificate, ran the command to start the server in Docker, ran ./configure, and then ran ./Server.py; I got this error:

2017-07-06 20:38:09,104 - config - INFO - line=72 - RABBITMQ_HOST: beehive-rabbitmq
2017-07-06 20:38:09,104 - config - INFO - line=75 - CASSANDRA_HOST: beehive-cassandra
Traceback (most recent call last):
  File "./Server.py", line 3, in <module>
    from config import *
  File "/usr/lib/waggle/beehive-server/config.py", line 184, in <module>
    with open('/mnt/beehive/beehive-config.json', 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/beehive/beehive-config.json'

Are there any troubleshooting steps I should try? I am not sure what this beehive-config.json file it's missing is.

Review ETL processes for sanitization, robustness and correctness

We should do a simple review of the main processes involved in loading data into the databases, processing it, etc. Some examples of what we're looking for are things like:

Do they apply sanitization? For example, ensure consistent node_ids, encoding, naming, etc.
Do they handle invalid data correctly? At least one process just drops bad blobs on failure. We probably would like to flag that data and have it put into an error queue or something for later inspection.
Are they tolerant to database and broker delays, timeouts, etc? This means things like not crashing immediately if the database is busy, ensuring proper message acknowledgements are being done, etc.
Are they relatively efficient in their implementation?

This is worth looking at and getting correct now, as these will be part of our architecture regardless of how we redesign beehive.

Ensure all important RabbitMQ exchanges and queues are defined as durable

This is important as it ensures these are available immediately once RabbitMQ starts. Without this, they need to be recreated each restart.

Simplify container runtime management

Docker supports managing and restarting containers on its own. I'd like to propose simplifying the deployment process by just letting docker manage all that instead of the mix of docker and systemd we have now.

This should be a relatively easy fix and could pay off a lot in terms of organization and easy of deployment.

Design and prototype provisioning of core piece of infrastructure

We should discuss a design where we have a few of the core pieces of beehive's infrastructure "preexisting" and then build on top of that. That could reduce "what beehive is" to primarily being glue between a few, well-operated components.

For example, RabbitMQ, Cassandra and Elasticsearch are probably best operated as core pieces of infrastructure, operated, managed, backed up by someone familiar with best practices and procedures. Different Beehive deployments would then live in a virtual namespace within one of these pieces of infrastructure and would be a configuration option.

For example, we keep a few machines around solely to operate the core RabbitMQ and Cassandra cluster, then deploy a production and development beehive onto the same clusters, just under different namespaces.

Ensure SSL/TLS processes have correct configuration and are clear to use

There is currently a tool on beehive for creating SSL/TLS credentials for servers and clients, however, it seems that it's pretty mysterious to most of us.

It would be great if we confirmed that it's creating credentials with the correct configuration (for example, reasonable names, domains, etc and thing like ensuring certificates are signed using SHA256 instead of SHA1). Further, it's important that everyone has a basic understanding of how this process works. We should get things to the point that it's easy to create server credentials for development or testing if we need.

Ensure metrics are also stored in Cassandra

It would be nice to have this data warehoused in Cassandra instead of just Elasticsearch in case Elasticsearch fails or we ever anticipate moving to another tool.

An update needed on web-watchdog service

When beehive-web service needs to be stopped or restarted, beehive-nginx service also has to be restarted to re-configure communication line between them.

In order to fix it, beehive-web-watchdog service needs to additionally check if external requests work. If not working, the service should restart nginx service as well. Below is the code snippet (See elif statement).

[Service]
Environment='CONTAINER=beehive-web'
Environment='NGINX_CONTAINER=beehive-nginx'
Environment='DATA=/mnt'

Restart=always
RestartSec=5

ExecStart=/bin/bash -c ' \
  while [ 1 ] ; do \
    sleep 2m ; \
    if [ $(curl --silent --max-time 10 $(docker inspect --format "{{ .NetworkSettings.Networks.beehive.IPAddress }}" beehive-web) | grep "This is the Waggle Beehive web server." | wc -l) -ne 1 ] ; then \
      docker rm -f ${CONTAINER} ; \
    elif [ $(curl --silent --max-time 10 beehive1.mcs.anl.gov | grep "This is the Waggle Beehive web server." | wc -l) -ne 1 ] ; then \
      docker rm -f ${NGINX_CONTAINER} ; \
    fi \
  done'

Schedule automatic Cassandra backup

Since we're about to start a lot of work on beehive, we should make sure we have a Cassandra backup process in place.

I went ahead and built a tool to pull datasets, we just need to schedule it and have a place to keep the backups: https://github.com/waggle-sensor/beehive-server/tree/master/data-exporter

The current missing half is having a complementary script to do a restore, but at least we have the raw data available now.

Drop v1 data pipeline (beehive-server container)

At this point, I think it makes sense to drop the v1 data pipeline. The only nodes using this were the original 4 we deployed and I'm not sure if any are sending data anymore.

Clean up Dockerfile and their locations

Yongho and I agree that patching what's on beehive now is a little painful because of how things are organized. We're going to take an afternoon to clean up the Dockerfile to make sure they just have the minimal dependencies and move them all to their own directories in the root of the repo. This will make working with that beehive-server repo much more pleasant and we'll be able to work much faster after that.

Standardize waggle node IDs

Currently we have two disjoint families of node ID conventions:

mac address
mac address with four zeros prefix added

This grew out of many scripts querying the mac address directly and using it without padding it with zeros and, perhaps, a misunderstand of why the zeros were added in the first place.

Now, some nodes have data labelled with both kinds of IDs which is not handled by the current data serving process. Needless to say, this is confusing.

The good thing is, we just need to be more consistent with our own naming, add some sanitization in the data loaders and clean up the datasets in use. The database itself shouldn't care at all as any string could be used as a node ID.

Ensure messages exchange is created during install

Seems to not automatically be created.

Unify data model

I've been thinking some more about how we can combine a bit of what we have now into a simpler pipeline. A very reasonable approach would be to transition to a cassandra table with columns: ((nodeid , date), topic, timestamp), body

(Yes, topic is kind of just a semantic change from plugin. It'd basically be used to store any routing key, for example, coresense:3 or metric.).

Now, we could put a single "data" exchange in beehive accepting all messages like this. If it's a direct exchange, we can then do a simple "opt-in" for each topic we want to store in that database.

The other nice thing about this layout is it supports splitting messages by topic from the database. Generally, you always end up having to handle each topic case-wise, so having the database support this would be great. At the moment, we can't do that without manually filtering. This should also allow better time slicing within a single day.

This is also general enough that we don't need any special code handling things at the front - we just grab data, maybe add a received_at timestamp, and shove it in the database from later processing. This eliminates the need to do any data format changing since all that has to be handled on a case-by-case basis anyway but ensures the storage (and backup) problem is handled uniformly.

Another way to think of this is as simply as a permanent message log which can be replayed for later processing. The nice thing is, this can be designed as a configurable service in the sense that a binding for each topic can be adding to any exchange and you'll automatically start getting backups.

Start laying out critical Ansible playbooks for provisioning and managing servers

I started a new directory for this here: https://github.com/waggle-sensor/ansible/tree/master/beehive

These are meant to automate and safely perform common tasks such as updating the server, backing up critical configuration and credential data, etc.

Configure nginx to use caching

For better page performance, it seems to make sense to have nginx cache pages for a short amount of time (5 minutes or so) so they don't have to be regenerated every time. This would also allow nginx to serve the last cached version of a page if beehive-flask goes down briefly and is restarting.

Ensure credentials are automatically backed up

We need to make sure critical pieces of credential information are automatically backed up and that we're able to restore the state of beehive from it.

For reference, I have a list of the credentials in use here: https://github.com/waggle-sensor/waggle/wiki/Credentials-Overview

I can't emphasis enough how crucial this is...

Review reverse ssh tunnel process

This is such a critical component that we should take some time to make sure this is designed the way we want now.

To get some discussion started, I'd recommend we think about the following:

Do we get enough logging information from this container to say anything useful? If not, we may want to see what we can do to improve that. For example, it would be nice to know that a particular public key is being rejected. We could even combine this with item 2 and have an sshrc post a log message to the logging / analytics pipeline when someone connects and use standard tools like who to see who's connected.
We may want to have a per-node home directory for additional security and deployment reasons. We already restrict what commands a particular user is able to run, but it wouldn't hurt to sandbox access even more. Further, this could allow us to do something like put per-node config / parameters in their own directories to allow them to pull new configurations or credentials on their own. It could essentially be a set of synced home directories.

I'd argue that this approach isn't much harder than we have now and could be built out in parallel until it's signed off on.

It's not much harder to have a script which rebuilds multiple directories / authorized_keys files as opposed to a single large one. The only additional step would be ensuring that each user is added to the ssh container. I'd imagine this step is a pretty fast and could be run at container startup or on a schedule.

Just something to think about.

Getting node datasets scales poorly

Currently, getting the dataset dates for an individual node requires pulling dataset dates for all nodes / dates. We probably want a way for this to scale like O(#dates) instead of O(#nodes * #dates).

A reasonable solution to this would be to have the data insertion process also insert an entry into a datasets table with schema (node_id), date.

switch cassandra to DateTieredCompactionStrategy

http://www.datastax.com/dev/blog/datetieredcompactionstrategy

Understand and prototype running core service in clustered mode

In order to get better redundancy and performance, we may want to consider how we'd run a few core pieces of infrastructure (RabbitMQ, Cassandra, Elasticsearch) in clustered mode. (This is the recommended way of deploying some of these.)

Layout good places to start "virtual deployment seams"

It's difficult to work on new features without a good development and testing deployment.

One option to think about is the following: Since RabbitMQ supports virtual hosts and Cassandra supports keyspaces, we should look for places to add configuration options to allow multiple virtual deployments within a single beehive.

Dynamic SSH Authorized Keys List

One interesting idea for managing many different nodes enabled ssh public keys is to investigate using SSH's AuthorizedKeysCommand parameter. This lets you dynamically generate a list of allowed public keys for a given user from a script.

One simple example is, you can keep a database table with a row like: username, public key, active. Then, a script could print out all the public keys that are active for a given user. Maybe you even have an expiration time as a column?

A cute example I tried last night was, in addition to some "hard-coded" public keys in one of my authorized_keys files, I have sshd dynamically fetch my Github public keys and allow those, too.

This isn't really important or needed anytime soon. Just an interesting idea, I think. Maybe it's easier than managing a huge authorized_keys file. It could also dynamically print out things like commands / port restrictions, too, so all those details could be traced back to a single source.

Layout expected queries and reports we'd like to generate

Having a clear idea of what kind of queries and reports we'd like to extract from our databases is crucial to knowing how to organizing them. This impacts a number of things I'll add in the comments.

Schedule static dataset index regeneration

I'd like to ensure this is scheduled to rebuild the index every 15-30 minutes, so we can start testing this as the primary, ANL public source for datasets.

Standardize node info management

Before adding lots of new nodes, we need to nail down the way we manage nodes. Perhaps between the processes which exist on beehive, this is good enough, but we should make sure and see what could be better. We just need to make sure we can manage things before we get too many nodes and things are out of hand.

Drop RabbitMQ management route from nginx

Currently, we've got RabbitMQ's management interface exposed over HTTPS via nginx. Only a subset of the users are tagged as admin users and all of them were created with strong, random passwords. Because of that, this isn't automatically insecure.

However, configuration-wise it simplifies things to just split RabbitMQ off and treat it as its own piece of infrastructure capable of serving its own management interface without nginx being involved.

New install from scrach authentication issues

We have deployed beehive-server from git using the install/beehiveInstallNew.bash script.
It had many path and silly issues we fixed, such as paths.
After it finished, we found that the RabbitMQ install has no users or maybe we have no passwords.
In short, none of the services are able to connect to MQ.
¿Is this expected?
¿Should we generate queues/users/credentials on behalf of the services?
It looks to me some part of the build failed and the users were not created.
Thank you in advance.

Document and review list of metrics + monitoring data in use

As a first step towards simplifying some different datasets we're sending, it's worth understanding the coverage we have between the different metrics and monitoring services. It may make sense to merge some of these, particularly if they already have significant overlap with each other.

This would complement issue #35 nicely for ensuring we're tracking of all of this in a tightly structured, backed up manner.

Backup plan for RabbitMQ definitions

RabbitMQ is able to save / restore it's current configuration state. This includes all internal configurations regarding exchanges, queues, bindings, users, permissions, virtual hosts, etc... We should probably have this backed up automatically and saved in case we ever need to restore the state of the broker. In particular, once we get to the point of having distinct users for each node, this will ensure we already have a backup plan in place for those credentials.