waggle-sensor / beehive-server Goto Github PK
View Code? Open in Web Editor NEWWaggle cloud software for aggregation, storage and analysis of sensor data from Waggle nodes.
Waggle cloud software for aggregation, storage and analysis of sensor data from Waggle nodes.
Currently, getting the dataset dates for an individual node requires pulling dataset dates for all nodes / dates. We probably want a way for this to scale like O(#dates) instead of O(#nodes * #dates).
A reasonable solution to this would be to have the data insertion process also insert an entry into a datasets table with schema (node_id), date
.
It seems like there's some interest in having other folks deploy beehive's? (If not, we can ignore this.) A few considerations which make that tricky are:
We use a number of large, infrastructure components such as RabbitMQ and Cassandra (and maybe Elasticsearch soon-ish). Proper operations and support during a failure require an ops or sysadmin experienced with these tools to address major issues. How sure are we that other teams will have these people? (We're certainly not in the business of providing that kind of support, right?)
What kind of systems are we anticipating deploying to? It may be ok to deploy to a single VM, but if the intent is to scale up over time, we probably want to consider moving towards an approach where core pieces of infrastructure such as RabbitMQ, Cassandra and Elasticsearch "preexist" and are clustered across multiple machines to provide high availability and performance. (In other words, beehive becomes a service built above that layer of infrastructure. Similar to how you'd build on top of AWS services.) Will most collaborators have a person capable of doing thing kind of provisioning and deployment? If our goal is to scale up, then designing for the lowest common denominator is potentially conflicting with this. (Also note that this is the recommended and intended method of deployment for Cassandra in production.)
In order to get better redundancy and performance, we may want to consider how we'd run a few core pieces of infrastructure (RabbitMQ, Cassandra, Elasticsearch) in clustered mode. (This is the recommended way of deploying some of these.)
I'm currently using an ssh config rule which allows ssh-ing devices via a config file. We should make something like this standard on beehive so there is a clear method to accessing nodes for debugging, applying updates, ad-hoc image pulls, etc.
To start out, things can be as simple as ssh nc12345
where 12345
is the port number. Later, we can do a search based method where you can use other criteria to connect.
We have deployed beehive-server from git using the install/beehiveInstallNew.bash script.
It had many path and silly issues we fixed, such as paths.
After it finished, we found that the RabbitMQ install has no users or maybe we have no passwords.
In short, none of the services are able to connect to MQ.
¿Is this expected?
¿Should we generate queues/users/credentials on behalf of the services?
It looks to me some part of the build failed and the users were not created.
Thank you in advance.
When beehive-web service needs to be stopped or restarted, beehive-nginx service also has to be restarted to re-configure communication line between them.
In order to fix it, beehive-web-watchdog service needs to additionally check if external requests work. If not working, the service should restart nginx service as well. Below is the code snippet (See elif statement).
[Service]
Environment='CONTAINER=beehive-web'
Environment='NGINX_CONTAINER=beehive-nginx'
Environment='DATA=/mnt'
Restart=always
RestartSec=5
ExecStart=/bin/bash -c ' \
while [ 1 ] ; do \
sleep 2m ; \
if [ $(curl --silent --max-time 10 $(docker inspect --format "{{ .NetworkSettings.Networks.beehive.IPAddress }}" beehive-web) | grep "This is the Waggle Beehive web server." | wc -l) -ne 1 ] ; then \
docker rm -f ${CONTAINER} ; \
elif [ $(curl --silent --max-time 10 beehive1.mcs.anl.gov | grep "This is the Waggle Beehive web server." | wc -l) -ne 1 ] ; then \
docker rm -f ${NGINX_CONTAINER} ; \
fi \
done'
Currently, we've got RabbitMQ's management interface exposed over HTTPS via nginx. Only a subset of the users are tagged as admin users and all of them were created with strong, random passwords. Because of that, this isn't automatically insecure.
However, configuration-wise it simplifies things to just split RabbitMQ off and treat it as its own piece of infrastructure capable of serving its own management interface without nginx being involved.
I started a new directory for this here: https://github.com/waggle-sensor/ansible/tree/master/beehive
These are meant to automate and safely perform common tasks such as updating the server, backing up critical configuration and credential data, etc.
Having a clear idea of what kind of queries and reports we'd like to extract from our databases is crucial to knowing how to organizing them. This impacts a number of things I'll add in the comments.
We should do a simple review of the main processes involved in loading data into the databases, processing it, etc. Some examples of what we're looking for are things like:
This is worth looking at and getting correct now, as these will be part of our architecture regardless of how we redesign beehive.
Seems to not automatically be created.
This first post request is giving me the error when running the develop beehive-server.
import requests
nodeid = '0000000000000001'
r = requests.post(f'http://localhost:80/api/registration?nodeid={nodeid}')
print('POST')
print(r.text)
As a first step towards simplifying some different datasets we're sending, it's worth understanding the coverage we have between the different metrics and monitoring services. It may make sense to merge some of these, particularly if they already have significant overlap with each other.
This would complement issue #35 nicely for ensuring we're tracking of all of this in a tightly structured, backed up manner.
One possible improvement we can do is build a static version of beehive which is regenerated on a schedule. This would dramatically improve page serving performance across the board and give us some room to add sanitization to the datasets until we've cleaned inconsistencies up.
This also has the side effect of completely eliminating direct database access for datasets to the outside work and so could eliminate any security mistakes which show up. (Even though, this really shouldn't be a problem...)
I think this is still worth prototyping, even though we now have nginx performing caching and have moved off the development server. As an example, the build-index tool in the data-exporter generates a "friendly" summary of all the datasets to make sure things look reasonable.
I'd like to ensure this is scheduled to rebuild the index every 15-30 minutes, so we can start testing this as the primary, ANL public source for datasets.
It would be nice to have this data warehoused in Cassandra instead of just Elasticsearch in case Elasticsearch fails or we ever anticipate moving to another tool.
It's difficult to work on new features without a good development and testing deployment.
One option to think about is the following: Since RabbitMQ supports virtual hosts and Cassandra supports keyspaces, we should look for places to add configuration options to allow multiple virtual deployments within a single beehive.
This is important as it ensures these are available immediately once RabbitMQ starts. Without this, they need to be recreated each restart.
Before reworking too much, we should go through the install scripts clearly, see what's going on there and know what can be cleaned up. (This may have been done already during Bill's last week.)
For better page performance, it seems to make sense to have nginx cache pages for a short amount of time (5 minutes or so) so they don't have to be regenerated every time. This would also allow nginx to serve the last cached version of a page if beehive-flask goes down briefly and is restarting.
This is such a critical component that we should take some time to make sure this is designed the way we want now.
To get some discussion started, I'd recommend we think about the following:
Do we get enough logging information from this container to say anything useful? If not, we may want to see what we can do to improve that. For example, it would be nice to know that a particular public key is being rejected. We could even combine this with item 2 and have an sshrc post a log message to the logging / analytics pipeline when someone connects and use standard tools like who to see who's connected.
We may want to have a per-node home directory for additional security and deployment reasons. We already restrict what commands a particular user is able to run, but it wouldn't hurt to sandbox access even more. Further, this could allow us to do something like put per-node config / parameters in their own directories to allow them to pull new configurations or credentials on their own. It could essentially be a set of synced home directories.
I'd argue that this approach isn't much harder than we have now and could be built out in parallel until it's signed off on.
It's not much harder to have a script which rebuilds multiple directories / authorized_keys files as opposed to a single large one. The only additional step would be ensuring that each user is added to the ssh container. I'd imagine this step is a pretty fast and could be run at container startup or on a schedule.
Just something to think about.
I've been thinking some more about how we can combine a bit of what we have now into a simpler pipeline. A very reasonable approach would be to transition to a cassandra table with columns: ((nodeid , date), topic, timestamp), body
(Yes, topic is kind of just a semantic change from plugin. It'd basically be used to store any routing key, for example, coresense:3
or metric
.).
Now, we could put a single "data" exchange in beehive accepting all messages like this. If it's a direct exchange, we can then do a simple "opt-in" for each topic we want to store in that database.
The other nice thing about this layout is it supports splitting messages by topic from the database. Generally, you always end up having to handle each topic case-wise, so having the database support this would be great. At the moment, we can't do that without manually filtering. This should also allow better time slicing within a single day.
This is also general enough that we don't need any special code handling things at the front - we just grab data, maybe add a received_at timestamp, and shove it in the database from later processing. This eliminates the need to do any data format changing since all that has to be handled on a case-by-case basis anyway but ensures the storage (and backup) problem is handled uniformly.
Another way to think of this is as simply as a permanent message log which can be replayed for later processing. The nice thing is, this can be designed as a configurable service in the sense that a binding for each topic can be adding to any exchange and you'll automatically start getting backups.
Currently we have two disjoint families of node ID conventions:
This grew out of many scripts querying the mac address directly and using it without padding it with zeros and, perhaps, a misunderstand of why the zeros were added in the first place.
Now, some nodes have data labelled with both kinds of IDs which is not handled by the current data serving process. Needless to say, this is confusing.
The good thing is, we just need to be more consistent with our own naming, add some sanitization in the data loaders and clean up the datasets in use. The database itself shouldn't care at all as any string could be used as a node ID.
Since we're about to start a lot of work on beehive, we should make sure we have a Cassandra backup process in place.
I went ahead and built a tool to pull datasets, we just need to schedule it and have a place to keep the backups: https://github.com/waggle-sensor/beehive-server/tree/master/data-exporter
The current missing half is having a complementary script to do a restore, but at least we have the raw data available now.
RabbitMQ is able to save / restore it's current configuration state. This includes all internal configurations regarding exchanges, queues, bindings, users, permissions, virtual hosts, etc... We should probably have this backed up automatically and saved in case we ever need to restore the state of the broker. In particular, once we get to the point of having distinct users for each node, this will ensure we already have a backup plan in place for those credentials.
One interesting idea for managing many different nodes enabled ssh public keys is to investigate using SSH's AuthorizedKeysCommand parameter. This lets you dynamically generate a list of allowed public keys for a given user from a script.
One simple example is, you can keep a database table with a row like: username, public key, active. Then, a script could print out all the public keys that are active for a given user. Maybe you even have an expiration time as a column?
A cute example I tried last night was, in addition to some "hard-coded" public keys in one of my authorized_keys files, I have sshd dynamically fetch my Github public keys and allow those, too.
This isn't really important or needed anytime soon. Just an interesting idea, I think. Maybe it's easier than managing a huge authorized_keys file. It could also dynamically print out things like commands / port restrictions, too, so all those details could be traced back to a single source.
At this point, I think it makes sense to drop the v1 data pipeline. The only nodes using this were the original 4 we deployed and I'm not sure if any are sending data anymore.
I think the simplest way to do this without having to build and significantly change any other layers on beehive is to exposed Cassandra locally within beehive and add an "exporter" role who can only do a select on specified data tables. (I think the last part is important even to just prevent us from making a mistake. You don't want an exporter to accidentally destroy a table!)
This would allow us to write a couple special purpose tools with good performance to do things like bulk backups and exports.
This could even be scheduled to periodically batch, compress and store the data on a mass data store like S3 daily.
Docker supports managing and restarting containers on its own. I'd like to propose simplifying the deployment process by just letting docker manage all that instead of the mix of docker and systemd we have now.
This should be a relatively easy fix and could pay off a lot in terms of organization and easy of deployment.
Yongho and I agree that patching what's on beehive now is a little painful because of how things are organized. We're going to take an afternoon to clean up the Dockerfile to make sure they just have the minimal dependencies and move them all to their own directories in the root of the repo. This will make working with that beehive-server repo much more pleasant and we'll be able to work much faster after that.
We should discuss a design where we have a few of the core pieces of beehive's infrastructure "preexisting" and then build on top of that. That could reduce "what beehive is" to primarily being glue between a few, well-operated components.
For example, RabbitMQ, Cassandra and Elasticsearch are probably best operated as core pieces of infrastructure, operated, managed, backed up by someone familiar with best practices and procedures. Different Beehive deployments would then live in a virtual namespace within one of these pieces of infrastructure and would be a configuration option.
For example, we keep a few machines around solely to operate the core RabbitMQ and Cassandra cluster, then deploy a production and development beehive onto the same clusters, just under different namespaces.
I wanted to try out the Array of Things cloud system, so I tried following the Docker instructions in the readme (docker pull waggle/beehive-server:latest
, docker network create beehive
...). I set my $DATA
directory to export DATA=${HOME}/waggle
, ran the command to create the client certificate, ran the command to start the server in Docker, ran ./configure
, and then ran ./Server.py
; I got this error:
2017-07-06 20:38:09,104 - config - INFO - line=72 - RABBITMQ_HOST: beehive-rabbitmq
2017-07-06 20:38:09,104 - config - INFO - line=75 - CASSANDRA_HOST: beehive-cassandra
Traceback (most recent call last):
File "./Server.py", line 3, in <module>
from config import *
File "/usr/lib/waggle/beehive-server/config.py", line 184, in <module>
with open('/mnt/beehive/beehive-config.json', 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/beehive/beehive-config.json'
Are there any troubleshooting steps I should try? I am not sure what this beehive-config.json
file it's missing is.
For safety reasons, I think it makes sense to define a few different roles for Cassandra use. A few examples are:
These could help limit the damage that could be done if a service was compromised or improperly designed. This could also help prepare us for a design where the database is easily provisioned into different production, development and testing keyspaces.
I've had an extremely good experience with using the docker stack features for doing a local development / testing deployment on my own laptop. I've been tracking that experiment here: https://github.com/seanshahkarami/beehive-stack
We should strongly consider transitioning the way beehive is deployed to use something like this. A few benefits are:
We need to make sure critical pieces of credential information are automatically backed up and that we're able to restore the state of beehive from it.
For reference, I have a list of the credentials in use here: https://github.com/waggle-sensor/waggle/wiki/Credentials-Overview
I can't emphasis enough how crucial this is...
There is currently a tool on beehive for creating SSL/TLS credentials for servers and clients, however, it seems that it's pretty mysterious to most of us.
It would be great if we confirmed that it's creating credentials with the correct configuration (for example, reasonable names, domains, etc and thing like ensuring certificates are signed using SHA256 instead of SHA1). Further, it's important that everyone has a basic understanding of how this process works. We should get things to the point that it's easy to create server credentials for development or testing if we need.
Before adding lots of new nodes, we need to nail down the way we manage nodes. Perhaps between the processes which exist on beehive, this is good enough, but we should make sure and see what could be better. We just need to make sure we can manage things before we get too many nodes and things are out of hand.
Currently, beehive-flask does not produce useful feedback on what's happening internally when things go wrong. In order to improve development and debugging, it's important to get reasonable error messages and status codes in exceptional cases. Here are a couple examples:
The database is not up yet. Flask should not crash - it should give the user a friendly warning that the database is not available right now and to retry later.
An invalid or nonexistent node is queried. Flask should make this clear and not just return a page with no datasets.
Having proper feedback without having to dig into the logs would go a long way for debugging and, more importantly, providing a good user experience.
It seems that this isn't automatically run during install.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.