Code Monkey home page Code Monkey logo

overwatch's Introduction

ALICE Overwatch

DOI Documentation Status Build Status

Welcome to ALICE Overwatch*, a project to provide real-time online data monitoring and quality assurance using timestamped data from the ALICE High Level Trigger (HLT) and Data Quality Monitoring (DQM). See the Web App to access Overwatch displaying ALICE data.

NOTE: With the end of Run 2, Overwatch is now complete.

Quick Start

Setup Overwatch

Along with a variety of dependencies which can be handled by pip, ROOT is required. ROOT 6 is recommended.

Local development

To setup for local development is fairly straightforward.

$ git clone https://github.com/raymondEhlers/OVERWATCH.git overwatch
$ cd overwatch
# Install webApp static data (Google Polymer and jsRoot)
$ cd overwatch/webApp/static && bower install && git clone https://github.com/root-project/jsroot.git jsRoot && cd -
# Probably best to do this in a virtualenv. The overwatch setup.py can't install this automatically.
$ pip install git+https://github.com/SpotlightKid/flask-zodb.git
# Install for local development
$ pip install -e .

Docker Image

As an alternative for advanced users or deployments, a docker image is available on Docker Hub under the name rehlers/overwatch. Be certain to mount a directory containing data into the image so it can be used! Note that you will likely want to use this image interactively (-it) and may want to remove the container when you are done (--rm). If the data is in a folder called data, it should look something like:

$ docker run -it --rm -v data:/overwatch/data rehlers/overwatch:latest-py3.6.7 /bin/bash

Installation only for running Overwatch

For just running Overwatch (ie. performing no development at all), it is also available on PyPI to install via pip.

# Required as a prerequisite since it is not available on PyPI.
$ pip install git+https://github.com/SpotlightKid/flask-zodb.git
# Install the final package
$ pip install aliceOverwatch

Note that the Overwatch package on PyPI includes all of JSRoot and Polymer components so that it can run out of the box! While this is useful, it is important to remember that these dependencies must also be kept up to date.

Using Overwatch

Retrieving test data

To use most parts of the Overwatch project, you need some data provided by the HLT. The latest five runs of data received by Overwatch can be accessed here. The login credentials are available on the ALICE TWiki. It includes at least the combined file and the file from which it is built. If the run is sufficiently long, it will include an additional file for testing of the time slice functionality.

Process the data with overwatchProcessing

Create a basic configuration file named config.yaml containing something like:

# Main options
# Enable debug settings, messages at the debug level
debug: true
loggingLevel: "DEBUG"
# Reprocess the data each time, even if it is not detected as needed. It can be useful
# to test modifications to the processing
forceReprocessing: true

# The directory defaults to "data", which is the recommended name
dataFolder: &dataFolder "path/to/data"

Then, start processing the data with:

$ overwatchProcessing

Visualizing the data with overwatchWebApp

For the webApp, add something similar to the following to your config.yaml:

# Define users for local usage
_users: !bcrypt
    # The key, (below is "username") is the the name of your user, while the value, (below is "password") is your password
    username: "password"
# Continue to keep debug: true . It often helps with ZODB difficulties.

Then, to start the webApp for data visualization, run:

$ overwatchWebApp

By default, the webApp will be available at http://127.0.0.1:8850 using the flask development server (not for production). Login with the credentials that were specified in your configuration file.

Table of Contents

  1. Overwatch Overview and Architecture
  2. Overwatch Configuration
  3. Overwatch Executables
  4. Overwatch Deployment
  5. Using Overwatch Data
  6. Citation
  7. Additional Resources

Overwatch Architecture

The Overwatch architecture is as shown above. Incoming data is handled by the receivers, which then make that data available to be processed by the processing module. The output of the processing is then visualized via the WebApp. In terms of code, the dependencies are as follows:

python modules
---
base <- processing <- webApp
     <- dqmReceiver

c++
---
zmqReceiver

Further information on each component is available in the sections below. More detailed technical information is available in the READMEs for each package, as well as in the code documentation.

Overwatch Processing

The main processing component of Overwatch is responsible for transforming the received data into a viewable form, while also extracting derived quantities and performing checks for alarms. The main processing module is written in python and depends heavily on PyROOT, with some functionality implemented through numpy. The module is located in overwatch/processing, with the file processRuns.py driving the processing.

At a high level, the processing pipeline looks like:

  • Extract run metadata (run number, HLT mode, detector subsystem being processed, available histograms in the particular run, etc).
  • Determine which runs need processing.
    • For example, if a new file has arrived for a particular run, then that run should be processed.
  • If the run is new, determine which objects (histograms) are included and to which groups they belong, which processing functions need to be run, etc.
    • The processing functions are implemented by each detector and called when requested by the particular detector.
  • Apply those processing functions for each object (histogram), and store the outputs.

Each detector (also known as a subsystem) is given the opportunity to plug into the processing pipeline at nearly every stage. Each one is identified by the three letter detector name. The detector specific code is located in overwatch/processing/detectors/ and can be enable through the processing configuration.

Overwatch WebApp

An Overwatch run page

The web app visualizes the information provided by the processing. The WebApp is based on flask and serves the various forms of visualization, as well as providing an interface to request on-demand processing of the data with customized parameters. Note that this causes a direct dependence on the processing module. The main mode of visualization is via json files displayed using JSRoot, which provides interactivity with the data.

Overwatch Data Receivers

The receivers are responsible for receiving data from the various input sources and writing them out. Receivers write out ROOT files with the same filename information, thereby allowing for them to be processed the same regardless of their source.

Note that these receivers need to be deployed in the production environment, but would rarely, if ever, need to be used by standard Overwatch users!

HLT Receivers

Data from the HLT consists of ROOT TObject-derived objects sent via ZeroMQ (ZMQ). The receiver is built in C++, with dependencies on HLT files automatically downloaded, compiled, and linked with the receiver code when the receiver is compiled.

Installation follows the typical CMake pattern for an out of source build. When configuring, remember to specifying the location of ZMQ and ROOT as necessary. Once built, the receiver executable is named zmqReceive. A variety of options are available - for the precise options, see the help (-h or --help).

Note that if there is a ROOT version mismatch (for example, ROOT 5 on the HLT but ROOT 6 for Overwatch), it is imperative to request the relevant ROOT streamers with the '--requestStreamers' option. Note that this option can potentially trigger an internal ROOT bug, and therefore should not be used too frequently. Thus, the request is only sent once when the receiver is started, and it should not be frequently restarted.

DQM Receiver

Data from DQM consists of ROOT files sent via a rest API. The DQM receiver code is written as a flask app. The web app is installed as part of the Overwatch package and can be run using the flask development server via overwatchDQMReceiver. It is configured using the same system as the rest of the Overwatch package, as described here.

For the APIs that are made available, please see the main server code in overwatch/receiver/dqmReceiver.py.

Overwatch Configuration

Overwatch is configured via options defined in YAML configuration files. There is one configuration file each for the Overwatch module (DQM receiver, processing, and webApp). Given the dependency of the various module on each other, the configuration files are also interconnected. For example, if the webApp is loaded, it will also load the processing configuration, along with the other configurations on which the processing depends. In particular, below is the ordered precedence for configuration files.

./config.yaml
~/overwatch{Module}.yaml
overwatch/webApp/config.yaml
overwatch/processing/config.yaml
overwatch/receiver/config.yaml
overwatch/base/config.yaml

The ordering of the configuration files means that values can be overridden in configurations that defined with a higher precedence. For example, to enable debugging, simply set debug: true in your ./config.yaml (stored in your current working directory) - it will override the definition of debug as false in the base configuration.

For a list of the available configuration options, see the config.yaml file in the desired module.

Overwatch Executables

In addition to processing and web application, there are a number of other executables available within the Overwatch project. They predominately play supporting roles for those two main packages.

A large number of executables are based on modules defined in overwatch.base. For further information, see the documentation and the README in overwatch.base. The following executables are defined there:

  • overwatchDeploy - Handle execution of Overwatch executables in deployments. Although not recommended, it can also be used locally. See also below
  • overwatchUpdateUsers - Simple helper to update the database with the user information defined in the configuration.
  • overwatchReceiverDataTransfer - Transfer data received by the ZMQ and DQM receivers to other Overwatch sites and EOS.
  • overwatchReplay - Replay processed Overwatch data as if it was newly received. Allows for full trending and other testing of the data receiving process.
  • overwatchReplayDataTransfer - Replay process Overwatch data to a specified data at a high rate. It is a more general tool than overwatchReplay and is used for moving processed data via overwatchReceiverDataTransfer.

The DQM receiver is defined in overwatch.receiver. For further information, see the documentation and the README in overwatch.receiver. The following executables are defined there:

  • overwatchDQMReceiver - Receiver data from the AMORE DQM system. Usage requires coordination with the DQM project.
  • overwatchReceiverMonitor - Monitor the ZMQ receivers via timestamps written by the C++ executables to ensure that they haven't died.

The ZMQ receiver is defined in receiver.src. It is a small C++ code base which receives files from the HLT and writes them to disk. It automatically downloads and compiles a few minor AliRoot dependency classes as needed, such that the only dependencies that must be install are ZMQ and ROOT. For further information, see the documentation and the README in receiver. The following executables are defined there:

  • zmqReceive - The main executable which handles receiving QA information from the HLT.

Overwatch Deployment

All of the components of Overwatch can be configured and launched by the overwatchDeploy executable. Overwatch is intended to be deployed with a docker image. Within this image, configurations are managed by supervisord. All web apps are deployed behind nginx.

The Dockerfiles and additional information is available in the docker directory.

Configuring Deployment

For a configuration file containing all available options, see overwatch/base/deployConfig.yaml. Note that this particular file is not considered when configuring a deployment - it only considers the file that is passed to it.

Deployment with the docker image

The role of the image is determined by the configuration passed into the environment variable config. Available configuration options are described in the section on configuring Overwatch for deployment.

The image can then be run with something like (using an external configuration file call config.yaml):

$ docker run -d -v data:/overwatch/data -e config="$(config.yaml)" rehlers/overwatch

Update Users in the Database

There is a simple utility to update the users in the ZODB database. It can be called via overwatchUpdateUsers (it takes no arguments). It will use the username/password values stored in the config.yaml.

Using Overwatch Data

Overwatch has time-stamped, persistently stored EMCal and HLT subsystem data dating back to November 2015. The TPC joined around April 2016 (Note that the HLT contains some data from various subsystems, such as the V0). This data is available through the end of Run 2 in December 2018, with the exception of the period between approximately mid-August to mid-October 2018, where some data was lost due to infrastructure issues.

For further detailed information no usage of this data, please see the additional resources .

Accessing the data

This data can be accessed in a few different ways:

  • For small data volumes, the underlying data files can be accessed directly via the Web App. Simply select the subsystem ROOT files from the main run list, and select the files to download.
  • For larger volumes, there are a few options:
    • The unprocessed data is also archived on EOS. It is stored in /eos/experiment/alice/overwatch. To access this data, send a request to Raymond and ALICE Offline.
    • REST API file access is also possible under certain circumstances - contact Raymond if this is needed.

Utilizing the data

To successfully use the Overwatch data, a few things must be kept in mind:

  • Each timestamp is in the CERN time zone. For properly handling these times, I recommend the pendulum python package. For a concrete example, see overwatch.utilities.base.extractTimeStampFromFilename.
  • Each data file is cumulative. To get the data received between time n and n+1, one must subtract the histogram, graph, or other object at time n+1 from the object at time n. From examples of and further information on how to do this, see overwatch.processing.mergeFiles.
  • The data was requested every minute, but the data is not from precisely only that minute. The HLT runs the QA components in a round-robin configuration through the HLT cluster. The new data that is received corresponds to data the components sent into the mergers within that minute. The rate at which the QA components send their data depends on the particular subsystem, but is often on the order of every 5 minutes. So the precision of the data is only on the order of approximately a few minutes.

In general, Overwatch provides functionality to simplify working with this data, even if you don't want to use all of the overwatch processing features. A much more detailed information on how all of this is handled can be found in the documentation and code in overwatch.processing.moveFiles.

Citation

Please cite Overwatch as:

@misc{raymond_ehlers_2018_1309376,
  author       = {Raymond Ehlers and
                  James Mulligan},
  title        = {ALICE Overwatch v1.0},
  month        = jul,
  year         = 2018,
  doi          = {10.5281/zenodo.1309376},
  url          = {https://doi.org/10.5281/zenodo.1309376}
}

Additional Resources

Name Meaning

OVERWATCH: Online Visualization of Emerging tRends and Web Accessible deTector Conditions using the HLT.

overwatch's People

Contributors

arturro96 avatar jdmulligan avatar nabywaniec avatar ostr00000 avatar raquishp avatar raymondehlers avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

overwatch's Issues

Do not write file with receiver if run number is 0

Now that HLT resets receivers, it appears that it then resets the run number to 0, but stills responds to requests with an empty payload. The receiver should note the run number and not write the file since it is empty.

Update for reset on request

To accommodate the P2 GUI, we need to handle reset on request. To handle this properly, we need

  • Update receiver
    • Test reset on request with receiver
  • Update process runs merger. The cleanest way probably involves
    • Name the reset histogram files something different to differentiate them.
    • Extensive testing, since we are modifying the merger.
    • Save each combined file, naming it according to the scheme when we don't reset histograms

Handles slashes and spaces in hist + image name

The TPC QA component has a slash and a space in histogram, which then propagates through the code and causes issues when writing the images. The space is fine, but the slash is not. Should be fixed by checking for and replacing dangerous characters when writing filenames.

jsRoot branch interface completion

From #17

TODO:

  • Site side of OVERWATCH status. See #12
  • Display and store HLT mode
  • Make some sort of bookmarks on the run list to make navigation easier. At worst, everyone 100 runs. Better would be by month. Ideal would be period, but basically impossible
    without querying logbook...
  • Implement new processing options. See: #3

Update authentication method

Update authentication method, likely by adding CERN Single Sign On (SSO) support. This would require a reworking of the authentication system. Our authentication system is very simple, so this should not be too terrible. For more information, see here. The integration process appears to be a very involved.

It may be simpler to implement using SAML2. See the CERN SSO page for more information.

Determine run period using AliEn

Based on Salvatore's suggestion, we could access the run period information from AliEn. This would look something like:

  • Retrieve the run period information via alien_find.
  • Connect to the OVERWATCH database and update a new run period object with the first and last run. This should be straightforward, since the connection is available via the utilities module.
  • Will need to manage how to keep the AliEn connection open.

Receiver continues to send data after EOR

This causes us to save a bunch of useless data, as it keeps sending the same data over and over after the end of run. This was not the case during the PbPb run in 2015.

For a resolution, perhaps look at comparing the current file with the previous one. If so, then tell the receiver to reset the merger. The best way to achieve this is currently unclear

jsroot Integration

Make the displayed histograms interactive by using jsroot. Allows zooming, log axis switches, precise value information, etc.

The implementation that we are most likely to be interested in is to created JSON files along with the images. It is further described under their documentation and implementation information page. Briefly, we would create the JSON files using TBufferJSON, and then these would be served using jsroot. However, such an approach may require a reworking of the page, as loading all the images at once would likely make things very slow. The required async requests for the images may also further complicate the implementation. Further information is available at their documentation and examples.

jsRoot branch interface

Needed for deployment of the jsRoot branch. Ordered by priority.

Critical:

  • Show spinning wheel when loading an ajax request

Final deployment issues

deploy2017 branch documentation

Desired for the deploy2017 branch deployment. Ordered by priority. This would be nice, but less critical than other issues.

TODO:

  • Update detector docs
  • Update module docs
  • Update README.md

jsRoot branch processing

Needed for jsRoot branch deployment. Listed by priority

Critical:

  • Utilize database. See #2
  • Finish updates to time slices.
  • Review subsystem class directory structure. In particular, we need to remove the dependence on dirPrefix, because that can change from system to system (although perhaps this is fine, because we require the "data/" symlink).

Allow additional user options for reprocessing

Allow additional user options for reprocessing via the time dependent merge. Additional options should include:

  • Option to disable scaling by nEvents.
  • Option to change the hot channel warning threshold. This should be done generally enough such that other values could be set in the future.

Move (and eventually delete) redundant files with the same number of events

We have a very large number of redundant files due to the HLT continuing to send data until after the run ends (until recently). While not necessarily problematic, it would be to clean them up from a disk usage perspective. Plus, it is fewer files to manage. We will probably need a reprocessing afterwards to ensure that everything works correctly (plus we'll need to recreate the combined hists files to ensure they are based on the right timestamp).

To identify the redundant files, we should look at the number of events in a given run. Once we have the same number in two different files (we should check that this doesn't happen to one of the first files because this may be a false positive), then the following files are redundant and can be removed. Perhaps it's best to move them first, verify that everything works, and then delete them afterwards. If nEvents is not in the file, perhaps the comparison script created for the EMCal corrections can be adapted to make comparisons here?

This should probably be done externally of OVERWATCH. I think there was a similar script developed in December of last year.

What do you think?

jsRoot branch deployment

Needed for deployment of the jsRoot branch more broadly. Ordered by criticality.

Critical:

  • Docker
    • Create a single docker image containing both nginx and the web app
      • Will mount data via volumes
    • Can we log to somewhere external? Say the sshfs connected drive? See: #4
  • Utilize vulcanize on layout.html. It current chokes on jinja functions, so some care is needed. Should be used when deployed, but should work without it when developing -> Committed to repo, but should almost always be fine - Updates to it should be rare.
    • Minimize polymer components needed
  • Combined, simplify, and improve initOVERWATCH.sh. Move all configuration to one stub. Consolidate logging information (maybe use log4bash?). Should there only be one
    script for running processing and creating HLT receivers?
  • Test and/or fix upstart script (Written but untested). Needed for OVERWATCH at CERN on Ubuntu 14.04 -> Irrelevant. SLC 7, Ubuntu 16.04, and Debian 8 all support systemd. It will need to be edited by install location, but this is not particularly difficult.
  • Use CDN when possible for jsRoot, Polymer

TODO:

  • Limit log file size for receivers. logrotate will probably work.

jsRoot branch web site

Needed for jsRoot branch deployment. Listed by priority:

Critical:

  • Utilize database. Either ZODB or sqlite. ZODB seems very promising. See #2
  • Time slices interface
    • Display
    • Make directly linkable
  • Disable login when behind SSO. See: #6
  • Update validation using database values and more sophisticated checks. This is how we keep everything save

Time slices with different processing options are not displayed correctly

Caused by two time slices with different processing options by the same time extent not being differentiated in the filename. For example

0-3 minutes, scale on:
timeSlice.0.3.root
-> Processing produces the scaled result, as expected
0-3 minutes, scale off:
timeSlice.0.3.root
-> Processing produces the unscaled result, as expected
0-3 minutes, scale on:
-> Processing knows this has been done by -> Returns existing time slice, but uses data from timeSlice.0.3...
-> This is the unscaled result! Wrong!

Handle server failures with ajax properly

Currently, the user just waits there. Eventually they will probably give up, but it would be better to handle them proactively.

At least handle 500 server internal error

Update to database to handle metadata

Processing relies heavily on metadata. These operations can be very slow, particularly on slows disks. To resolve this, a database should be created (likely built using MongoDB) which caches and manages such metadata, thereby reducing the load on the disk.

A longer term approach could store the data directly there, creating ROOT histograms (or some other visualization tool) on the fly. May nicely integrate with #1.

OVERWATCH Status page

Implement a status page showing the status of the tunnels, receivers, ongoing run, etc.

However, we may want to restrict the access

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.