mozilla / telemetry-server Goto Github PK

View Code? Open in Web Editor NEW

40.0 35.0 35.0 2.18 MB

Server for the Mozilla Telemetry project

License: Other

CMake 1.23% Python 67.22% Shell 3.45% Makefile 0.15% CSS 0.65% HTML 6.26% JavaScript 7.55% Lua 1.43% C++ 12.06%

telemetry-server's Introduction

Telemetry Server

This repository is deprecated. Details on the current server for Firefox Telemetry can be found here and here.

Server components to receive, validate, convert, store, and process Telemetry data from the Mozilla Firefox browser.

Talk to us on irc.mozilla.org in the #telemetry channel, or visit the Project Wiki for more information.

See the TODO list for some outstanding tasks.

Storage Format

See StorageFormat for details.

On-disk Storage Structure

See StorageLayout for details.

Data Converter

Use RevisionCache to load the correct Histograms.json for a given payload
1. Use revision if possible
2. Fall back to appUpdateChannel and appBuildID or appVersion as needed
3. Use the Mercurial history to export each version of Histograms.json with the date range it was in effect for each repo (mozilla-central, -aurora, -beta, -release)
4. Keep local cache of Histograms.json versions to avoid re-fetching
Filter out bad submission data
1. Invalid histogram names
2. Histogram configs that don't match the expected parameters (histogram type, num buckets, etc)
3. Keep metrics for bad data

MapReduce

We have implemented a lightweight MapReduce framework that uses the Operating System's support for parallelism. It relies on simple python functions for the Map, Combine, and Reduce phases.

For data stored on multiple machines, each machine will run a combine phase, with the final reduce combining output for the entire cluster.

Mongodb Importer

Telemetry data can be optionally imported into mongodb. The benefits of doing that is the reduced time to run multiple map-reduce jobs on the same dataset, as mongodb keeps as much data as possible in memory.

Start mongodb, e.g. mongod --nojournal
Fetch a dataset from S3, e.g. aws s3 cp s3://... /mnt/yourdataset --recursive
Import the dataset, e.g. python3 -m mongodb.importer /mnt/yourdataset
Run a map-reduce job, e.g. mongo localhost/telemetry mongodb/examples/osdistribution.js

Plumbing

Once we have the converter and MapReduce framework available, we can easily consume from the existing Telemetry data source. This will mark the first point that the new dashboards can be fed with live data.

Integration with the existing pipeline is discussed in more detail on the Bagheera Integration page.

Data Acquisition

When everything is ready and productionized, we will route the client (Firefox) submissions directly into the new pipeline.

Code Overview

These are the important parts of the Telemetry Server architecture.

`http/server.js`

Contains the Node.js HTTP server for receiving payloads. The server's job is simply to write incoming submissions to disk as quickly as possible.

It accepts single submissions using the same type of URLs supported by Bagheera, and expects (but doesn't require) the partition information to be submitted as part of the URL.

To set up a test server locally:

Install node.js (left as an exercise to the reader)
Edit http/server_config.json, replacing log_path and stats_log_file with directories suitable to your machine
Run the server using cd http; node ./server.js ./server_config.js
Send some test data to the server. Using curl: curl -X POST http://127.0.0.1:8080/submit/telemetry/foo/bar/baz -d '{"test": 1}'

Stop the server, and check that there is a telemetry.log.<something>.finished file in the directory you specified in step 2 above.

You can examine the resulting file in python (from the root of the repo):

import telemetry.util.files as fu
for r in fu.unpack('/path/to/telemetry.log.<something>.finished'):
    print "URL Path:", r.path
    print "JSON Payload:", r.data
    print "Submission Timestamp:", r.timestamp
    print "Submission IP:", r.ip
    print "Error (if any):", r.error

`telemetry/convert.py`

Contains the Converter class, which is used to convert a JSON payload from the raw form submitted by Firefox to the more compact storage format for on-disk storage and processing.

You can run the main method in this file to process a given data file (the expected format is one record per line, each line containing an id followed by a tab character, followed by a json string).

You can also use the Converter class to convert data in a more flexible way.

`telemetry/export.py`

Contains code to export data to Amazon S3.

`telemetry/persist.py`

Contains the StorageLayout class, which is used to save payloads to disk using the directory structure as documented in the storage layout section above.

`telemetry/revision_cache.py`

Contains the RevisionCache class, which provides a mechanism for fetching the Histograms.json spec file for a given revision URL. Histogram data is cached locally on disk and in-memory as revisions are requested.

`telemetry/telemetry_schema.py`

Contains the TelemetrySchema class, which encapsulates logic used by the StorageLayout and MapReduce code.

`process_incoming/process_incoming_mp.py`

Contains the multi-process version of the data-transformation code. This is used to download incoming data (as received by the HTTP server), validate and convert it, then publish the results back to S3.

`process_incoming/worker`

Contains the C++ data validation and conversion routines.

Prerequisites

Clang 3.1 or GCC 4.7.0 or Visual Studio 10
CMake (2.8.7+) - http://cmake.org/cmake/resources/software.html
Boost (1.54.0) - http://www.boost.org/users/download/
zlib
OpenSSL
Protobuf

Optional (used for documentation)

Graphviz (2.28.0) - http://graphviz.org/Download..php
Doxygen (1.8+)- http://www.stack.nl/~dimitri/doxygen/download.html#latestsrc

convert - Build instructions (from the telemetry-server root)

mkdir release
cd release
cmake -DCMAKE_BUILD_TYPE=release ..
make

Configuring the converter

heka_server (string) - Hostname:port of the heka log/stats service.
histogram_server (string) - Hostname:port of the histogram.json web service.
telemetry_schema (string) - JSON file containing the dimension mapping.
histogram_server (string) - Hostname:port of the histogram.json web service.
storage_path (string) - Converter output directory
upload_path (string) - Staging directory for S3 uploads.
max_uncompressed (int) - Maximum uncompressed size of a telemetry record.
memory_constraint (int) -
compression_preset (int) -

    {
        "heka_server": "localhost:5565",
        "telemetry_schema": "telemetry_schema.json",
        "histogram_server": "localhost:9898",
        "storage_path": "storage",
        "upload_path": "upload",
        "max_uncompressed": 1048576,
        "memory_constraint": 1000,
        "compression_preset": 0
    }

Setting up/running the histogram server

pushd http
../bin/get_histogram_tools.sh
popd
python -m http.histogram_server

Running the converter

in the release directory

mkdir input
./convert convert.json input.txt

# input.txt should contain a list of files to process (newline delimited)
# i.e. /<path to telemetry-server>/release/input/telemetry1.log

from another shell, in the release directory

cp ../process_incoming/worker/common/test/data/telemetry1.log input

Without the histogram server running it will produce something like this:

processing file:"telemetry1.log"
LoadHistogram - connect: Connection refused
ConvertHistogramData - histogram not found: https://hg.mozilla.org/releases/mozilla-release/rev/a55c55edf302
done processing file:"telemetry1.log" processed:1 failures:1 time:0.001871 throughput (MiB/s):9.3563 data in (B):18356 data out (B):0

With the histogram server running:

processing file:"telemetry1.log"
done processing file:"telemetry1.log" processed:1 failures:0 time:0.013622 throughput (MiB/s):1.2851 data in (B):18356 data out (B):45909

Ubuntu Notes

apt-get install cmake libprotoc-dev zlib1g-dev libboost-system1.54-dev \
   libboost-filesystem1.54-dev libboost-thread1.54-dev libboost-test1.54-dev \
   libboost-log1.54-dev libboost-regex1.54-dev protobuf-compiler libssl-dev \
   liblzma-dev xz-utils

`mapreduce/job.py`

Contains the MapReduce code. This is the interface for running jobs on Telemetry data. There are example job scripts and input filters in the examples/ directory.

`provisioning/aws/*`

Contains scripts to provision and launch various kinds of cloud services. This includes launching a telemetry server node, a MapReduce job, or a node to process incoming data.

`monitoring/heka/*`

Contains the configuration used by Heka to process server logs.

telemetry-server's People

Stargazers

Watchers

telemetry-server's Issues

since people eventually will have to learn about RDD's, force them to

In this service, lots and lots of things are RDDs.

Document it, put it in the hello world script?

Links to: https://github.com/mozilla/telemetry-server/tree/master/http/analysis-service

Suggestions for 'right patterns' for all the various transforms are also helpful.

Clarifications on histograms format

I'm currently working on a clean-slate implementation of Telemetry as a self-contained library, for Servo, and I need a few clarifications regarding the format accepted for histograms. See this issue for this specific piece of work.

Fields that make no sense for some histograms

If I read the source code correctly, convert.py will set sum, sum_squares_lo, sum_squares_hi, log_sum, log_sum_squares to -1 if these fields cannot be found. Does this mean that histograms that have no meaningful values for either of these fields (e.g. enumerated histograms, count histograms, boolean histograms, flag histograms) can omit these fields? Will e.g. the dashboard still work?

Min/max

For count histograms, if I read correctly the C++ source, the min is harcoded to 1, the max is hardcoded to 2 and the number of buckets is hardcoded to 3. Can I deduce that all three are ignored?

(I may have other questions later)

Add swap file

We should add a swap file to the self-service telemetry instances to avoid an out of memory error in some cases.

Pass file/offset information into mapper

I found myself needing to perform data dedeplication for the update hotfix data. (Yes, I know incremental upload works better with map reduce.)

My ideal deduplication mechanism is:

Collect the upload date of all records with the same ID
Throw away all but the most recent record

i.e. last write wins

It is difficult to do this in telemetry-server today if N>1 mappers are involved because the mapper receives no metadata about upload time. Call this a deficiency in how the hotfix data stream is constructed if you want. You can do this with N=1 mappers, but it is absurdly slow (I have multiple cores and I want to use them, dammit). You can write out the original records in map() and have the reducer produce essentially new streams from the filtered output. But that involves tons of extra CPU and I/O. Why should I rewrite a multi-gigabyte data set where the duplication rate is low?

I propose adding the source file/offset information into map(). In my dedupe job, I can write the record ID and file/offset information. A reducer can then find IDs with multiple records and produce an output listing either the "good" or "bad" sets of file/offsets. I can then load this file into another MR job and filter incoming records against it. You pay a penalty for hash lookup on each record, but that should be fast if set is used.

I've already hacked up telemetry-server to pass this extra information to map(). But it breaks the API of map(). Next step is likely to hook up inspect.getargspec() to see if the callee supports the new arguments and pass them if it does. But I wanted to get feedback from people before I fully implement this, as the changes are probably a bit controversial.

Cluster Monitoring Connection Instructions could be awesomer.

Currently, the rsync / scp of the files is a drag. There are lots of solutions here, including autosync with s3, or gist or github.

I have been thinking something along the lines of this:

## GREGG TOTALLY STINKS AT BASH, this needs to be... sourceable?

export -f TELEMETRYCLUSTER=ec2-54-201-192-120.us-west-2.compute.amazonaws.com;
export TELDIR=~/telemetry-analysis-files
mkdir -p "${TELDIR}"

echo -n "${TELEMETRYCLUSTER}" > ~/.telemtry-cluster
echo -n "${TELDIR}" > ~/.telemetry-dir


# connect
tel-connect () { ssh -i my-private-key -L 4040:localhost:4040 -L 8888:localhost:8888 hadoop@`cat ~/.cluster`; }

# scp down
tel-download () { scp -i my-private-key hadoop@$(cat ~/.cluster):~/analyses/* "${TELDIR}"; }

# scp up
tel-upload () { scp -i my-private-key "${TELDIR}"/* hadoop@$(cat ~/.cluster):~/analyses/ ; }

export tel-connect
export tel-upload
export tel-download

Maybe suggesting putting this is some snippet you can cram into a .bash_profile or whatnot :)

Use ujson for faster job execution

Various MR jobs are using the built-in json or simplejson packages for deserializing json payloads. Switching to ujson gives a significant speed-up.

I have a single day of the Firefox update hotfix payloads cached locally. There are 627,404 records that lz4 decompress to 11,655,091,683 bytes. I have a dead simple MR script that performs a JSON deserialize and extracts a single value from the payload and combines it.

Here is the performance of that job with 8 concurrent processes on 4+4HT cores with various JSON implementations.

built-in json

real 0m54.250s
user 6m26.780s
sys 0m9.834s

real 0m53.691s
user 6m22.161s
sys 0m9.710s

real 0m52.698s
user 6m14.038s
sys 0m9.596s

simplejson

real 0m34.825s
user 4m7.692s
sys 0m7.125s

real 0m34.218s
user 4m3.766s
sys 0m7.055s

real 0m34.830s
user 4m4.105s
sys 0m7.043s

ujson

real 0m26.212s
user 3m6.775s
sys 0m5.789s

real 0m27.636s
user 3m16.358s
sys 0m6.077s

real 0m28.094s
user 3m18.188s
sys 0m6.227s

Averages

The averages for CPU time is:

json: 391s
simplejson: 252s
ujson: 200s

lzma --decompress --stdout on this data set takes about 83s of CPU time.

As the data demonstrates, ujson is significantly faster than simplejson and will thus make Telemetry jobs faster and more efficient.

My data should not need validation: any Google search on "Python json benchmark" will tell you others have reached the same conclusion that ujson is the bomb.

Upgrade to Ubuntu 14

It would be great if we could upgrade to Ubuntu 14 as it comes with a newer R version which has better compatibility with some useful packages like dplyr.

Doc-request: How to make a public (and private) dash

Per irc today :) Maybe put a snippit / git link near the page with the on-going spark job, or wiki this up a bit?

07:13 < gregglind> peeps, what's the recommended path for using a spark job as the basis for a dashboard?
07:14 < gregglind> 1.  run it as a cron... then get the data from S3?  Is there permissions and stuff to deal with?
07:14 < gregglind> (i.e., wanting to setup a json or csv that other inside-moz dashboards can use)
07:15 <@mreid> gregglind: public dashboards are easy.  moz-internal dashboards take a little more effort, but it's doable.
07:20 < gregglind> tell me how to make the easy :)
07:27 <@mreid> gregglind: run a scheduled spark job, set it to public, and write out the data you want for your dashboard to
               $PWD (or just make the notebook itself be the dashboard)
07:27 <@mreid> gregglind: then your output will be web-accessible at a URL like this:
07:27 <@mreid> https://analysis-output.telemetry.mozilla.org/mreid-test-new-dev/data/test.txt
07:30  * gregglind confused, doesn't the machine die?
07:30 <@mreid> gregglind: if you want to just have a notebook that updates (with embedded viz and stuff), it would appear at
               https://analysis-output.telemetry.mozilla.org/<job_name>/data/MyAwesomeNotebook.ipynb
07:30 <@mreid> gregglind: it's magic, don't worry :P
07:31 < gregglind> I am trying to avoid 'the notebook is the dashboard.
07:31 < gregglind> Okay :)
               is $PWD) get auto-uploaded to S3 for posterity
07:31 < gregglind> Can you maybe just copy that to a little gist :) I will file a bug against telemetry dash :)
07:32 <@mreid> so you could also output "my_fancy_dataset.csv[.gz]" and it would become available via a similar URL

I know the eventual route is to go auto gist / s3 / whatever. Maybe this is a temp solution?

# remember that gregg is the worst at bash!
export TELEMETRYCLUSTER=ec2-54-201-192-120.us-west-2.compute.amazonaws.com; 
export TELDIR=~/telemetry-analysis-files
mkdir -p "${TELDIR}"

echo -n "${TELEMETRYCLUSTER}" > ~/.telemetry-cluster
echo -n "${TELDIR}" > ~/.telemetry-dir

# connect
tel-connect () { ssh -i my-private-key -L 4040:localhost:4040 -L 8888:localhost:8888 hadoop@`cat ~/.telemetry-cluster`; }

# scp down
tel-download () { scp -i my-private-key hadoop@$(cat ~/.telemetry-cluster):~/analyses/* "$(cat ~/.telemetry-dir)"; }

# scp up
tel-upload () { scp -i my-private-key "$(cat ~/.telemetry-dir)"/* hadoop@$(cat ~/.telemetry-cluster):~/analyses/ ; }

export -f tel-connect
export -f tel-upload
export -f tel-download

Add support for attachments to the email notifier script

In notify.py Allow multiple --attach some_file.txt command line arguments, and add each one as an attachment to the email.

See example code here: https://gist.github.com/yosemitebandit/2883593