mozilla / fx_usage_report Goto Github PK

Deprecated - please see https://github.com/mozilla/firefox-public-data-report-etl for the current ETL powering Firefox Public Data Report

Python 98.72% Dockerfile 0.54% Makefile 0.74%

fx_usage_report's Introduction

Firefox Public Data

The Firefox Public Data (FxPD) project is a public facing website which tracks various merics over time and helps the general public understand what kind of data is being tracked by Mozilla and how it is used. It is modeled after and evolved out of the Firefox Hardware Report, which is now included as a part of FxPD.

This repository contains the code used to pull and process the data for the User Activity and Usage Behavior subsections of the Desktop sections of the report.

The website itself is generated by the Ensemble and Ensemble Transposer repos.

Data

The data is pulled from Firefox desktop telemetry, specifically the main summary view of the data.

The data is on a weekly resolution (one datapoint per week), and includes the metrics below. The metrics are estimated from a 10% sample of the Release, Beta, ESR, and Other channels, and broken down by the top 10 countries and worldwide overall aggregate. The historical data is kept in an S3 bucket as a JSON file.

This job (the repo) is designed to be run once a week and will produce the data for a single week. It will then update the historical data in the S3 bucket.

For backfills, this job needs to be run for each week of the backfill.

Metrics

For the list of metrics, see METRICS.md.

Data Structure

For a description of the structure of the data output, see DATAFORMAT.md.

Developing

Run the Job

To initiate a test run of this job, you can clone this repo onto an ATMO cluster. First run

$ pip install py4j --upgrade

from your cluster console to get the latest version of py4j.

Next, clone the repo, and from the repo's top-level directory, run:

$ python usage_report/usage_report.py --date [some date, i.e. 20180201] --no-output

which will aggregate usage statistics from the last 7 days by default. It is recommended when testing to specifiy the --lag-days flag to 1 for quicker iterations, i.e

$ python usage_report/usage_report.py --date 20180201 --lag-days 1 --no-output

Note: there is currently no output to S3, so testing like this is not a problem. However when testing runs in this way, always make sure to include the flag --no-output

Testing

Each metric has it's own set of unit tests. Code to extract a particular metric are found in .py files in usage_report/utils/, which are integrated in usage_report/usage_report.py.

To run these tests, first ensure you have Docker installed. First build the container using

$ make build

then run the tests with

$ make test

finally,

$ make lint

runs the linter.

fx_usage_report's People

Contributors

Stargazers

Watchers

Forkers

suyounghong tylerwx51 fbertsch benmiroglio jingsong29-zz davehunt mozilla-github-standards patil2099 haroldwoo jingsong29

fx_usage_report's Issues

Add annotation to flash chart which explains recent instability

Since about June 2018, the "Has Flash" chart has been noticeably more unstable, with the percentage of users varying much more than usual from week to week. Do we think this might be a good case for an annotation?

Usage report should support (re-)running of old versions

For example, say we want to re-run the run of two-weeks ago. What would currently happen is:

Usage report would pull the data from the current master version
Report would create data points for the date it's running on (2 weeks ago)
Report would not add the new data

We need to:

Write the current running date instead of historical date for a partition
Update this line to replace the date rather than append

This way when we run on an old date, it will:

Update the master JSON with the new date
Replace that date's JSON in S3
Replace the master version in S3 to include that new run's data

We will still have "incorrect" historical data in that the date-versioned run will contain data from dates after it, but that is not a big concern and won't break anything.

Migrate Firefox Hardware Report Pipeline

needs to:

be consistent with Fx_Usage_Report
split into countries
preprocess data into ensemble structure

Include logical cores (i.e. hyperthreads) in hardware data

This was requested by @lukewagner in a firefox-hardware-report issue.

The Hardware Report currently reports the number of physical processors (which is a great start), but it would be really helpful when tuning for parallelism to know the distribution of logical cores (i.e., hyperthreads). In particular, one big question I have is what % of our 2-physical-core users have hyperthreads (so 4 logical cores).

See the discussion in the firefox-hardware-report issue for more context.

Automatically Update FF release annotations

Right now we have to manually update the JSON file.

The release dates / schedule is available over HTTP. Add function that automatically checks this and updates the annotations JSON when the job is run

add data readme and data documentation

Include in the repo:

Should include:
data definitions
current ensemble data structure (JSON)
overall code flow / util purposes

Tests are slow

Slow tests make development more difficult. The task here is to retain the same tests but just make them faster.

Desktop Metrics Part 2

Prepare batch 2 of the metrics.

Pre-work done in notebook (note to self, see fxpd-metrics-phase2.html in your workflow repo)

Add to pipeline
think about how to incorporate phase 2 of graphics / presentation
also, add the following potential metrics:
- reader mode

port code to python3

we're running into issues of compatibility with dependencies since we're using python2.

port existing code to python 3

EDA on potential future metrics (Non-Urgent)

Priority: Non-urgent

Start groundwork for non-P1 metrics:

Age of clients
Addon Install Rate
Number of days used in last 7 days
Recency
Tabs Opened
Maximum Tabs
Crash 1
Crash Rate

any other metrics you had in mind

in notebooks:

write out technical definitions to stay consistent with our current data (week's worth of data, etc.)
test what they look like over a year date period
EDA any probes required (any version requirements, null value behavior, etc.). See attached file for examples of what to check for or ping me if you have any questions :)

Add improve test coverage

We need to improve our test coverage to be more robust.

Currently, we have coverage for:
metric functions [w/country, w/out country]
integrated function [w/country, w/out country]

We can leave metric functions as is. But for integrated function, lets add tests for data that is:

no data
multiple countries
- include countries that are not in country list
- include countries into country_list that are not in data
clients with only pings from outside date range
clients with some pings from outside date range
all/some of a given field are null, '', or zero
field is missing (it doesn't exist). Test should make sure this returns an error.
for metrics that give topN, data has ties in ranking

To Do: Add tests for integrated function to check different combinations of the above edge cases.

Rerun data pull before launch

We made some changes to the code (such as top10addons).

Re-run the data pull prior to release (with some time cushion in case anything happens)

Consider adding forecasting trendlines to hardware graphs

It may be interesting to forecast out say 1 year (or whatever we feel is comfortable given the data) into the future to see how things may change that we might want to be prepared for, or just for informational purposes.

This would mostly be useful for the hardware report I think, as it's more about where the market is headed.

For example if we look at display resolution, it looks like 1080p will take over as the primary display size in ~6 months. Or that users with < 4gb of ram will be rare in ~2 years. Or that Windows 10 will be the primary OS by the end of this year

write utils for S3

The following functions:

function0: given a date, return an S3 bucket with the date in the path

function1: given a S3 bucket name, read from an S3 bucket and return the json sitting there (as a dict)

function2: given an S3 bucket name and a dict, write to an S3 bucket as a json

Why? Our script will have a day dict, which we need to update the previous data with, and write to S3.

Please have this in a util file inside usage_report

FYI, might want to look into BOTO

Some russian text rendered incorrectly for 8. September

Comparing "Top Add-ons" 1. September vs. 8. September the name for No 5 doesn't output correctly:

CODE_OF_CONDUCT.md file missing

As of January 1 2019, Mozilla requires that all GitHub projects include this CODE_OF_CONDUCT.md file in the project root. The file has two parts:

Required Text - All text under the headings Community Participation Guidelines and How to Report, are required, and should not be altered.
Optional Text - The Project Specific Etiquette heading provides a space to speak more specifically about ways people can work effectively and inclusively together. Some examples of those can be found on the Firefox Debugger project, and Common Voice. (The optional part is commented out in the raw template file, and will not be visible until you modify and uncomment that part.)

If you have any questions about this file, or Code of Conduct policies and procedures, please see Mozilla-GitHub-Standards or email [email protected].

(Message COC001)

Publish Hardware Report data

We are currently transforming the current Hardware Report JSON into the format that the Public Data Report website expects. If FX_Usage_Report could output this data directly, in the same format as the Health and Usage datasets, the transformation would be unnecessary and the performance of the site would be better.

It would be nice if this could be done before release, but I don't think it's a blocker. The transformer we wrote does work, even if it is a little slow.

In addition to changing the format of the JSON, our transformer also changes the organization of the data. It removes some populations, renames some populations, and combines some populations into bigger ones. These changes are documented in the file population-modifications.json. Ideally these changes would be made on your end before spitting the data out.

tl;dr: Please publish the hardware data directly so that the extra transformation step becomes unnecessary. The resulting JSON should be identical to what our transformer spits out.

[P1] Get Updated Data Using Full Script

Why1: We have some data that we pulled in https://metrics.mozilla.com/protected/usage-report-demo/dashboard/health right now, but that data has a bug where the specific metrics are mixed up for different countries ('All' is reporting 'DE's for example). We need updated data that doesn't have this bug.
Why2: The repo/master script is just about ready (with the above bug fixed), and has all the functionality we need it to have. This will be good opportunity to put it through it's paces and test it on live data.

To Do:

write a shell script that executes the repo's main script with the below flags:

python usage_report/usage_report.py --date [some date, i.e. 20180201] --output-bucket telemetry-parquet --output-prefix fxpd/inaugural-run --sample 10 --lag-days 7

for a range of dates, starting with 20170101 and ending sometime last week, in 7 day intervals (so 20170101, 20170108, 2017015, 20170121, etc.)

execute this script, and check the output in the s3 buckets. make sure it's running correctly. check to make sure the data is correct.
NOTE: you will need to run this from an ATMO cluster, and you'll need the latest code, which is currently in a pull request (hasn't been merged to master yet). Make sure to run this from a branch in the ATMO cluster's local repo that pulled directly from the pull request.

Deliverables: the shell script for running this (doesn't have to be pushed to repo), data in the s3 bucket.

Look into using ESR/beta/release instead of just release.

Due to dropping XP support in Firefox 52, there are some weird bumps at the time of it's release. Including ESR and beta should help smooth some of these bumps out.

New User algorithm undercounts new users

Currently the algorithm is:

Calculate beginning and end of the previous week
Filter out submission_date_s3s that fall outside of this range
Filter out profile_creation_dates that fall outside of this range

The problem with this approach is that it leaves out any users who start in week 1 but submit their first ping in week 2. This could be especially prevalent for new_users who submit data on the last day of the week.

For example, consider a user living in the US who starts using on Saturday. We timestamp the end of the week as Sunday 00:00 UTC. So if this user submits their first ping at the end of their first day, e.g. after 17:00 CST, then that ping will be considered submitted on the "next" week, and will then not be included in any analysis:

It will not be included in this week's analysis because of condition (2.)
It will not be included in next week's analysis because of condition (3.)

If we are going to be filtering on submission_date_s3, we have to determine the earliest submission_date_s3 of the client, and use that as an indication that they are a new user instead. Otherwise we will be undercounting new users.

Clean up window_version buckets and top_10_addons_on_date

Lets clean up the bucketing logic for window_version.
Specifically:

we're missing a version of Windows XP (os = 'Windows_NT' and os_version = '5.2')
older versions of windows like Win95, Win98, etc. don't seem to be os='Windows_NT' and we're getting them as their own individual buckets (should be in 'Other Windows'
Linux and Darwin are fine as is.

To Do: fix bucketing logic so we ONLY get Linux, Darwin, and the output of window_version from os_on_date.
To Do: run function against range of older dates (doing this ad hoc with a notebook is fine, in fact, it's preferred) to confirm that there's no other os cases that our function is missing.

Also, very minor fix: right now top_10_addons_on_date is returning the addon names as 'name'. Rename this column to 'addon_name' and update all appropriate code that this function touches (tests, integration function, etc.)

Mobile data

Publish mobile data for all reports. At the time of this writing, I'm not too concerned about how you decide to format this: this data can be in the same files that you already export, new files, or something else. Let me know when this is ready or nearly ready so that I can add support for mobile data in workshop, ensemble-transposer, and ensemble.

update indonesia annotations

update the annotations to reflect indonesia specific holidays

also, newest FF releases

Top add-ons percentages

Are the percentages of total Firefox users? or is it the percentage out of users who have any add-ons installed?

https://data.firefox.com/dashboard/usage-behavior#metric-overview-4

Consider outputting all Darwin versions, even below 1%

This job appears to exclude any metric which has a value below 1%. Because of this, only one Darwin 18 metric is currently present in the dataset: osName_Darwin-18.7.0. Earlier versions like osName_Darwin-18.6.0 and osName_Darwin-18.5.0 do not appear.

Nonetheless, I group all metrics beginning with osName_Darwin-18 into a bucket called "macOS Mojave" on the front-end, which is really inaccurate. The true Mojave total should be the sum of osName_Darwin-18.7.0, osName_Darwin-18.6.0, osName_Darwin-18.5.0, etc. But only the first of those metrics currently appears in the dataset.

Would it be possible to provide metrics for all versions of Darwin, even those below 1%? That would allow me to provide accurate numbers for major Mac releases like Mojave, High Sierra, etc. Alternatively, maybe this job could do the grouping right here. Rather than outputting osName_Darwin-18.[1,2,3,4,5,6,7,etc.].0, it could output osName_Darwin-Mojave.

FHR: pre-process bucketing

rewrite FHR to bucket CPU and GPU speeds

currently being done in the presentation layer
preprocess it here instead

Questions about GPU models

I have a couple of questions about GPU models that we see in the data.

What is the difference between gpuModel_Other and gpuModel_Unknown? Would it be inaccurate to call these both "Other" in the front-end?
Do you have any insight into what gpuModel_CIK-MULLINS, gpuModel_CIK-KABINI, or gpuModel_R600-RS880 represent? It looks like the first two are AMD GPUs and the third is an ATI GPU, but I'm not exactly sure how to turn them into human-readable names. I'm continuing to look into this but will take any advice you have.

Why is osName_Darwin-Other so high?

In the 2019-10-27 data of the hardware dataset, it appears that unknown versions of Darwin are being used much more widely than known versions of Darwin. Here are the three Darwin-related metrics from that date:

osName_Darwin-17.7.0 at 1.20%
osName_Darwin-18.7.0 at 1.67%
osName_Darwin-Other at 4.43%

What's going on that might explain this? It's notable that older point-releases like osName_Darwin-17.6.0 are not showing up in the latest data, but maybe that's because they're below 1%.

Investigate large changes on 11/15

There are a number of large changes in the data for 11/15. Other internal data sources don't indicate the same changes.

Add ability to toggle graph lines

Some lines within graphs are so close together that it is hard to drill down to the different results. Take for example Operating Systems, to see and compare the numbers for different macOS versions is almost impossible as the lines are so much overlapping.

It would be great if lines could be toggled via their corresponding labels under the graphs.