Code Monkey home page Code Monkey logo

velocity-1's Introduction

velocity

Track development velocity

'*.sql' files in BigQuery folder are Google BigQuery queries that produce csv data files to be put in the in data/ directory for processing

analysis.rb is a tool that processes input files (csv files from BigQuery results) and generates final data for Bubble/Motion Google Sheet Chart. This tool also uses:

  • a "hints" file with additional mapping: repo name -> project. (N repos --> 1 Project), so a given project name may be listed be in many lines
  • a "urls" file which defines URLs for the listed projects (a separate file is used because otherwise, in hints file we would have to duplicate data for each project ) (1 Project --> 1 URL)
  • a "default" map file which defines non standard names for projects generated automatically via grouping by org (like aspnet --> ASP.net) or to group multiple orgs and/or repos into a single project. It is the last step of project name mapping This tool outputs a data file into the 'projects/' directory

Example use:

ruby analysis.rb data/data_yyyymm.csv projects/projects_yyyymm.csv map/hints.csv map/urls.csv map/defmaps.csv skip.csv ranges.csv

The Top 30 open source projects process is described in the "Most Up to date process" section.

The CNCF projects process, is described in the "CNCF Projects" section.

File formats

input.csv data/data_yyyymm.csv from BigQuery, like the following:

org,repo,activity,comments,prs,commits,issues,authors
kubernetes,kubernetes/kubernetes,11243,9878,720,70,575,40
ethereum,ethereum/go-ethereum,10701,570,109,43,9979,14
...

output.csv to be imported via Google Sheet (File -> Import) and then chart created from this data. It looks like this:

org,repo,activity,comments,prs,commits,issues,authors,project,url
dotnet,corefx+coreclr+roslyn+cli+docs+core-setup+corefxlab+roslyn-project-system+sdk+corert+eShopOnContainers+core+buildtools,20586,14964,1956,1906,1760,418,dotnet,microsoft.com/net
kubernetes+kubernetes-incubator,kubernetes+kubernetes.github.io+test-infra+ingress+charts+service-catalog+helm+minikube+dashboard+bootkube+kargo+kube-aws+community+heapster,20249,15735,2013,1323,1178,423,Kubernetes,kubernetes.io
...

hints.csv a csv file with hints for repo --> project mapping, it has this format:

repo,project
Microsoft/TypeScript,Microsoft TypeScript
...

urls.csv a csv file with project --> url mapping wiht the following format:

project,url
Angular,angular.io
...

defmaps.csv a csv file with proper names for projects generated as default groupping within org:

name,project
aspnet,ASP.net
nixpkgs,NixOS
Azure,=SKIP
...

The special flag '=SKIP' for a project means that this org should NOT be groupped

skip.csv a csv file that contains lists of repos and/or orgs and/or projects to be skipped in the analysis:

org,repo,project
"enkidevs,csu2017sp314,thoughtbot,illacceptanything,RubySteps,RainbowEngineer",Microsoft/techcasestudies,"Apache (other),OpenStack (other)"
"2015firstcmsc100,swcarpentry,exercism,neveragaindottech,ituring","mozilla/learning.mozilla.org,Microsoft/HolographicAcademy,w3c/aria-practices,w3c/csswg-test",
"orgX,orgY","org1/repo1,org2/repo2","project1,project2"

ranges.csv a csv file that contains ranges of repos properties which makes repo included in calculations. It can constrain any of "commits, prs, comments, issues, authors" to be within range n1 .. n2 (if n1 or n2 < 0 then this value is skipped, so -1..-1 means unlimited There can be also be exception repos/orgs that do not use those ranges:

key,min,max,exceptions
activity,50,-1,"kubernetes,docker/containerd,coreos/rkt"
comments,20,100000,"kubernetes,docker/containerd,coreos/rkt"
prs,10,-1,"kubernetes,docker/containerd,coreos/rkt"
commits,10,-1,"kubernetes,kubernetes-incubator"
issues,10,-1,"kubernetes,docker/containerd,coreos/rkt"
authors,3,-1,"kubernetes,docker/containerd,google/go-github"

The generated output file contains all the input data (so it can be 600 rows for 1000 input rows for example). You should manually review generated output and choose how many rocords you need.

hintgen.rb is a tool that takes data already processed for various created charts and creates distinct projects hint file from it:

hintgen.rb data.csv map/hints.csv Use multiple times putting a different file (1st parameter: data.csv) and generate final hints.csv.

Results

Data files existing in the repository:

  • data/data_YYYYMM.csv --> data for given YYYYMM from BigQuery.
  • projects/projects_YYYYMM.csv --> data generated by analysis.rb based on data_YYYYMM.csv using: map/: hints.csv, urls.csv, defmaps.csv

Motion charts

generate_motion.rb a tool that merges data from multiple files into one to be used for motion chart. Usage:

ruby generate_motion.rb projects/files.csv motion/motion.csv motion/motion_sums.csv [projects/summaries.csv]

File files.csv contains a list of data files to be merged, it has the following format:

name,label
projects/projects_201601.csv,01/2016
projects/projects_201602.csv,02/2016
...

This tool generates 2 output files:

  • 1st is a motion data from each file with a given label
  • 2nd is cumulative sum of data, so 1st label contains data from 1st label, 2nd contains 1st+2nd, 3rd=1st+2nd+3rd ... last = sum of all data. Labels are summed-up in alphabetical order. When input data is divided by months, "YYYYMM" or "YYYY-MM" format must be used to receive correct results. "MM/YYYY" will, for example, swap "2/2016" and "1/2017"

Output formats of 1st and 2nd files are identical.

The first column is a data file generated by analysis.rb. The following column is a label that will be used as "time" for google sheets motion chart Output is in this format:

project,url,label,activity,comments,prs,commits,issues,authors,sum_activity,sum_comments,sum_prs,sum_commits,sum_issues,sum_authors
Kubernetes,kubernetes.io,2016-01,6289,5211,548,199,331,73,174254,136104,18264,8388,11498,373
Kubernetes,kubernetes.io,2016-02,13021,10620,1180,360,861,73,174254,136104,18264,8388,11498,373
...
Kubernetes,kubernetes.io,2017-04,174254,136104,18264,8388,11498,373,174254,136104,18264,8388,11498,373
dotnet,microsoft.com/net,2016-01,8190,5933,779,760,718,158,158624,111553,17019,17221,12831,382
dotnet,microsoft.com/net,2016-02,17975,12876,1652,1908,1539,172,158624,111553,17019,17221,12831,382
...
dotnet,microsoft.com/net,2017-04,158624,111553,17019,17221,12831,382,158624,111553,17019,17221,12831,382
VS Code,code.visualstudio.com,2016-01,7526,5278,381,804,1063,112,155621,104386,9501,17650,24084,198
VS Code,code.visualstudio.com,2016-02,17139,11638,986,1899,2616,133,155621,104386,9501,17650,24084,198
...
VS Code,code.visualstudio.com,2017-04,155621,104386,9501,17650,24084,198,155621,104386,9501,17650,24084,198
...

Each row contains its label data (separate or cumulative) whereas columns with starting with max_ conatin cumulative data for all labels. This is to make the data ready for google sheet motion chart without complex cell indexing.

The final (optional) file summaries.csv is used to read the number of authors. This is because the number of authors is computed differently. Without the summaries file (or if a given project is not in the summaries file), we have a number of distinct authors in each period. Summary value is a sum of all periods max. This is obviously not a real count of all distinct authors in all periods. Number of authors would be computed if another file is supplied, one which contains summary data for a longer period that is equal to sum of all periods.

Adding non-GitHub projects

To manually add other projects (like Linux) use add_linux.sh or create similar tools for other projects. Data for this tool was generated manually using a custom gitdm tool (github cncf/gitdm) on torvalds/linux repo and via manually counting email addresses in different periods on LKML. Example usage (assuming Linux additional data in data/data_linux.csv), could be: ruby add_linux.rb data/data_201603.csv data/data_linux.csv 2016-03-01 2016-04-01`

A larger scope (e.g. GitHub data) file can be injected with such custom script results data (from Gitlab or Linux or External) by the merger script: ruby merger.rb file_to_merge.csv file_to_get_data_from.csv See for example ./shells/top30_201605_201704.sh Every merge will compound data into the merger file.

Processing unlimited BigQuery data

This means removing some filtering out of BigQuery and letting Ruby tools perform the task instead.

To process "unlimited" data from BigQuery output (file data/unlimited.csv) , use shells/unlimited.sh or shells/unlimited_both.sh). Unlimited means that BigQuery is not constraining repositories by having commits, comments, issues, PRs, authors > N (this N is 5-50 depending on which metric: authors for example is 5 while comments is 50). Unlimited only requires that authors, comments, commits, prs, issues are all > 0. And then only CSV map/ranges_unlimited.csv is used to further constrain data. This basically moves filtering out of BigQuery (so it can be called once) to the Ruby tool. And shells/unlimited_both.sh uses map/ranges_unlimited.csv that is not setting ANY limit:

key,min,max,exceptions
activity,-1,-1,
comments,-1,-1,
prs,-1,-1,
commits,-1,-1,
issues,-1,-1,
authors,-1,-1,

It means that mapping must have extremely long list of projects from repos/orgs to get valid non obfuscated data.

You can skip a ton of organization's small repos (if they do not sum up to just few projects, while they are distinct), with: rauth[res[res.map { |i| i[0] }.index('Google')][0]].select { |i| i.split(',')[1].to_i < 14 }.map { |i| i.split(',')[0] }.join(',') The following is an example based on Google. Say Top 100 projects have 100th project with 290 authors. All tiny google repos (distinct small projects) will sum up and make Google overall 15th (for example). The above command generates output list of google repos with 13 authors or less . You can put the results in map/skip.csv" and then You'll avoid false positive top 15 for Google overall (which would not be true)

Adding external projects' data

There is also a tool to add data for external projects (not hosted on GitHub): add_external.rb. It is used by shells/unlimited.csv and shells/unlimited_both.sh Example call: ruby add_external.rb data/unlimited.csv data/data_gitlab.csv 2016-05-01 2017-05-01 gitlab gitlab/GitLab It requires a csv file with external repo data. It must be defined per date range. It has this format (see data/data_gitlab.csv for example):

org,repo,from,to,activity,comments,prs,commits,issues,authors
gitlab,gitlab/GitLab,2016-05-01,2017-05-01,40000,40000,11595,9479,22821,1500

There is also a tool to update generated projects file which in turn is used to import data for charts. update_projects.rb Listed in shells/unlimited_both.sh It is used to update certain values in given projects It processes an input file with the following format:

project,key,value
Apache Mesos,issues,7581
Apache Spark,issues,5465
Apache Kafka,issues,1496
Apache Camel,issues,1284
Apache Flink,issues,2566
Apache (other),issues,52578

This allows updating specific keys in specific projects with data taken from sources other than GitHub. It is currently being used to update github data with issues statistics from jira (for apache projects).

Project ranks

Tool to create ranks per project (for all project's numeric properties) report_projects_ranks.rb & shells/report_cncf_project_ranks.sh Shell script projects from projects/unlimited_both.csv and uses: reports/cncf_projects_config.csv file to get a list of projects that needs to be included in the rank statistics. File format is:

project
project1
project2
...
projectN

It outputs a rank statistics file reports/cncf_projects_ranks.txt

Examples of external (non-GitHub) data processing

For special cases (see ./shells/unlimited_both.sh which calls all scripts in the correct order) Some details about adding external data from non-GitHub projects:

  • How to find Apache issues in Jira: res/data_apache_jira.query

  • Case with Chromium: (details here: res/data_chromium_bugtracker.txt), issues from their bugtracker, number of authors and commits in date range via git log one-liner: Must be called in Git repo cloned from GoogleSource (not from github): git clone https://chromium.googlesource.com/chromium/src Commits: git log --since "2016-05-01" --until "2017-05-01" --pretty=format:"%H" | sort | uniq | wc -l gives 77437 Authors: git log --since "2016-05-01" --until "2017-05-01" --pretty=format:"%aE" | sort | uniq | wc -l gives 1663 To analyze those commits (such as to exclude merge and robot commits): data/data_chromium_commits.csv, run while in chromium/src repository: git log --since "2016-05-01" --until "2017-05-01" --pretty=format:"%aE~~~~%aN~~~~%H~~~~%s" | sort | uniq > chromium_commits.csv Then remove special csv characters with VI commands: :%s/"//g, :%s/,//g Then add a csv header row manually "email,name,hash,subject" and move it to: data/data_chromium_commits.csv Finally replace '~~~~' with ',' to create correct csv: :%s/\~\~\~\~/,/g Then run ruby commits_analysis.rb data/data_chromium_commits.csv map/skip_commits.csv or ./shells/chromium_commits_analysis.sh

  • Case with OpenStack: res/data_openstack_lanuchpad.query - data from their launchpad

  • Case with WebKit: res/data_webkit_links.txt issues from their bug tracker: https://webkit.org/reporting-bugs/ For authors and commits, 3 different tools were tried: our cncf/gitdm on their webkit/WebKit github repo, git one-liner on the same repo (git clone git://git.webkit.org/WebKit.git WebKit): Authors: 121: git log --since "2016-05-01" --until "2017-05-01" --pretty=format:"%aE" | sort | uniq | wc -l Authors: 121: git log --since "2016-05-01" --until "2017-05-01" --pretty=format:"%cE" | sort | uniq | wc -l Commits: 13051: git log --since "2016-05-01" --until "2017-05-01" --pretty=format:"%H" | sort | uniq | wc -l Our cncf/gitdm output files are also stored here: res/webkit/: WebKit_2016-05-01_2017-05-01.csv WebKit_2016-05-01_2017-05-01.txt

Also tried SVN one liner on their original SVN repo (due to the fact that its Github repo is only a mirror): To fetch SVN repo: svn checkout https://svn.webkit.org/repository/webkit/trunk WebKit or: tar jxvf WebKit-SVN-source.tar.bz2 cd webkit svn switch --relocate http://svn.webkit.org/repository/webkit/trunk https://svn.webkit.org/repository/webkit/trunk Finally run their script: update-webkit

Number of commits: svn log -q -r {2016-05-01}:{2017-05-01} | sed '/^-/ d' | cut -f 1 -d "|" | sort | uniq | wc -l Number of authors: svn log -q -r {2016-05-01}:{2017-05-01} | sed '/^-/ d' | cut -f 2 -d "|" | sort | uniq | wc -l To get the data from SVN: Revisions: svn log -q -r {2017-05-25}:{2017-05-26} | sed '/^-/ d' | cut -f 1 -d "|" Authors: svn log -q -r {2017-05-25}:{2017-05-26} | sed '/^-/ d' | cut -f 2 -d "|" Dates: svn log -q -r {2017-05-25}:{2017-05-26} | sed '/^-/ d' | cut -f 3 -d "|"

  • GitLab estimation and details here: res/gitlab_estims.txt
  • LibreOffice case: see res/libreoffice_git_repo.txt

Special GitHub projects (like mirrors, backups etc.)

To add a new non-standard project (but from github mirros, which can have 0s on comments, commits, issues, prs, activity, authors) follow this route:

  • Copy BigQuery/org_finder.sql to clipboard and run this on BigQuery replacing condition for org (for example lower(org.login) like '%your%org%)
  • Examine output org/repos combination (manually on GitHub) and decide about final condition for the final BigQuery run
  • Copy BigQuery/query_apache_projects.sql into some BigQuery/query_your_project.sql then update conditions to those found in the previous step
  • Run the query
  • Save results to a table. Export this table to GStorage. Download this table as CSV from GStorage into data/data_your_project_datefrom_date_to.csv
  • Add this to shells/unlimited_both.csv:
echo "Adding/Updating YourProject case"
ruby merger.rb data/unlimited.csv data/data_your_project_datefrom_date_to.csv
  • Update map/range*.csv - add exception for YourProject (because it can have 0s now - this is output from BigQuery without numeric conditions)
  • Run shells/unlimited_both.sh and examine Your Project (few iterations to add correct mapping in ./map/: hints, defmaps, urls etc.)
  • You can run manually: ruby analysis.rb data/unlimited.csv projects/unlimited_both.csv map/hints.csv map/urls.csv map/defmaps.csv map/skip.csv map/ranges_sane.csv
  • For example see YourProject rank: res.map { |i| i[0] }.index('LibreOffice') or res[res.map { |i| i[0] }.index('LibreOffice')][2][:sum]
  • Some of the values will be missing (like for example PRs for mirror repos)
  • Now it is time for a non standard path, please see shells/unlimited_both.sh for non standar data update that comes after final ruby analysis.rb call - this is usually different for each non-standard project

Most up-to-date process

To generate all data for the Top 30 chart: https://docs.google.com/spreadsheets/d/1hD-hXlVT60AGhGVifNn7nNo9oVMKnIoQ2kBNmx-YY8M/edit?usp=sharing

  • Fetch all necessary data using BigQuery or use data already fetched present in this repo.
  • If fetched new BigQuery data then re-run the special projects BigQuery analysis scripts: ./shells/: run_apache.sh, run_chrome_chromium.sh, run_cncf.sh, run_openstack.sh
  • To just regenerate all other data: run ./shells/unlimited_both.sh
  • See per project ranks statistics: `reports/cncf_projects_ranks.txt
  • Get final output file projects/unlimited.csv and import it on the A50 cell in https://docs.google.com/spreadsheets/d/1hD-hXlVT60AGhGVifNn7nNo9oVMKnIoQ2kBNmx-YY8M/edit?usp=sharing chart

Example - generate chart for a new data range

We already have shells/unlimited_both.sh that generates our chart for 2016-05-01 to 2017-05-01. We want to generate the chart for a new date range: 2016-06-01 to 2017-06-01. This is a step by step tutorial on how to do it.

  • Copy shells/unlimited_both.sh to shells/unlimited_20160601-20170601.sh
  • Keep shells/unlimited_20160601-20170601.sh opened in some other terminal window vi shells/unlimited_20160601-20170601.sh and we need to update all steps
  • First we need unlimited BigQuery output for a new date range:
echo "Restoring BigQuery output"
cp data/unlimited_output_201605_201704.csv data/unlimited.csv
  • We need the data/unlimited_output_201606_201705.csv file. To generate this one, we need to run BigQuery for the new date range.
  • Open the sql file that generated the current range's data: vi BigQuery/query_201605_201704_unlimited.sql
  • Save it as: BigQuery/query_201606_201705_unlimited.sql after changing the date ranges in SQL.
  • Copy it to clipboard pbcopy < BigQuery/query_201606_201705_unlimited.sql and run in Google BigQuery: https://bigquery.cloud.google.com/queries/<<your_google_project_name>>
  • Save result to a table <<your_google_user_name>>:unlimited_201606_201705, it takes about 1TB and costs about $5 "Save as table"
  • Open this table <<your_google_user_name>>:unlimited_201606_201705 and click "Export Table" to export it to google storage as: gs://<<your_google_user_name>>/unlimited_201606_201705.csv (You may click "View files" to see files in your gstorage)
  • Go to google storage and download <<your_google_user_name>>/unlimited_201606_201705.csv and put it where shells/unlimited_20160601-20170601.sh expects it (update the file name to data/unlimited_output_201606_201705.csv):
echo "Restoring BigQuery output"
cp data/unlimited_output_201606_201705.csv data/unlimited.csv
  • So we have main data (step 1) ready for the new chart Now we need to get data for all non-standard projects. You can try our analysis tool without any special projects by running: ruby analysis.rb data/unlimited.csv projects/unlimited_both.csv map/hints.csv map/urls.csv map/defmaps.csv map/skip.csv map/ranges_sane.csv
  • There can be some new projects that are unknown, ranks can chage during this step, so there can be manual changes needed to mappings in map/ directory: hints.csv, defmaps.csv and urls.csv. Possibly also in skip.csv (if there are new projects that should be skipped)
  • This is what came out on the 1st run:
Project #23 (org, 457) skillcrush (skillcrush) (skillcrush-104) have no URL defined
Project #45 (org, 366) pivotal-cf (pivotal-cf) (...) have no URL defined
Project #50 (org, 353) Automattic (Automattic) (...) have no URL defined
  • Let's see which top authors projects for those non-found projects are: rauth[res[res.map { |i| i[0] }.index('Automattic')][0]]
  • Then we must add entries for few top ones in map/hints.csv say with >= 20 authors:
Automattic/amp-wp,31
Automattic/wp-super-cache,29
Automattic/simplenote-electron,22
Automattic/happychat-service,21
Automattic/kue,20

We need to examine each one in github.com, like for the 1st project: github.com/Automattic/amp-wp. We see that this is a WordPress plugin, so it belnogs to the wWrdpress/WP Calypso project: grep -HIn "wordpress" map/*.csv grep -HIn "WP Calypso" map/*.csv We see that we have WP Calypso defined in the hints file:

map/hints.csv:23:Automattic/WP-Job-Manager,WP Calypso
map/hints.csv:24:Automattic/facebook-instant-articles-wp,WP Calypso
map/hints.csv:26:Automattic/sensei,WP Calypso
map/hints.csv:29:Automattic/wp-calypso,WP Calypso
map/hints.csv:30:Automattic/wp-e2e-tests,WP Calypso
map/urls.csv:438:WP Calypso,developer.wordpress.com/calypso

Just add a new repo mapping row for this project (map/hints.csv): Automattic/amp-wp,WP Calypso Do the same for other projects/repos. Re-run the analysis tool untill all is fine.

  • For example, after defining some new projects we see "EPFL-SV-cpp-projects" in the top 50. This is an educational org that should be skipped. Add it to map/skip.csv for skipping row: EPFL-SV-cpp-projects,,

  • Once You have all URL's defined, added new mapping, you may see a preview of the Top projects on while stopped in binding.pry, by typing all. Now we need to go back to shells/unlimited_20160601-20170601.sh and regenerate all non standard data (for projects not on github or requiring special queries on github - for example because of having 0 activity, comments, commits, issues, prs or authors)

  • Now Linux case: we need to change this line ruby add_linux.rb data/unlimited.csv data/data_linux.csv 2016-05-01 2017-05-01 into ruby add_linux.rb data/unlimited.csv data/data_linux.csv 2016-06-01 2017-06-01 and run it

  • You will see: Data range not found in data/data_linux.csv: 2016-06-01 - 2017-06-01 that meens you need to add a new data range for Linux in file: data/data_linux.csv

  • Data for linux is here https://docs.google.com/spreadsheets/d/1CsdreHox8ev89WoP6LjcryroKDOH2gQipMC9oS95Zhc/edit?usp=sharing but it doesn have May 2017 (finished yesterday), so we need last month's data.

  • Go to: https://lkml.org/lkml/2017 and copy May 2017 into linked google spreadsheet: (22110).

  • Add a row for May 2017 to data/data_linux.csv: torvalds,torvalds/linux,2017-05-01,2017-06-01,0,0,0,0,22110 - You will see that now we only have the "emails" column. Other columns must be feteched from the linux kernel repo using the cncf/gitdm analysis:

  • You can also sum up the issues from the sheet to get 2016-06-01 - 2017-06-01: (254893): torvalds,torvalds/linux,2016-06-01,2017-06-01,0,0,0,0,254893

  • Now cncf/gitdm on linux kernel repo: cd ~/dev/linux && git checkout master && git reset --hard && git pull. An alternative to it (if you don't have the linux repo cloned) is: cd ~/dev/, git clone https://github.com/torvalds/linux.git.

  • Go to cncf/gitdm: cd ~/dev/cncf/gitdm, run: ./linux_range.sh 2017-05-01 2017-06-01

  • While on cncf/gitdm, see: vim linux_stats/range_2017-05-01_2017-06-01.txt:

Processed 1219 csets from 424 developers
34 employers found
A total of 24970 lines added, 14469 removed (delta 10501)
  • You have values for changesets,additions,removals,authors here, update cncf/velocity/data/data_linux.csv accordingly.
  • Do the same for ./linux_range.sh 2016-06-01 2017-06-01 and linux_stats/range_2016-06-01_2017-06-01.txt, Results:
Processed 64482 csets from 3803 developers
91 employers found
A total of 3790914 lines added, 1522111 removed (delta 2268803)
  • Final linux rows (one for May 2017, another for last year including May 2017) are:
torvalds,torvalds/linux,2017-05-01,2017-06-01,1219,24970,14469,424,22110
torvalds,torvalds/linux,2016-06-01,2017-06-01,64482,3790914,1522111,3803,254893
  • GitLab case: Their repo is: https://gitlab.com/gitlab-org/gitlab-ce/, clone it via: git clone https://gitlab.com/gitlab-org/gitlab-ce.git in ~/dev/ directory.
  • Their repo hosted by GitHub is: https://github.com/gitlabhq/gitlabhq, clone it via git clone https://gitlab.com/gitlab-org/gitlab-ce.git in ~/dev/ directory.
  • Go to cncf/gitdm and run GitLab repo analysis: ./repo_in_range.sh ~/dev/gitlab-ce/ gitlab 2016-06-01 2017-06-01
  • Results are output to other_repos/gitlab_2016-06-01_2017-06-01.txt:
Processed 16574 csets from 513 developers
15 employers found
A total of 926818 lines added, 548205 removed (delta 378613)
  • Their bug tracker is https://gitlab.com/gitlab-org/gitlab-ce/issues, just count issues in the given date range. Sort by "Last created" and count issues in given range: There are 732 pages of issues (20 per page) = 14640 issues (https://gitlab.com/gitlab-org/gitlab-ce/issues?page=732&scope=all&sort=created_desc&state=all)

  • To count Merge Requests (PRs): https://gitlab.com/gitlab-org/gitlab-ce/merge_requests?scope=all&state=all Merge Requests: 371,5 pages * 20 = 7430

  • To count authors run in gitlab-ce directory: git log --since "2016-06-01" --until "2017-06-01" --pretty=format:"%aE" | sort | uniq | wc -l --> 575

  • To count authors run in gitlab-ce directory: git log --since "2016-05-01" --until "2017-05-01" --pretty=format:"%aE" | sort | uniq | wc -l --> 589

  • Cloud Foundry case:

  • Copy: BigQuery/query_cloudfoundry_201605_201704.sql to BigQuery/query_cloudfoundry_201606_201705.sql and update conditions. Then run query in the BigQuery console (see details at the beginning of example)

  • Finally, you will have data/data_cloudfoundry_201606_201705.csv (run query, save results to table, export table to gstorage, download csv from gstorage).

  • Update (and eventually manually run) the CF case (in shells/unlimited_20160601-20170701.sh): ruby merger.rb data/unlimited.csv data/data_cloudfoundry_201606_201705.csv force

  • CNCF Projects case

  • We have a line in ruby merger.rb data/unlimited.csv data/data_cncf_projects.csv which needs to be changed to ruby merger.rb data/unlimited.csv data/data_cncf_projects_201606_201705.csv

  • Copy: cp BigQuery/query_cncf_projects.sql BigQuery/query_cncf_projects_201606_201705.sql, update conditions: BigQuery/query_cncf_projects_201606_201705.sql

  • Run on BigQuery and do the same as in the CF case. The final output file will be: data/data_cncf_projects_201606_201705.csv

  • Final line should be (try it): ruby merger.rb data/unlimited.csv data/data_cncf_projects_201606_201705.csv

  • WebKit case

  • Change merger line to ruby merger.rb data/unlimited.csv data/webkit_201606_201705.csv

  • WebKit has no usable data on GitHub, so running BigQuery is not needed, we no longer need those lines for WebKit (we will just update data/webkit_201606_201705.csv file), remove them from current shell shells/unlimited_20160601-20170601.sh:

echo "Updating WebKit project using gitdm and other"
ruby update_projects.rb projects/unlimited_both.csv data/data_webkit_gitdm_and_others.csv -1
Processed 13337 csets from 125 developers
6 employers found
A total of 11838610 lines added, 3105609 removed (delta 8733001)

Example for Murano: https://bugs.launchpad.net/murano/+bugs?field.searchtext=&search=Search&field.status%3Alist=NEW&field.status%3Alist=OPINION&field.status%3Alist=INVALID&field.status%3Alist=WONTFIX&field.status%3Alist=EXPIRED&field.status%3Alist=CONFIRMED&field.status%3Alist=TRIAGED&field.status%3Alist=INPROGRESS&field.status%3Alist=FIXCOMMITTED&field.status%3Alist=FIXRELEASED&field.status%3Alist=INCOMPLETE_WITH_RESPONSE&field.status%3Alist=INCOMPLETE_WITHOUT_RESPONSE&assignee_option=any&field.assignee=&field.bug_reporter=&field.bug_commenter=&field.subscriber=&field.structural_subscriber=&field.tag=&field.tags_combinator=ANY&field.has_cve.used=&field.omit_dupes.used=&field.omit_dupes=on&field.affects_me.used=&field.has_patch.used=&field.has_branches.used=&field.has_branches=on&field.has_no_branches.used=&field.has_no_branches=on&field.has_blueprints.used=&field.has_blueprints=on&field.has_no_blueprints.used=&field.has_no_blueprints=on&orderby=-datecreated&memo=425&start=350&direction=backwards

  • The final line should be ruby update_projects.rb projects/unlimited_both.csv data/data_openstack_bugs_201606_201705.csv -1

  • Apache case:

  • Exactly the same BigQuery steps as in the OpenStack example,. The final line should be ruby merger.rb data/unlimited.csv data/data_apache_201606_201705.csv

  • cp BigQuery/query_apache_projects.sql BigQuery/query_apache_projects_201606_201705.sql, update conditions, run BigQ, download results to data/data_apache_201606_201705.csv

  • Run ruby merger.rb data/unlimited.csv data/data_apache_201606_201705.csv

  • Now we need more data for Apache from their jira, first copy file from previous data range cp data/data_apache_jira.csv data/data_apache_jira_201606_201705.csv

  • Now go to their jira: issues.apache.org/jira/browse, you may set conditions to find issues, like this:

project not in (FLINK, MESOS, SPARK, KAFKA, CAMEL, FLINK, CLOUDSTACK, BEAM, ZEPPELIN, CASSANDRA, HIVE, HBASE, HADOOP, IGNITE, NIFI, AMBARI, STORM, "Traffic Server", "Lucene - Core", Solr, CarbonData, GEODE, "Apache Trafodion", Thrift, Kylin) AND created >= 2016-05-01 AND created <= 2017-05-01

Example URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2769?jql=project%20not%20in%20(FLINK%2C%20MESOS%2C%20SPARK%2C%20KAFKA%2C%20CAMEL%2C%20FLINK%2C%20CLOUDSTACK%2C%20BEAM%2C%20ZEPPELIN%2C%20CASSANDRA%2C%20HIVE%2C%20HBASE%2C%20HADOOP%2C%20IGNITE%2C%20NIFI%2C%20AMBARI%2C%20STORM%2C%20%22Traffic%20Server%22%2C%20%22Lucene%20-%20Core%22%2C%20Solr%2C%20CarbonData%2C%20GEODE%2C%20%22Apache%20Trafodion%22%2C%20Thrift%2C%20Kylin)%20AND%20created%20%3E%3D%202016-05-01%20AND%20created%20%3C%3D%202017-05-01 We need: Mesos, Spark, Kafka, Camel, Flink (above query is for other projects, these will not be included) Query for Mesos in our data range: project in (Mesos) AND created >= 2016-06-01 AND created <= 2017-06-01 --> 2055 Do this for all projects.

  • Final line for Apache should be: ruby update_projects.rb projects/unlimited_both.csv data/data_apache_jira_201606_201705.csv -1

  • Chromium case

  • Beginning (BigQuery part) exactly the same as Apache or OpenStack (just replace with word chromium): ruby merger.rb data/unlimited.csv data/data_chromium_201606_201705.csv

  • Now the manual part - copy data/data_chromium_bugtracker.csv to data/data_chromium_bugtracker_201606_201705.csv (we need to generate this file)

  • Get Issues from their bug tracker: https://bugs.chromium.org/p/chromium/issues/list?can=1&q=opened%3E2016%2F7%2F25&colspec=ID+Pri+M+Stars+ReleaseBlock+Component+Status+Owner+Summary+OS+Modified&x=m&y=releaseblock&cells=ids All issues + opened>2016/7/19 gives: 63565 (for 2016/7/18 gives 63822+ which means a non exact number) we will extrapolate from here. All issues + opened>2017/6/1 gives 325, so we have: 63565 - 325 = 63240 issues in 2016-07-19 - 2017-06-01 irb> require 'date'; Date.parse('2017-06-01') - Date.parse('2016-07-19') --> 317 irb> Date.parse('2017-06-01') - Date.parse('2016-06-01') --> 365 irb> 63240.0 * (365.0 / 317.0) --> 72815 Now add chromedriver too: All issues, opened>2017/6/1 --> 1 All issues, opened>2016/6/1 --> 430 So there are 429 chromedriver issues and the total is: 429 + 72815 = 73244

  • Now chromium commits analysis which is quite complex

  • Their sources (all projects) are here: https://chromium.googlesource.com

  • Clone chromium/src in ~/dev/src/: git clone https://chromium.googlesource.com/chromium/src

  • Commits: git log --since "2016-06-01" --until "2017-06-01" --pretty=format:"%H" | sort | uniq | wc -l gives 79144 (but this is only FYI, this is way too many, there are bot commits here)

  • Authors: git log --since "2016-06-01" --until "2017-06-01" --pretty=format:"%aE" | sort | uniq | wc -l gives 1697 To analyze those commits (also exclude merge and robot commits): Run while in chromium/src repository: git log --since "2016-05-01" --until "2017-05-01" --pretty=format:"%aE~~~~%aN~~~~%H~~~~%s" | sort | uniq > chromium_commits_201606_201705.csv Then remove special CSV characters with VI commands: :%s/"//g, :%s/,//g Then add CSV header manually "email,name,hash,subject" and move it to: cncf/velocity:data/data_chromium_commits_201606_201705.csv: mv chromium_commits_201606_201705.csv ~/dev/cncf/velocity/data/data_chromium_commits_201606_201705.csv Finally replace '~~~~' with ',' to create correct CSV: :%s/\~\~\~\~/,/g Then run ruby commits_analysis.rb data/data_chromium_commits_201606_201705.csv map/skip_commits.csv Eventually/optionally add new rules to skip commits to map/skip_commits.csv Tool will say something like this: "After filtering: authors: 1637, commits: 67180", update data/data_chromium_bugtracker_201606_201705.csv accordingly.

  • Final line should be ruby update_projects.rb projects/unlimited_both.csv data/data_chromium_bugtracker_201606_201705.csv -1

  • openSUSE case

  • BigQuery part exactly the same as Apache or OpenStack (just replace with word opensuse): ruby merger.rb data/unlimited.csv data/data_opensuse_201606_201705.csv

  • AGL (automotive Grade Linux) case:

  • Go to: https://wiki.automotivelinux.org/agl-distro/source-code and get source code somewhere:

  • mkdir agl; cd agl

  • curl https://storage.googleapis.com/git-repo-downloads/repo > repo; chmod +x ./repo

  • ./repo init -u https://gerrit.automotivelinux.org/gerrit/AGL/AGL-repo; ./repo init

  • Now You need to use script agl/run_multirepo.sh that uses cncf/gitdm to generate GitHub statistics.

  • There will be agl.txt file generated, something like this:

Processed 67124 csets from 1155 developers
52 employers found
A total of 13431516 lines added, 12197416 removed, 24809064 changed (delta 1234100)
  • You can get number of authors: 1155 and commits 67124 (this is for all time)
  • To get data for some specific data range: cd agl; DTFROM="2016-10-01" DTTO="2017-10-01" ./run_multirepo_range.sh ==> agl.txt.
Processed 7152 csets from 365 developers
ruby count_issues.rb data/data_libreoffice_bugs.csv Opened 2016-06-01 2017-06-01
Counting issues in 'data/data_libreoffice_bugs.csv', issue date column is 'Opened', range: 2016-06-01T00:00:00+00:00 - 2017-06-01T00:00:00+00:00
Found 7223 matching issues.

Update data/data_libreoffice_git_201606_201705.csv accordingly.

  • Final line should be: ruby update_projects.rb projects/unlimited_both.csv data/data_libreoffice_git_201606_201705.csv -1

  • Now let's examine a new case: FreeBSD:

  • Use BigQuery/org_finder.sql (with condition '%freebsd%' to find FreeBSD orgs). Check all of them on GitHub and create final BigQuery:

  • cp BigQuery/query_apache_projects.sql BigQuery/query_freebsd_projects.sql and update conditions, run query, download results, put them in data/data_freebsd_201606_201705.csv (save as table, export to gstorage, download csv)

  • Now define FreeBSD project the same way as in BigQuery: put orgs in map/defmaps.csv, put URL in map/urls.csv, put orgs as exceptions in map/ranges.csv and map/ranges_sane.csv (because some values can be 0s due to custom BigQuery)

  • Add FreeBSD processing to shells/unlimited:

echo "Adding/Updating FreeBSD Projects"
ruby merger.rb data/unlimited.csv data/data_freebsd_201606_201705.csv
  • Go to ~/dev/freebsd and clone 3 SVN repos:
svn checkout https://svn.freebsd.org/base/head base
svn checkout https://svn.freebsd.org/doc/head doc
svn checkout https://svn.freebsd.org/ports/head ports
  • Use cncf/gitdm:freebsd_svn.sh` script to analyse FreeBSD SVN repos:
Revisions:    35927
Authors:      335
  • Now rerun shells/unlimited_201606_201705.sh and see FreeBSD's rank

  • Run final updated script: shells/unlimited_20160601-20170601.sh to get final results.

  • Finally ./projects/unlimited.csv is generated. You need to import it in final Google chart by doing:

  • Select the cell A50. Use File --> Import, then "Upload" tab, "Select a file from your computer", choose ./projects/unlimited.csv

  • Then "Import action" --> "replace data starting at selected call", click Import.

  • Voila! Final version will live here: https://docs.google.com/spreadsheets/d/1a2VdKfAI1g9ZyWL09TnJ-snOpi4BC9kaEVmB7IufY7g/edit?usp=sharing

Results:

NOTE: for viewing using those motion charts You'll need Adobe Flash enabled when clicking links. It works (tested) on Chrome and Safari with Adobe Flash installed and enabled.

For data from files.csv (data/data_YYYYMM.csv), 201601 --> 201703 (15 months) Chart with cumulative data (each month is sum of this month and previous months) is here: https://docs.google.com/spreadsheets/d/11qfS97WRwFqNnArRmpQzCZG_omvZRj_y-MNo5oWeULs/edit?usp=sharing Chart with monthly data (that looks wrong IMHO due to google motion chart data interpolation between months) is here: https://docs.google.com/spreadsheets/d/1ZgdIuMxxcyt8fo7xI1rMeFNNx9wx0AxS-2a58NlHtGc/edit?usp=sharing

I suggest playing around with the 1st chart (cumulative sum): It is not able to remember settings so once you click on "Chart1" scheet I suggest:

  • Change axis-x and axis-y from Lin (linerar) to Log (logarithmics)
  • You can choose what column should be used for color: I suggest activity (this is default and shows which project was most active) or choose unique color (You can select from commits, prs+issues, size) (size is square root of number of authors)
  • Change playback speed (control next to play) to slowest
  • Select inerested projects from Legend (like Kubernetes for example or Kubernetes vs dotnet etc) and check "trails"
  • You can also change what x and y axisis use as data, defaults are: x=commits, y=pr+issues, and change scale type lin/log
  • You can also change which column is used for bubble size (default is "size" which means square root of number of authors), note that the number of authors = max from all months (distinct authors that contributed to activity), this is obviously different from set of distinct authors activity in the entire 15 months range

On the top/right just above the Color drop down you will see additional two chart types:

  • Bar chart - this can be very useful
  • Choose li or log y-axis scale, then select Kubernetes from Legend and then choose any of y-axis possible values (activity, commits, PRs+issues, Size) and click play to see how Kubernetes overtakes multiple projects during our period. Finally there is also a linear chart, take a look at it as well.

CNCF Projects

To generate data for CNCF projects:

  • Run BigGuery/query_cncf_projects.sql in the Google BigQuery Console. It takes about 800 GiB which costs is about $4.
  • Save output to GoogleSheets and download it as a csv file and save it in data/data_cncf_projects.csv (File -> Download As -> Comma separated values ...)
  • Process BigQuery output with velocity's analysis tool: ruby analysis.rb data/data_cncf_projects.csv projects/projects_cncf.csv map/hints.csv map/urls.csv map/defmaps.csv or use shells/run_cncf.sh which does the same
  • Import output file projects/projects_cncf.csv as Google chart's data.

There is also a gist here (but above description is more up to date): https://gist.github.com/lukaszgryglicki/093ced06455a3f14f0e4d25459525207

Links to various charts and videos generated using this project are here: res/links.txt https://www.cncf.io/blog/2017/06/05/30-highest-velocity-open-source-projects/

CNCF Projects split by Kubernetes VS rest

For this case, a new set of map files was created:

  • map/k8s_vs_rest_defmaps.csv - list of orgs found in query
  • map/k8s_vs_rest_urls.csv - definition of k8s vs rest
  • map/k8s_vs_rest_hints.csv - list of repos found in query

Lists of orgs/repos in the map files should contain all values used in any period query.

It should be noted that historically, as CNCF grows, new projects are added. To get data for 2016, a query similar to that in BigQuery/query_cncf_4p_201511_201610.sql should be run and the next year would be span by BigQuery/query_cncf_projects_201611_201710.sql. To prepare an analysis, a command similar to this should be run:

ruby analysis.rb data/data_cncf_projects_201611_201710.csv projects/projects_cncf_k8s_vs_rest_201611_201710.csv map/k8s_vs_rest_hints.csv map/k8s_vs_rest_urls.csv map/k8s_vs_rest_defmaps.csv map/skip.csv map/ranges_unlimited.csv

How to find bots to be excluded in queries for project data

Two queries were created to be run in GoogleBigQuery. One for CloudFoundry, one for Chromium. Take a look at query_cloudfoundry_authors_from_to.sql The result is in data_cloudfoundry_authors_201611_201710.csv A bot can be spotted visually in the row where author (github login) is 'coveralls'

activity,comments,prs,commits,issues,author
1246,330,104,700,112,frodenas
1210,1210,0,0,0,coveralls
1164,88,58,979,39,genevievelesperance

The other authors can be validated to be human by going to address such as https://github.com/frodenas

Another way to identify bots would be by means af a query such as query_chromium_authors_v2_from_to.sql which lists names and counts of their commits. A results file such as data_chromium_authors_v2_201611_2017_10.csv brings data as follows:

activity,comments,prs,commits,issues,author_name
30583,17349,5997,25,7212,(null)
1549,0,0,1549,0,Matt Gaunt
857,0,0,857,0,Paul Irish
... ... ...
125,0,0,125,0,DevTools Bot

Bots should be excluded from the data queries and future bot hunting queries as to not duplicate efforts.

velocity-1's People

Contributors

lukaszgryglicki avatar radoslaw avatar dankohn avatar brian-brazil avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.