Track development velocity
'*.sql' files in BigQuery
folder are Google BigQuery queries that produce csv data files to be put in the in data/
directory for processing
analysis.rb
is a tool that processes input files (csv files from BigQuery results) and generates final data for Bubble/Motion Google Sheet Chart.
This tool also uses:
- a "hints" file with additional mapping: repo name -> project. (N repos --> 1 Project), so a given project name may be listed be in many lines
- a "urls" file which defines URLs for the listed projects (a separate file is used because otherwise, in hints file we would have to duplicate data for each project ) (1 Project --> 1 URL)
- a "default" map file which defines non standard names for projects generated automatically via grouping by org (like aspnet --> ASP.net) or to group multiple orgs and/or repos into a single project. It is the last step of project name mapping This tool outputs a data file into the 'projects/' directory
ruby analysis.rb data/data_yyyymm.csv projects/projects_yyyymm.csv map/hints.csv map/urls.csv map/defmaps.csv skip.csv ranges.csv
The Top 30 open source projects process is described in the "Most Up to date process" section.
The CNCF projects process, is described in the "CNCF Projects" section.
input.csv
data/data_yyyymm.csv from BigQuery, like the following:
org,repo,activity,comments,prs,commits,issues,authors
kubernetes,kubernetes/kubernetes,11243,9878,720,70,575,40
ethereum,ethereum/go-ethereum,10701,570,109,43,9979,14
...
output.csv
to be imported via Google Sheet (File -> Import) and then chart created from this data. It looks like this:
org,repo,activity,comments,prs,commits,issues,authors,project,url
dotnet,corefx+coreclr+roslyn+cli+docs+core-setup+corefxlab+roslyn-project-system+sdk+corert+eShopOnContainers+core+buildtools,20586,14964,1956,1906,1760,418,dotnet,microsoft.com/net
kubernetes+kubernetes-incubator,kubernetes+kubernetes.github.io+test-infra+ingress+charts+service-catalog+helm+minikube+dashboard+bootkube+kargo+kube-aws+community+heapster,20249,15735,2013,1323,1178,423,Kubernetes,kubernetes.io
...
hints.csv
a csv file with hints for repo --> project mapping, it has this format:
repo,project
Microsoft/TypeScript,Microsoft TypeScript
...
urls.csv
a csv file with project --> url mapping wiht the following format:
project,url
Angular,angular.io
...
defmaps.csv
a csv file with proper names for projects generated as default groupping within org:
name,project
aspnet,ASP.net
nixpkgs,NixOS
Azure,=SKIP
...
The special flag '=SKIP' for a project means that this org should NOT be groupped
skip.csv
a csv file that contains lists of repos and/or orgs and/or projects to be skipped in the analysis:
org,repo,project
"enkidevs,csu2017sp314,thoughtbot,illacceptanything,RubySteps,RainbowEngineer",Microsoft/techcasestudies,"Apache (other),OpenStack (other)"
"2015firstcmsc100,swcarpentry,exercism,neveragaindottech,ituring","mozilla/learning.mozilla.org,Microsoft/HolographicAcademy,w3c/aria-practices,w3c/csswg-test",
"orgX,orgY","org1/repo1,org2/repo2","project1,project2"
ranges.csv
a csv file that contains ranges of repos properties which makes repo included in calculations.
It can constrain any of "commits, prs, comments, issues, authors" to be within range n1 .. n2 (if n1 or n2 < 0 then this value is skipped, so -1..-1 means unlimited
There can be also be exception repos/orgs that do not use those ranges:
key,min,max,exceptions
activity,50,-1,"kubernetes,docker/containerd,coreos/rkt"
comments,20,100000,"kubernetes,docker/containerd,coreos/rkt"
prs,10,-1,"kubernetes,docker/containerd,coreos/rkt"
commits,10,-1,"kubernetes,kubernetes-incubator"
issues,10,-1,"kubernetes,docker/containerd,coreos/rkt"
authors,3,-1,"kubernetes,docker/containerd,google/go-github"
The generated output file contains all the input data (so it can be 600 rows for 1000 input rows for example). You should manually review generated output and choose how many rocords you need.
hintgen.rb
is a tool that takes data already processed for various created charts and creates distinct projects hint file from it:
hintgen.rb data.csv map/hints.csv
Use multiple times putting a different file (1st parameter: data.csv
) and generate final hints.csv
.
Data files existing in the repository:
- data/data_YYYYMM.csv --> data for given YYYYMM from BigQuery.
- projects/projects_YYYYMM.csv --> data generated by
analysis.rb
based on data_YYYYMM.csv using:map/
:hints.csv
,urls.csv
,defmaps.csv
generate_motion.rb
a tool that merges data from multiple files into one to be used for motion chart. Usage:
ruby generate_motion.rb projects/files.csv motion/motion.csv motion/motion_sums.csv [projects/summaries.csv]
File files.csv
contains a list of data files to be merged, it has the following format:
name,label
projects/projects_201601.csv,01/2016
projects/projects_201602.csv,02/2016
...
This tool generates 2 output files:
- 1st is a motion data from each file with a given label
- 2nd is cumulative sum of data, so 1st label contains data from 1st label, 2nd contains 1st+2nd, 3rd=1st+2nd+3rd ... last = sum of all data. Labels are summed-up in alphabetical order. When input data is divided by months, "YYYYMM" or "YYYY-MM" format must be used to receive correct results. "MM/YYYY" will, for example, swap "2/2016" and "1/2017"
Output formats of 1st and 2nd files are identical.
The first column is a data file generated by analysis.rb
. The following column is a label that will be used as "time" for google sheets motion chart
Output is in this format:
project,url,label,activity,comments,prs,commits,issues,authors,sum_activity,sum_comments,sum_prs,sum_commits,sum_issues,sum_authors
Kubernetes,kubernetes.io,2016-01,6289,5211,548,199,331,73,174254,136104,18264,8388,11498,373
Kubernetes,kubernetes.io,2016-02,13021,10620,1180,360,861,73,174254,136104,18264,8388,11498,373
...
Kubernetes,kubernetes.io,2017-04,174254,136104,18264,8388,11498,373,174254,136104,18264,8388,11498,373
dotnet,microsoft.com/net,2016-01,8190,5933,779,760,718,158,158624,111553,17019,17221,12831,382
dotnet,microsoft.com/net,2016-02,17975,12876,1652,1908,1539,172,158624,111553,17019,17221,12831,382
...
dotnet,microsoft.com/net,2017-04,158624,111553,17019,17221,12831,382,158624,111553,17019,17221,12831,382
VS Code,code.visualstudio.com,2016-01,7526,5278,381,804,1063,112,155621,104386,9501,17650,24084,198
VS Code,code.visualstudio.com,2016-02,17139,11638,986,1899,2616,133,155621,104386,9501,17650,24084,198
...
VS Code,code.visualstudio.com,2017-04,155621,104386,9501,17650,24084,198,155621,104386,9501,17650,24084,198
...
Each row contains its label data (separate or cumulative) whereas columns with starting with max_
conatin cumulative data for all labels.
This is to make the data ready for google sheet motion chart without complex cell indexing.
The final (optional) file summaries.csv
is used to read the number of authors. This is because the number of authors is computed differently.
Without the summaries file (or if a given project is not in the summaries file), we have a number of distinct authors in each period. Summary value is a sum of all periods max.
This is obviously not a real count of all distinct authors in all periods. Number of authors would be computed if another file is supplied, one which contains summary data for a longer period that is equal to sum of all periods.
To manually add other projects (like Linux) use add_linux.sh
or create similar tools for other projects. Data for this tool was generated manually using a custom gitdm
tool (github cncf/gitdm
) on torvalds/linux
repo and via manually counting email addresses in different periods on LKML.
Example usage (assuming Linux additional data in data/data_linux.csv), could be:
ruby add_linux.rb data/data_201603.csv data/data_linux.csv 2016-03-01 2016-04-01`
A larger scope (e.g. GitHub data) file can be injected with such custom script results data (from Gitlab or Linux or External) by the merger script:
ruby merger.rb file_to_merge.csv file_to_get_data_from.csv
See for example ./shells/top30_201605_201704.sh
Every merge will compound data into the merger file.
This means removing some filtering out of BigQuery and letting Ruby tools perform the task instead.
To process "unlimited" data from BigQuery output (file data/unlimited.csv
) , use shells/unlimited.sh
or shells/unlimited_both.sh
).
Unlimited means that BigQuery is not constraining repositories by having commits, comments, issues, PRs, authors > N (this N is 5-50 depending on which metric: authors for example is 5 while comments is 50).
Unlimited only requires that authors, comments, commits, prs, issues are all > 0.
And then only CSV map/ranges_unlimited.csv
is used to further constrain data. This basically moves filtering out of BigQuery (so it can be called once) to the Ruby tool.
And shells/unlimited_both.sh
uses map/ranges_unlimited.csv
that is not setting ANY limit:
key,min,max,exceptions
activity,-1,-1,
comments,-1,-1,
prs,-1,-1,
commits,-1,-1,
issues,-1,-1,
authors,-1,-1,
It means that mapping must have extremely long list of projects from repos/orgs to get valid non obfuscated data.
You can skip a ton of organization's small repos (if they do not sum up to just few projects, while they are distinct), with:
rauth[res[res.map { |i| i[0] }.index('Google')][0]].select { |i| i.split(',')[1].to_i < 14 }.map { |i| i.split(',')[0] }.join(',')
The following is an example based on Google.
Say Top 100 projects have 100th project with 290 authors.
All tiny google repos (distinct small projects) will sum up and make Google overall 15th (for example).
The above command generates output list of google repos with 13 authors or less . You can put the results in map/skip.csv" and then You'll avoid false positive top 15 for Google overall (which would not be true)
There is also a tool to add data for external projects (not hosted on GitHub): add_external.rb
.
It is used by shells/unlimited.csv
and shells/unlimited_both.sh
Example call:
ruby add_external.rb data/unlimited.csv data/data_gitlab.csv 2016-05-01 2017-05-01 gitlab gitlab/GitLab
It requires a csv file with external repo data.
It must be defined per date range.
It has this format (see data/data_gitlab.csv
for example):
org,repo,from,to,activity,comments,prs,commits,issues,authors
gitlab,gitlab/GitLab,2016-05-01,2017-05-01,40000,40000,11595,9479,22821,1500
There is also a tool to update generated projects file which in turn is used to import data for charts.
update_projects.rb
Listed in shells/unlimited_both.sh
It is used to update certain values in given projects
It processes an input file with the following format:
project,key,value
Apache Mesos,issues,7581
Apache Spark,issues,5465
Apache Kafka,issues,1496
Apache Camel,issues,1284
Apache Flink,issues,2566
Apache (other),issues,52578
This allows updating specific keys in specific projects with data taken from sources other than GitHub. It is currently being used to update github data with issues statistics from jira (for apache projects).
Tool to create ranks per project (for all project's numeric properties) report_projects_ranks.rb
& shells/report_cncf_project_ranks.sh
Shell script projects from projects/unlimited_both.csv
and uses: reports/cncf_projects_config.csv
file to get a list of projects that needs to be included in the rank statistics.
File format is:
project
project1
project2
...
projectN
It outputs a rank statistics file reports/cncf_projects_ranks.txt
For special cases (see ./shells/unlimited_both.sh
which calls all scripts in the correct order)
Some details about adding external data from non-GitHub projects:
-
How to find Apache issues in Jira:
res/data_apache_jira.query
-
Case with Chromium: (details here:
res/data_chromium_bugtracker.txt
), issues from their bugtracker, number of authors and commits in date range viagit log
one-liner: Must be called in Git repo cloned from GoogleSource (not from github):git clone https://chromium.googlesource.com/chromium/src
Commits:git log --since "2016-05-01" --until "2017-05-01" --pretty=format:"%H" | sort | uniq | wc -l
gives 77437 Authors:git log --since "2016-05-01" --until "2017-05-01" --pretty=format:"%aE" | sort | uniq | wc -l
gives 1663 To analyze those commits (such as to exclude merge and robot commits): data/data_chromium_commits.csv, run while in chromium/src repository:git log --since "2016-05-01" --until "2017-05-01" --pretty=format:"%aE~~~~%aN~~~~%H~~~~%s" | sort | uniq > chromium_commits.csv
Then remove special csv characters with VI commands::%s/"//g
,:%s/,//g
Then add a csv header row manually "email,name,hash,subject" and move it to:data/data_chromium_commits.csv
Finally replace '~~~~' with ',' to create correct csv::%s/\~\~\~\~/,/g
Then runruby commits_analysis.rb data/data_chromium_commits.csv map/skip_commits.csv
or./shells/chromium_commits_analysis.sh
-
Case with OpenStack:
res/data_openstack_lanuchpad.query
- data from their launchpad -
Case with WebKit:
res/data_webkit_links.txt
issues from their bug tracker:https://webkit.org/reporting-bugs/
For authors and commits, 3 different tools were tried: our cncf/gitdm on their webkit/WebKit github repo, git one-liner on the same repo (git clone git://git.webkit.org/WebKit.git WebKit
): Authors: 121:git log --since "2016-05-01" --until "2017-05-01" --pretty=format:"%aE" | sort | uniq | wc -l
Authors: 121:git log --since "2016-05-01" --until "2017-05-01" --pretty=format:"%cE" | sort | uniq | wc -l
Commits: 13051:git log --since "2016-05-01" --until "2017-05-01" --pretty=format:"%H" | sort | uniq | wc -l
Our cncf/gitdm output files are also stored here:res/webkit/
: WebKit_2016-05-01_2017-05-01.csv WebKit_2016-05-01_2017-05-01.txt
Also tried SVN one liner on their original SVN repo (due to the fact that its Github repo is only a mirror):
To fetch SVN repo:
svn checkout https://svn.webkit.org/repository/webkit/trunk WebKit
or:
tar jxvf WebKit-SVN-source.tar.bz2
cd webkit
svn switch --relocate http://svn.webkit.org/repository/webkit/trunk https://svn.webkit.org/repository/webkit/trunk
Finally run their script: update-webkit
Number of commits: svn log -q -r {2016-05-01}:{2017-05-01} | sed '/^-/ d' | cut -f 1 -d "|" | sort | uniq | wc -l Number of authors: svn log -q -r {2016-05-01}:{2017-05-01} | sed '/^-/ d' | cut -f 2 -d "|" | sort | uniq | wc -l To get the data from SVN: Revisions: svn log -q -r {2017-05-25}:{2017-05-26} | sed '/^-/ d' | cut -f 1 -d "|" Authors: svn log -q -r {2017-05-25}:{2017-05-26} | sed '/^-/ d' | cut -f 2 -d "|" Dates: svn log -q -r {2017-05-25}:{2017-05-26} | sed '/^-/ d' | cut -f 3 -d "|"
- GitLab estimation and details here:
res/gitlab_estims.txt
- LibreOffice case: see
res/libreoffice_git_repo.txt
To add a new non-standard project (but from github mirros, which can have 0s on comments, commits, issues, prs, activity, authors) follow this route:
- Copy
BigQuery/org_finder.sql
to clipboard and run this on BigQuery replacing condition for org (for example lower(org.login) like '%your%org%) - Examine output org/repos combination (manually on GitHub) and decide about final condition for the final BigQuery run
- Copy
BigQuery/query_apache_projects.sql
into someBigQuery/query_your_project.sql
then update conditions to those found in the previous step - Run the query
- Save results to a table. Export this table to GStorage. Download this table as CSV from GStorage into
data/data_your_project_datefrom_date_to.csv
- Add this to
shells/unlimited_both.csv
:
echo "Adding/Updating YourProject case"
ruby merger.rb data/unlimited.csv data/data_your_project_datefrom_date_to.csv
- Update
map/range*.csv
- add exception for YourProject (because it can have 0s now - this is output from BigQuery without numeric conditions) - Run
shells/unlimited_both.sh
and examine Your Project (few iterations to add correct mapping in./map/
: hints, defmaps, urls etc.) - You can run manually:
ruby analysis.rb data/unlimited.csv projects/unlimited_both.csv map/hints.csv map/urls.csv map/defmaps.csv map/skip.csv map/ranges_sane.csv
- For example see YourProject rank:
res.map { |i| i[0] }.index('LibreOffice')
orres[res.map { |i| i[0] }.index('LibreOffice')][2][:sum]
- Some of the values will be missing (like for example PRs for mirror repos)
- Now it is time for a non standard path, please see
shells/unlimited_both.sh
for non standar data update that comes after finalruby analysis.rb
call - this is usually different for each non-standard project
To generate all data for the Top 30 chart: https://docs.google.com/spreadsheets/d/1hD-hXlVT60AGhGVifNn7nNo9oVMKnIoQ2kBNmx-YY8M/edit?usp=sharing
- Fetch all necessary data using BigQuery or use data already fetched present in this repo.
- If fetched new BigQuery data then re-run the special projects BigQuery analysis scripts: ./shells/: run_apache.sh, run_chrome_chromium.sh, run_cncf.sh, run_openstack.sh
- To just regenerate all other data: run
./shells/unlimited_both.sh
- See per project ranks statistics: `reports/cncf_projects_ranks.txt
- Get final output file
projects/unlimited.csv
and import it on the A50 cell inhttps://docs.google.com/spreadsheets/d/1hD-hXlVT60AGhGVifNn7nNo9oVMKnIoQ2kBNmx-YY8M/edit?usp=sharing
chart
We already have shells/unlimited_both.sh
that generates our chart for 2016-05-01 to 2017-05-01. We want to generate the chart for a new date range: 2016-06-01 to 2017-06-01.
This is a step by step tutorial on how to do it.
- Copy
shells/unlimited_both.sh
toshells/unlimited_20160601-20170601.sh
- Keep
shells/unlimited_20160601-20170601.sh
opened in some other terminal windowvi shells/unlimited_20160601-20170601.sh
and we need to update all steps - First we need unlimited BigQuery output for a new date range:
echo "Restoring BigQuery output"
cp data/unlimited_output_201605_201704.csv data/unlimited.csv
- We need the
data/unlimited_output_201606_201705.csv
file. To generate this one, we need to run BigQuery for the new date range. - Open the sql file that generated the current range's data:
vi BigQuery/query_201605_201704_unlimited.sql
- Save it as:
BigQuery/query_201606_201705_unlimited.sql
after changing the date ranges in SQL. - Copy it to clipboard
pbcopy < BigQuery/query_201606_201705_unlimited.sql
and run in Google BigQuery:https://bigquery.cloud.google.com/queries/<<your_google_project_name>>
- Save result to a table
<<your_google_user_name>>:unlimited_201606_201705
, it takes about 1TB and costs about $5 "Save as table" - Open this table
<<your_google_user_name>>:unlimited_201606_201705
and click "Export Table" to export it to google storage as:gs://<<your_google_user_name>>/unlimited_201606_201705.csv
(You may click "View files" to see files in your gstorage) - Go to google storage and download
<<your_google_user_name>>/unlimited_201606_201705.csv
and put it whereshells/unlimited_20160601-20170601.sh
expects it (update the file name todata/unlimited_output_201606_201705.csv
):
echo "Restoring BigQuery output"
cp data/unlimited_output_201606_201705.csv data/unlimited.csv
- So we have main data (step 1) ready for the new chart Now we need to get data for all non-standard projects. You can try our analysis tool without any special projects by running:
ruby analysis.rb data/unlimited.csv projects/unlimited_both.csv map/hints.csv map/urls.csv map/defmaps.csv map/skip.csv map/ranges_sane.csv
- There can be some new projects that are unknown, ranks can chage during this step, so there can be manual changes needed to mappings in
map/
directory:hints.csv
,defmaps.csv
andurls.csv
. Possibly also inskip.csv
(if there are new projects that should be skipped) - This is what came out on the 1st run:
Project #23 (org, 457) skillcrush (skillcrush) (skillcrush-104) have no URL defined
Project #45 (org, 366) pivotal-cf (pivotal-cf) (...) have no URL defined
Project #50 (org, 353) Automattic (Automattic) (...) have no URL defined
- Let's see which top authors projects for those non-found projects are:
rauth[res[res.map { |i| i[0] }.index('Automattic')][0]]
- Then we must add entries for few top ones in
map/hints.csv
say with >= 20 authors:
Automattic/amp-wp,31
Automattic/wp-super-cache,29
Automattic/simplenote-electron,22
Automattic/happychat-service,21
Automattic/kue,20
We need to examine each one in github.com
, like for the 1st project: github.com/Automattic/amp-wp
. We see that this is a WordPress plugin, so it belnogs to the wWrdpress/WP Calypso project:
grep -HIn "wordpress" map/*.csv
grep -HIn "WP Calypso" map/*.csv
We see that we have WP Calypso defined in the hints file:
map/hints.csv:23:Automattic/WP-Job-Manager,WP Calypso
map/hints.csv:24:Automattic/facebook-instant-articles-wp,WP Calypso
map/hints.csv:26:Automattic/sensei,WP Calypso
map/hints.csv:29:Automattic/wp-calypso,WP Calypso
map/hints.csv:30:Automattic/wp-e2e-tests,WP Calypso
map/urls.csv:438:WP Calypso,developer.wordpress.com/calypso
Just add a new repo mapping row for this project (map/hints.csv
): Automattic/amp-wp,WP Calypso
Do the same for other projects/repos. Re-run the analysis tool untill all is fine.
-
For example, after defining some new projects we see "EPFL-SV-cpp-projects" in the top 50. This is an educational org that should be skipped. Add it to
map/skip.csv
for skipping row:EPFL-SV-cpp-projects,,
-
Once You have all URL's defined, added new mapping, you may see a preview of the Top projects on while stopped in
binding.pry
, by typingall
. Now we need to go back toshells/unlimited_20160601-20170601.sh
and regenerate all non standard data (for projects not on github or requiring special queries on github - for example because of having 0 activity, comments, commits, issues, prs or authors) -
Now Linux case: we need to change this line
ruby add_linux.rb data/unlimited.csv data/data_linux.csv 2016-05-01 2017-05-01
intoruby add_linux.rb data/unlimited.csv data/data_linux.csv 2016-06-01 2017-06-01
and run it -
You will see:
Data range not found in data/data_linux.csv: 2016-06-01 - 2017-06-01
that meens you need to add a new data range for Linux in file:data/data_linux.csv
-
Data for linux is here
https://docs.google.com/spreadsheets/d/1CsdreHox8ev89WoP6LjcryroKDOH2gQipMC9oS95Zhc/edit?usp=sharing
but it doesn have May 2017 (finished yesterday), so we need last month's data. -
Go to:
https://lkml.org/lkml/2017
and copy May 2017 into linked google spreadsheet: (22110). -
Add a row for May 2017 to
data/data_linux.csv
:torvalds,torvalds/linux,2017-05-01,2017-06-01,0,0,0,0,22110
- You will see that now we only have the "emails" column. Other columns must be feteched from the linux kernel repo using thecncf/gitdm
analysis: -
You can also sum up the issues from the sheet to get 2016-06-01 - 2017-06-01: (254893):
torvalds,torvalds/linux,2016-06-01,2017-06-01,0,0,0,0,254893
-
Now
cncf/gitdm
on linux kernel repo:cd ~/dev/linux && git checkout master && git reset --hard && git pull
. An alternative to it (if you don't have the linux repo cloned) is:cd ~/dev/
,git clone https://github.com/torvalds/linux.git
. -
Go to
cncf/gitdm
:cd ~/dev/cncf/gitdm
, run:./linux_range.sh 2017-05-01 2017-06-01
-
While on
cncf/gitdm
, see:vim linux_stats/range_2017-05-01_2017-06-01.txt
:
Processed 1219 csets from 424 developers
34 employers found
A total of 24970 lines added, 14469 removed (delta 10501)
- You have values for
changesets,additions,removals,authors
here, updatecncf/velocity/data/data_linux.csv
accordingly. - Do the same for
./linux_range.sh 2016-06-01 2017-06-01
andlinux_stats/range_2016-06-01_2017-06-01.txt
, Results:
Processed 64482 csets from 3803 developers
91 employers found
A total of 3790914 lines added, 1522111 removed (delta 2268803)
- Final linux rows (one for May 2017, another for last year including May 2017) are:
torvalds,torvalds/linux,2017-05-01,2017-06-01,1219,24970,14469,424,22110
torvalds,torvalds/linux,2016-06-01,2017-06-01,64482,3790914,1522111,3803,254893
- GitLab case: Their repo is:
https://gitlab.com/gitlab-org/gitlab-ce/
, clone it via:git clone https://gitlab.com/gitlab-org/gitlab-ce.git
in~/dev/
directory. - Their repo hosted by GitHub is:
https://github.com/gitlabhq/gitlabhq
, clone it viagit clone https://gitlab.com/gitlab-org/gitlab-ce.git
in~/dev/
directory. - Go to
cncf/gitdm
and run GitLab repo analysis:./repo_in_range.sh ~/dev/gitlab-ce/ gitlab 2016-06-01 2017-06-01
- Results are output to
other_repos/gitlab_2016-06-01_2017-06-01.txt
:
Processed 16574 csets from 513 developers
15 employers found
A total of 926818 lines added, 548205 removed (delta 378613)
-
Their bug tracker is
https://gitlab.com/gitlab-org/gitlab-ce/issues
, just count issues in the given date range. Sort by "Last created" and count issues in given range: There are 732 pages of issues (20 per page) = 14640 issues (https://gitlab.com/gitlab-org/gitlab-ce/issues?page=732&scope=all&sort=created_desc&state=all
) -
To count Merge Requests (PRs):
https://gitlab.com/gitlab-org/gitlab-ce/merge_requests?scope=all&state=all
Merge Requests: 371,5 pages * 20 = 7430 -
To count authors run in gitlab-ce directory:
git log --since "2016-06-01" --until "2017-06-01" --pretty=format:"%aE" | sort | uniq | wc -l
--> 575 -
To count authors run in gitlab-ce directory:
git log --since "2016-05-01" --until "2017-05-01" --pretty=format:"%aE" | sort | uniq | wc -l
--> 589 -
Cloud Foundry case:
-
Copy:
BigQuery/query_cloudfoundry_201605_201704.sql
toBigQuery/query_cloudfoundry_201606_201705.sql
and update conditions. Then run query in the BigQuery console (see details at the beginning of example) -
Finally, you will have
data/data_cloudfoundry_201606_201705.csv
(run query, save results to table, export table to gstorage, download csv from gstorage). -
Update (and eventually manually run) the CF case (in
shells/unlimited_20160601-20170701.sh
):ruby merger.rb data/unlimited.csv data/data_cloudfoundry_201606_201705.csv force
-
CNCF Projects case
-
We have a line in
ruby merger.rb data/unlimited.csv data/data_cncf_projects.csv
which needs to be changed toruby merger.rb data/unlimited.csv data/data_cncf_projects_201606_201705.csv
-
Copy:
cp BigQuery/query_cncf_projects.sql BigQuery/query_cncf_projects_201606_201705.sql
, update conditions:BigQuery/query_cncf_projects_201606_201705.sql
-
Run on BigQuery and do the same as in the CF case. The final output file will be:
data/data_cncf_projects_201606_201705.csv
-
Final line should be (try it):
ruby merger.rb data/unlimited.csv data/data_cncf_projects_201606_201705.csv
-
WebKit case
-
Change merger line to
ruby merger.rb data/unlimited.csv data/webkit_201606_201705.csv
-
WebKit has no usable data on GitHub, so running BigQuery is not needed, we no longer need those lines for WebKit (we will just update
data/webkit_201606_201705.csv
file), remove them from current shellshells/unlimited_20160601-20170601.sh
:
echo "Updating WebKit project using gitdm and other"
ruby update_projects.rb projects/unlimited_both.csv data/data_webkit_gitdm_and_others.csv -1
- Now we need to generate the values for
data/webkit_201606_201705.csv
file: - Issues: Go to: https://webkit.org/reporting-bugs/ Search all bugs in webkit, order by modified desc - will be truncated to 10,000. https://bugs.webkit.org/buglist.cgi?bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&bug_status=RESOLVED&bug_status=VERIFIED&bug_status=CLOSED&limit=0&order=changeddate%20DESC%2Cbug_status%2Cpriority%2Cassigned_to%2Cbug_id&product=WebKit&query_format=advanced&resolution=---&resolution=FIXED&resolution=INVALID&resolution=WONTFIX&resolution=LATER&resolution=REMIND&resolution=DUPLICATE&resolution=WORKSFORME&resolution=MOVED&resolution=CONFIGURATION%20CHANGED 2016-12-13 --> 2017-06-01 = 9988 issues: ruby> Date.parse('2017-06-01') - Date.parse('2016-12-13') => (170/1), (9988.0 * 365.0/170.0) --> 21444 issues See how many days makes 10k, and estimate for 365 days (1 year): gives 22k bugs/issues
- Commits, Authors:
cd ~dev/ && git clone git://git.webkit.org/WebKit.git WebKit
- Some git one liner stats:
All authors & commits
git log --pretty=format:"%aE" | sort | uniq | wc -l
--> 648git log --pretty=format:"%H" | sort | uniq | wc -l
--> 189693 And for our date period:git log --since "2016-06-01" --until "2017-06-01" --pretty=format:"%aE" | sort | uniq | wc -l
--> 125git log --since "2016-06-01" --until "2017-06-01" --pretty=format:"%H" | sort | uniq | wc -l
--> 13348 - Now use cncf/gitdm to analyse commits, authors: from
cncf/gitdm
directory run:./repo_in_range.sh ~/dev/WebKit/ WebKit 2016-06-01 2017-06-01
- See output:
vim other_repos/WebKit_2016-06-01_2017-06-01.txt
:
Processed 13337 csets from 125 developers
6 employers found
A total of 11838610 lines added, 3105609 removed (delta 8733001)
-
So we have authors=125, commits=13348
-
Now we need to estimate the remaining: activity, comments, prs:
-
A good idea is to get it from ALL projects summaries (we have value for ALL keys summed-up in all projects from analysis.rb), this is automatically saved by
analysis.rb
toreports/sumall.csv
file. -
The record from last
analysis.rb
run is:{"activity"=>30714776, "comments"=>12766215, "prs"=>3311370, "commits"=>11687914, "issues"=>3104377}
-
Now average PRs/issues: sumall['prs'].to_f / sumall['issues'].to_f = 1.07 which gives PRs = 1.1 * 21444 = 23600
-
Comments would be 2 * commits = 26000
-
Activity = sum of all others (comments, commits, issues, prs)
-
OpenStack case:
-
Change line
ruby merger.rb data/unlimited.csv data/data_openstack_201605_201704.csv
toruby merger.rb data/unlimited.csv data/data_openstack_201606_201705.csv
-
To get
data/data_openstack_201606_201705.csv
file from BigQuery do: -
Copy
cp BigQuery/query_openstack_projects.sql BigQuery/query_openstack_projects_201606_201705.sql
and update date range condition inBigQuery/query_openstack_projects_201606_201705.sql
-
Copy to clipboard
pbcopy < BigQuery/query_openstack_projects_201606_201705.sql
and run BigQuery, Save as Table, export to gstorage, and save the results asdata/data_openstack_201606_201705.csv
-
Run
ruby merger.rb data/unlimited.csv data/data_openstack_201606_201705.csv
for a test -
Now need to update data to get file
data/data_openstack_bugs_201606_201705.csv
(copy file fromdata/data_openstack_bugs.csv
) -
Use thier launchpad to get issues info: https://wiki.openstack.org/wiki/Bugs Specifically go to:
When you find a bug, you should file it against the proper OpenStack project using the corresponding link
Click for example "Report a bug in Nova" https://bugs.launchpad.net/nova/, go to Advanced, select all possible issues, click "Age" sort desc, and then manually count issues in the given date range Once you have one correct URL, like: https://bugs.launchpad.net/keystone/+bugs?field.searchtext=&search=Search&field.status%3Alist=NEW&field.status%3Alist=OPINION&field.status%3Alist=INVALID&field.status%3Alist=WONTFIX&field.status%3Alist=EXPIRED&field.status%3Alist=CONFIRMED&field.status%3Alist=TRIAGED&field.status%3Alist=INPROGRESS&field.status%3Alist=FIXCOMMITTED&field.status%3Alist=FIXRELEASED&field.status%3Alist=INCOMPLETE_WITH_RESPONSE&field.status%3Alist=INCOMPLETE_WITHOUT_RESPONSE&assignee_option=any&field.assignee=&field.bug_reporter=&field.bug_commenter=&field.subscriber=&field.structural_subscriber=&field.tag=&field.tags_combinator=ANY&field.has_cve.used=&field.omit_dupes.used=&field.omit_dupes=on&field.affects_me.used=&field.has_patch.used=&field.has_branches.used=&field.has_branches=on&field.has_no_branches.used=&field.has_no_branches=on&field.has_blueprints.used=&field.has_blueprints=on&field.has_no_blueprints.used=&field.has_no_blueprints=on&orderby=-datecreated&memo=350&start=75 You will replace "keystone" with projects names like: nova, glance, swift, horizon etc. After each replace, click "Age" to sort the created desc. Note how many issues discard from first page (as too new) or next pages. Then manipulate the "memo" parameter (end of URL) to get a starting value. And choose such value when start date is within. Count issues using memo + #isse which is out - numbe rof issues from 1st (or more) pages which come after. Estimate for all 12 OpenStack projects.
-
The final line should be
ruby update_projects.rb projects/unlimited_both.csv data/data_openstack_bugs_201606_201705.csv -1
-
Apache case:
-
Exactly the same BigQuery steps as in the OpenStack example,. The final line should be
ruby merger.rb data/unlimited.csv data/data_apache_201606_201705.csv
-
cp BigQuery/query_apache_projects.sql BigQuery/query_apache_projects_201606_201705.sql
, update conditions, run BigQ, download results todata/data_apache_201606_201705.csv
-
Run
ruby merger.rb data/unlimited.csv data/data_apache_201606_201705.csv
-
Now we need more data for Apache from their jira, first copy file from previous data range
cp data/data_apache_jira.csv data/data_apache_jira_201606_201705.csv
-
Now go to their jira: issues.apache.org/jira/browse, you may set conditions to find issues, like this:
project not in (FLINK, MESOS, SPARK, KAFKA, CAMEL, FLINK, CLOUDSTACK, BEAM, ZEPPELIN, CASSANDRA, HIVE, HBASE, HADOOP, IGNITE, NIFI, AMBARI, STORM, "Traffic Server", "Lucene - Core", Solr, CarbonData, GEODE, "Apache Trafodion", Thrift, Kylin) AND created >= 2016-05-01 AND created <= 2017-05-01
Example URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2769?jql=project%20not%20in%20(FLINK%2C%20MESOS%2C%20SPARK%2C%20KAFKA%2C%20CAMEL%2C%20FLINK%2C%20CLOUDSTACK%2C%20BEAM%2C%20ZEPPELIN%2C%20CASSANDRA%2C%20HIVE%2C%20HBASE%2C%20HADOOP%2C%20IGNITE%2C%20NIFI%2C%20AMBARI%2C%20STORM%2C%20%22Traffic%20Server%22%2C%20%22Lucene%20-%20Core%22%2C%20Solr%2C%20CarbonData%2C%20GEODE%2C%20%22Apache%20Trafodion%22%2C%20Thrift%2C%20Kylin)%20AND%20created%20%3E%3D%202016-05-01%20AND%20created%20%3C%3D%202017-05-01
We need: Mesos, Spark, Kafka, Camel, Flink (above query is for other projects, these will not be included)
Query for Mesos in our data range: project in (Mesos) AND created >= 2016-06-01 AND created <= 2017-06-01
--> 2055
Do this for all projects.
-
Final line for Apache should be:
ruby update_projects.rb projects/unlimited_both.csv data/data_apache_jira_201606_201705.csv -1
-
Chromium case
-
Beginning (BigQuery part) exactly the same as Apache or OpenStack (just replace with word chromium):
ruby merger.rb data/unlimited.csv data/data_chromium_201606_201705.csv
-
Now the manual part - copy
data/data_chromium_bugtracker.csv
todata/data_chromium_bugtracker_201606_201705.csv
(we need to generate this file) -
Get Issues from their bug tracker: https://bugs.chromium.org/p/chromium/issues/list?can=1&q=opened%3E2016%2F7%2F25&colspec=ID+Pri+M+Stars+ReleaseBlock+Component+Status+Owner+Summary+OS+Modified&x=m&y=releaseblock&cells=ids All issues + opened>2016/7/19 gives: 63565 (for 2016/7/18 gives 63822+ which means a non exact number) we will extrapolate from here. All issues + opened>2017/6/1 gives 325, so we have: 63565 - 325 = 63240 issues in 2016-07-19 - 2017-06-01 irb> require 'date'; Date.parse('2017-06-01') - Date.parse('2016-07-19') --> 317 irb> Date.parse('2017-06-01') - Date.parse('2016-06-01') --> 365 irb> 63240.0 * (365.0 / 317.0) --> 72815 Now add chromedriver too: All issues, opened>2017/6/1 --> 1 All issues, opened>2016/6/1 --> 430 So there are 429 chromedriver issues and the total is: 429 + 72815 = 73244
-
Now chromium commits analysis which is quite complex
-
Their sources (all projects) are here: https://chromium.googlesource.com
-
Clone
chromium/src
in~/dev/src/
:git clone https://chromium.googlesource.com/chromium/src
-
Commits:
git log --since "2016-06-01" --until "2017-06-01" --pretty=format:"%H" | sort | uniq | wc -l
gives 79144 (but this is only FYI, this is way too many, there are bot commits here) -
Authors:
git log --since "2016-06-01" --until "2017-06-01" --pretty=format:"%aE" | sort | uniq | wc -l
gives 1697 To analyze those commits (also exclude merge and robot commits): Run while in chromium/src repository:git log --since "2016-05-01" --until "2017-05-01" --pretty=format:"%aE~~~~%aN~~~~%H~~~~%s" | sort | uniq > chromium_commits_201606_201705.csv
Then remove special CSV characters with VI commands::%s/"//g
,:%s/,//g
Then add CSV header manually "email,name,hash,subject" and move it to:cncf/velocity
:data/data_chromium_commits_201606_201705.csv
:mv chromium_commits_201606_201705.csv ~/dev/cncf/velocity/data/data_chromium_commits_201606_201705.csv
Finally replace '~~~~' with ',' to create correct CSV::%s/\~\~\~\~/,/g
Then runruby commits_analysis.rb data/data_chromium_commits_201606_201705.csv map/skip_commits.csv
Eventually/optionally add new rules to skip commits tomap/skip_commits.csv
Tool will say something like this: "After filtering: authors: 1637, commits: 67180", updatedata/data_chromium_bugtracker_201606_201705.csv
accordingly. -
Final line should be
ruby update_projects.rb projects/unlimited_both.csv data/data_chromium_bugtracker_201606_201705.csv -1
-
openSUSE case
-
BigQuery part exactly the same as Apache or OpenStack (just replace with word opensuse):
ruby merger.rb data/unlimited.csv data/data_opensuse_201606_201705.csv
-
AGL (automotive Grade Linux) case:
-
Go to: https://wiki.automotivelinux.org/agl-distro/source-code and get source code somewhere:
-
mkdir agl; cd agl
-
curl https://storage.googleapis.com/git-repo-downloads/repo > repo; chmod +x ./repo
-
./repo init -u https://gerrit.automotivelinux.org/gerrit/AGL/AGL-repo; ./repo init
-
Now You need to use script
agl/run_multirepo.sh
that usescncf/gitdm
to generate GitHub statistics. -
There will be
agl.txt
file generated, something like this:
Processed 67124 csets from 1155 developers
52 employers found
A total of 13431516 lines added, 12197416 removed, 24809064 changed (delta 1234100)
- You can get number of authors: 1155 and commits 67124 (this is for all time)
- To get data for some specific data range:
cd agl; DTFROM="2016-10-01" DTTO="2017-10-01" ./run_multirepo_range.sh
==>agl.txt
.
Processed 7152 csets from 365 developers
-
7152 commits and 365 authors.
-
To get number of Issues, search Jira:
https://jira.automotivelinux.org/browse/SPEC-923?jql=created%20%3E%3D%202016-10-01%20AND%20created%20%3C%3D%202017-10-01
-
It says 665 issues in a given date range
-
LibreOffice case
-
Beginning (BigQuery part) exactly the same as Apache or OpenStack (just replace with word libreoffice):
ruby merger.rb data/unlimited.csv data/data_libreoffice_201606_201705.csv
-
Now git repo analysis:, first copy
cp data/data_libreoffice_git.csv data/data_libreoffice_git_201606_201705.csv
and we will update thedata/data_libreoffice_git_201606_201705.csv
file -
Get source code: https://www.libreoffice.org/about-us/source-code/, for example:
git clone git://anongit.freedesktop.org/libreoffice/core
in~/dev/
-
Analyse this repo as described in:
res/libreoffice_git_repo.txt
, to see that it generates lower number than those from BigQuery output (so we can skip this step) -
Commits:
git log --since "2016-05-01" --until "2017-05-01" --pretty=format:"%H" | sort | uniq | wc -l
-
Authors:
git log --since "2016-05-01" --until "2017-05-01" --pretty=format:"%aE" | sort | uniq | wc -l
-
Put results in:
data/data_libreoffice_git_201606_201705.csv
(authors, commits), values will probably be skipped by the updater tool (they are lower than current values gathered so far) -
Issues: Issue listing is here: https://bugs.freedesktop.org/buglist.cgi?product=LibreOffice&query_format=specific&order=bug_id&limit=0 Create account, change columns to "Opened" and "ID" generaly no more needed. (ID is a link). Sprt by Opened desc and try to see all results. (You can hit nginx gateway timeout). This URL succeeded for me: https://bugs.documentfoundation.org/buglist.cgi?bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&bug_status=RESOLVED&bug_status=VERIFIED&bug_status=CLOSED&bug_status=NEEDINFO&columnlist=opendate&component=Android%20Viewer&component=Base&component=BASIC&component=Calc&component=Chart&component=ci-infra&component=contrib&component=deletionrequest&component=Documentation&component=Draw&component=Extensions&component=filters%20and%20storage&component=Formula%20Editor&component=framework&component=graphics%20stack&component=Impress&component=Installation&component=LibreOffice&component=Linguistic&component=Localization&component=Printing%20and%20PDF%20export&component=sdk&component=UI&component=ux-advise&component=Writer&component=Writer%20Web&component=WWW&limit=0&list_id=703831&order=opendate%20DESC%2Cchangeddate%2Cbug_id%20DESC&product=LibreOffice&query_format=advanced&resolution=---&resolution=FIXED&resolution=INVALID&resolution=WONTFIX&resolution=DUPLICATE&resolution=WORKSFORME&resolution=MOVED&resolution=NOTABUG&resolution=NOTOURBUG&resolution=INSUFFICIENTDATA Download as csv to
data/data_libreoffice_bugs.csv
, and then count issues with given date range "2016-06-01" --> "2017-06-01" withruby count_issues.rb data/data_libreoffice_bugs.csv Opened '2016-06-01 00:00:00' '2017-06-01 00:00:00'
ruby count_issues.rb data/data_libreoffice_bugs.csv Opened 2016-06-01 2017-06-01
Counting issues in 'data/data_libreoffice_bugs.csv', issue date column is 'Opened', range: 2016-06-01T00:00:00+00:00 - 2017-06-01T00:00:00+00:00
Found 7223 matching issues.
Update data/data_libreoffice_git_201606_201705.csv
accordingly.
-
Final line should be:
ruby update_projects.rb projects/unlimited_both.csv data/data_libreoffice_git_201606_201705.csv -1
-
Now let's examine a new case: FreeBSD:
-
Use BigQuery/org_finder.sql (with condition '%freebsd%' to find FreeBSD orgs). Check all of them on GitHub and create final BigQuery:
-
cp BigQuery/query_apache_projects.sql BigQuery/query_freebsd_projects.sql
and update conditions, run query, download results, put them indata/data_freebsd_201606_201705.csv
(save as table, export to gstorage, download csv) -
Now define FreeBSD project the same way as in BigQuery: put orgs in
map/defmaps.csv
, put URL inmap/urls.csv
, put orgs as exceptions inmap/ranges.csv
andmap/ranges_sane.csv
(because some values can be 0s due to custom BigQuery) -
Add FreeBSD processing to shells/unlimited:
echo "Adding/Updating FreeBSD Projects"
ruby merger.rb data/unlimited.csv data/data_freebsd_201606_201705.csv
- Go to
~/dev/freebsd
and clone 3 SVN repos:
svn checkout https://svn.freebsd.org/base/head base
svn checkout https://svn.freebsd.org/doc/head doc
svn checkout https://svn.freebsd.org/ports/head ports
- Use
cncf/gitdm
:freebsd_svn.sh` script to analyse FreeBSD SVN repos:
Revisions: 35927
Authors: 335
-
Now rerun
shells/unlimited_201606_201705.sh
and see FreeBSD's rank -
Run final updated script:
shells/unlimited_20160601-20170601.sh
to get final results. -
Finally
./projects/unlimited.csv
is generated. You need to import it in final Google chart by doing: -
Select the cell A50. Use File --> Import, then "Upload" tab, "Select a file from your computer", choose
./projects/unlimited.csv
-
Then "Import action" --> "replace data starting at selected call", click Import.
-
Voila! Final version will live here: https://docs.google.com/spreadsheets/d/1a2VdKfAI1g9ZyWL09TnJ-snOpi4BC9kaEVmB7IufY7g/edit?usp=sharing
NOTE: for viewing using those motion charts You'll need Adobe Flash enabled when clicking links. It works (tested) on Chrome and Safari with Adobe Flash installed and enabled.
For data from files.csv (data/data_YYYYMM.csv), 201601 --> 201703 (15 months) Chart with cumulative data (each month is sum of this month and previous months) is here: https://docs.google.com/spreadsheets/d/11qfS97WRwFqNnArRmpQzCZG_omvZRj_y-MNo5oWeULs/edit?usp=sharing Chart with monthly data (that looks wrong IMHO due to google motion chart data interpolation between months) is here: https://docs.google.com/spreadsheets/d/1ZgdIuMxxcyt8fo7xI1rMeFNNx9wx0AxS-2a58NlHtGc/edit?usp=sharing
I suggest playing around with the 1st chart (cumulative sum): It is not able to remember settings so once you click on "Chart1" scheet I suggest:
- Change axis-x and axis-y from Lin (linerar) to Log (logarithmics)
- You can choose what column should be used for color: I suggest activity (this is default and shows which project was most active) or choose unique color (You can select from commits, prs+issues, size) (size is square root of number of authors)
- Change playback speed (control next to play) to slowest
- Select inerested projects from Legend (like Kubernetes for example or Kubernetes vs dotnet etc) and check "trails"
- You can also change what x and y axisis use as data, defaults are: x=commits, y=pr+issues, and change scale type lin/log
- You can also change which column is used for bubble size (default is "size" which means square root of number of authors), note that the number of authors = max from all months (distinct authors that contributed to activity), this is obviously different from set of distinct authors activity in the entire 15 months range
On the top/right just above the Color drop down you will see additional two chart types:
- Bar chart - this can be very useful
- Choose li or log y-axis scale, then select Kubernetes from Legend and then choose any of y-axis possible values (activity, commits, PRs+issues, Size) and click play to see how Kubernetes overtakes multiple projects during our period. Finally there is also a linear chart, take a look at it as well.
To generate data for CNCF projects:
- Run
BigGuery/query_cncf_projects.sql
in the Google BigQuery Console. It takes about 800 GiB which costs is about $4. - Save output to GoogleSheets and download it as a csv file and save it in
data/data_cncf_projects.csv
(File -> Download As -> Comma separated values ...) - Process BigQuery output with velocity's analysis tool:
ruby analysis.rb data/data_cncf_projects.csv projects/projects_cncf.csv map/hints.csv map/urls.csv map/defmaps.csv
or useshells/run_cncf.sh
which does the same - Import output file
projects/projects_cncf.csv
as Google chart's data.
There is also a gist here (but above description is more up to date): https://gist.github.com/lukaszgryglicki/093ced06455a3f14f0e4d25459525207
Links to various charts and videos generated using this project are here: res/links.txt
https://www.cncf.io/blog/2017/06/05/30-highest-velocity-open-source-projects/
For this case, a new set of map files was created:
map/k8s_vs_rest_defmaps.csv
- list of orgs found in querymap/k8s_vs_rest_urls.csv
- definition of k8s vs restmap/k8s_vs_rest_hints.csv
- list of repos found in query
Lists of orgs/repos in the map files should contain all values used in any period query.
It should be noted that historically, as CNCF grows, new projects are added. To get data for 2016, a query similar to that in BigQuery/query_cncf_4p_201511_201610.sql
should be run and the next year would be span by BigQuery/query_cncf_projects_201611_201710.sql
.
To prepare an analysis, a command similar to this should be run:
ruby analysis.rb data/data_cncf_projects_201611_201710.csv projects/projects_cncf_k8s_vs_rest_201611_201710.csv map/k8s_vs_rest_hints.csv map/k8s_vs_rest_urls.csv map/k8s_vs_rest_defmaps.csv map/skip.csv map/ranges_unlimited.csv
Two queries were created to be run in GoogleBigQuery. One for CloudFoundry, one for Chromium. Take a look at
query_cloudfoundry_authors_from_to.sql
The result is in
data_cloudfoundry_authors_201611_201710.csv
A bot can be spotted visually in the row where author (github login) is 'coveralls'
activity,comments,prs,commits,issues,author
1246,330,104,700,112,frodenas
1210,1210,0,0,0,coveralls
1164,88,58,979,39,genevievelesperance
The other authors can be validated to be human by going to address such as https://github.com/frodenas
Another way to identify bots would be by means af a query such as query_chromium_authors_v2_from_to.sql
which lists names and counts of their commits. A results file such as data_chromium_authors_v2_201611_2017_10.csv
brings data as follows:
activity,comments,prs,commits,issues,author_name
30583,17349,5997,25,7212,(null)
1549,0,0,1549,0,Matt Gaunt
857,0,0,857,0,Paul Irish
... ... ...
125,0,0,125,0,DevTools Bot
Bots should be excluded from the data queries and future bot hunting queries as to not duplicate efforts.