cncf / velocity Goto Github PK

View Code? Open in Web Editor NEW

167.0 28.0 43.0 510.54 MB

🚅Track development velocity

Home Page: https://cncf.io

License: Apache License 2.0

Ruby 33.55% Shell 46.74% HTML 12.00% Python 7.28% JavaScript 0.43%

cncf

velocity's Introduction

Open Source Project Velocity by CNCF

Definitions

Authors is defined as a number of distinct commits/changesets authors across all project's repositories.
Issues value is defined as a number of distinct issues/bugs/emails (depending on data source type, like GitHub, Gerrint, Linux Kernel mail archives, etc.).
PRs value is defined as a number of distinct Pull Requests/Merge Requests (depending on data source type).
Charts use the logarithmic scale number of commits for the X-axis, the logarithmic scale number of sum of issues and PRs for the Y-axis, and the square root of the number of authors for the bubble size.

Current reports

1/1/2023 - 1/1/2024:

Past reports

10/12/2022 - 10/12/2023:

7/1/2022 - 7/1/2023:

1/1/2022 - 1/1/2023:

8/1/2021 - 8/1/2022:

7/1/2021 - 1/1/2022:

1/1/2021 - 1/1/2022:

1/1/2021 - 7/1/2021:

1/1/2020 - 1/1/2021:

1/1/2019 - 1/1/2020:

Track development velocity

This tool set generates data for a Bubble/Motion Google Sheet Chart.
The main script is analysis.rb. The input is a csv file created from BigQuery results.

This tool is being used for periodical chars update as described in the following documents:
Guide to the CNCF projects chart creation
Guide to the LinuxFoundation projects chart creation
Guide to the Top-30 projects chart creation

https://www.cncf.io/blog/2017/06/05/30-highest-velocity-open-source-projects/
Links to various charts and videos generated using this project

Example use:

ruby analysis.rb data/data_yyyymm.csv projects/projects_yyyymm.csv map/hints.csv map/urls.csv map/defmaps.csv skip.csv ranges.csv

Depending on data, the script will stop execution and present a command line.

[1] pry(main)>

To continue, type 'quit' and hit enter/return.

Arguments list:

data file, points to the results of running an sql statement designed for Google BigQuery. The query generates a standardized (in terms of velocity) header. The .sql files are stored in BigQuery/ folder
output file, typically a new file in the projects/ folder
a "hints" file with additional mapping: repo name -> project. (N repos --> 1 Project), so a given project name may be listed be in many lines
a "urls" file which defines URLs for the listed projects (a separate file is used because otherwise, in hints file we would have to duplicate data for each project ) (1 Project --> 1 URL)
a "default" map file which defines non standard names for projects generated automatically via grouping by org (like aspnet --> ASP.net) or to group multiple orgs and/or repos into a single project. It is the last step of project name mapping This tool outputs a data file into the 'projects/' directory
a "skip" file that lists repos and/or orgs and/or projects to be skipped
a "ranges" file that contains ranges of repos properties which makes repo included in calculations

File formats

input.csv data/data_yyyymm.csv from BigQuery, like the following:

org,repo,activity,comments,prs,commits,issues,authors
kubernetes,kubernetes/kubernetes,11243,9878,720,70,575,40
ethereum,ethereum/go-ethereum,10701,570,109,43,9979,14
...

output.csv to be imported via Google Sheet (File -> Import) and then chart created from this data. It looks like this:

org,repo,activity,comments,prs,commits,issues,authors,project,url
dotnet,corefx+coreclr+roslyn+cli+docs+core-setup+corefxlab+roslyn-project-system+sdk+corert+eShopOnContainers+core+buildtools,20586,14964,1956,1906,1760,418,dotnet,microsoft.com/net
kubernetes+kubernetes-incubator,kubernetes+kubernetes.github.io+test-infra+ingress+charts+service-catalog+helm+minikube+dashboard+bootkube+kargo+kube-aws+community+heapster,20249,15735,2013,1323,1178,423,Kubernetes,kubernetes.io
...

hints.csv a csv file with hints for repo --> project mapping, it has this format:

repo,project
Microsoft/TypeScript,Microsoft TypeScript
...

urls.csv a csv file with project --> url mapping with the following format:

project,url
Angular,angular.io
...

defmaps.csv a csv file with proper names for projects generated as default grouping within org:

name,project
aspnet,ASP.net
nixpkgs,NixOS
Azure,=SKIP
...

The special flag '=SKIP' for a project means that this org should NOT be grouped

skip.csv a csv file that contains lists of repos and/or orgs and/or projects to be skipped in the analysis:

org,repo,project
"enkidevs,csu2017sp314,thoughtbot,illacceptanything,RubySteps,RainbowEngineer",Microsoft/techcasestudies,"Apache (other),OpenStack (other)"
"2015firstcmsc100,swcarpentry,exercism,neveragaindottech,ituring","mozilla/learning.mozilla.org,Microsoft/HolographicAcademy,w3c/aria-practices,w3c/csswg-test",
"orgX,orgY","org1/repo1,org2/repo2","project1,project2"

ranges.csv a csv file that contains ranges of repos properties which makes repo included in calculations. It can constrain any of "commits, prs, comments, issues, authors" to be within range n1 .. n2 (if n1 or n2 < 0 then this value is skipped, so -1..-1 means unlimited There can also be exception repos/orgs that do not use those ranges:

key,min,max,exceptions
activity,50,-1,"kubernetes,docker/containerd,coreos/rkt"
comments,20,100000,"kubernetes,docker/containerd,coreos/rkt"
prs,10,-1,"kubernetes,docker/containerd,coreos/rkt"
commits,10,-1,"kubernetes,kubernetes-incubator"
issues,10,-1,"kubernetes,docker/containerd,coreos/rkt"
authors,3,-1,"kubernetes,docker/containerd,google/go-github"

The generated output file contains all the input data (so it can be 600 rows for 1000 input rows for example). You should manually review generated output and choose how many records you need.

hintgen.rb is a tool that takes data already processed for various created charts and creates distinct projects hint file from it. Example usage:

hintgen.rb data.csv map/hints.csv Use multiple times putting a different data file (1st parameter) and generate final hints.csv.

Input and Output

Data files existing in the repository:

data/data_YYYYMM.csv --> data for given YYYYMM from BigQuery.
projects/projects_YYYYMM.csv --> data generated by analysis.rb based on data_YYYYMM.csv with map/: hints.csv, urls.csv, defmaps.csv, skip.csv, ranges.csv parameters

Motion charts

generate_motion.rb a tool that merges data from multiple files into one to be used for motion chart. Usage:

ruby generate_motion.rb projects/files.csv motion/motion.csv motion/motion_sums.csv [projects/summaries.csv]

File files.csv contains a list of data files to be merged. It has the following format:

name,label
projects/projects_201601.csv,01/2016
projects/projects_201602.csv,02/2016
...

This tool generates 2 output files:

1st is a motion data from each file with a given label
2nd is cumulative sum of data, so 1st label contains data from 1st label, 2nd contains 1st+2nd, 3rd=1st+2nd+3rd ... last = sum of all data. Labels are summed-up in alphabetical order. When input data is divided by months, "YYYYMM" or "YYYY-MM" format must be used to receive correct results. "MM/YYYY" will, for example, swap "2/2016" and "1/2017".
Output formats of 1st and 2nd files are identical.
The first column is a data file generated by analysis.rb. The following column is a label that will be used as "time" for google sheets motion chart.

Output format:

project,url,label,activity,comments,prs,commits,issues,authors,sum_activity,sum_comments,sum_prs,sum_commits,sum_issues,sum_authors
Kubernetes,kubernetes.io,2016-01,6289,5211,548,199,331,73,174254,136104,18264,8388,11498,373
Kubernetes,kubernetes.io,2016-02,13021,10620,1180,360,861,73,174254,136104,18264,8388,11498,373
...
Kubernetes,kubernetes.io,2017-04,174254,136104,18264,8388,11498,373,174254,136104,18264,8388,11498,373
dotnet,microsoft.com/net,2016-01,8190,5933,779,760,718,158,158624,111553,17019,17221,12831,382
dotnet,microsoft.com/net,2016-02,17975,12876,1652,1908,1539,172,158624,111553,17019,17221,12831,382
...
dotnet,microsoft.com/net,2017-04,158624,111553,17019,17221,12831,382,158624,111553,17019,17221,12831,382
VS Code,code.visualstudio.com,2016-01,7526,5278,381,804,1063,112,155621,104386,9501,17650,24084,198
VS Code,code.visualstudio.com,2016-02,17139,11638,986,1899,2616,133,155621,104386,9501,17650,24084,198
...
VS Code,code.visualstudio.com,2017-04,155621,104386,9501,17650,24084,198,155621,104386,9501,17650,24084,198
...

Each row contains its label data (separate or cumulative) whereas columns starting with max_ contain cumulative data for all labels. This is to make the data ready for google sheet motion chart without complex cell indexing.

The final (optional) file summaries.csv is used to read the number of authors. This is because the number of authors is computed differently. Without the summaries file (or if a given project is not in the summaries file), we have a number of distinct authors in each period. Summary value is a sum of all periods max. This is obviously not a real count of all distinct authors in all periods. Number of authors would be computed if another file is supplied, one which contains summary data for a longer period that is equal to sum of all periods.

Project ranks

Tool to create ranks per project (for all project's numeric properties) report_projects_ranks.rb & shells/report_cncf_project_ranks.sh Shell script projects from projects/unlimited_both.csv and uses: reports/cncf_projects_config.csv file to get a list of projects that needs to be included in the rank statistics. File format is:

project
project1
project2
...
projectN

It outputs a rank statistics file reports/cncf_projects_ranks.txt

More info

Guide to non-GitHub project processing

Other useful notes

velocity's People

Contributors

Stargazers

Watchers

Forkers

brian-brazil emonty akshayemp tracymiranda sportsbite radoslaw amitkumarj441 kaissi-oss alexxnica kryndex horzsolt hello-packet hnjm lukaszgryglicki one909 robertdigital blueseine swipswaps global-localhost global19 global19-atlassian-net sowmith-mandadi bow0628 embodimentgeniuslm3 devbox10 octocat208 git19112019 maxmood96 debasispaul mr-destructive manny27nyc srtc387 halcyondude sshyran isabella232 askngan creativity-spot lawrencehecht pinkdiamond1 ashabibi luizeduardoserrano terry-basin

velocity's Issues

Remove Terraform / Elastic from Top 30

Terraform/Elastic are no longer open source

Hashicorp related projects should be removed from the velocity reports

They are no longer open source

https://www.hashicorp.com/blog/hashicorp-adopts-business-source-license
https://www.hashicorp.com/license-faq

velocity_x and animator causing error while trying to run flutter project

I started a project today, i want to do a form with flutter + dart and have seen that velocity_x looks interesting for front so i add its last version to my pubspec.yaml in VSC and 'flutter pub get' it.

Then i import it to my dart code and start running it.

Ended up with this error and am just starting dart so don't really know how to fix it.

See project velocity by year

The velocity reports are great, but they bias towards older projects because they just look at absolute numbers. Would it be possible to add a time dimension to the velocity reports so we can see which projects are speeding up or slowing down?

Ensure that forked repos don't end up in velocity reports

For example, in the latest velocity report that we haven't published, the Cilium projects included a forked repo of Linux which jacks up the velocity data imho, we should probably ignore all FORKED repos into the project, e.g,. https://github.com/cilium/linux

Add Yocto Project for velocity statistics

I wondered if it would be possible to add Yocto Project to the list of LF projects? Someone mentioned to me that it wasn't listed but it is a key LF project.

Whilst our repos are at git.yoctoproject.org, we do have a github mirror for poky, one of our main repos here: https://github.com/yoctoproject/poky

For issues, we bugzilla at https://bugzilla.yoctoproject.org

Data for 1/1/2024 contain stale data from 10/23, for example Argo

I was trying to find the updated numbers and charts graph for ArgoCon24EU and the numbers and charts for 1/1/2024 are the same numbers as 10/2023 ArgoConNA23

From the README page the spreadsheets

spreadsheet with data using range "10/12/2022 - 10/12/2023" Linux Foundation Projects Velocity has the same data for spreadsheet "1/1/2023 - 1/1/2024" Linux Foundation Projects Velocity.
All the projects have the same numbers expect the top 3, take a look at the screen shot you can see from row 5 "Argo" project all projects down have the same data.

Add hyperlink to updated 2019 data?

I saw the latest chart in this tweet: https://twitter.com/dankohn1/status/1148967017048924162.

Will there be an easy to access spreadsheet like a few years ago? (see https://docs.google.com/spreadsheets/d/1a2VdKfAI1g9ZyWL09TnJ-snOpi4BC9kaEVmB7IufY7g/edit#gid=1169691230)

Lukas, I have been going through all your work. You are doing a great job of documenting everything, I just know that I'm less likely to use the data if I have to sift through everything.,

Fix Chart Title in Google Sheets

The "Chart" sheet in for the "Top 30 projects" Google workbooks reads "Top 20" and needs to be fixed.

For example: https://docs.google.com/spreadsheets/d/124e8u53eq-zCPPnA3ICe0LIPkOmVlxakMmka_8t9w_A/edit#gid=134798507

I had requested "edit" access to fix this. I don't need that access if someone addresses this issue.

Consider removing 'github' from the top 30 project analysis

https://github.com/github isn't really a distinct project in a sense and it can confuse folks as GitHub is closed source

Zephyr duplicated in 1/1/2023-1/1/2024 Linux Foundation data

Hello,

Zephyr is represented twice in the 1/1/2023-1/1/2024 Linux Foundation Projects data/chart. Also, OpenTelemetry seems to be missing from that chart.

It appears that the OpenTelemetry data may have been subsumed by one of those Zephyr entries.

Update README.md

README.md needs to be reviewed and updated.

Data for 2020?

Hey CNCF velocity! I was wondering if there was a plan to generate new data for 2019-2020? It would be really great to see!

Add Xen Project to velocity statistics

Xen is a LF project, see official websites at https://xenproject.org
Xen and its sub-projects are hosted at https://xenbits.xen.org, main development repository for Xen Hypervisor is at https://xenbits.xen.org/git-http/xen.git

Authors definition

In the velocity report, it's unclear what constitutes an author and what's the difference between that and contributors as measured by devstats.

Logo Design

Hi, how are you? I'd like to collaborate on your open source project and propose free logo/icon design for this project. If it's something you're interested in, please, let me know!
Best Regards
Zuur

Create chart for all LF projects

Create a chart for Linux Foundation projects.
It should be similar to CNCF projects and Top 30 open source projects.

Date in velocity charts title is incorrect

If you open any of the charts for CNCF Project Velocity, the date in the chart title seems to be hard-coded to "1/1/2022 - 1/1/2023", which doesn't match the title of the spreadsheet. I think the spreadsheet title is correct - for example, this is the current report which seems to be correctly showing the last year from 12 October 2022

Severely flawed methodology, wrong results

So, rather than this being a bug on the scripts themselves, this is a bug in the methodology that they implement. According to the article...

Rather than debate whether to measure high-velocity projects via commits, authors, or comments and pull requests, we use a bubble chart to show all 3 axes of data, and plot on a log-log chart to show the data across large scales.

The problem is that this is a completely biased metric. To understand why, let's look at a hypothetical example. Let's pretend for a moment that there are only two ways to make a HTTP request:

Using EveryRequest, which is a library that implements HTTP, FTP, and SCP. Each of those three protocols is maintained by a different author, but within the same codebase; and there is a fourth author that wires it all together in a common interface.
Using HTTPRequest, which is a library that just implements HTTP, and that is maintained by a single author. Similarly, there are FTPRequest and SCPRequest (also each with a single maintainer of its own), but we don't use those because we only need HTTP.

Now, let's say that the userbase looks like this, for each:

EveryRequest: 130 people using it in total; 80 people using it for HTTP, 40 people using it for FTP, 10 people using it for SCP.
HTTPRequest: 100 people using it for HTTP.
FTPRequest: 60 people using it for FTP.
SCPRequest: 20 people using it for SCP.

Now, we get the following data:

Most users: EveryRequest. This is wrong, because for each individual protocol, HTTPRequest/FTPRequest/SCPRequest are more popular, but their userbase is split between multiple projects.
Most commits: EveryRequest. It contains all the commits for each protocol, thereby likely having more commits in total than any of the other three projects, even if it implements the same protocols, with the same degree of maintenance, and so on. Therefore, also wrong.
Most authors: Again, EveryRequest has 4 authors, as opposed to 1 author for each of the other libraries; which, for those other libraries, adds up to 3 since there's nobody who needs to tie the interfaces together. Yet there's the same degree of maintenance (one author per protocol), and thus this result is, again, wrong.
Most issues: Again, EveryRequest comes out 'on top' because it has more total users than any of the other libraries individually, even if per protocol it has both less users and less issues. Which, again, is wrong. An additional factor here is that EveryRequest may be less reliable, and therefore generate more issues than any of the other libraries, simply because its quality is lower.
Most PRs: Once again, EveryRequest comes out on top. It gets PRs for all three of the protocols, while each of the other projects only get PRs for the single protocol they implement. Even if the net amount of contributions to those is higher.

In all of the above data points, EveryRequest comes out on top; in all of them, incorrectly so. The reason this happens, is that the wrong unit of measure ("project") is used; a more accurate measurement would have been by feature. How many authors maintain a given feature? How many people use it? How many contributions are received to it?

As it stands, the metrics greatly favour monolithic projects, which are necessarily going to be corporate projects; it's already well-understood that project structure often mirrors the organizational structure of the organization or environment in which it was developed. This means that corporations are used to developing monolithic internal projects, and have simply extended this practice to their open-source projects (which can indeed be seen from the architecture of many of the listed projects).

On the other hand, individual developers are more likely to build smaller, single-purpose projects that can be integrated with other software, and that are often deployed far, far more widely than these "high-velocity" monolithic projects. How many people really use OpenStack, for example? It's primarily used internally at companies for large infrastructure deployments, and that's also where its contributions come from, because it tries to handle everything in that infrastructure in a single project (even if composed of multiple parts).

(There are quite a few other unaddressed questions here, too. Was the project always corporate-backed, or did that only happen after the bulk of the contributions? How many of the contributions are made by third parties, and how many are made by employees of the corporation running the project? How much does the corporation really contribute?)

In other words: the way you're measuring favours corporate open-source projects, and has therefore already decided the outcome of the research before even starting on it. This is already a problem from a research perspective, but it's made worse by the chilling effect this can have on individual open-source contributions (especially when published on the Linux Foundation site!), by making individual contributors feel like the open-source community is no longer 'theirs'. There are real and serious consequences to this.

I would strongly recommend retracting the article and informing press (eg. TechRepublic) of that, or at the very least adding a clear notice at the top that the research is not reliable. As it stands, it's extremely misleading.

EDIT: From a quick glance at the article, it also seems like this data was based entirely on GitHub projects alone, which introduces further bias. There are many other platforms (including self-hosted!) that are often used for maintaining non-corporate projects.

Sigstore is missing from the list

I took a look at the LF projects velocity sheet: https://docs.google.com/spreadsheets/d/1XWFyDLi299Wb-PVWE7s9MpFGN9SDGQOmUUc4VgVVuK8/edit#gid=1169691230

and Sigstore is missing. Any idea how to get it added? We're at github.com/sigstore, and have been an LF project since early 2021.

"Number of pull requests" data appears inaccurate/misleading

In the velocity reports, we report "The y-axis is the total number of pull requests and issues".

From the query, this is determined by the total amount of PullRequestEvents: https://github.com/cncf/velocity/blob/8e1d1c189b65e2544fae7aec43c6381f9e4b4d82/BigQuery/velocity_cncf.sql#L19C18-L19C34.

A PullRequestEvent does not correlate 1:1 with "a PR" in a way a person would interpret a count of PRs, in my opinion. There are two reasonable approaches (merged PRs or opened PRs, strongly preferring merged PRs), neither of which this counts.

Per docs 'The action that was performed. Can be one of opened, edited, closed, reopened, assigned, unassigned, review_requested, review_request_removed, labeled, unlabeled, and synchronize.'. However, in practice I found this doesn't seem to be the case. Looking at a single day across github:

   1916 reopened
 168129 closed
 193220 opened

Even without the other possible events, we at least appear to be double counting PRs?