coreinfrastructure / census Goto Github PK

View Code? Open in Web Editor NEW

113.0 113.0 29.0 31.85 MB

📜Automated review of open source software projects

License: Other

Makefile 0.02% Python 0.18% Shell 0.01% HTML 99.79%

analysis census metrics oss statistics

census's People

Contributors

Stargazers

Watchers

census's Issues

Review "Influence analysis of Github repositories"

"Influence analysis of Github repositories"
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4975729/
may be helpful in estimating importance of projects.
It also has some interesting references.

Polling just debian popcon?

I have be thinking on creating a cross distribution popcon project.
Debian is a vast resource for popcon, but Ubuntu has it's own collection1, and
other distribution don't gather such data.
There have been some fail attempts to create popcon for gentoo and fedora.
I think there is a good case of creating a unified project for all of those
projects.

I have started working on such project, but I have limited resources, and I am not
sure I will be able to host the web service for a long time.

Add Android/Ubuntu/CentOS info (including popularity)

Question about oss_package_analysis

I'm an undergrad student and currently using your oss_package_analysis file to develop metric to measure risk management of open source repo.
class Oss_Package(object):
'''
Class that represents an OSS package and corresponding attributes
'''
def init(self, package_name, openhub_lookup_name, direct_network_exposure,
process_network_data, potential_privilege_escalation,
comment_on_priority):

self.package_name = package_name
self.direct_network_exposure = str(direct_network_exposure)
self.process_network_data = str(process_network_data)
self.potential_privilege_escalation = str(potential_privilege_escalation)
self.comment_on_priority = comment_on_priority

I get stuck with finding information about package_name, direct_network_exposure, process_network_data,
potential_privilege_escalation. If you can help me where to find these, I will highly appreciate your help.
Thank you.

Consider New York City (NYC) 2016 brainstorming ideas

The following are notes from a brainstorming session in NYC in 2016. This needs some cleanup & further analysis; the goal here is to capture the ideas.

morning session -- how to measure and prioritize project investment

Looking at what can be measured.

How to measure what exists. How would you understand what needs to exist and doesn’t yet?

Census -- CII
Grabbing data from OpenHub, Black Duck, Debian et al
Applying an algorithm to weigh that based on “risk”?

Q: What type of data in the Census?
Debian measure of “popularity”? I.e. users who allow tracking of use, report back which packages they instal…
OpenHub data -- commits, # of developers, etc.. Scrapes GitHub, SourceForge, etc.
Whether it has a website
CVEs…

of contributors (if in last year 0, problem)

Popularity (as above)
Is it in C or C++? More vulnerable…
Network exposure

Questions about how to interpret the data...one developer may mean an atrophied project. 100 may mean disorganized project. Non-linear…

Problem -- much won’t be in there. Reproducible builds would not be in there ever. It’s not an installed package. Or, deciding which browser needs to be re-written, wouldn’t be there.

Questions on how to expose link between application software and system packages…

Another dimension of measurement deserves to be called out: class of ux bugs. Can’t just scrape a bug database, etc. to understand where these exist. Making assumptions, automating, w/o “thick data” can lead to really bad decisions.

How do we measure importance in different ways, how to measure dependencies, use, etc.

Libraries.io -- measures software interdependencies, chains of dependencies…
Problems being solved: discoverability, maintainability (problem, there’s a suite of software that someone can work on in spare time, need to ensure it’s up to date), sustainability.
Might be useful to have as a map of what to work on next
Could be useful to help developers chose what to use (or stay away from!)
Discovering what to use, and what to fix, and what to replace, etc..

Important to do both “big public works” projects, and funding incremental maintenance and small projects.

What can data illuminate?
As making decisions on what to recommend -- if you can look at a map that shows some cluster of higher level applications are centered around some dependency that’s poorly supported. So, looking at it as a forecast, as well.

Also want to be able to track improvement and security. Census want’s to run regularly, plot a timeline -- how is this doing?

Need a time-indexed record where analysis of changes over time can be done.

One thing to track -- don’t track domains where software is used -- whether a game, or critical infrastructure. Not just “is a dependency”
How to automate? Can exclude packages installed automatically from “recommends”

Metric: “is this fundamentally a security component of systems?” Hard to do, because libraries that aren’t technically “security” libraries, can also be issues.
Can look at how often they’re a dependency for security-critical functions. And, how exposed are they.

Not always obvious whether something is connected to the network. Idea that we need a lot of automation to look at this -- source analysis, simple table analysis…

How to make this type of data generally augmentable? How to allow many eyes, etc.?

Can you have the developers help augment?

Can you have small grants to incent developers to analyze and create their own weights, views, analyses of the data?

There’s a huge amount of knowledge and intuition that those who’ve worked with these packages will have, and could add. Engaging them in helping to augment, crowdsource this info.

There are also many projects on github that aren’t packaged anywhere. Don’t show up on dependency trees, but have millions of users.

Starting to see a shift in the way people consume OS -- from packaged to direct pull… People used to consume OS by buying RedHat. Now do it because some guy on dev team was like, “need a library to do X, so pull it down.” Many ways to bypass package managers. GitHub and Docker very invested in this data.

GitHub could start specifying dependencies of GitHub projects on other GitHub projects… But, this has been solved in part. MBM can solve this, configure scripts can pull dependencies in…
What’s the incentive to put dependency info outside Docker image (etc.)? No incentive to make it available outside. Skepticism re. Docker among developers -- “surprise! You’re running all these fucking dependencies that you never knew!”

What are the classes of funding priorities and directions that people might use these for?

How do we use this to track ecosystem change over time? Very difficult to sustain funding without showing movement. Need a stable baseline to show movement.

Need historical view -- that’s what we’re creating here. Can show rise and fall of different popularity. Can’t measure security, but can integrate metrics that give us a sense of key metrics…

Another metric that could be interesting -- a variety of AI based tools that can assess “code quality.” Could run this over key projects…

Could do density measures -- # of warnings per lines of code…
Could be done as an external data source -- something that builds stuff and assesses across different metrics.

Meta question: what infrastructure is this built on?

Could use NORAD-like system (test infrastructure) to analyze sources, etc.?

Inability to build and test a datapoint on its own…

How to get this data? Incent to give us data? Insidious incentives? Infantilize people?
Get to the people who run the distros, have them demand it of people submitting…
Can’t just demand, social response will be shitty. Maybe before the next Debian conference...pay $50/$100 per data-entry, and donate this to conference for travel, etc.. “For every one of these that gets done, we’ll give $50K to the Debian travel fund.”

Could certify (“good software”) via best practice badges…

Lesson-learned by EFF re. Secure Messaging Tools Scorecard -- good effort, but a lot of criticism. People were reading as a nutrition label -- consumer reports...

Ways to prioritize projects (census work/metrics) - raw brainstormed results, NYC 2016

Blue
(A1) Code quality: (Measure with) compiler warnings, etc.
How else can we measure importance/popularity? E.G., if a popular program (like Skype) uses library X, it’s popular.
Follow dependencies to determine popularity (use system & language package managers)
CVEs - but what does it mean?
Examine code: Look for vulnerabilities (Coverity scan, etc.), look at “quality” metrics (lint, rubocop, etc.)
Classification/typing of programs. E.G., CII for networks, gaming, etc. Some package managers include this information
Does it build?
Demonstrating results / ecosystem change
Interdependency without package management (indirect)
Properties of CII/crypto primitives
Separate dependencies - if it’s depended on by 3 packages, it’s more important than if it’s pulled in once (expat)
Ubuntu popcorn
Dependency compilation
Crowd sourcing by email? ($ donated to Debian travel fund?)

Green
Debian popcorn
Open Hub. (GitHub, SourceForge, …)

of committers (& patterns), # of commits (& patterns), license, readme, contributors

Bug reports: Types, reactivity, context
Cron dis

Red:
Application level - how does it relate?
Initiatives
UX quality - what is the experience? Usability is human.
Exposure (to attack?)
Indie projects - direct pull
What are we running this all on?

Light green:
Community? (Is there one?)
Dev’s what
Tracking
What to recommend

libsqlite3-0 risk index should be 8 not 6

Manually calculating the score yields a score of 8:

website: 0 points
CVE : 3 points (4 CVEs since 2010, don't know why its marked as 0 in your csv)
Contributor: 0 points (according to scm history, 4 contributors in 12 months)
popularity: 1 point (popcon vote: ~~126928~~ 83348, popcon inst: 130862)
Network exposure: at least 1 point (firefox uses it for IndexedDB IIRC)
Dependencies: 2 points (440 unique reverse depends for libsqlite3-0 sqlite3)
Patches: 1 point (8 patches, not marked as forwarded in debian)
ABRT crash statistics: don't know where to get this from

Deleted project causes error

The zlib project was apparently deleted on OpenHub. The cache file reports it as a deleted project. The OpenHub deleted project page doesn't parse cleanly resulting in a Traceback as follows:
Traceback (most recent call last):
File "./oss_package_analysis.py", line 444, in
main()
File "./oss_package_analysis.py", line 406, in main
project_data['comment_on_priority'])
File "./oss_package_analysis.py", line 202, in init
self.get_openhub_data(openhub_lookup_name)
File "./oss_package_analysis.py", line 243, in get_openhub_data
tree = ET.parse(filename)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1182, in parse
tree.parse(source, parser)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
parser.feed(data)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: mismatched tag: line 20, column 2

It would be nice if the program were to catch this exception. For now, I am removing the OpenHub label for zlib from the projects_to_examine file along with some other clean-ups and will submit a pull request with that updated version.

zlib.txt

Future: Consider adding bug report processing information

Consider the following (from the paper section 5.B):
Gather and analyze bug report processing (e.g., how long (on average) does it take to respond to a bug report, and how many bug reports lie unresolved after some time (such as 90 days)). This turns out to be hard data to gather across a large number of projects, because many projects do not separate bug reports from enhancement requests. The “isitmaintained.com” site can analyze GitHub projects to separate bug reports from enhancement requests, but it cannot analyze projects on sites other than
GitHub, and it requires that a project use one of the tags it knows about.

Consider reporting each part of the risk index's value in the result

Currently the results file reports the final risk index, but not the breakdown of how the score was derived. You can figure it out from the other data, but it might be better to report the specific values as well.

This could be reported as a bunch of new columns, one for each value. That would be easy to import into SQL and search on, for example. If that's a pain, it could be reported in the form "0+2+0+1+1..."; each of the values could be in a specific order.

This is inspired from Nathan Willis's article in LWN.net; see https://lwn.net/Articles/651268/ which says, "Regrettably, the raw numbers that make up each package's score do not appear to be available. It would have been interesting to see the exact point values assigned for number of contributors, for example."

Examine other potential ways to get data about OSS projects

There's an interesting list of ways to get OSS project metadata in the discussion about what OMB should ask for.

Other approaches:

Use a collection of source code weakness analyzers (these are static analysis tools) to look for vulnerabilities (HP Fortify, Coverity, SWAMP’s set, etc.). You can use vulnerability density (#hits/KSLOC) to hint at the quality of the code overall. This isn’t a new idea, of course, but it still seems to be one of the bigger ones being discussed in places such as the NIST 2016 forum on security metrics. This is challenging for the census, because there are so many languages involved, but it’s possible.
Use tools to identify “where did the source code weakness analyzers give up or are likely to miss things?” Sadly, the proprietary tool-makers have some incentives to not reveal where they give up, and in any case it’s often hard to report (they have to approximate). I don’t know of any production-quality tool that really does this, suggestions welcome.
Use tools to examine quality-related issues; these can hint at potential problems, and also might hint at areas where the source code weakness analyzers are likely to give up (since they can identify especially-complex code). There are, of course, tools that do this.
Use dynamic analysis tools (e.g., fuzzers). The problem here, of course, is that not only is this compute-intensive, but it’s labor-intensive to set up execution environments for each one. I don’t think this makes sense for the census at this time.

Add points if listed in debian-security-support

The debian-security-support package lists packages for which security support is no longer available within Debian, or for which support is explicitly disclaimed:

https://anonscm.debian.org/cgit/collab-maint/debian-security-support.git

I don't think it is necessary to analyse the reasons for lack of support too closely. Even if it does not apply to the current version, the reasons for lack of support rarely go away completely, so there is usually a risk of recurrence.

the url: github.com/linuxfoundation is owned by my friend.a thief from your orgnazation stolen my friend's github account, do you know this?

the url: github.com/linuxfoundation is owned by my friend. his name is zhang3

A thief from your orgnazation stolen my friend's github account, do you know this?

Future: Consider adding risk if many "downstream-only" patches

Consider the number of downstream-only patches. E.G., if a deb or rpm includes more than 5 patches which have not been accepted upstream, the package receives a point. Distros carry patches for unique packaging requirements and when the upstream project is non-responsive. One or two patches may adjust for unique requirements, but beyond that (especially if they last a long time) they may suggest a non-responsive project. The patches are often less reviewed than the original project and so may add risk to the project all by themselves. This parameter may have some overlap with the Contributor Count parameter already included (if there are few contributors, downstream patches may be the only effective way to fix something).

Example language-level package managers

Add analysis of language-level package managers; some components are highly depended on.

An interesting example is the npm left-pad issue:
http://www.theregister.co.uk/2016/03/23/npm_left_pad_chaos/

Look at Jesus M. Gonzalez-Barahona (Bitergia) information, e.g., Polarsys Maturity Model, GrimoireLab

Jesus M. Gonzalez-Barahona (Bitergia) has done a lot of work on measuring OSS projects; we should look further at his (their) work. We had an interesting conversation at the 2016 Linux Foundation Collab Summit.

They've participated in some developments by the Eclipse Polarsys WG, which has a focus on maturity and future availability. The most interesting is probably the Polarsys Maturity Model, which is based in part in data they collect with MetricsGrimoire (it uses other sources too, such as Sonar): http://dashboard.polarsys.org/

We can see the definitions of the metrics in: http://dashboard.polarsys.org/documentation/metrics.html
and the GQM model in: http://dashboard.polarsys.org/documentation/quality_model.html
(Jesus notes that it takes a while to load).

Example:
http://projects.bitergia.com/opnfv

Information on MetricsGrimoire is here: https://metricsgrimoire.github.io/

They are rewriting a whole new metrics collection system called GrimoireLab (a complete redesign based on their experience with MetricsGrimoire). As I understand it, the software itself is OSS (in Python3), they then sell services on top of it. GrimoreLab's architecture includes:

Perseval - Grabs data from VCS. One backend per software repo, in Python3, produces “data items”. Supports GitHub, SourceForge, etc. (Many!). It's not hard to write them.
Arthur: Orchestrates retrieval, Python3.
Kibiter – fork of ElasticSearch dashboard. (trying to upstrream)

More about GrimoireLab is at: http://grimoirelab.github.io/

A related book "Evaluating Free / Open Source Software Projects (Book)" is here: https://github.com/jgbarah/evaluating-foss-projects

On a related note, lots of GitHub-related information is available via: http://ghtorrent.org/

Allow interactive adjustment of weights by users using web browser

Enable varying the weights inside a browser. Since we have no "truth values" to compare against, we've had to estimate weights using expertise, and people can always question that. A solution that Gartner uses is to say "if you don't like our weights, here's a tool that lets you select your own weights."

The D3.js library (https://d3js.org/) might be perfect for this - it'd make it easy to visualize what changes with weight changes. With modern browsers we should be able to dump and process the entire dataset in a browser without problems.

comment about the "huge backlog" of issues re: BIND?

I am curious about the source of the comment about the BIND issue backlog. It sounds like a vague rumor, which stands out in this otherwise mostly factual report. I can't guess what you would consider a "huge backlog" for a very long-lived, large-sized project. I am mostly concerned that the comment implies we either don't promptly triage incoming reports (we do triage them within a day or so) or that we don't fix critical or important issues (I think we have quite a good record on that).

Can you clarify?

I am the Product Manager for both BIND and ISC DHCP and have full access to the bug database.

Future: Consider adding "Truck Factor"

In a future version, consider adding "truck factor" ("hit by a bus factor") as a metric.

See "What is the Truck Factor of Popular GitHub Applications? A First Assessment"
https://peerj.com/preprints/1233v1.pdf

Reflect mutual exclusion of boolean variables in the code

By design boolean variables 'direct_network_exposure', 'process_network_data' and 'potential_privilege_escalation' are mutually exclusive. That is, one of these variables must be selected. Currently the code doesn't reflect that.

Adjust the Popularity measure in the risk index

Currently a package receives a point if it is in the top 90% of packages analyzed. Making this a relative measure. Consider making it absolute, adjusting this measure to the top 5% of ALL debian packages based on [1]. With more than 140K packages being tracked by the popularity contest, it is more sensible to reduce this measure to a much smaller percentage. Even 1% (~1400 packages) can be a reasonable threshold. Thanks.

[1] http://popcon.debian.org/

Future: Consider adding ABRT crash data

Florian Weimer suggested adding ABRT crash report counts. An advantage is that these correlate to potential issues and whether/how quickly those issues are being fixed. CVE counts (by contrast) only show up when someone is looking and bothers to request a CVE number for the issue.

Here is some such data from Fedora and CentOS:
https://retrace.fedoraproject.org/faf/summary/

Source code for the server appears to be here:
https://github.com/abrt/faf/tree/master/src/webfaf2

Based on that code, you can get raw JSON by sending a suitable Accept header:
$ curl -H "Accept: application/json" https://retrace.fedoraproject.org/faf/stats/yesterday/

Review "What's in a poke?"

Review "What's in a poke?" to look for metrics ideas.

Future: Consider adding dependency count

Consider counting how many packages depend on something (and possibly their popularity), to emphasize popular libraries. It may be that this is essentially captured in popularity counts, but perhaps not.

Record/report trends

Report on trends for FLOSS overall, in addition to identifying "projects that most need help". This would be of interest to a lot of people who aren't FLOSS developers.

This is a completely new purpose, but a lot of the work we're already doing gets us close to it. After all, we already use a program to gather quantitative data about projects, and then calculate a score for each project. We'd need to re-run the program periodically to see trends (say weekly, monthly, or quarterly), but that's very doable. We'd also need to use a much larger set of programs (currently we focus on "concerning" ones), but we intended to broaden the set of projects anyway. I imagine we'd like to review that every 6 months, but I think we should re-run much more often (weekly?) to show that the trends really are trends.

It may be easier to see trends if we break software into categories. Are there categories that we could identify a priori that might be useful? We could also try to automatically determine categories from the data, but that only works if we collect the data that would help divide the software into reasonable categories :-).

some packages mentioned in the paper are missing from results.csv

I see nginx mentioned in the paper, but its not in results.csv or projects_to_examine.csv, did it get lost in the process?
Risk index should be ~8 points if I counted right:

website: 0 points
CVE : 3 points (17 CVEs since 2010)
Contributor: 0 points (according to scm history, 22 contributors in 12 months)
popularity: 1 point (popcon vote: 126928)
Network exposure: 2 points
Dependencies: 2 points (~20 unique reverse depends for nginx, nginx-light, nginx-extras, nginx-full)
Patches: 0 point (1 patch)
ABRT crash statistics: don't know where to get this from

Future: Consider adding static analysis for vulnerabilities (e.g., hit density)

Per section 5.B of the paper:

Perform static analysis on source code to determine the likely number of latent vulnerabilities (e.g., using Coverity scan, RATS, or flawfinder); measures such as hit density could indicate more problematic software. A variant would be to report on densities of warnings when warning flags are enabled.

separate risk index computation into configuration

Currently the risk index is computed with point values for code features hard coded into the script:

https://github.com/linuxfoundation/cii-census/blob/master/oss_package_analysis.py#L313

In the interest of having the risk index be something that can be trained based on scientific evaluation of open source communities, it would be better if these values were configurable rather than being hard-coded.

Change the way debian_role is obtained

Currently a project's role (e.g. Shared Library, Program, Application Data, etc.) is obtained from https://packages.debian.org/wheezy/. For example, to get a project's 'role' and 'implemented in', one goes to https://packages.debian.org/wheezy/bash and sees that it is a Program, implemented in C.

An alternative and easier way to get this information is from the apt_cache_dumpavail.txt file which is already being used in the repo.

coreinfrastructure / census Goto Github PK

census's People

Contributors

Stargazers

Watchers

Forkers

census's Issues

of contributors (if in last year 0, problem)

of committers (& patterns), # of commits (& patterns), license, readme, contributors

Recommend Projects

Recommend Topics

Recommend Org