mean requires at least one data point

On running

python ghstats.py zerda-android-l10n mozilla-l10n docs/

I found this happened for many repos. I have this error:
I get this or #48 with all attempts, so I wonder if there's a connection

raise StatisticsError(u'mean requires at least one data point')
File "ghstats.py", line 396, in
main()
File "ghstats.py", line 393, in main
createGraphs(args.owner, args.repository, args.htmldir)
File "ghstats.py", line 366, in createGraphs
os.path.join(repoPath, i[0] + 's-rampup.html'))
File "ghstats.py", line 232, in graphRampTime
'{:.2f}'.format(len(deltas)/(len(deltas)+len(nocontribs))*100) + '%')
File "/usr/local/lib/python2.7/site-packages/statistics/init.py", line 294, in mean
raise StatisticsError(u'mean requires at least one data point')
statistics.StatisticsError: mean requires at least one data point

Add directory info to newcomers statistics

Ideas

It would be good to add a section to the newcomer statistics html report along the lines of

Newcomers who had a merged pull request often touched these directories:
 - foo/ - 85% of newcomer pull requests merged
 - foo/bar/ - 65% of newcomer pull requests merged

The following directories may not be newcomer-friendly:
 - baz/ - 15% of newcomer pull requests merged
 - quux/ - 5% of newcomer pull requests merged
 - corge/ - no newcomer pull requests submitted in the last year

Why is this useful?

Some code is more complex than others. Some open source projects have sub-communities that maintain code in a particular subdirectory, and each sub-community may be more or less newcomer friendly.

Implementation

We would need to pull down the pull request diff stats and store them. The data we need is similar to issue #9.

Add bot usernames for elm-lang

Many github communities use bots to suggest reviewers, do automated testing, and more.

foss-heartbeat breaks bots into a separate category in the html contribution statistics, by checking for the username in getBots in ghstats.py.

Additionally, if a bot merges a commit after being issued a command in a pull request comment, appendReviewers in ghcategorize.py will attribute the merge to that person, as well as to the bot. Currently, we check for a specific command string at the beginning of the comment. If your test-and-merge bot accepts commands in the middle of a comment, please update the code.

Negative delta for user error

I am trying to figure out why I get merge errors (in different repos) running

python ghstats.py repo name username docs/

Specifically Ive been running into this error which is being called from getRampTime();

('Negative delta for user', 'username', 'for', 'merger', 'on', datetime.datetime(2015, 7, 27, 9, 7, 1))
('first contribution was on', datetime.datetime(2015, 8, 26, 20, 57, 25), 'file', 'mozilla-l10n/appstores/issue-103339758/comment-135168390.json')
Traceback (most recent call last):
File "ghstats.py", line 396, in
main()
File "ghstats.py", line 393, in main
createGraphs(args.owner, args.repository, args.htmldir)
File "ghstats.py", line 371, in createGraphs
os.path.join(repoPath, i[0] + 's-frequency.html'))
File "ghstats.py", line 248, in graphFrequency
recent = nobots[0][4]

It appears to be because nextDate - startDate has a negative value.
If I remove the offending username records from merge.txt this error does go away, but I end up with another - which means I'm not solving anything :)
This user for what it's worth, is a staff member (thus frequently listed), I've only seen this with staff name so far, which makes me wonder if there's it's related to a slightly different way of working (and first arrival logic).

Add bot usernames for dotnet/coreclr

Many github communities use bots to suggest reviewers, do automated testing, and more.

foss-heartbeat breaks bots into a separate category in the html contribution statistics, by checking for the username in getBots in ghstats.py.

Additionally, if a bot merges a commit after being issued a command in a pull request comment, appendReviewers in ghcategorize.py will attribute the merge to that person, as well as to the bot. Currently, we check for a specific command string at the beginning of the comment. If your test-and-merge bot accepts commands in the middle of a comment, please update the code.

Auto-generate docs/index.html

It would be nice to automatically generate docs/index.html from the project report directories in docs.

Add code to ghscraper.py to fetch the project description. Save to a description.txt file.
Add an optional argument to ghscraper.py to give a project category. Save to a category.txt file.
Add a function to ghreport.py to find all directories that contain foss-heartbeat.html (aside from the template directory). Use the description.txt within the <p> tag for the project description. Break projects into groups by category. Write category name as <h3> header, followed by the list of projects in alphabetical order. Make sure to divide the project list into two columns (with one less in the last column). Overwrite docs/index.html

Add bot usernames for rails/rails

Many github communities use bots to suggest reviewers, do automated testing, and more.

foss-heartbeat breaks bots into a separate category in the html contribution statistics, by checking for the username in getBots in ghstats.py.

Additionally, if a bot merges a commit after being issued a command in a pull request comment, appendReviewers in ghcategorize.py will attribute the merge to that person, as well as to the bot. Currently, we check for a specific command string at the beginning of the comment. If your test-and-merge bot accepts commands in the middle of a comment, please update the code.

Meta: Use more python classes

Indexing into arrays is unreadable and prone to introducing bugs. The code is pretty ugly in a lot of places.

Add bot usernames for fsharp/fsharp

Many github communities use bots to suggest reviewers, do automated testing, and more.

foss-heartbeat breaks bots into a separate category in the html contribution statistics, by checking for the username in getBots in ghstats.py.

Additionally, if a bot merges a commit after being issued a command in a pull request comment, appendReviewers in ghcategorize.py will attribute the merge to that person, as well as to the bot. Currently, we check for a specific command string at the beginning of the comment. If your test-and-merge bot accepts commands in the middle of a comment, please update the code.

Double check bots for rust-lang/rust

Many github communities use bots to suggest reviewers, do automated testing, and more.

foss-heartbeat breaks bots into a separate category in the html contribution statistics, by checking for the username in getBots in ghstats.py.

Additionally, if a bot merges a commit after being issued a command in a pull request comment, appendReviewers in ghcategorize.py will attribute the merge to that person, as well as to the bot. Currently, we check for a specific command string at the beginning of the comment. If your test-and-merge bot accepts commands in the middle of a comment, please update the code.

Add bot usernames for twbs/bootstrap

Many github communities use bots to suggest reviewers, do automated testing, and more.

foss-heartbeat breaks bots into a separate category in the html contribution statistics, by checking for the username in getBots in ghstats.py.

Additionally, if a bot merges a commit after being issued a command in a pull request comment, appendReviewers in ghcategorize.py will attribute the merge to that person, as well as to the bot. Currently, we check for a specific command string at the beginning of the comment. If your test-and-merge bot accepts commands in the middle of a comment, please update the code.

Add bot usernames for idris-lang/idris-dev

Many github communities use bots to suggest reviewers, do automated testing, and more.

foss-heartbeat breaks bots into a separate category in the html contribution statistics, by checking for the username in getBots in ghstats.py.

Additionally, if a bot merges a commit after being issued a command in a pull request comment, appendReviewers in ghcategorize.py will attribute the merge to that person, as well as to the bot. Currently, we check for a specific command string at the beginning of the comment. If your test-and-merge bot accepts commands in the middle of a comment, please update the code.

Track communities with multiple repos

Suggested by @mjgiarlo.

Many communities have separate repositories for code and documentation. It would be good to combine statistics across those repositories.

Other communities are really separate sub-communities that share the same repository (e.g. the Linux kernel). Some, like OpenStack, have sub-communities split across multiple repositories.

Basically, this is a hard problem, and I want to allow people flexibility in what repos they want to study. I'll have to give this one more thought.

Add bot usernames for nodejs/node

Many github communities use bots to suggest reviewers, do automated testing, and more.

foss-heartbeat breaks bots into a separate category in the html contribution statistics, by checking for the username in getBots in ghstats.py.

Additionally, if a bot merges a commit after being issued a command in a pull request comment, appendReviewers in ghcategorize.py will attribute the merge to that person, as well as to the bot. Currently, we check for a specific command string at the beginning of the comment. If your test-and-merge bot accepts commands in the middle of a comment, please update the code.

Fix merger over count in rust-lang/rust

rust-lang/rust has a bot called bors, which will test the pull request against the CI system, and merge it if it passes. We attempt to attribute the merge of the pull request to both bors and the person who issued the command to bors.

ghcategorize.py appendReviewers() makes some bad assumptions:

bors will always merge the pull request (which isn't true)
any person who issued a command to bors should be counted as a merger

We need to be able to figure out whether bors actually merged the pull request, and back track to figure out who was the last person to issue a command to bors.

Further, bors can merge starting from a specific commit. I'm not sure whether people ever merge part of a pull request but not all of a pull request. ??? Would need to talk to the rust folks to understand their usage cases.

I'll mark the place in the code where this needs to get fixed.

Track connectors

Who is a connector?

Often times, newcomers don't know who to add as a reviewer, and their pull requests languish. One type of contributor is the "connector", someone who tags the right person to look at the issue or pull request. An example of this is @rust-highfive, a bot that suggests reviewers for pull requests. Non-bot contributors do this by knowing the skill set of the community and tagging the right person.

How to tell if someone is a connector?

They are the first person to tag another person who hasn't previously commented on the issue or pull request (and isn't the person who opened the issue or pull request).
That person comments on the issue or pull request.

What can we do with the data?

Since it takes non-bots a while to learn which people in the community have domain-specific knowledge, it would be interesting to look at ramp-up time. It would be good to acknowledge the connectors on the active contributors list. Further, we can use the data to do hypothesis testing. E.g. we've identified @rust-highfive as a connector; do pull requests they comment on get closed faster than pull requests they didn't comment on?

How do we get the data from github?

We should currently have all the data we need, since we have issue and pull request comment text. Look for any tags, extract the username from the tag, and look to see if they commented after they were tagged. Discard if they commented before they were tagged. Create a new contribution type and record those contributions in connectors.txt.

Add tag info to newcomer statistics

Idea

When a newcomer approaches a new project, they often don't know what kinds of tasks to look for. Many projects have an "easy" tag, but others have more nuanced tags based on what part of the code or skillset is needed. Some people choose to take on hard bugs or features on their first contribution because they look interesting. They may or may not succeed. After newcomers complete their first simple pull requests, they may want a medium-sized project, but again, they may not know where to start.

Hypothesis testing

Question: For all newcomers (active for < 1 year), which pull requests that reference issues with tags are more likely to get merged?

Need more positive examples of open source communication

This is an open issue for anyone to leave links to conversations in open source communities that they think are particularly positive. As I'm designing the sentiment analysis to recognize empathy, so I'm looking for uplifting threads with thanks, gratitude, helpfulness, collaboration, or praise. Other positive conversations are also welcome.

`statistics` module required in Python 2.7

The README.md doesn't state anything about the Python version required for the project to run.

When running using Python 2.7, there's an extra dependency to add:

$ python --version
Python 2.7.12
$ python ghstats.py crystal crystal-lang docs/
Traceback (most recent call last):
  File "ghstats.py", line 41, in <module>
    import statistics
ImportError: No module named statistics

Doing pip install statistics will do the trick. This should probably work on Python 3, since statistics package is A Python 2.* port of 3.4 Statistics Module.

This may be a <3.4 issue, though - I'm not a Pythoneer at all.

It would be to add a notice in the README, or maybe there's a way to list it in the requirements.txt that won't conflict Python 3.4 installations - I don't know how you'd do that.

Bug in generating statistics for elm-lang/elm-compiler

$ python ../src/ghstats.py elm-compiler elm-lang ../src/docs/
Negative delta for user evancz for merger on 2012-04-27 17:01:13
first contribution was on 2012-04-28 06:48:26 file elm-lang/elm-compiler/issue-4333101/comment-5394896.json
Traceback (most recent call last):
  File "../src/ghstats.py", line 280, in <module>
    main()
  File "../src/ghstats.py", line 277, in main
    createGraphs(args.owner, args.repository, args.htmldir)
  File "../src/ghstats.py", line 260, in createGraphs
    os.path.join(repoPath, i[0] + 's-rampup.html'))
  File "../src/ghstats.py", line 126, in graphRampTime
    '{:.2f}'.format(len(deltas)/(len(deltas)+len(nocontribs))*100) + '%')
  File "/usr/lib/python3.5/statistics.py", line 330, in mean
    raise StatisticsError('mean requires at least one data point')
statistics.StatisticsError: mean requires at least one data point

I suspect no one has contributed in a particular way to elm-compiler, but it's possible there's a bug in the categorizing code, or my scraping for elm-compiler was interrupted.

Figure out why some lines in first-interactions.txt aren't parsed

See the FIXME in getRampTime in ghstats.py. I suspect the bug is actually in ghcategorize.py, not writing the first-interactions.txt correctly.

Add bot usernames for jquery/jquery

Many github communities use bots to suggest reviewers, do automated testing, and more.

foss-heartbeat breaks bots into a separate category in the html contribution statistics, by checking for the username in getBots in ghstats.py.

Additionally, if a bot merges a commit after being issued a command in a pull request comment, appendReviewers in ghcategorize.py will attribute the merge to that person, as well as to the bot. Currently, we check for a specific command string at the beginning of the comment. If your test-and-merge bot accepts commands in the middle of a comment, please update the code.

Repo description suggested change: IMPROVE SEO WITH THIS ONE WEIRD TRICK

Hey Sarah!

Thanks for making this incredible project. @kytrinyx mentioned it to me recently, and it's a really interesting way to solve a very common OSS maintainer problem. 🤘 🎉 ❤️

I have a non-code suggestion for this repo that may help more people benefit from it. I'd like to suggest adding an additional sentence to the repo description[1] to possibly help improve it's Google search page results. Something perhaps like:

Need a good way to do an open source health check? FOSS Heartbeat analyses the health of a community of contributors by ...

Just wanted to suggest it because since @kytrinyx has told me about it several weeks back, I've tried to remember the name of the project so I could get a link to give someone else, but never could, and consequently had to revert to much more complicated methods of simply finding a link to your project. Before I eventually found it I always tried unsuccessfully to google for it with terms like "open source health check".

So just hoping for sake of more folks discovering your project via search that adding a few of those type of words to your repo description will help improve the discoverability of this fantastic project.

Cheers! 🍻

Disclaimer SEO meme in title not indicative of actual SEO expertise. 😜

[1] the repo description I'm referring to:

elm-lang/core merger ramp-up time graph has off-by-one

The merger ramp-up time graph at https://sarahsharp.github.io/foss-heartbeat/elm-lang/core/newcomers.html says "Number of contributors who did this: 2" but the histogram bar shows "1". The merger frequency chart also shows two mergers.

I didn't see any warnings about first-time contributions from mergers being found in elm-lang/core, so this could be a different bug.

NOTICE: Maintainer out until 04/01/2017

I'm at a conference this week. I apologize in advance for any delays in merging pull requests and addressing bugs. Cc @kytrinyx @mureinik

Add new first contribution type: Merger

About 1-3 users per repository seem to start merging code as their first contribution. Sometimes that's because they're a bot that merges code based on commands in pull request comments. Sometimes they're the original owner of the repo.

Right now, we don't record a merger contribution in first-interactions.txt. This causes ghstats.py to ignore those contributors when calculating ramp-up time, since they aren't in first-interactions.txt:

Negative delta for user sergiy-k for merger on 2015-02-04 03:05:58
first contribution was on 2015-02-04 08:06:23 file dotnet/coreclr/issue-56489388/comment-72806456.json
Negative delta for user Chrisboh for merger on 2015-11-05 22:20:25
first contribution was on 2015-11-05 22:21:18 file dotnet/coreclr/issue-115197419/comment-154213617.json
Negative delta for user JohnChen0 for merger on 2015-04-20 16:21:19
first contribution was on 2015-04-28 21:43:06 file dotnet/coreclr/issue-71715376/pr-34314626.json

I don't think merges get recorded as a pull request comment (or maybe order merges didn't?).

To fix this:

Change ghcategorize.py to record a new contribution type in first-contributions.txt. Probably the best approach would be to:
a. Pass createStats users from main
b. Have appendContributor add that merger info to users if they're not already in the list, or if the date of the merge is before the one in the list.
c. In main, write first-interactions.txt after createStats is called.

Note: currently the format in first-contributions.txt assumes the interaction has a file associated with it. We could use the filename of the pull-request, but I think then the contribution would be mis-categorized as a pull request (since the statistics code looks at the filename prefix). We could simply say 'merger' instead of the filename, which might work fine because we still have to record the issue directory.

Either way, ghstats.py is going to need to be changed. graphNewcomers will need to be modified to display 'Merged a pull request' and the associated statistic.

Track documentation contributors and submitters

It would be good to acknowledge those contributors who write documentation as part of a pull request.

What could we use the data for?

We could use that information to see what the ramp-up time is for newcomers to start writing documentation. It would also be useful to see what percentage of core contributors write documentation. The data could be used to test hypothesis around whether newcomers ramp up faster when interacting with projects with good documentation. Another interesting hypothesis to test would be whether code contributions without documentation are more or less likely to contain bugs.

How to tell if someone is writing documentation

We could look at the file extension and see if it's .md, .txt, .html, etc. Some people may be documenting things with Jupyter Notebooks or literate Haskell, and I'm not sure how to handle that. If we could figure out what language file we're reading from, we could parse the number of comment lines added or deleted.

Getting data out of github

Unfortunately, github pull request json file doesn't list which files are touched by the pull request. ghscraper.py would need to be modified to pull down more information.

Sentiment tab is empty

Hi I ran foss-heartbeat on coala/coala, and it seems like the sentiment tab is empty. Is there any steps that I have probably missed?

Add bot usernames for facebook/react

Many github communities use bots to suggest reviewers, do automated testing, and more.

foss-heartbeat breaks bots into a separate category in the html contribution statistics, by checking for the username in getBots in ghstats.py.

Additionally, if a bot merges a commit after being issued a command in a pull request comment, appendReviewers in ghcategorize.py will attribute the merge to that person, as well as to the bot. Currently, we check for a specific command string at the beginning of the comment. If your test-and-merge bot accepts commands in the middle of a comment, please update the code.

Add bot usernames for angular/angular.js

Many github communities use bots to suggest reviewers, do automated testing, and more.

foss-heartbeat breaks bots into a separate category in the html contribution statistics, by checking for the username in getBots in ghstats.py.

Additionally, if a bot merges a commit after being issued a command in a pull request comment, appendReviewers in ghcategorize.py will attribute the merge to that person, as well as to the bot. Currently, we check for a specific command string at the beginning of the comment. If your test-and-merge bot accepts commands in the middle of a comment, please update the code.

Need more negative examples of open source communication

This is an open issue for anyone to leave links to conversations in open source communities that they think are particularly negative. As I'm designing the sentiment analysis to also recognize personal attacks, so I'm looking for threads that are: insulting, dismissive, rude, arrogant, insulting, unprofessional, condescending, inappropriate, hostile, accusatory, racist, sexist, homophobic, transphobic, classist, or ablist. Threads where a code of conduct was invoked or a person's behavior was corrected by core developers are a plus.

Add more newcomer stats

Idea

It would be helpful for the newcomers html report to have a sentence like "On average, first pull requests are open A number of days and it takes B number of review comments to get them merged. It usually takes the community C number of hours to respond to a pull request. D% of pull requests never get a response."

This could be extrapolated to issues as well (e.g. how long does it take for an issue to be closed).

Why is this useful?

Waiting for someone to respond to a pull request is the worst feeling for newcomers. This would encourage newcomers to not give up on their pull request, and to understand that it often takes many comments or revisions to get something merged into an established project. It would also encourage core contributors to compare their response times to other competing projects' response times, and improve.

Implementation Details

The averages should be taken across the last year's pull requests, to allow the community room to improve. Don't include review bots in the number of hours it takes for the community to respond to a pull request.

ghcategorize.py: Remove comments from submitters

At least in the facebook/react data I have, I noticed that one comment from the author of a pull request got put into the reviewers.txt, which is incorrect. reviewers.txt should only contain comments from other users on the submitter's pull request.

The file I noticed that was included was facebook/react/issue-114559118/comment-153236130.json, which is a comment by the user ali, who submitted the pull request. This could explain the odd statistics I've been seeing with ghwordhypothesis.py. Maybe submitters are thanking maintainers for merging their pull requests and that's adding false positives to the 'thanked' submitter count?

Add leader board for core contributors and new contributors

Idea

For each category under contributors, it would be nice to see a list of the top ten active contributors. The scatter chart is interesting, but it doesn't have the zing of pride that seeing your username and icon in a list does. Make the lists have the username and icon for the user, with a link to their github user page.

Opinions

I want to avoid the static sort of "contributors" graphs that github has, where if a person contributed a large chunk of code but is inactive, they're always at the top of the list. No core contributor will check this, because nothing ever changes. They're not useful to newcomers to figure out who they should listen to, because they show inactive core contributors.

Implementation Details

There should be two types of contributor lists:

Active core contributors. These are contributors who have been active in the project for at least a year, and they are in the top 25% of active contributors. Sort by contribution frequency over the last two months and take the top ten.
Active new contributors. These contributors haven't been active in the project for more than a year, and they are in the top 25% of active contributors. Sort by contribution frequency over the last two months and take the top ten.

Add the results of those lists (and user icons and links to user pages) to the contributions tab of the html reports.

sagesharp / foss-heartbeat Goto Github PK

foss-heartbeat's People

Contributors

Stargazers

Watchers

Forkers

foss-heartbeat's Issues

Ideas

Why is this useful?

Implementation

Who is a connector?

How to tell if someone is a connector?

What can we do with the data?

How do we get the data from github?

Idea

Hypothesis testing

What could we use the data for?

How to tell if someone is writing documentation

Getting data out of github

Idea

Why is this useful?

Implementation Details

Idea

Opinions

Implementation Details

Recommend Projects

Recommend Topics

Recommend Org