Git-Heat-Map

Map showing the files in cpython that Guido van Rossum changed the most; full SVG image available in repo

Now with file extension based highlighting

Now with submodule support

Website now available

A version of this program is now available for use at heatmap.jonathanforsythe.co.uk

Basic use guide

Generate database with python generate_db.py {path_to_repo_dir}
Create virtual environment with python -m venv . and install required modules with pip install -r requirements.txt
Run web server with python app.py or flask run (flask run --host=<ip> to run on that ip address, with 0.0.0.0 being used for all addresses on that machine)
Connect on 127.0.0.1:5000
Available repos will be displayed, select the one you want to view
Add emails, commits, filenames, and date ranges you want to highlight
- The "browse" buttons allow the user to see a list of valid values
- Alternatively valid sqlite patterns can be passed in
Clicking on any of these entries will cause the query to exclude results matching that entry
By default highlight hue is determined by file extensions but this can be manually overridden
Options affecting performance are levels of text to render, and minimum size of boxes rendered
Press submit query to update which files are highlighted
Press refresh to update highlighting hue and redraw based on window size
Click on directories to zoom in, and the back button in the sidebar to zoom out

Project Structure

This project consists of two parts:

Git log -> database
Database -> treemap

Git log -> database

Scans through an entire git history using git log, and creates a database using three tables:

Files, which just keeps track of filenames
Commits, which stores commit hash, author, committer
CommitFile, which stores an instance of a certain file being changed by a certain commit, and tracks how many lines were added/removed by that commit
Author, which stores an author name and email
CommitAuthor, which links commits and Author in order to support coauthors on commits

Using these we can keep track of which files/commits changed the repository the most, which in itself can provide useful insight

Database -> treemap

Taking the database above, uses an SQL query to generate a JSON object with the following structure:

directory:
  "name": <Directory name>
  "val": <Sum of sizes of children>
  "children": [<directory or file>, ...]

file:
  "name": <File name>
  "val": <Total number of line changes for this file over all commits>

then uses this to generate an inline svg image representing a treemap of the file system, with the size of each rectangle being the val described above.

Then generates a second JSON object in a similar manner to above, but filtering for the things we want (only certain emails, date ranges, etc), then uses this to highlight the rectangles in varying intensity based on the vals returned eg highlighting the files changed most by a certain author.

Performance

These speeds were attained on my personal computer.

Database generation

Repo	Number of commits	Git log time	Git log size	Database time	Database size	Total time
linux	1,154,884	60 minutes	444MB	462.618 seconds	733MB	68 minutes
cpython	115,874	4.6 minutes	44.6MB	36.607 seconds	74.3MB	5.2 minutes

Time taken seems to scale linearly, going through approximately 300 commits/second, or requiring 0.0033 seconds/commit. Database size also scales linearly, with approximately 2600 commits/MB, or requiring 384 B/commit.

Querying database and displaying treemap

For this test I filtered each repo by its most prominent authors:

Repo	Author filter	Drawing treemap time	Highlighting treemap time
linux	[email protected]	19.7 s	54.3 s
cpython	[email protected]	842 ms	1238 ms

These times are with minimum size drawn = 0, on very large repositories, so the performance is not completely unreasonable. This does not include the time for the browser to actually render the svg, which can take longer.

Wanted features

Faster database generation

Currently done using git log which can take a very long time for large repos. Will look into any other ways of getting needed information on files.

Multiple filters per query

Currently the user can submit only a single query for the highlighting. Ideally they could have a separate filter dictating which boxes to draw in the first place, and possibly multiple filters that could result in multiple colour highlighting on the same image.

I recommend to speed-up with the `clickhouse-git-import` tool.

Installation:

curl https://clickhouse.com/ | sh

Usage:

./clickhouse git-import --help

will show the documentation and the usage of the tool.

Then the tool can be run directly inside the git repository.
It will collect data like commits, file changes, and changes of every
line in every file for further analysis.
It works well even on the largest repositories like Linux or Chromium.

Example of a trivial query:

SELECT author AS k, count() AS c FROM line_changes WHERE
file_extension IN ('h', 'cpp') GROUP BY k ORDER BY c DESC LIMIT 20

Example of some non-trivial query - a matrix of authors, how much code
of one author is removed by another:

SELECT k, written_code.c, removed_code.c,
    round(removed_code.c * 100 / written_code.c) AS remove_ratio
FROM (
    SELECT author AS k, count() AS c
    FROM line_changes
    WHERE sign = 1 AND file_extension IN ('h', 'cpp')
        AND line_type NOT IN ('Punct', 'Empty')
    GROUP BY k
) AS written_code
INNER JOIN (
    SELECT prev_author AS k, count() AS c
    FROM line_changes
    WHERE sign = -1 AND file_extension IN ('h', 'cpp')
        AND line_type NOT IN ('Punct', 'Empty')
        AND author != prev_author
    GROUP BY k
) AS removed_code USING (k)
WHERE written_code.c > 1000
ORDER BY c DESC LIMIT 500

jmforsythe / git-heat-map Goto Github PK

git-heat-map's Introduction

Git-Heat-Map

Now with file extension based highlighting

Now with submodule support

Website now available

Basic use guide

Project Structure

Git log -> database

Database -> treemap

Performance

Database generation

Querying database and displaying treemap

Wanted features

Faster database generation

Multiple filters per query

git-heat-map's People

Contributors

Stargazers

Watchers

Forkers

git-heat-map's Issues

Recommend Projects

Recommend Topics

Recommend Org