Code Monkey home page Code Monkey logo

github-explorer's Introduction

We prepared a dataset from the GH Archive that contains all the events in all GitHub repositories since 2011 in structured format. The dataset was uploaded into ClickHouse, where it contains 3.1 billion records. We redistribute it for research purposes and it can be downloaded at this direct link. This dataset can help answer almost any question about GitHub that you can imagine.

Read the article

github-explorer's People

Contributors

alex-zaitsev avatar alexey-milovidov avatar blinkov avatar krishnevsky avatar millecodex avatar vitaly-zdanevich avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

github-explorer's Issues

Data after 2022-11-30 not found

SELECT *
           FROM github_events
           ORDER BY created_at DESC
           LIMIT 1

gives:

┌───────────file_time─┬─event_type──┬─actor_login─┬─repo_name─────────────────┬──────────created_at─┬──────────updated_at─┬─action─┬─comment_id─┬─body─┬─path─┬─position─┬─line─┬─ref──┬─ref_type─┬─creator_user_login─┬─number─┬─title─┬─labels─┬─state─┬─locked─┬─assignee─┬─assignees─┬─comments─┬─author_association─┬───────────closed_at─┬───────────merged_at─┬─merge_commit_sha─┬─requested_reviewers─┬─requested_teams─┬─head_ref─┬─head_sha─┬─base_ref─┬─base_sha─┬─merged─┬─mergeable─┬─rebaseable─┬─mergeable_state─┬─merged_by─┬─review_comments─┬─maintainer_can_modify─┬─commits─┬─additions─┬─deletions─┬─changed_files─┬─diff_hunk─┬─original_position─┬─commit_id─┬─original_commit_id─┬─push_size─┬─push_distinct_size─┬─member_login─┬─release_tag_name─┬─release_name─┬─review_state─┐
│ 2022-11-29 23:00:00 │ CreateEvent │ jclasley    │ jclasley/personal-website │ 2022-11-29 23:59:59 │ 1970-01-01 00:00:00 │ none   │          0 │      │      │        0 │    0 │ main │ branch   │                    │      0 │       │ []     │ none  │      0 │          │ []        │        0 │ NONE               │ 1970-01-01 00:00:00 │ 1970-01-01 00:00:00 │                  │ []                  │ []              │          │          │          │          │      0 │         0 │          0 │ unknown         │           │               0 │                     0 │       0 │         0 │         0 │             0 │           │                 0 │           │                    │         0 │                  0 │              │                  │              │ none         │
└─────────────────────┴─────────────┴─────────────┴───────────────────────────┴─────────────────────┴─────────────────────┴────────┴────────────┴──────┴──────┴──────────┴──────┴──────┴──────────┴────────────────────┴────────┴───────┴────────┴───────┴────────┴──────────┴───────────┴──────────┴────────────────────┴─────────────────────┴─────────────────────┴──────────────────┴─────────────────────┴─────────────────┴──────────┴──────────┴──────────┴──────────┴────────┴───────────┴────────────┴─────────────────┴───────────┴─────────────────┴───────────────────────┴─────────┴───────────┴───────────┴───────────────┴───────────┴───────────────────┴───────────┴────────────────────┴───────────┴────────────────────┴──────────────┴──────────────────┴──────────────┴──────────────┘

today.

@alexey-milovidov is this dataset unmaintained now?

How to get all existing commit SHA hashes from a repo?

Hi @alexey-milovidov,

I have a repo, for example: https://github.com/tachiyomiorg/tachiyomi-extensions/tree/repo
I am looking for a query to pull data from the GH archive dataset to extract SHA commit hashes from PushEvent records
I need the output to be in this format:

{created_at} https://github.com/tachiyomiorg/tachiyomi-extensions/tree/{sha}
2021-03-01 https://github.com/tachiyomiorg/tachiyomi-extensions/tree/37307e29cc91ae10afd322dfc31f65cc7c175f6a

i tried this but is not showing the sha, only the date and url, can you help me modify it?:
SELECT created_at, format('https://github.com/{}/tree/{}', repo_name, number::String, 'sha') AS url FROM github_events WHERE event_type = 'PushEvent' AND repo_name LIKE 'tachiyomiorg/%' ORDER BY created_at DESC

Download: HTTP Range Request Not Supported

Hi,

I've trying to download the 83GB TSV dataset. The connection keeps getting interrupted and each time I have to start over, because the server always respond with Content-Range: bytes 0-89432430895/89432430896.

Is it possible to fix this or is there any alternative way to fetch this dataset?

Some questions about GitHub dataset in Clickhouse

Hello, I am also very interested in GitHub event data and using them with Clickhouse for research purpose.

Here is some question about the dataset and setup of database.

  • In my practice, to put all event logs into one table will be quite slow if the number of rows grows to 1 billion. Is something I missed here. Will the order by in create table clause help and if it will be more quicker if I add partition toYYYYMM(created_at) to part the data by month?
  • I also notice that there will be some duplicate data in different raw data file like maybe several rows about o'clock will be in both adjacent files. Since there is no primary key in Clickhouse, how could we avoid this?
  • You use LowCardinality on repo_name and actor_login, but by official suggestion, we should use it if the distinct values count is under 100,000 , but the repo_name and actor_login are more than 10 million on GitHub, will this perform better than ordinary string type?

Thanks for the project.

How often dataset updated?

Hi!

The last publication about Github Explorer was in December 2020, and I am just curious, Is the dataset updated since that time?
I do regular research about government open-source code, and right now, I use GitHub API directly, but it would be great to use gh-api instead.

Best Regards,
Ivan

GitHub issues burndown chart & survival analysis

Hello, thanks for providing this dataset!

Not sure if this is the right place to post, but I used it my Observable notebook to analyze how long do issues "live" and to create some "burndown" charts.

After playing a bit with Observable + ClickHouse, I found it to be a great combo for performing and sharing such explorations.

Here are some charts for ClickHouse repo:

image

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.