Code Monkey home page Code Monkey logo

ghs's Introduction

GitHub Search ยท Status MIT license Latest Dump DOI

This project is made of two components:

  1. A Spring Boot powered back-end, responsible for:
    1. Continuously crawling GitHub API endpoints for repository information, and storing it in a central database;
    2. Acting as an API for providing access to the stored data.
  2. A Bootstrap-styled and jQuery-powered web user interface, serving as an accessible front for the API, available here.

Running Locally

Prerequisites

Dependency Version Requirement
Java 17
Maven 3.9
MySQL 8.4
Flyway 10.12
Git 2.43
curl 8.5
cloc 1.98

Database

Before choosing whether to start with a clean slate or pre-populated database, make sure the following requirements are met:

  1. The database timezone is set to +00:00. You can verify this via:

    SELECT @@global.time_zone, @@session.time_zone;
  2. The event scheduler is turned ON. You can verify this via:

    SELECT @@global.event_scheduler;
  3. The binary logging during the creation of stored functions is set to 1. You can verify this via:

    SELECT @@global.log_bin_trust_function_creators;
  4. The gse database exists. To create it:

    CREATE DATABASE gse CHARACTER SET utf8 COLLATE utf8_bin;
  5. The gseadmin user exists. To create one, run:

    CREATE USER IF NOT EXISTS 'gseadmin'@'%' IDENTIFIED BY 'Lugano2020';
    GRANT ALL ON gse.* TO 'gseadmin'@'%';

If you prefer to begin with an empty database, there's nothing more for you to do. The required tables will be generated through Flyway migrations during the initial startup of the server. However, if you'd like your local database to be pre-populated with the data we've collected, you can utilize the compressed SQL dump we offer. We host this dump, along with the four previous iterations, on Dropbox. After choosing and downloading a database dump, you can import the data by executing:

gzcat < gse.sql.gz | mysql -u gseadmin -pLugano2020 gse

Server

Before attempting to run the server, you must generate your own GitHub personal access token (PAT). GHS relies on the GraphQL API, which is inaccessible without authentication. To access the information provided by the GitHub API, the token must include the repo scope.

Once that is done, you can run the server locally using Maven:

mvn spring-boot:run

If you want to make use of the token when crawling, specify it in the run arguments:

mvn spring-boot:run -Dspring-boot.run.arguments=--ghs.github.tokens=<your_access_token>

Alternatively, you can compile and run the JAR directly:

mvn clean package
ln target/ghs-application-*.jar target/ghs-application.jar
java -Dghs.github.tokens=<your_access_token> -jar target/ghs-application.jar

Here's a list of project-specific arguments supported by the application that you can find in the application.properties:

Variable Name Type Default Value Description
ghs.github.tokens List<String> List of GitHub personal access tokens (PATs) that will be used for mining the GitHub API. Must not contain blank strings.
ghs.github.api-version String 2022-11-28 GitHub API version used across various operations.
ghs.git.folder-prefix String ghs-clone- Prefix used for the temporary directories into which analyzed repositories are cloned. Must not be blank.
ghs.git.clone-timeout-duration Duration 5m Maximum time allowed for cloning Git repositories.
ghs.cloc.analysis-timeout-duration Duration 5m Maximum time allowed for analyzing cloned Git repositories with cloc.
ghs.curl.connect-timeout-duration Duration 1m Maximum time allowed for establishing HTTP connections with curl.
ghs.crawler.enabled Boolean true Specifies if the repository crawling job is enabled.
ghs.crawler.minimum-stars int 10 Inclusive lower bound for the number of stars a project needs to have in order to be picked up by the crawler. Must not be negative.
ghs.crawler.languages List<String> See application.properties List of language names that will be targeted during crawling. Must not contain blank strings. To ensure proper operations, the names must match those specified in linguist.
ghs.crawler.start-date Date 2008-01-01T00:00:00Z Default crawler start date: the earliest date for repository crawling in the absence of prior crawl jobs. Value format: yyyy-MM-ddTHH:MM:SSZ.
ghs.analysis.enabled Boolean true Specifies if the analysis job is enabled.
ghs.analysis.delay-between-runs Duration PT6H Delay between successive analysis runs, expressed as a duration string.
ghs.analysis.max-pool-threads int 3 Maximum amount of live threads dedicated to concurrently analyzing repositories. Must be positive.
ghs.clean-up.enabled Boolean true Specifies if the job responsible for removing unavailable repositories (clean-up) is enabled.
ghs.clean-up.cron CronTrigger 0 0 0 * * 1 Delay between successive repository clean-up runs, expressed as a Spring CRON expression.

Web UI

The easiest way to launch the front-end is through the provided NPM script:

npm run dev

You can also use the built-in web server of your IDE, or any other web server of your choice. Regardless of which method you choose for hosting, the back-end CORS restricts you to using ports 3030 and 7030.

Dockerisation ๐Ÿณ

The deployment stack consists of the following containers:

Service/Container name Image Description Enabled by Default
gse-database mysql Platform database โœ…
gse-migration flyway Database schema migration executions โœ…
gse-backup tiredofit/db-backup Automated database backups โŽ
gse-server seart/ghs-server Spring Boot server application โœ…
gse-website seart/ghs-website NGINX web server acting as HTML supplier โœ…
gse-watchtower containrrr/watchtower Automatic Docker image updates โŽ

The service dependency chain can be represented as follows:

graph RL
    gse-migration --> |service_healthy| gse-database
    gse-backup --> |service_completed_successfully| gse-migration
    gse-server --> |service_completed_successfully| gse-migration
    gse-website --> |service_healthy| gse-server
    gse-watchtower --> |service_healthy| gse-website

Deploying is as simple as, in the docker-compose directory, run:

docker-compose -f docker-compose.yml up -d

It's important to note that the database setup steps explained in the preceding section are unnecessary when running with Docker. This is because the environment properties passed to the service will automatically establish the MySQL user and database during the initial startup. However, this convenience does not extend to the database data, as the default deployment generates an empty database. If you wish to utilize existing data from the dumps, you'll need to override the compose deployment to employ a custom database image that includes the dump. To do this, create your docker-compose.override.yml file with the following contents:

version: '3.9'
name: 'gse'

services:

  gse-database:
     image: seart/ghs-database:latest

The above image will include the freshest database dump, at most 15 days behind the actual platform data. For a more specific database version, refer to the Docker Hub page. Just remember to specify the override file during deployment:

docker-compose -f docker-compose.yml -f docker-compose.override.yml up -d

The database data itself is kept in the gse-data volume, while detailed back-end logs are kept in a local mount called logs. You can also use this override file to change the configurations of other services, for instance specifying your own PAT for the crawler:

version: '3.9'
name: 'gse'

services:

  # other services omitted...

  gse-server:
    environment:
      GHS_GITHUB_TOKENS: 'A single or comma-separated list of token(s)'
      GHS_CRAWLER_ENABLED: 'true'

Any of the Spring Boot properties or aforementioned application-specific properties can be overridden. Just keep in mind, that the ghs.x.y property corresponds to the GHS_X_Y service environment setting.

Another example is the automated database backup service, which is disabled by default. Should you choose to re-enable it, you would have to add the following to the override file:

version: '3.9'
name: 'gse'

services:

  # other services omitted...

  gse-backup:
    restart: always
    entrypoint: "/init"

Finally, configurations for some programs are stored within files that are added to services through bind mounts. For instance, the Git configuration file is stored in the git directory. If you want to further customize it in deployment (i.e. specify an alternative user agent), you can create your own .override.gitconfig, and add the following to the override file:

version: '3.9'
name: 'gse'

services:

  # other services omitted...

  gse-server:
    volumes:
      - ./git/.override.gitconfig:/root/.gitconfig

FAQ

How can I request a feature or ask a question?

If you have ideas for a feature you would like to see implemented or if you have any questions, we encourage you to create a new discussion. By initiating a discussion, you can engage with the community and our team, and we'll respond promptly to address your queries or consider your feature requests.

How can I report a bug?

To report any issues or bugs you encounter, please create a new issue. Providing detailed information about the problem you're facing will help us understand and address it more effectively. Rest assured, we are committed to promptly reviewing and responding to the issues you raise, working collaboratively to resolve any bugs and improve the overall user experience.

How do I contribute to the project?

Please refer to CONTRIBUTING.md for more information.

How do I extend/modify the existing database schema?

To do that you should be familiar with database migration tools and practices. This project in particular uses Flyway by Redgate. However, the general rule for schema manipulation is: create new migrations, and do not edit existing ones.

ghs's People

Contributors

cerfedino avatar csnagy avatar dabico avatar dependabot[bot] avatar emadpres avatar gbavota avatar github-actions[bot] avatar seart-bot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

ghs's Issues

Add linters for JavaScript and CSS

The checkstyle pull request action is super useful, so I want to also add counterparts for JS and CSS. Namely:

Although triggered on PR to master, they should be restricted to specific files only. I will also retroactively apply this to the existing checkstyle config.

Back button keeps previous search parameters

Hi,
I was wondering if it's possible to keep the search parameters after returning from a search ('Back' button).

Example use case:

  • Set some filtering criteria
  • Search
  • Get results
  • Want to refine/fine-tune/adjust search criteria
  • hit the back button
  • modify criteria I want to adjust (others are "saved/restored" automatically)

Maybe can be also done by supporting the browser's back button (actually ignoring 'in site' navigation) and removing the custom one.

Topic search has no autocompletion

The search suggestions for "Has Topic" do not show when the field is focused.
The request completes successfully, as evidenced by the waterfall:

Screenshot 2023-05-19 at 15 26 05

`CleanupProjectsJob` false positives

The aforementioned job seems to be flagging a lot of projects as non-existent, but upon closer inspection, it seems that the projects do actually exist. What's more puzzling is that I only observed this behavior when the app is deployed on our server, but not when running locally. I suspect that this is some sort of issue with git-ls-remote but further investigation is required.

Materialized Views and Statistics Query Optimizations

Thanks to the changes introduced by #84, we can now consider optimizing the on-load fetch queries that run on the initial page viewing (eg. top 500 labels, repository licenses...)

Current problem: The first user accessing the platform post-cache-cleanup may have to wait for the expensive queries to finish (and the caches to repopulate) before they can receive label/topic suggestions in search. This may lead to confusion on the user end, as suggestions are unavailable immediately.

Solution: Introduce some form of result cache, such as a materialized view that would compute and store all these expensive query results. Given that the ranking of labels is solidified (and there is no ranking for say, licenses), we do not need to refresh this frequently, meaning that the best time to do so is when the server starts up.

New problem: MySQL does not support any notion of a "materialized view" โ˜น๏ธ. PostgreSQL does, but I doubt that we want to change the DB vendor at this time.

Solutions:

Note: Since these stored table views need to be kept in sync with the ones that the data is drawn from, we must ensure that they are read-only (no INSERT, UPDATE or DELETE allowed). This can be achieved by restricting the permissions of the database user or by creating triggers that would prevent modifications.

Tag criteria missing

I accidentally lost the code I originally wrote when merging from other branches, silly me

Migration to Spring Boot 3.1 and Java 17

  • Java LTS version change: 11 -> 17
  • Spring Boot version change: 2.7.10 -> 3.1.0
  • javax relocation to jakarta
  • springdoc-openapi relocation

Checklist:

  • Remove usages of deprecated APIs
    • createNativeQuery
    • withinMillis
    • AsyncResult
    • HttpSecurity methods
    • value in Retryable annotation
  • Make use of new language features
    • String Blocks
    • Enhanced Switches
    • Record Classes
    • Stream#toList
  • Update Docker images
  • Update Run configurations

Refactor contents of the `github` package

Although functional, our interface with the GitHub API is incredibly confusing and hard to read. As a result, any additions or changes that need to be introduced to that part of the platform will no doubt cause a great deal of pain to the current and future developers/maintainers.

By far the worst offenders are:

Search does not work on Safari

Search works on all browsers except for Safari.
The following error is reported on form submission:

Unhandled Promise Rejection: TypeError: undefined is not an object (evaluating 'e.preventDefault')

Seems like checking the form submit handler is a good starting point.

Update all the `README` files

Given that the project has seen significant changes in terms of how it is set up and run, we should update all the documentation in preparation for the next release.

Incorrect search results when searching by Language

The new querying functionalities you introduced in #45 have caused unintended issues: When searching only by language without specifying metrics, only repositories with calculated metrics are matched. You have to ensure that joins to git_repo_metric are only performed if both a language and a metric filter are specified. I also recommend refactoring the GitRepoSpecification to accept a new GitRepoSearch object instead of a Map<String, ?>. This intermediate representation will be derived from its DTO counterpart. It will contain all the information that is otherwise present in the map, along with boolean methods that will be used during the construction of the specification. In fact, why not go a step further and implement a conversion chain that goes: GitRepoSearchDTO -> GitRepoSearch -> GitRepoSpecification.

Does api-url always correspond to https://api.github.com/repos/owner_name/project_name?

Hi,
the api-url link can be reconstructed from the project 'owner_name/project_name'.

The typical link format is something like:
https://api.github.com/repos/owner_name/project_name

What happens when owner changes?
Can this disambiguate repos stored in your dataset with older names by seeing that the api-url points to a different link?

I'm not sure how (or how many times) GitHub can redirect some repos searched by the older name to the new owner/repo name. This is more a possible investigation to ensure consistency in the output.

Improve initialisation for `SupportedLanguage`

On a clean installation, languages targeted during mining are specified through manual inserts or flyway migrations. Since doing manual inserts before starting the server is inconvenient, and relying on migrations leads to schema history pollution, we should provide a more accessible alternative. One that comes to mind is the usage of an application property. This would require a dedicated startup task that checks which language entries need to be initialized in the DB, and which ones can be loaded later when the crawling begins.

Automate reminders for uploading the database dumps

My original idea was to create a GitHub action that would automatically update the database dump. Unfortunately, the current setup requires an upload to Dropbox, followed by manual edits to the README and quickstart database image. I think that for the time being, we can keep the process manual while using an action to create automatic reminders with a checklist in the form of an issue.

Error occurred in cleanup job caused by dangling transitive references from `GitRepoMetrics`

We've had an incident during cleanup:

gse-app  | 2023-05-12 14:21:27.334  WARN 1 --- [     GHSThread1] o.h.engine.jdbc.spi.SqlExceptionHelper   : SQL Error: 1451, SQLState: 23000
gse-app  | 2023-05-12 14:21:27.334 ERROR 1 --- [     GHSThread1] o.h.engine.jdbc.spi.SqlExceptionHelper   : Cannot delete or update a parent row: a foreign key constraint fails (`gse`.`repo_metrics`, CONSTRAINT `repo_metrics_ibfk_1` FOREIGN KEY (`repo_id`) REFERENCES `repo` (`id`))
gse-app  | 2023-05-12 14:21:27.344 ERROR 1 --- [     GHSThread1] usi.si.seart.job.CleanUpProjectsJob      : Exception occurred while deleting GitRepo [id=58131437, name=asahiocean/qr]!
gse-app  | 
gse-app  | javax.persistence.PersistenceException: org.hibernate.exception.ConstraintViolationException: could not execute statement
gse-app  | 	at org.hibernate.internal.ExceptionConverterImpl.convert(ExceptionConverterImpl.java:154)
gse-app  | 	at org.hibernate.internal.ExceptionConverterImpl.convert(ExceptionConverterImpl.java:181)
gse-app  | 	at org.hibernate.query.internal.AbstractProducedQuery.executeUpdate(AbstractProducedQuery.java:1705)
gse-app  | 	at usi.si.seart.job.CleanUpProjectsJob.run(CleanUpProjectsJob.java:84)
gse-app  | 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
gse-app  | 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
gse-app  | 	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
gse-app  | 	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
gse-app  | 	at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84)
gse-app  | 	at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
gse-app  | 	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
gse-app  | 	at java.base/java.util.concurrent.FutureTask.runAndReset(Unknown Source)
gse-app  | 	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
gse-app  | 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
gse-app  | 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
gse-app  | 	at java.base/java.lang.Thread.run(Unknown Source)
gse-app  | Caused by: org.hibernate.exception.ConstraintViolationException: could not execute statement
gse-app  | 	at org.hibernate.exception.internal.SQLExceptionTypeDelegate.convert(SQLExceptionTypeDelegate.java:59)
gse-app  | 	at org.hibernate.exception.internal.StandardSQLExceptionConverter.convert(StandardSQLExceptionConverter.java:37)
gse-app  | 	at org.hibernate.engine.jdbc.spi.SqlExceptionHelper.convert(SqlExceptionHelper.java:113)
gse-app  | 	at org.hibernate.engine.jdbc.spi.SqlExceptionHelper.convert(SqlExceptionHelper.java:99)
gse-app  | 	at org.hibernate.engine.jdbc.internal.ResultSetReturnImpl.executeUpdate(ResultSetReturnImpl.java:200)
gse-app  | 	at org.hibernate.hql.internal.ast.exec.BasicExecutor.doExecute(BasicExecutor.java:80)
gse-app  | 	at org.hibernate.hql.internal.ast.exec.BasicExecutor.execute(BasicExecutor.java:50)
gse-app  | 	at org.hibernate.hql.internal.ast.exec.DeleteExecutor.execute(DeleteExecutor.java:177)
gse-app  | 	at org.hibernate.hql.internal.ast.QueryTranslatorImpl.executeUpdate(QueryTranslatorImpl.java:458)
gse-app  | 	at org.hibernate.engine.query.spi.HQLQueryPlan.performExecuteUpdate(HQLQueryPlan.java:377)
gse-app  | 	at org.hibernate.internal.SessionImpl.executeUpdate(SessionImpl.java:1478)
gse-app  | 	at org.hibernate.query.internal.AbstractProducedQuery.doExecuteUpdate(AbstractProducedQuery.java:1714)
gse-app  | 	at org.hibernate.query.internal.AbstractProducedQuery.executeUpdate(AbstractProducedQuery.java:1696)
gse-app  | 	... 13 common frames omitted
gse-app  | Caused by: java.sql.SQLIntegrityConstraintViolationException: Cannot delete or update a parent row: a foreign key constraint fails (`gse`.`repo_metrics`, CONSTRAINT `repo_metrics_ibfk_1` FOREIGN KEY (`repo_id`) REFERENCES `repo` (`id`))
gse-app  | 	at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:117)
gse-app  | 	at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)
gse-app  | 	at com.mysql.cj.jdbc.ClientPreparedStatement.executeInternal(ClientPreparedStatement.java:916)
gse-app  | 	at com.mysql.cj.jdbc.ClientPreparedStatement.executeUpdateInternal(ClientPreparedStatement.java:1061)
gse-app  | 	at com.mysql.cj.jdbc.ClientPreparedStatement.executeUpdateInternal(ClientPreparedStatement.java:1009)
gse-app  | 	at com.mysql.cj.jdbc.ClientPreparedStatement.executeLargeUpdate(ClientPreparedStatement.java:1320)
gse-app  | 	at com.mysql.cj.jdbc.ClientPreparedStatement.executeUpdate(ClientPreparedStatement.java:994)
gse-app  | 	at com.zaxxer.hikari.pool.ProxyPreparedStatement.executeUpdate(ProxyPreparedStatement.java:61)
gse-app  | 	at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.executeUpdate(HikariProxyPreparedStatement.java)
gse-app  | 	at org.hibernate.engine.jdbc.internal.ResultSetReturnImpl.executeUpdate(ResultSetReturnImpl.java:197)
gse-app  | 	... 21 common frames omitted
gse-app  | 
gse-app  | 2023-05-12 14:21:27.345  INFO 1 --- [     GHSThread1] usi.si.seart.job.CleanUpProjectsJob      : Rolling back transaction...
gse-app  | 2023-05-12 14:21:27.365 ERROR 1 --- [     GHSThread1] .c.SchedulerConfig$SchedulerErrorHandler : Unhandled exception occurred while performing a scheduled job.

Take a look at this particular section of CleanUpProjectsJob:

if (!exists) {
log.info("Deleting repository: {} [{}]", name, id);
Transaction transaction = null;
try (Session nested = factory.openSession()) {
transaction = nested.beginTransaction();
nested.createQuery("DELETE FROM GitRepo r WHERE r.id = :id")
.setParameter("id", id)
.executeUpdate();
nested.createQuery("DELETE FROM GitRepoLabel l WHERE l.repo.id = :id")
.setParameter("id", id)
.executeUpdate();
nested.createQuery("DELETE FROM GitRepoLanguage l WHERE l.repo.id = :id")
.setParameter("id", id)
.executeUpdate();
nested.flush();
transaction.commit();
} catch (PersistenceException ex) {
log.error("Exception occurred while deleting GitRepo [id=" + id + ", name=" + name + "]!", ex);
if (transaction != null) {
log.info("Rolling back transaction...");
transaction.rollback();
}
}

As you can see, all the foreign key references to a repository are deleted prior to it being removed from the database. What you forgot to add here was the SQL statement that would also clear the rows from the repo_metrics junction table that match the ID of the repo being deleted. Please add the necessary update query that would delete entries from GitRepoMetric whose repo_id field matches the :id.

Number of contributors overreported

I recently pulled some data and noticed that there were a few projects that had a reported number of contributors that is quite odd.

I noticed this for three projects in the data set I am using. The projects in question are:
radeonopencompute/rock-kernel-driver
youling257/android-mainline
xanmod/linux

In the csv output generated the number of contributors for these projects is 9.22337203685477E+018. Or at least it is after I import it to a spreadsheet.

On the website search results the number of contributors for these projects is 9223372036854776000.

Based on the number it looks like this is a JavaScript error causing the maximum integer value to be displayed. Obviously the number of contributors to the projects in question cannot be close to these reported values.

Add Checkstyle configuration and PR action

It will primarily include stylistic rules, such as indentation, curly brace usage, wildcard imports, the maximum allowed line width, etc. This may also require the files with existing violations to be corrected. The end result is also the introduction of a PR-triggered action that will highlight violations to the submitter.

Make table naming consistent with other projects

We use the name/prefix repo for the GitHub repository and adjacent tables. Since other projects use git_repo, I think it might benefit us greatly to make the naming consistent. Not only will the schema be easier to work with when switching from one project to another, but we may also explore the possibility of using foreign data wrappers to circumvent the need for copying and synchronizing data between databases.

Last commit date

Thank you for a great project! I noticed that there are many repositories that have update date around 2 years ago (for example, PyTorch repo). For them the last known commit also dates to approximately the same time, which makes it impossible to use filtering by last commit date (it automatically misses a lot of active repositories that have not been indexed for some time).

Is it a know issue and do you plan to run updates for such repositories?

Move DB dumps from Git LFS to an alternative service

Since we have reached the limit on our LFS quota, we need to consider other means of hosting and providing our database data. Currently, I'm thinking of just hosting it on Dropbox, while providing an initialization script that downloads the data, so that the database can subsequently use it in initialization.

Improve error propagation and handling for static code analysis

With this, I aim to address rare error cases that prop up, mostly during cloning:

  1. No such device or address:

    Most of these are deleted repositories, repositories from suspended accounts, or just private repositories requiring a login.

    Example Stacktrace
    usi.si.seart.exception.CloneException: 'git clone' process did not start/exit successfully
    	at usi.si.seart.service.GitRepoClonerService$GitRepoClonerImpl.cloneRepo(GitRepoClonerService.java:76)
    	at usi.si.seart.service.StaticCodeAnalysisService$StaticCodeAnalysisServiceImpl.getCodeMetrics(StaticCodeAnalysisService.java:76)
    	at jdk.internal.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
    	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
    	at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344)
    	at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198)
    	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
    	at org.springframework.aop.interceptor.AsyncExecutionInterceptor.lambda$invoke$0(AsyncExecutionInterceptor.java:115)
    	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
    	at java.base/java.util.concurrent.ThreadPoolExecutor$CallerRunsPolicy.rejectedExecution(Unknown Source)
    	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source)
    	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source)
    	at java.base/java.util.concurrent.AbstractExecutorService.submit(Unknown Source)
    	at org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor.submit(ThreadPoolTaskExecutor.java:388)
    	at org.springframework.aop.interceptor.AsyncExecutionAspectSupport.doSubmit(AsyncExecutionAspectSupport.java:289)
    	at org.springframework.aop.interceptor.AsyncExecutionInterceptor.invoke(AsyncExecutionInterceptor.java:129)
    	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
    	at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:215)
    	at com.sun.proxy.$Proxy195.getCodeMetrics(Unknown Source)
    	at usi.si.seart.job.CodeAnalysisJob.analyze(CodeAnalysisJob.java:44)
    	at java.base/java.util.Iterator.forEachRemaining(Unknown Source)
    	at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Unknown Source)
    	at java.base/java.util.stream.ReferencePipeline$Head.forEach(Unknown Source)
    	at org.hibernate.query.spi.StreamDecorator.forEach(StreamDecorator.java:153)
    	at usi.si.seart.job.CodeAnalysisJob.run(CodeAnalysisJob.java:38)
    	at usi.si.seart.job.CodeAnalysisJob$$FastClassBySpringCGLIB$$888b5c11.invoke(<generated>)
    	at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
    	at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:793)
    	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
    	at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763)
    	at org.springframework.transaction.interceptor.TransactionInterceptor$1.proceedWithInvocation(TransactionInterceptor.java:123)
    	at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:388)
    	at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:119)
    	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
    	at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763)
    	at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:708)
    	at usi.si.seart.job.CodeAnalysisJob$$EnhancerBySpringCGLIB$$75300724.run(<generated>)
    	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
    	at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84)
    	at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
    	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
    	at java.base/java.util.concurrent.FutureTask.runAndReset(Unknown Source)
    	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
    	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    	at java.base/java.lang.Thread.run(Unknown Source)
    Caused by: usi.si.seart.exception.TerminalExecutionException: Error occurred while waiting on the successful exit of the terminal process
    	at usi.si.seart.analysis.TerminalExecution.waitSuccessfulExit(TerminalExecution.java:103)
    	at usi.si.seart.service.GitRepoClonerService$GitRepoClonerImpl.cloneRepo(GitRepoClonerService.java:64)
    	... 49 common frames omitted
    Caused by: java.lang.Exception: Terminal process returned error code 128
    == stderr:
    Cloning into '/tmp/ghs-cloned-525960775574378396'...
    fatal: could not read Username for 'https://github.com': No such device or address
    ===
    	at usi.si.seart.analysis.TerminalExecution.waitSuccessfulExit(TerminalExecution.java:98)
    	... 50 common frames omitted
    
  2. Could not resolve host: github.com

    From what I can tell, this is due to faulty proxy settings. I think we need to add some .gitconfig settings to remedy this.

    Example Stacktrace
     usi.si.seart.exception.CloneException: 'git clone' process did not start/exit successfully
     	at usi.si.seart.service.GitRepoClonerService$GitRepoClonerImpl.cloneRepo(GitRepoClonerService.java:76)
     	at usi.si.seart.service.StaticCodeAnalysisService$StaticCodeAnalysisServiceImpl.getCodeMetrics(StaticCodeAnalysisService.java:76)
     	at jdk.internal.reflect.GeneratedMethodAccessor35.invoke(Unknown Source)
     	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
     	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
     	at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344)
     	at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198)
     	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
     	at org.springframework.aop.interceptor.AsyncExecutionInterceptor.lambda$invoke$0(AsyncExecutionInterceptor.java:115)
     	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
     	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
     	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
     	at java.base/java.lang.Thread.run(Unknown Source)
     Caused by: usi.si.seart.exception.TerminalExecutionException: Error occurred while waiting on the successful exit of the terminal process
     	at usi.si.seart.analysis.TerminalExecution.waitSuccessfulExit(TerminalExecution.java:103)
     	at usi.si.seart.service.GitRepoClonerService$GitRepoClonerImpl.cloneRepo(GitRepoClonerService.java:64)
     	... 12 common frames omitted
     Caused by: java.lang.Exception: Terminal process returned error code 128
     == stderr:
     Cloning into '/tmp/ghs-cloned-6965035175156364833'...
     fatal: unable to access 'https://github.com/guilhermeblanco/zendframework1-doctrine2/': Could not resolve host: github.com
     ===
     	at usi.si.seart.analysis.TerminalExecution.waitSuccessfulExit(TerminalExecution.java:98)
     	... 13 common frames omitted
    
    
  3. Unable to create symlink: Filename too long

    As before, may be resolved with a .gitconfig setting.

    Example Stacktrace
     usi.si.seart.exception.CloneException: 'git clone' process did not start/exit successfully
     	at usi.si.seart.service.GitRepoClonerService$GitRepoClonerImpl.cloneRepo(GitRepoClonerService.java:76)
     	at usi.si.seart.service.StaticCodeAnalysisService$StaticCodeAnalysisServiceImpl.getCodeMetrics(StaticCodeAnalysisService.java:76)
     	at jdk.internal.reflect.GeneratedMethodAccessor35.invoke(Unknown Source)
     	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
     	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
     	at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344)
     	at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198)
     	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
     	at org.springframework.aop.interceptor.AsyncExecutionInterceptor.lambda$invoke$0(AsyncExecutionInterceptor.java:115)
     	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
     	at java.base/java.util.concurrent.ThreadPoolExecutor$CallerRunsPolicy.rejectedExecution(Unknown Source)
     	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source)
     	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source)
     	at java.base/java.util.concurrent.AbstractExecutorService.submit(Unknown Source)
     	at org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor.submit(ThreadPoolTaskExecutor.java:388)
     	at org.springframework.aop.interceptor.AsyncExecutionAspectSupport.doSubmit(AsyncExecutionAspectSupport.java:289)
     	at org.springframework.aop.interceptor.AsyncExecutionInterceptor.invoke(AsyncExecutionInterceptor.java:129)
     	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
     	at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:215)
     	at com.sun.proxy.$Proxy195.getCodeMetrics(Unknown Source)
     	at usi.si.seart.job.CodeAnalysisJob.analyze(CodeAnalysisJob.java:44)
     	at java.base/java.util.Iterator.forEachRemaining(Unknown Source)
     	at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Unknown Source)
     	at java.base/java.util.stream.ReferencePipeline$Head.forEach(Unknown Source)
     	at org.hibernate.query.spi.StreamDecorator.forEach(StreamDecorator.java:153)
     	at usi.si.seart.job.CodeAnalysisJob.run(CodeAnalysisJob.java:38)
     	at usi.si.seart.job.CodeAnalysisJob$$FastClassBySpringCGLIB$$888b5c11.invoke(<generated>)
     	at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
     	at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:793)
     	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
     	at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763)
     	at org.springframework.transaction.interceptor.TransactionInterceptor$1.proceedWithInvocation(TransactionInterceptor.java:123)
     	at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:388)
     	at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:119)
     	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
     	at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763)
     	at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:708)
     	at usi.si.seart.job.CodeAnalysisJob$$EnhancerBySpringCGLIB$$57c7064f.run(<generated>)
     	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
     	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
     	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
     	at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84)
     	at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
     	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
     	at java.base/java.util.concurrent.FutureTask.runAndReset(Unknown Source)
     	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
     	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
     	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
     	at java.base/java.lang.Thread.run(Unknown Source)
     Caused by: usi.si.seart.exception.TerminalExecutionException: Error occurred while waiting on the successful exit of the terminal process
     	at usi.si.seart.analysis.TerminalExecution.waitSuccessfulExit(TerminalExecution.java:103)
     	at usi.si.seart.service.GitRepoClonerService$GitRepoClonerImpl.cloneRepo(GitRepoClonerService.java:64)
     	... 49 common frames omitted
     Caused by: java.lang.Exception: Terminal process returned error code 128
     == stderr:
     Cloning into '/tmp/ghs-cloned-15974021917862905878'...
     error: unable to create symlink logfile/0.05_repeat1_60epoches_log.txt: Filename too long
     fatal: unable to checkout working tree
     warning: Clone succeeded, but checkout failed.
     You can inspect what was checked out with 'git status'
     and retry with 'git restore --source=HEAD :/'
     
     ===
     	at usi.si.seart.analysis.TerminalExecution.waitSuccessfulExit(TerminalExecution.java:98)
     	... 50 common frames omitted
    

Improve GitHub API Rate Limit usage through conditional requests

For each crawled repository, we make approximately 16 HTTP requests to various endpoints. Although performing these requests is central to keeping our database in line with the information on GitHub, we have to keep in mind that not all repository information changes over time. For instance, while the number of commits may increase linearly for active projects, the popularity growth rate of that same project can remain the same. For this reason, we should only focus on obtaining information that we are certain has changed since our last inquiry to the API. GitHub provides a means of achieving this through the use of conditional requests. By placing ETag or Last-Modified values previously obtained from the API, we can make requests that will only count towards our rate limit, provided that the contents supplied by the endpoint have changed.

Mined repositories languages

Is there a way to see if some important languages are excluded from the mining?

I have seen the language stats report in the link 'Mined Projects'.
Are these the 13 most widespread languages and everything else is 'below Kotlin' or there are holes with widespread languages in between?

Export 'forked from' and 'fork source' fields

Hi,
I was wondering if it could be useful to have forked parameters like 'forked from' and 'fork source' (specifically for repos that are forks).

I know that every possible detail might be useful in one or more cases. Just wondering if these two fields are worth having in the output.

Example: if you need to analyze forks and fork chains starting from a subset of projects GHS could be used directly for that kind of research if fork chains could be reconstructed directly without scraping GH.

Improve logging setup

In light of recent incidents, I have concluded that our logging setup needs some improvement. While we typically log INFO-level and above, we should consider adding DEBUG information to dedicated files. The logging setup should be similar to DL4SE, using size- and date-based rolling appender strategies.

Schema Normalisation

I think that the current repo_label and repo_language implementations through a many-to-one relationship are excessively redundant, and will only lead to performance degradations in the long term. I want to take this opportunity to investigate the impact of schema normalization on the overall query performance. The objective is to migrate the two tables to a many-to-many relationship. Although this is a pretty drastic change, I believe that it will be essential to the platform's longevity. This will also require changes to the ER model, DTOs, and maybe even some dataset export and UI rendering logic.

Improve application property settings

  • Add a configuration for minimum number of stars for mining
  • Change millisecond-style duration numbers to string representations of java.time.Duration

Change the `Mined Projects` modal to `Statisitcs`

The current visualisation was better when the number of analyzed repositories was more disproportionate. I think that we can change it back to how it was before (single bar per language) while augmenting the information within to provide some details about the analysis. We should for sure explain what we mean by "Analyzed", and as such we should add a paragraph below the chart.

Improve Docker service dependency chain

Discussed in #99

Originally posted by dabico June 13, 2023
Although it's quite convenient to trigger Flyway from the Server, I think it's better for it to be an isolated service, kind of how DL4SE uses Liquibase. However, it's worth noting that in that case, the DB versioning system had to be isolated because two components (the crawler and server back-end) depended on it. Whether or not this is beneficial in GHS is up for debate. It could be beneficial if we ever decide to migrate to a more microservice-oriented architecture.

I'm starting to believe that our database migration setup needs an overhaul. Yesterday, we had a case of a long-running migration. For about half an hour, the Web UI was available and displaying error messages about not being able to connect to the server, when in reality it was not able to accept connections until Flyway had finished. I propose the following: we should linearize the container dependency chain to database -> flyway -> server -> ui. The back-end would wait on migrations to exit successfully, while the UI webserver would be deployed once the server API is available. This also solves the problem of the UI being available if the back-end is unhealthy. At the same time, I still want to migrate with Flyway when running the app locally. This can be achieved through different Spring and Maven profiles.

Gateway Time-Out

What's heppening?
Hi, when i use the web interface without any filter to save the result it gives me "504 Gateway Time-Out error".

How to see the problem again?
Open web interface without choosing anything just clike on Search button and then clicke on Download CSV.

Output Result:
Generated url:

https://seart-ghs.si.usi.ch/api/r/download/csv?hasWiki=false&onlyForks=false&hasLicense=false&nameEquals=false&hasPulls=false&excludeForks=false&hasIssues=false

Screen Shot 2022-10-24 at 2 52 58 PM

Btw. Can you share the whole up-to-date dataset? I found a dataset on zenodo, but it is not up-to-date

Statistics breakdown about current search

Hi,
maybe it would be useful to have a statistics breakdown after returning a successful search (i.e., how many different languages, average contributors, min-max-average commits just to name a few examples).

An overview of the selected bunch of repositories could help in refining search criteria or have a first glimpse of what's inside a potential dataset even before exporting the output.

In general min, max, average, median, unique values (and/or other usual descriptive statistics) for each 'applicable' output field could be useful for recap/early analysis.

Include search parameters in the exported JSON

Hi,
I would find very useful to have the search parameters stored in the exported JSON. Not sure about other export formats but I suppose a similar approach would still apply.

Top level in JSON there is 'items'. Adding something like 'search parameters' at the same level could help in keeping the specifics close to the data (instead of relying on file renaming or just manual annotation). It would help in keeping track of the search criteria months after when a dataset pops out from a folder and I can't remember exactly the specifics of it.

Unique resilient identifier

Hi,
I had problems in finding a unique identifier resilient to repo name changes, ownership changes, etc.

Can you think of any such identifier that can be exposed in the export to have a more reliable (time-invariant) way of retrieving a repository?

I see from GH rest API that there is indeed an id returned but I'm not sure if this is usable and/or if this is already what you export as id (possibly relate to #16)

Mismatched results for C++ project

Hi, when I search for projects in C++ the result on webpage is correct but the result in the download file is not correct. The file seems to only shows projects written in C.

Older repos that are unavailable anymore are returned as results and need further client filtering

I found that some repos are returned as results even if they're not on GitHub anymore.

See "id" : 35679, "name" : "eclipse/jgit" for an example.

This repo is not available anymore on GitHub (404 error), so I suspect it didn't get renamed, it just got removed/migrated elsewhere. GHSearch still returns this consistently in the results, prompting for a filtering step also with up to date data.

This is not super-critical though, since most of the times you will start from GHS output and check, ensure validity, filter further anyways.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.