Code Monkey home page Code Monkey logo

cam's Introduction

Classes and Metrics (CaM)

arXiv make License Docker Cloud Automated build

This is a dataset of open source Java classes and some metrics on them. Every now and then I make a new version of it using the scripts in this repository. You are welcome to use it in your researches. Each release has a fixed version. By referring to it in your research you avoid ambiguity and guarantees repeatability of your experiments.

This is a more formal explanation of this project: in PDF.

The latest ZIP archive with the dataset is here: cam-2024-03-02.zip (2.22Gb). There are 48 metrics calculated for 532,394 Java classes from 1000 GitHub repositories, including: lines of code (reported by cloc); NCSS; cyclomatic and cognitive complexity (by PMD); Halstead volume, effort, and difficulty; maintainability index; number of attributes, constructors, methods; number of Git authors; and others (see PDF).

Previous archives (took me a few days to build each of them, using a pretty big machine):

If you want to create a new dataset, just run the following command and the entire dataset will be built in the current directory (you need to have Docker installed), where 1000 is the number of repositories to fetch from GitHub and XXX is your personal access token:

docker run --detach --name=cam --rm --volume "$(pwd):/dataset" \
  -e "TOKEN=XXX" -e "TOTAL=1000" -e "TARGET=/dataset" \
  --oom-kill-disable --memory=16g --memory-swap=16g \
  yegor256/cam:0.9.2 "make -e >/dataset/make.log 2>&1"

This command will create a new Docker container, running in the background. (run docker ps -a, in order to see it). If you want to run docker interactively and see all the logs, you can just disable detached mode by removing the --detach option from the command.

The dataset will be created in the current directory (may take some time, maybe a few days!), and a .zip archive will also be there. Docker container will run in the background: you can safely close the console and come back when the dataset is ready and the container is deleted.

Make sure your server has enough swap memory (at least 32Gb) and free disk space (at least 512Gb) — without this, the dataset will have many errors. It's better to have multiple CPUs, since the entire build process is highly parallel: all CPUs will be utilized.

If the script fails at some point, you can restart it again, without deleting previously created files. The process is incremental — it will understand where it stopped before. In order to restart an entire "step," delete the following directory:

  • github/ to rerun clone
  • temp/jpeek-logs/ to rerun jpeek
  • measurements/ to rerun measure

You can also run it without Docker:

make clean
make TOTAL=100

Should work, if you have all the dependencies installed, as suggested in the Dockerfile.

In order to analyze just a single repository, do this (yegor256/tojos as an example):

make clean
make REPO=yegor256/tojos

How to Contribute (e.g. by adding a new metric)

If you want to add a new metric to the script, fork a repository and create a new file in the metrics/ directory, using one of the existing files as an example. Then, create a test for your metric, in the tests/metrics/ directory.

Then, run the entire test suite (this should take a few minutes to complete, without errors):

sudo make install
make test lint

Then, send us a pull request. We will review your changes and apply them to the master branch shortly, provided they don't violate our quality standards.

You can also test it with Docker:

docker build . -t cam
docker run --rm cam make test

There is even a faster way to run all tests, with the help of Docker, if you don't change any installation scripts:

docker run -v $(pwd):/c --rm yegor256/cam:0.9.2 make -C /c test

How to Calculate Additional Metrics

You may want to use this dataset as a basis, with an intend of adding your own metrics on top of it. It should be easy:

  • Clone this repo into cam/ directory
  • Download ZIP archive
  • Unpack it to the cam/dataset/ directory
  • Add a new script to the cam/metrics/ directory (use ast.py as an example)
  • Delete all other files except yours from the cam/metrics/ directory
  • Run make in the cam/ directory: sudo make install; make all

The make should understand that a new metric was added. It will apply this new metric to all .java files, generate new .csv reports, aggregate them with existing reports (in the cam/dataset/data/ directory), and then the final .pdf report will also be updated.

How to Build a New Archive

When it's time to build a new archive, create a new m7i.2xlarge server (8 CPU, 32Gb RAM, 512Gb disk) with Ubuntu 22.04 in AWS.

Then, install Docker into it:

sudo apt update -y
sudo apt install -y apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update -y
sudo apt-cache policy docker-ce
sudo apt install -y docker-ce
sudo usermod -aG docker ${USER}

Then, add swap memory of 16Gb:

sudo dd if=/dev/zero of=/swapfile bs=1048576 count=16384
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Then, create a personal access token in GitHub, and run Docker as explained above.

cam's People

Contributors

yegor256 avatar ilnarkhasanov avatar renovate[bot] avatar rultor avatar howcanunot avatar rocket-3 avatar rliskunov avatar zaqbez39me avatar ocelot335 avatar advasileva avatar orillio avatar volodya-lombrozo avatar ililliasi avatar evermake avatar yistarostin avatar zener085 avatar doritosxxx avatar khairullin-alexandr avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.