Refurbishing Atarashi @ FOSSology

Project Details | Contributions | Deliverables | Future Goals | Key Takeaways

Project Details

Atarashi scans for license statements in open source software, focusing on text statistics and information retrieval algorithms. It was designed to work stand-alone and with FOSSology Software. Atarashi is currently using text similarity algorithm based approach to give proper results. It is a well compiled piece of software with many little components(agents). Each agent has a specific form of implementation and this is the reason there is also a variety of accuracy results of the agents.

My proposed ideas and objectives revolved around Atarashi entirely i.e. from including a machine learning based approach for classification of license statements to building a completely independent library that helps Atarashi in background in extracting the license statements from the provided file or the directory. Entire goal for the proposed ideas was to introduce new functionalities into atarashi and to refurbish what it alreads has.

Contributions

1. Nirjas ~ নির্যাস

A Python library for Comments and Source Code Extraction

Codebase: GitHub
Library: PyPI
Documentation: Nirjas-Wiki

One thing about source codes is very special that every code contains a lot of vital information that includes the license which tells about the re-usability and implementation tactics for the code. Extracting the license part from the code blocks of the file was the crucial and proposed working for Nirjas. Atarashi and all the agents in it has high dependency on quality of input they are getting from this "Code Comment Extractor". Better inputs results into higher accuracy for the models.

From there, I and Ayush decided to go for a fully functional library which can be used for other purposes as well. We started working on it from ground zero. We discussed and prepared a working structure and we followed it till the library is able to accomplish the defined task.

Nirjas is live at PyPI and can be installed using pip install nirjas.

The major task was to classify different types of comments and to write separate logic for each one of them. The types are:

Extraction of comments is a very crucial task as each language can have different commenting styles and several types of comments as well. We had to come up with a logic for each comment type. They are:

Single line comments
Multi-line comments
Continuous single lines (continuous lines commented out using single-line syntax at each line)
Inline comments (the comments that are written after the code on the same line)

We made Nirjas user-friendly and multi-purpose. It not only extracts the comments from the inputfile but also it can provide you a source file with the source code. Also, it includes some metadata that really helps in understanding the dynamics of the code. It supports a large variety of programming languages and this is not the end, Nirjas is designed in such a way that adding new languages is super simple.

Pull Request & Commits Authored:

UnitTest & Automated TestScript (PR)
Seprate functions for seprate functionalities(commit)
Add testing support for new functions(commit)
Add regular expressions for particular comment type(commit)
Regex update of all the comment types(commit)
Integration of all the files with binder(commit)
Directory traversing & Add flags for command line(commit)
Add Command Line Input for file to scan(commit)

2. Introducing machine learning classifiers for License Detection

Currently, Atarashi works on algorithms like tfidf, ngram, DLD etc. These algorithms are supported by different types of similarity calculation techniques like Cosine Similarity, Score Similarity etc. We planned to introduce a RNN model for this specific task and the basics of model will use Siamese Manhattan approach. Licenses are bit tricky and uncertain in nature this made us to question this approach. We planned to introduce few machine learning based classification approaches and check how they behave on licenses and after going through these results we can plan for RNN approach.

I started working on 4 different approaches/models to solve this problem. I picked some good classification models and trained them on the dataset provided. Pre-processing of license text and maintaining a dictionary of vocabulary for each licenses helped a lot in classification. We finally came up with three models. The main target was to achieve a good accuracy score with minimization of time required. The models introduced finally are:

Results

The accuracy of the models is calculated through the evaluator script on pre-compiled testing dataset.

Model Name	Accuracy Score in %	Time taken on 100 files in (sec)
Logistic Regression	31	88.6
Linear SVC	36	79.4
Multinomial Naive Bayes	30	83.72

Screenshots of results
1. Linear SVC
1. Logistic Regression
1. Multinomial Naive Bayes

Pull Request

Feat(models): Implemented three models for license similarity

3. SPDX Dataset Generation

Dataset is a key component for working of any Machine Learning or the Deep Learning models. The dataset defines the accuracy of any model i.e. Lack of properly arranged and described data leads to lower accuracy(mostly). We are curently using this licenseList.csv as our main dataset for classification.

The only backstep with our current data is that it belongs to 1-Class -- 1-License Text category and the most likeable data for the model to train properly should be of 1-Class -- N-License Text. Reason being the license text in source code are unpredictable and the model should be flexible and trained on variety of probable and most occuring form of that specific class. After searching a lot for getting any similar dataset we decided to create one for ourselves.

What approach we took?

This task was also a combined implementation by me and Ayush. The perfect way of dividing a single file's paragraphs into several files was by N-gramming it. Then after applying a bit of logic and getting all the permutations and combinations with the N-grammed files we achieved a somewhat closer result. E.g.

Suppose a license text has 5 paragraphs [1,2,3,4,5] in order. To create a dataset we include subsets like [1], [1,2], [1,2,3], [1,2,3,4], [1,2,3,4,5] for all combinations starting from 1,2,3,4 and 5. each one with the same label.

We generated about 1 million files using 447 license files. Voilaa !!

Few more things that needed to be done are:
- Shifting from txt files to SPDX JSON endpoint
- Differentiating License Header from Full Text
- Adding FOSSology Nomos agent STRINGS.in regex in dataset creation
Codebase
- GitHub Repo: SPDX OSS Dataset
- Pull Request: Adding Permutations and optimising the code

Deliverables

Tasks	Planned	Completed	Remarks
Creating Nirjas	Yes	✔️	Beta version is live & the project will be developed & maintained continuously
Implementing Classification Algorithm	Yes	✔️	The current accuracy and speed is acceptable but the new dataset might help increasing frequency
SPDX Dataset Generation	Yes	✔️	Generations is completed but it requires cleaning.
Shifting from Argparse to Plac	Yes	❌	It is in the queue for now.

Future Goals

Nirjas still needs implementation of a fully round up comment extraction approach.
Continue cleaning of new SPDX dataset and figuring out other corner cases.
Continue developing Nirjas and Atarashi
Implementing the classification models on the new dataset and working on achieving higher accuracies.
Developing a high accuracy algorithm and then Integrating Atarashi with Fossology.
Maintaining Nirjas and Atarashi

Key Takeaways

Learnt the art of collaboration and working on real-time software development.
Improved programming skills, including OOP concepts and Modular Programming.
Learnt alot about Text Similarity algorithms, NLP Techniques and Classification approaches.
Learnt about importance of Open-Source licenses and their detail figurative analysis.
Learnt about Python Packaging.
Improved Git skills.
Importance of CI/CD and Unit Tests.
Better analysis of code and debugging more easily.
Importance of a well equipped dataset and creating one from scratch.
Punctuality and adaptability according to time and situation.
Communicating properly, presenting the code and keep on asking doubts.

kaushl2208 / gsoc-2020 Goto Github PK

gsoc-2020's Introduction

Refurbishing Atarashi @ FOSSology

Project Details

Contributions

1. Nirjas ~ নির্যাস

Pull Request & Commits Authored:

2. Introducing machine learning classifiers for License Detection

Results

Pull Request

3. SPDX Dataset Generation

What approach we took?

Codebase

Deliverables

Future Goals

Key Takeaways

Reach out to me

gsoc-2020's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org