Code Monkey home page Code Monkey logo

gsoc-2020's Introduction

Summer-of-code

Refurbishing Atarashi @ FOSSology

Project Details | Contributions | Deliverables | Future Goals | Key Takeaways

Project Details

Atarashi scans for license statements in open source software, focusing on text statistics and information retrieval algorithms. It was designed to work stand-alone and with FOSSology Software. Atarashi is currently using text similarity algorithm based approach to give proper results. It is a well compiled piece of software with many little components(agents). Each agent has a specific form of implementation and this is the reason there is also a variety of accuracy results of the agents.

My proposed ideas and objectives revolved around Atarashi entirely i.e. from including a machine learning based approach for classification of license statements to building a completely independent library that helps Atarashi in background in extracting the license statements from the provided file or the directory. Entire goal for the proposed ideas was to introduce new functionalities into atarashi and to refurbish what it alreads has.


Contributions

1. Nirjas ~ নির্যাস

A Python library for Comments and Source Code Extraction

One thing about source codes is very special that every code contains a lot of vital information that includes the license which tells about the re-usability and implementation tactics for the code. Extracting the license part from the code blocks of the file was the crucial and proposed working for Nirjas. Atarashi and all the agents in it has high dependency on quality of input they are getting from this "Code Comment Extractor". Better inputs results into higher accuracy for the models.

From there, I and Ayush decided to go for a fully functional library which can be used for other purposes as well. We started working on it from ground zero. We discussed and prepared a working structure and we followed it till the library is able to accomplish the defined task.

Nirjas is live at PyPI and can be installed using pip install nirjas.

The major task was to classify different types of comments and to write separate logic for each one of them. The types are:

Extraction of comments is a very crucial task as each language can have different commenting styles and several types of comments as well. We had to come up with a logic for each comment type. They are:

  1. Single line comments
  2. Multi-line comments
  3. Continuous single lines (continuous lines commented out using single-line syntax at each line)
  4. Inline comments (the comments that are written after the code on the same line)

We made Nirjas user-friendly and multi-purpose. It not only extracts the comments from the inputfile but also it can provide you a source file with the source code. Also, it includes some metadata that really helps in understanding the dynamics of the code. It supports a large variety of programming languages and this is not the end, Nirjas is designed in such a way that adding new languages is super simple.

Nirjas

Pull Request & Commits Authored:

2. Introducing machine learning classifiers for License Detection

Currently, Atarashi works on algorithms like tfidf, ngram, DLD etc. These algorithms are supported by different types of similarity calculation techniques like Cosine Similarity, Score Similarity etc. We planned to introduce a RNN model for this specific task and the basics of model will use Siamese Manhattan approach. Licenses are bit tricky and uncertain in nature this made us to question this approach. We planned to introduce few machine learning based classification approaches and check how they behave on licenses and after going through these results we can plan for RNN approach.

I started working on 4 different approaches/models to solve this problem. I picked some good classification models and trained them on the dataset provided. Pre-processing of license text and maintaining a dictionary of vocabulary for each licenses helped a lot in classification. We finally came up with three models. The main target was to achieve a good accuracy score with minimization of time required. The models introduced finally are:

Results

The accuracy of the models is calculated through the evaluator script on pre-compiled testing dataset.

Model Name Accuracy Score in % Time taken on 100 files in (sec)
Logistic Regression 31 88.6
Linear SVC 36 79.4
Multinomial Naive Bayes 30 83.72
  • Screenshots of results

    1. Linear SVC

    SVC_results

    1. Logistic Regression

    Logistic_results

    1. Multinomial Naive Bayes

    NB_results

Pull Request

3. SPDX Dataset Generation

Dataset is a key component for working of any Machine Learning or the Deep Learning models. The dataset defines the accuracy of any model i.e. Lack of properly arranged and described data leads to lower accuracy(mostly). We are curently using this licenseList.csv as our main dataset for classification.

The only backstep with our current data is that it belongs to 1-Class -- 1-License Text category and the most likeable data for the model to train properly should be of 1-Class -- N-License Text. Reason being the license text in source code are unpredictable and the model should be flexible and trained on variety of probable and most occuring form of that specific class. After searching a lot for getting any similar dataset we decided to create one for ourselves.

  • What approach we took?

    This task was also a combined implementation by me and Ayush. The perfect way of dividing a single file's paragraphs into several files was by N-gramming it. Then after applying a bit of logic and getting all the permutations and combinations with the N-grammed files we achieved a somewhat closer result. E.g.

    Suppose a license text has 5 paragraphs [1,2,3,4,5] in order. To create a dataset we include subsets like [1], [1,2], [1,2,3], [1,2,3,4], [1,2,3,4,5] for all combinations starting from 1,2,3,4 and 5. each one with the same label.

    We generated about 1 million files using 447 license files. Voilaa !!

    Few more things that needed to be done are:

    • Shifting from txt files to SPDX JSON endpoint
    • Differentiating License Header from Full Text
    • Adding FOSSology Nomos agent STRINGS.in regex in dataset creation
  • Codebase


Deliverables

Tasks Planned Completed Remarks
Creating Nirjas Yes ✔️ Beta version is live & the project will be developed & maintained continuously
Implementing Classification Algorithm Yes ✔️ The current accuracy and speed is acceptable but the new dataset might help increasing frequency
SPDX Dataset Generation Yes ✔️ Generations is completed but it requires cleaning.
Shifting from Argparse to Plac Yes It is in the queue for now.

Future Goals

  1. Nirjas still needs implementation of a fully round up comment extraction approach.
  2. Continue cleaning of new SPDX dataset and figuring out other corner cases.
  3. Continue developing Nirjas and Atarashi
  4. Implementing the classification models on the new dataset and working on achieving higher accuracies.
  5. Developing a high accuracy algorithm and then Integrating Atarashi with Fossology.
  6. Maintaining Nirjas and Atarashi

Key Takeaways

  • Learnt the art of collaboration and working on real-time software development.
  • Improved programming skills, including OOP concepts and Modular Programming.
  • Learnt alot about Text Similarity algorithms, NLP Techniques and Classification approaches.
  • Learnt about importance of Open-Source licenses and their detail figurative analysis.
  • Learnt about Python Packaging.
  • Improved Git skills.
  • Importance of CI/CD and Unit Tests.
  • Better analysis of code and debugging more easily.
  • Importance of a well equipped dataset and creating one from scratch.
  • Punctuality and adaptability according to time and situation.
  • Communicating properly, presenting the code and keep on asking doubts.

Reach out to me

gsoc-2020's People

Contributors

kaushl2208 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.