Code Monkey home page Code Monkey logo

atarashi's Introduction

FOSSology

Gitpod ready-to-code GPL-2.0 CII Best Practices Coverage Status Slack Channel GitHub release (latest by date) YouTube Channel REUSE status

About

FOSSology is an open source license compliance software system and toolkit. As a toolkit, you can run license, copyright, and export control scans from the command line. As a system, a database and web UI are provided to give you a compliance workflow. In one click you can generate an SPDX file or a ReadMe with all the copyrights notices from your software. FOSSology deduplication means that you can scan an entire distro, rescan a new version, and only the changed files will get rescanned. This is a big time saver for large projects.

Check out Who Uses FOSSology!

FOSSology does not give legal advice. https://fossology.org/

Requirements

The PHP versions 7.3 and later are supported to work for FOSSology. FOSSology requires Postgresql as the database server and apache httpd 2.6 as the web server. These and more dependencies are installed by utils/fo-installdeps.

To install Python dependencies, run install/fo-install-pythondeps.

Installation

FOSSology should work with many Linux distributions.

See https://github.com/fossology/fossology/releases for source code download of the releases.

For installation instructions see Install from Source page in Github Wiki

Docker

FOSSology comes with a Dockerfile allowing the containerized execution both as a single instance or in combination with an external PostgreSQL database. Note: It is strongly recommended to use an external database for production use since the standalone image does not take care of data persistency.

A pre-built Docker image is available from Docker Hub and can be run using the following command:

docker run -p 8081:80 fossology/fossology

The docker image can then be used using http://IP_OF_DOCKER_HOST:8081/repo user fossy password fossy.

If you want to run Fossology with an external database container, you can use Docker Compose, via the following command:

docker-compose up

Docker Compose is a tool that allows you to define and run multi-container applications using a YAML file. FOSSology provides a docker-compose.yml file that defines three services: scheduler, web, and db.

The scheduler service runs the FOSSology scheduler daemon, which handles the analysis tasks. The web service runs the FOSSology web server, which provides the web interface. The db service runs a PostgreSQL database server, which stores the FOSSology data.

The docker-compose up command starts all the three services at once.

The FOSSology web service allows you to configure its database connection using some environment variables. These variables are defined in the docker-compose.yml file under the environment key.

  • FOSSOLOGY_DB_HOST: Hostname of the PostgreSQL database server. An integrated PostgreSQL instance is used if not defined or set to localhost.
  • FOSSOLOGY_DB_NAME: Name of the PostgreSQL database. Defaults to fossology.
  • FOSSOLOGY_DB_USER: User to be used for PostgreSQL connection. Defaults to fossy.
  • FOSSOLOGY_DB_PASSWORD: Password to be used for PostgreSQL connection. Defaults to fossy.

You can change them if you want to use a different database server or credentials.

Vagrant

FOSSology comes with a VagrantFile that can be used to create an isolated environment for FOSSology and its dependencies.

Pre-requisites: Vagrant >= 2.x and Virtualbox >= 5.2.x

Steps:

git clone https://github.com/fossology/fossology
cd fossology/
vagrant up

The server must be ready at http://localhost:8081/repo/. The login credentials are:

user: fossy
pass: fossy

Test Instance

For trying out FOSSology quickly, a test instance is also available at https://fossology.osuosl.org/. This instance can be deleted or reinstalled at any time, thus it is not suitable for serving as your productive version. The login credentials are as follows:

Username: fossy
Password: fossy

Note: The test instance is not up to date with the latest release. The instance will reset every night at 2 am UTC and all the user uploaded data will be lost.

Quick dev prototype with GitPod.io

FOSSology is ready to be coded on GitPod.io. To use it, you would need to setup an account. You can directly use the following button to launch the project on GitPod.io: Link to Gitpod

Once in, you should see 2 terminals, one running FOSSology scheduler and one running the installation.

Handy scripts/aliases

For the ease of usability, following aliases/scripts have been defined and can be used:

  • conffoss: This will reconfigure cmake with all variables
  • buildfoss: This will build the FOSSology using cmake
  • installfoss: This will install FOSSology
  • fossrun: Run the FOSSology scheduler
  • pg_stop: Stop PostgreSQL server
  • pg_start: Start PostgreSQL server

Documentation

We are currently migrating our documentation to Github. At this stage, you can find general documentation at: https://www.fossology.org/get-started/basic-workflow/ and developer docs on Github Wiki and https://fossology.github.io/

Support

Mailing lists, FAQs, Release Notes, and other useful info are available by clicking the documentation tab on the project website. We encourage all users to join the mailing list and participate in discussions. There is also a #fossology IRC channel on the freenode IRC network if you'd like to talk to other FOSSology users and developers. See Contact Us

Contributing

We really like contributions in several forms, see CONTRIBUTING.md

Licensing

The original FOSSology source code and associated documentation including these web pages are Copyright (C) 2007-2012 HP Development Company, L.P. In the past years, other contributors added source code and documentation to the project, see the NOTICES file or the referring files for more information.

Any modifications or additions to source code or documentation contributed to the FOSSology project are Copyright (C) the contributor, and should be noted as such in the comments section of the modified file(s).

FOSSology is licensed under GPL-2.0

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; version 2 of the License

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc.,
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

Exception:

All of the FOSSology source code is licensed under the terms of the GNU General Public License version 2, with the following exceptions:

libfossdb and libfossrepo libraries are licensed under the terms of the GNU Lesser General Public License version 2.1, LGPL-2.1.

This library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License as published by the Free Software Foundation; either
version 2.1 of the License.

This library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public
License along with this library; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
02110-1301  USA

Please see the LICENSE file included with this software for the full texts of these licenses.

atarashi's People

Contributors

ag4ums avatar aman-codes avatar amanjain97 avatar gmishx avatar hastagab avatar its-sushant avatar kaushl2208 avatar mcjaeger avatar singhshreya05 avatar tanweerulhaque avatar vasudevmaduri avatar xavierfigueroav avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

atarashi's Issues

Error in CommentPreprocessor

I get the following error when running python atarashii.py -a wordFrequencySimilarity <file>:

Traceback (most recent call last):
  File "atarashii.py", line 213, in <module>
    main()
  File "atarashii.py", line 167, in main
    result = run_scan(scanner_obj, inputPath)
  File "atarashii.py", line 116, in run_scan
    return scanner.scan(inputFile)
  File "/home/xavierfigueroav/Documents/atarashi-project/atarashi/atarashi/../atarashi/agents/wordFrequencySimilarity.py", line 41, in scan
    processedData = super().loadFile(filePath)
  File "/home/xavierfigueroav/Documents/atarashi-project/atarashi/atarashi/../atarashi/agents/atarashiAgent.py", line 44, in loadFile
    self.commentFile = CommentPreprocessor.extract(filePath)
  File "/home/xavierfigueroav/Documents/atarashi-project/atarashi/atarashi/../atarashi/libs/commentPreprocessor.py", line 131, in extract
    data = json.loads(data_file)
  File "/usr/lib/python3.8/json/__init__.py", line 341, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not dict

I got this error when running the wordFrequencySimilarity agent, but since commentPreprocessor is used by all the others, this may be affecting the whole package.

The error occurs because the result of the function extract from Nirjas is passed in the method loads of the module json, but extract returns a dictionary and loads expects a string. See lines 129 and 130 in commentPreprocessor.py.

with open(outputFile, 'w') as outFile:
# if the file extension is supported
if fileType in supportedFileExtensions:
data_file = commentExtract(inputFile)
data = json.loads(data_file)
data1 = licenseComment(data)
outFile.write(data1)

So the fix consists of removing the line 130 and passing data_file in licenseComment, instead of data.

Shift from argparse module to plac command line parser.

Description

argparse is a parser for command-line options, arguments, and sub-command.
Read Docs: https://docs.python.org/3/library/argparse.html
Currently, argparse is used as command line parser in atarashi and we're planning to shift it to plac.
plac does the same thing with very less line of code.
Repo: https://github.com/micheles/plac

How to Solve

Read the plac documentation: http://micheles.github.io/plac/

Files to be changed

Parallelize the evaluator algorithm

Description

There is a script to evaluate the algorithms for Atarashi: evaluator.py

Currently, it scans the test files sequentially (One by One).
We have to parallelize the script by using multiprocessing, multithreading or something else to reduce the effective time of scanning.

How to solve

Use multiprocessing, multithreading in the main loop of evaluator.py

ModuleNotFoundError occurs when running after installing with pip.

I tried running it after installing it with pip on python 3.6.9. Then, the following error occurs.
Are there additional modules I need to install?

$pip install atarashi 
$atarashi -h
Traceback (most recent call last):
  File "/home/soimkim/test/venv/bin/atarashi", line 5, in <module>
    from atarashi.atarashii import main
  File "/home/soimkim/test/venv/lib/python3.6/site-packages/atarashi/atarashii.py", line 26, in <module>
    from atarashi.agents.cosineSimNgram import NgramAgent
  File "/home/soimkim/test/venv/lib/python3.6/site-packages/atarashi/agents/cosineSimNgram.py", line 30, in <module>
    from atarashi.agents.atarashiAgent import AtarashiAgent
  File "/home/soimkim/test/venv/lib/python3.6/site-packages/atarashi/agents/atarashiAgent.py", line 27, in <module>
    from atarashi.libs.commentPreprocessor import CommentPreprocessor
  File "/home/soimkim/test/venv/lib/python3.6/site-packages/atarashi/libs/commentPreprocessor.py", line 23, in <module>
    import code_comment  # https://github.com/amanjain97/code_comment/
ModuleNotFoundError: No module named 'code_comment'

Create a unified entry point

Create a unified entry point for every file using command line arguments instead of adding __main__ to every point. This will help in keeping code concise, easy to maintain and readable.

Make evaluation.py more informative

The evaluation script should

  • Allow to print a comparison table with all the algorithms supported by atarashi. You can find examples of comparison tables in #95 and #65.
  • Allow to print a confusion matrix so that we can easily do error analysis to make decisions on how to improve current agents or implement new ones.

Ability to scan directories

Currently atarashi can scan only files. If a directory is provided as input, it should be able to find all files under it and run the selected agent on them.
The results of each scan can be stored in a list and printed as a JSON array.

It will be preferred, however, to print results as they come maintaining the validity of the JSON array. So if someone is running a scan in interactive terminal, it should not give a feeling as nothing is happening.
It can be emulated as printing a starting [ followed by printing of scan result object {...} and a ,. The last result will not have a trailing , and a ] can be printed at the end of scan. This approach will eliminate the need of additional list to hold temporary results.

Problem with identifying the short license text

Generally, license contained in the source code file is either is short license itself or a block of large license which becomes difficult for the information retrieval algorithms and similarity finding algorithms to classify efficiently.

Please suggest how this should be resolved before implementing other IR (Information retrieval) algorithms.

FEAT: Increasing the overall performance of Atarashi

We can improve the performance of atarahi, nirjas, and others by using Numba and RAPIDS by Nvidia. Regular NumPy, pandas, and other libraries are slow. Maximum amount of time is wasted in serialization, deserialization, pre-processing, transfer of memory between CPU and others. We can make it fast using Numba's parallel processing, JIT, and in built features which can even work on CPU. Also, most of the programs can be made even faster using RAPIDS' cuML, cuDF, dask, etc by executing everything through a GPU like pre-processing, vectorization, database query, serialization, deserialization, parallel processing, etc. The entire codebase can be translated without much hassle resulting in computational efficiency, higher accuracy, and lower memory usage.
This can ensure Atarashi's integration with FOSSology.
I have somewhat started with the work. Can I proceed with the same??
@hastagAB @GMishx

Build fails

Hi, I'm getting an error when trying to build (master branch) on Python 3.7:

Installing collected packages: code-comment
  Running setup.py develop for code-comment
    Complete output from command /home/rob/projects/atarashi/.venv/bin/python -c "import setuptools, tokenize;__file__='/home/rob/projects/atarashi/.venv/src/code-comment/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" develop --no-deps --user --prefix=:
    usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
       or: -c --help [cmd1 cmd2 ...]
       or: -c --help-commands
       or: -c cmd --help
    
    error: option --user not recognized

I'm able to build when I run the command with sudo, but this shouldn't be necessary per my understanding.

Pipfile: the replacement for requirements.txt

We can migrate our current requirement.txt file to pipfile due to several reasons like:

  • TOML syntax for declaring all types of Python dependencies.
  • One Pipfile (as opposed to multiple requirements.txt files).
  • A Pipfile is inherently ordered.

or, Refer: Why?

[Proposal] Improve the speed of matching

WHAT

Atarashi can use a lot of agents and can use a lot of similarity types.
But according to the previous tests, we observed that there is a need for improvement in speed of scanning the licenses.

Proposal

to be decided

Build fails due to import error from nirjas

I get the following error:

Traceback (most recent call last):
  File "setup.py", line 26, in <module>
    from atarashi.build_deps import download_dependencies
  File "/home/xavierfigueroav/Documents/atarashi-project/atarashi/atarashi/build_deps.py", line 32, in <module>
    from atarashi.license.licensePreprocessor import LicensePreprocessor
  File "/home/xavierfigueroav/Documents/atarashi-project/atarashi/atarashi/license/licensePreprocessor.py", line 31, in <module>
    from atarashi.libs.commentPreprocessor import CommentPreprocessor
  File "/home/xavierfigueroav/Documents/atarashi-project/atarashi/atarashi/libs/commentPreprocessor.py", line 30, in <module>
    from nirjas import extract as commentExtract, LanguageMapper
ImportError: cannot import name 'LanguageMapper' from 'nirjas' (/home/xavierfigueroav/Documents/atarashi-project/atarashi/.env/lib/python3.8/site-packages/nirjas/__init__.py)

This happens when running python setup.py build, after cloning the repository and installing the dependencies.

Comment extraction not working on curling quotes

Curling quotes (,, , ) are not filtered in comment extractor which results in some wrong results.

A more extensive listing of problematic word characters:

Character UTF-8 ASCII Name
\u2013 - EM DASH
\u2014 - EM DASH
\u2015 - Horizontal Bar
\u2018 ' Left single quotation mark
\u2019 ' Right single quotation mark
\u201a , Single low-9 quotation mark
\u201b ' Single high-reversed-9 quotation mark
\u201c " Left double quotation mark
\u201d " Right double quotation mark
\u201e " Double low-9 quotation mark
\u2026 ... Horizontal ellipsis
\u2032 ' Prime
\u2033 " Double prime
© \u00a9 (c) Copyright sign

Invalid File Path in Atarashi

Whenever a invaid file path is provided to atrashi it generate the following error:
env) akshay@akshay-VirtualBox:~/atarashi/atarashi/evaluator$ atarashi -a tfidf Testfiles/APSL-style.html
Traceback (most recent call last):
File "/home/akshay/atarashi/env/bin/atarashi", line 8, in

sys.exit(main())

File "/home/akshay/atarashi/env/lib/python3.8/site-packages/atarashi/atarashii.py", line 123, in main
result = atarashii_runner(inputFile, processedLicense, agent_name, similarity, ngram_json, verbose)
File "/home/akshay/atarashi/env/lib/python3.8/site-packages/atarashi/atarashii.py", line 83, in atarashii_runner
result = scanner.scan(inputFile)
File "/home/akshay/atarashi/env/lib/python3.8/site-packages/atarashi/agents/tfidf.py", line 140, in scan
return self.__tfidfcosinesim(filePath)
File "/home/akshay/atarashi/env/lib/python3.8/site-packages/atarashi/agents/tfidf.py", line 112, in __tfidfcosinesim
processedData1 = super().loadFile(inputFile)
File "/home/akshay/atarashi/env/lib/python3.8/site-packages/atarashi/agents/atarashiAgent.py", line 44, in loadFile
self.commentFile = CommentPreprocessor.extract(filePath)
File "/home/akshay/atarashi/env/lib/python3.8/site-packages/atarashi/libs/commentPreprocessor.py", line 129, in extract
data1 = licenseComment(data)
File "/home/akshay/atarashi/env/lib/python3.8/site-packages/atarashi/libs/commentPreprocessor.py", line 42, in licenseComment
for id, item in enumerate(data[0]["multi_line_comment"]):
IndexError: list index out of range
(env) akshay@akshay-VirtualBoxatarashi/atarashi/evaluator$

### Instead It should Generate a simple error msg that the file path that was provided was wrong

Comment extraction not working properly

Multi line comment extraction for following file types are still not working JS, PHP, and Python.

For example in Python

print """Some long
print message
"""
print 'Some different message'
"""
Actual comment
"""

In this case, the script returns print 'Some different message' as a comment and leave the Actual comment.

List Index Out of Range

When Running atarashi -a {Any agent } -s {Similarities} following error produce:

image

Take a reference from here#80

Steps To Reproduce

--> Previously @Aman-Codes make changes in atarahii.py file #L83
->which was working fine but after This PR it again started creating the same error
-> or I think we need to implement the scan function inside the scan function like [here] or update the atarashii.py file

Run the evaluator command without any 'similarity' parameter

Description

The Evaluator commands are set for both two parameters i.e agent_name and similarity but some agents runs without similarity type also.

Example: for tfidfagent there are three commands :

  1. With cosine similarity : atarashi -a tfidf -s CosineSim /path/to/file.c
  2. With Score similarity : atarashi -a tfidf -s ScoreSim /path/to/file.c
  3. Without any similarity : atarashi -a tfidf /path/to/file.c

The evaluator covers the first two cases and not the third one. The same goes for other agents.

How to fix

  1. Goto the getCommand function of the evaluator
  2. Write separate conditions as desired or manipulate the existing ones.
  3. Test and verify if it's working.

Removing third party module in dameruLevenDist agent

Right now atarashi is using damerau_levenshtein_distance imported from pyxdameraulevenshtein in dameruLevenDist agent. The function is not that long and we do not if it will get removed. So, We can remove it from atarashi and write our own damerau_levenshtein_distance function to increase the overall speed of dameruLevenDist agent and make atarashi less dependent on other repository.
I have already started working on it. Can i proceed further?

Improve TF-IDF agent by tuning matches threshold

Hello.

I've been playing around with some parameters of the TF-IDF agent.

I've found that if we stop using a threshold (cosine similarity >= 0.30) to filter the match results, the accuracy improves up to 3 points. However, filtering helps to reduce the compute time, since at the end of the search the results are sorted. See the piece of code I am talking about (specially lines 126 and 133):

for counter, value in enumerate(all_documents_matrix, start=0):
sim_score = self.__cosine_similarity(value, search_martix)
if sim_score >= 0.3:
matches.append({
'shortname': self.licenseList.iloc[counter]['shortname'],
'sim_type': "TF-IDF Cosine Sim",
'sim_score': sim_score,
'desc': ''
})
matches.sort(key=lambda x: x['sim_score'], reverse=True)
if self.verbose > 0:
print("time taken is " + str(time.time() - startTime) + " sec")
return matches

Using the evaluation.py script, I've carried out some experiments:

Algorithm Time elapsed Accuracy
1 tfidf (CosineSim) (thr=0.30) 30.19 59.0%
2 tfidf (CosineSim) (thr=0.17) 35.29 61.0%
3 tfidf (CosineSim) (thr=0.16, max_df=0.10) 27.34 62.0%
4 tfidf (CosineSim) (thr=0.16) 36.42 62.0%
5 tfidf (CosineSim) (thr=0.15) 38.45 62.0%
6 tfidf (CosineSim) (thr=0.10) 39.91 62.0%
7 tfidf (CosineSim) (thr=0.00) 61.49 62.0%
8 Ngram (CosineSim) - 57.0%
9 Ngram (BigramCosineSim) - 56.0%
10 Ngram (DiceSim) - 55.0%
11 wordFrequencySimilarity - 23.0%
12 DLD - 17.0%
13 tfidf (ScoreSim) - 13.0%
  • Row 1 shows the performance (speed and accuracy) of the current configuration of the TF-IDF agent using CosineSim as similarity measure.
  • Row 7 shows how we can reach an accuracy of 62.% just by removing the threshold (cosine similarity >= 0.00). However, just removing the threshold makes the agent 2x slower, so I continued tuning the threshold holding the last value that produces 62.0% of accuracy, which is 0.16, showed in row 4.
  • In order to continue decreasing the excecution time and increasing the accuracy, I tuned some parameters of the TfidfVectorizer. Setting max_df to 0.10 (default is 1.0) keeps the accuracy equal to 62.0%, but makes the agent 1.1x faster, showed in row 3.
    • Why does decreasing the max_df value increase the speed? It increases the speed because the vectorizer ignores all the terms that appear in more than the max_df percent of the documents (see docs), i.e., it ignores more frequent terms, so each document vector is shorter, making the cosine similarity easier to compute.
    • Why does decreasing the max_df value keeps the accuracy high? My explanation is that the terms that appear in most licenses do not help the algorithm distinguish licenses; rare terms are the ones that make licenses different between each other, so they are enough for the algorithm to do a good job.

I will be opening a PR for you to reproduce the results in row 3 and merge the changes if you consider them relevant.

Important notes:

  • I've left out the speed times for all the other algorithms, because I ran those experiments in another context, so the comparison of time wouldn't be fair.
  • All the results differ from the last report I could find out there. I do not fully understand why some of them are so different; probably changes in the test files or changes in the algorithms. Anyway, 62.0% is the new best result in both reports.
  • My findings may help improve other agents that use thresholds, such as Ngram.
  • This new state-of-atarashi performance 😅 may also push the goals of future agents implementations, since it would be the new baseline.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.