Code Monkey home page Code Monkey logo

ninka's Introduction

Contact information

Any feedback will be appreciated. You can email us at Daniel M. German [email protected] and Yuki Manabe [email protected]

Introduction

Ninka is license identification tool that identifies the license(s) under which a given source file is made available.

This tool uses a source file as input and outputs the licenses identified within that file.

If you need to know the detail of Ninka, please see the following paper:

Daniel M. German, Yuki Manabe and Katsuro Inoue. A sentence-matching method for automatic license identification of source code files. In 25nd IEEE/ACM International Conference on Automated Software Engineering (ASE 2010). You can email me ([email protected]) for a copy or download it from http://turingmachine.org/~dmg/papers/dmg2010ninka.pdf.

If you use Ninka for research purposes, we would appreciate you cite the above paper.

Contributors

  • Paul Clough for his code to split sentences
  • Anthony Kohan for writing the excel and sqlite backends
  • Armijn Hemel from Tjaldur Software Governance Solutions for multiple bug reports and suggestions
  • René Scheibe for modularizing the code

License

Ninka is licensed under the GPLv2+:

Copyright (C) 2009-2014  Yuki Manabe and Daniel M. German

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as
published by the Free Software Foundation; either version 2 of the
License, or (at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.

Ninka::SentenceExtraxtor is a derivative work of the rule-based sentence splitter script by Paul Paul Clough.

comments is based on a program to remove comments by Jon Newman.

Requirements

  • How to install

    1. Unpack the distribution in a directory.
    2. Optional: Build and install comments (make sure it is somwehere in the path) (see directory comments)

Usage

ninka [options] filename

Available options:

-i create intermediary files
-v verbose

Example:

ninka -i foo.c

It will create five files:

  1. foo.c.comments: extracted the first comments blocks, where the license is usually included
  2. foo.c.sentences: creates the list of sentences in the license statement
  3. foo.c.goodsent: contains sentences that are likely to be part of a license statement
  4. foo.c.badsent: contains the sentences that are not part of foo.c.goodsent
  5. foo.c.senttok: Each sentence in *.goodsent is converted into a tokenized sentence (or unmatched, when none matches)
  6. foo.c.license: List of licenses found in the file. Its contains a single line with 3 fields (semicolon delimited):
    • Licenses
    • Unmatched sentences in *.senttok that were not matched

The files are not required for Ninka's functionality. But they can help to debug license detection issues.

Ninka model

Ninka uses a pipe-model. Each stage of the pipe does something very specific:

  1. Comment extractor

    • Module: Ninka::CommentExtractor

    • Purpose: Extracts top comments of source code. If no comment extractor is known for the language, then extracts top lines from source (currently 700)

    • Output: .comments

  2. Split sentences in comments

    • Module: Ninka::SentenceExtractor

    • Purpose: Ninka works by matching sentences of licenses, hence it needs to properly break text into sentences.

    • Output: .sentences

  3. Filter "good" sentences

    • Module: Ninka::SentenceFilter

    • Purpose: Some sentences are related to a license, some are not. It is valuable to know if a file contains lines that look like a license or not (e.g. to know that a file has no license).

    • Output: .goodsent and .badsent

  4. Tokenize sentences

    • Module: Ninka::SentenceTokenizer

    • Purpose: It creates a file that corresponds to the recognized sentence tokens. For each sentence, it outputs its sentence token, or unknown otherwise.

    • Output: .senttok

  5. Match sentences to licenses

    • Module: Ninka::LicenseMatcher

    • Purpose: It looks at the sentence tokens and outputs the licenses found.

    • Output: .license

The script ninka takes care of all these steps, and optionally creates intermediary files, and writes to the stdout the licenses found.


How to read the output:

Assume, for example, this output:

eq.c;MITX11noNotice;1;2;2;6;0;Copyright,-1,-1,DualLicenseIntention,GPLorOpenBSDTypeVer2,BSDpre,BSDcondSource,BSDcondBinary

So Ninka detects all the sentences, including the MIT variant, it finds the GPL bsd intention. But the license is not really BSD.

The disclaimers are not what you expect. Now, in all fairness, maybe this is another license.

Let me translate the output for you:

file: eq.c; License(s) found: MITX11noNotice

;1;2;2;6;0; Found 1 license Composed of 2 lines (tokens) 2 tokens were ignored 6 tokens were not mached: Copyright,-1,-1,DualLicenseIntention,GPLorOpenBSDTypeVer2,BSDpre,BSDcondSource,BSDcondBinary (-1 indicates where a match happened) 0 tokens were unknown

Another example:

nsAccessibilityUtils.cpp;MPLv1_1;1;1;3;7;2;UNKNOWN,MPL1_1_GPL2_LGPL2_1intentionVer0,1,-1,-1,MPLsee,Copyright,-1,Altern,UNKNOWN,MPLoptionNOTGPLVer0,MPLoptionIfNotDelete3licsVer0,licenseBlockEnd

License matched:MPLv1_1; One license: 1; Composed of one token: 1; 3 token were ignored 3; 7 tokens were matched but not recognized as a license: UNKNOWN,MPL1_1_GPL2_LGPL2_1intentionVer0,1,-1,-1,MPLsee,Copyright,-1,Altern,UNKNOWN,MPLoptionNOTGPLVer0,MPLoptionIfNotDelete3licsVer0,licenseBlockEnd 2 of those tokens were unknown

ninka's People

Contributors

bfirsh avatar darxriggs avatar dktrkranz avatar dmgerman avatar jeremiah avatar joshovi avatar rillig avatar sethwoodworth avatar wyhfrank avatar zacchiro avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ninka's Issues

does not recognize simple/standard AGPL license headers

I've been playing with ninka on a couple of AGPL'd applications. For testing purposes, here are the apps I've used with the corresponding tarballs:

  1. GNU mediagoblin (Git snapshot): http://upsilon.cc/~zack/stuff/mediagoblin-snapshot.tar.gz
  2. Debsources (ditto): http://upsilon.cc/~zack/stuff/debsources-snapshot.tar.gz

as a baseline test, I've also used the following archive (which contains code licensed under a mixture of licenses):

  1. python-debian (0.1.25): http://upsilon.cc/~zack/stuff/python-debian.tar.gz

I've used the new excel & sqlite wrappers in my tests.

On archive (3), ninka seems to work as expected, recognizing various licenses.
On archive (1) and (2), ninka does not recognize any single AGPL'd file as such, even though the headers in them are fairly explicit and standard, e.g.:

# Debsources is free software: you can redistribute it and/or modify it under
# the terms of the GNU Affero General Public License as published by the Free
# Software Foundation, either version 3 of the License, or (at your option) any
# later version.

It seems that ninka does some AGPL "stuff", as reported in the token dump, but fails to conclude that the file is licensed under AGPL.

The problem seems to be specific to AGPL. To verify that I've done the following experiment. I've removed (with "sed -i") all occurrences of the string "Affero " in a local copy of the Debsources archive, and rerun ninka on the resulting archive. ninka has been immediately able to conclude that most files are licensed under GPL3.

So maybe there is a simple AGPL regexp to be tweaked somewhere?

Many thanks for ninka!
Cheers.

Add Install instructions to the README.md

I'm haven't used perl in over 10 years so I didn't realise that to build and install I would need

$ perl Makefile.PL
$ make
$ sudo make install

(culled from the Dockerfile)

It would be great if this was in the README.md somewhere.

Keep releases up-to date

Major changes have been made since the latest release on Jul 7, 2013. The master branch shouldn't be used as a method for other repos to get ninka, as it could introduce breaking changes

Ninka barfs on test files from Qt with non-ASCII data and is killed

The test file in the attached ZIP (from Qt 3.3.0) makes Ninka 2.0-pre1 (release) barf and it is killed after taking quite a bit of time:

$ time ./bin/ninka /tmp/aticatac.cpp
execution of program [comments -c1 '/tmp/aticatac.cpp' 2> /dev/null] failed: status [137], error [sh: line 1: 23288 Killed comments -c1 '/tmp/aticatac.cpp' 2> /dev/null] at /gpl/ninka/ninka-2.0-pre1/lib/Ninka/CommentExtractor.pm line 76, line 1.

real 0m10.047s
user 0m10.042s
sys 0m0.010s

qt-test-file.zip

Remote code execution in bin/ninka-excel and bin/ninka-sqlite

These programs use backticks command execution, which interprets shell commands. This means that a filename containing a single quote can be used to execute arbitrary code on the system analyzing a package.

Publicly reported since neither of these files is installed by default.

Add instructions for reporting un/misrecognized licenses in README

I have dozens of Python packages for which I want to identify the licenses. For a significant part of them, Ninka just returns "UNKNOWN" for the license text file.

The README file should give instructions for how to report these identification failures and how to include samples of problematic license texts.

Link to ports in other languages

I think it would be nice if the README (after being converted to markdown format) would link to known ports of Ninka to other programming languages. The only port I'm aware of right now is JNinka.

Help with using ninka

I am trying to run the ninka script in bin following the readme notes but haven't succeeded.

I cloned the project on my mac-osx, where I have perl 5. When trying to run the file in bin simply

$ perl bin/ninka
Can't locate Ninka.pm in @INC (you may need to install the Ninka module) (@INC contains: /usr/local/opt/perl/lib/perl5/site_perl/5.36/darwin-thread-multi-2level /usr/local/opt/perl/lib/perl5/site_perl/5.36 /usr/local/opt/perl/lib/perl5/5.36/darwin-thread-multi-2level /usr/local/opt/perl/lib/perl5/5.36 /usr/local/lib/perl5/site_perl/5.36) at bin/ninka line 6.
BEGIN failed--compilation aborted at bin/ninka line 6.

Reading that (you may need to install the Ninka module), I tried to compile/install. I've seen a Makefile.PL, so I also tried following these instructions with no joy... see the following.

Step one works:

$ perl Makefile.PL 
Warning: prerequisite DBD::SQLite 0 not found.
Warning: prerequisite DBI 0 not found.
Warning: prerequisite IO::CaptureOutput 0 not found.
Warning: prerequisite Spreadsheet::WriteExcel 0 not found.
Warning: prerequisite Test::Pod 1.00 not found.
Warning: prerequisite Test::Strict 0 not found.
Generating a Unix-style Makefile
Writing Makefile for Ninka
Writing MYMETA.yml and MYMETA.json

Make seems to work as well:

$ make
...
cp bin/ninka blib/script/ninka
"/usr/local/Cellar/perl/5.36.1/bin/perl" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/ninka
Manifying 1 pod document
Manifying 8 pod documents

Make test fails miserably:

$ make test
PERL_DL_NONLAZY=1 "/usr/local/Cellar/perl/5.36.1/bin/perl" "-MExtUtils::Command::MM" "-MTest::Harness" "-e" "undef *Test::Harness::Switches; test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
t/pod_ok.t .................... Can't locate Test/Pod.pm in @INC (you may need to install the Test::Pod module) (@INC contains: /Users/ed4565/Development/ninka/blib/lib /Users/ed4565/Development/ninka/blib/arch /usr/local/opt/perl/lib/perl5/site_perl/5.36/darwin-thread-multi-2level /usr/local/opt/perl/lib/perl5/site_perl/5.36 /usr/local/opt/perl/lib/perl5/5.36/darwin-thread-multi-2level /usr/local/opt/perl/lib/perl5/5.36 /usr/local/lib/perl5/site_perl/5.36 .) at t/pod_ok.t line 3.
BEGIN failed--compilation aborted at t/pod_ok.t line 3.
t/pod_ok.t .................... Dubious, test returned 2 (wstat 512, 0x200)
No subtests run 
t/reference_licenses.t ........ Can't locate IO/CaptureOutput.pm in @INC (you may need to install the IO::CaptureOutput module) (@INC contains: /Users/ed4565/Development/ninka/blib/lib /Users/ed4565/Development/ninka/blib/arch /Library/Perl/5.30/darwin-thread-multi-2level /Library/Perl/5.30 /Network/Library/Perl/5.30/darwin-thread-multi-2level /Network/Library/Perl/5.30 /Library/Perl/Updates/5.30.3 /System/Library/Perl/5.30/darwin-thread-multi-2level /System/Library/Perl/5.30 /System/Library/Perl/Extras/5.30/darwin-thread-multi-2level /System/Library/Perl/Extras/5.30 .) at /Users/ed4565/Development/ninka/blib/lib/Ninka/CommentExtractor.pm line 7.
BEGIN failed--compilation aborted at /Users/ed4565/Development/ninka/blib/lib/Ninka/CommentExtractor.pm line 7.
Compilation failed in require at /Users/ed4565/Development/ninka/blib/lib/Ninka.pm line 6.
BEGIN failed--compilation aborted at /Users/ed4565/Development/ninka/blib/lib/Ninka.pm line 6.
Compilation failed in require at /Users/ed4565/Development/ninka/bin/ninka line 6.
BEGIN failed--compilation aborted at /Users/ed4565/Development/ninka/bin/ninka line 6.

    #   Failed test 'stdout is as expected'
    #   at t/reference_licenses.t line 25.
    #          got: ''
    #     expected: '/var/folders/t0/6dry46x961xds6f9cb87z1c580sq3r/T/YmnZ7cSBpL/AAL;UNKNOWN;0;0;0;2;7;Copyright,AllRights-TOOLONG,UNKNOWN,UNKNOWN,UNKNOWN,UNKNOWN,UNKNOWN,UNKNOWN,UNKNOWN'
    # Looks like you failed 1 test of 2.
t/reference_licenses.t ........ 1/96 
#   Failed test 'AAL'
#   at t/reference_licenses.t line 26.

(... many more like that)

Failed 3/3 test programs. 96/96 subtests failed.
make: *** [test_dynamic] Error 2

What's the fastest way to test the tool with some source code?

I don't understand the splitter/README file.

Hi,

I realize that the authors of this software are non-native English speakers. :-)

I'm hoping to get a little help understanding the splitter/README file in ninka. It says; "Before you can start using Ninka, you need to create the sentence breaking program 'splitter.pl'" But splitter.pl is an executable perl script and doesn't need to be "created" in order to be used. What does the term "create" mean in this context? Does it mean I need to put splitter.pl someplace where ninka.pl can see it? (This seems to be the case because ninka.pl looks in $path/splitter/splitter.pl).

If this is the case, perhaps I don't need to actually create anything, ninka.pl will find splitter.pl because its already there? At least if you pull from GitHub.

Cheers,

Jeremiah

MPLv2.0 is reported as UNKNOWN.

It seems that ninka couldn't recognize Mozilla Public License Version 2.0.

Here's a license header from a C source file that I tried with:

/* This Source Code Form is subject to the terms of the Mozilla Public
 * License, v. 2.0. If a copy of the MPL was not distributed with this
 * file, You can obtain one at http://mozilla.org/MPL/2.0/. */

The .license file from ninka is:

UNKNOWN;0;0;0;1;1;UNKNOWN,MPLv2part2

And .senttok file contains:

UNKNOWN;0;UNKNOWN;1;This Source Code Form is subject to the terms of the Mozilla Public License, v<dot> 2<dot>0:This Source Code Form is subject to the terms of the Mozilla Public License, v<dot> 2<dot>0.
MPLv2part2;10;;:If a copy of the MPL was not distributed with this file, You can obtain one at http<colon>//mozilla.org/MPL/2.0/.

Further more, if I use the standard version from its official website, which is, using https instead of http in the URL part, like this:

/* This Source Code Form is subject to the terms of the Mozilla Public
 * License, v. 2.0. If a copy of the MPL was not distributed with this
 * file, You can obtain one at https://mozilla.org/MPL/2.0/. */

Then the .license file turns to be:

UNKNOWN;0;0;0;0;2;UNKNOWN,UNKNOWN

And .senttok file be:

UNKNOWN;0;UNKNOWN;1;This Source Code Form is subject to the terms of the Mozilla Public License, v<dot> 2<dot>0:This Source Code Form is subject to the terms of the Mozilla Public License, v<dot> 2<dot>0.
UNKNOWN;0;UNKNOWN;1;If a copy of the MPL was not distributed with this file, You can obtain one at https<colon>//mozilla.org/MPL/2.0/:If a copy of the MPL was not distributed with this file, You can obtain one at https<colon>//mozilla.org/MPL/2.0/.

Deprecated use of regex

Ninka is giving me the following warning message:

Unescaped left brace in regex is deprecated here (and will be fatal in Perl 5.30), passed through in regex; marked by <-- HERE in m/^\s*[0-9]{ <-- HERE 1-2}+\s*[-)]/ at /opt/kss/lib/perl5/site_perl/Ninka/SentenceExtractor.pm line 117.

Ninka (comments) barfs on files containing spaces

another issue when the 'comments' program is used. I scanned a file in a directory that has a space. I am encountering this quite frequently. A simple test case:

$ /gpl/ninka/ninka-2.0-pre1/bin/ninka /tmp/test/space\ test/bzip2-1.0.6/bzip2.c
execution of program [comments -c1 '/tmp/test/space\ test/bzip2-1.0.6/bzip2.c' 2> /dev/null] failed: status [1], error [] at /gpl/ninka/ninka-2.0-pre1/lib/Ninka/CommentExtractor.pm line 76.

If the comments program is not used it works.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.