dosocsv2 / dosocsv2 Goto Github PK

SPDX 2.0 document creation and storage

License: GNU General Public License v2.0

Python 97.48% Shell 2.16% TeX 0.36%

dosocsv2's Introduction

dosocs2

branch	status	vulnerability scanner	status	Python	Status
master		BlackDuck CoPilot		Dependencies
dev		BlackDuck CoPilot		Python3

python	status
Dependencies
Python3

dosocs2 is a command-line tool for managing SPDX 2.0 documents and data. It can scan source code distributions to produce SPDX information, store that information in a relational database, and extract it in a plain-text format on request.

The discovery and presentation of software package license information is a complex problem facing organizations that rely on open source software within their innovation streams. dosocs2 enables creation of an SPDX document for any software package to represent associated license information. In addition, dosocs2 can be used in the creation and continuous maintenance of an inventory of all open-source software used in an organization. The primary audience for dosocs2 is open source software teams seeking to advance the representation and maintenance of open source software package license information.

SPDX is a standard format for communicating information about the contents of a software package, including license and copyright information. dosocs2 supports the SPDX 2.0 standard, released in May 2015.

dosocs2 is under heavy development; expect frequent backwards-incompatible changes until a 1.x.x release!

Current deviations from SPDX 2.0 specification

Exactly one package per document is required. (SPDX 2.0 allows zero or more packages per document.)
Files in a document can only exist within a package. (SPDX 2.0 allows files to exist outside of a package.)
Checksums are always assumed to be SHA-256. (SPDX 2.0 permits SHA-1, SHA-256, and MD5)
A file may be an artifact of only one project.
License expression syntax is not parsed; license expressions are interpreted as license names that are not on the SPDX license list.
Deprecated fields from SPDX 1.2 (reviewer info and file dependencies) are not supported.

License and Copyright

dosocs2 is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 2 of the License, or (at your option) any later version. See the file LICENSE for more details.

All associated documentation is licensed under the terms of the Creative Commons Attribution Share-Alike 3.0 license. See the file CC-BY-SA-3.0 for more details.

Dependencies

Python 2.7.x

Optional:

PostgreSQL 8.x or later version (can be on a separate machine)

Python libraries:

All Python dependencies are handled automatically by pip.

Installation

Step 1 - Download and install

Grab the source tarball for the latest release and use pip to install it as a package. Replace 0.x.x with the latest release version number.

I recommend doing this inside a Python virtualenv, but it is not a requirement. If you are not inside a virtualenv you may have to run pip as root (not recommended!).

$ tar xf 0.x.x.tar.gz
$ pip install ./DoSOCSv2-0.x.x

Then run the install script for the default license scanner:

$ ./DoSOCSv2-0.x.x/scripts/install-nomos.sh

Step 2 (Optional) - Change the default configuration

Not required, but strongly recommended, is to generate an initial config file:

$ dosocs2 newconfig
dosocs2: wrote config file to /home/tom/.config/dosocs2/dosocs2.conf

The default config points to a SQLite database stored in your home directory. For example, for user tom, this database would be created at /home/tom/.config/dosocs2/dosocs2.sqlite3. If you like, you can open the config file and change the connection_uri variable to use a different location for the database.

Step 3 (Optional) - Add PostgreSQL configuration

Follow this step if you want to use PostgreSQL instead of SQLite for the SPDX database.

You will have to create the spdx (or whatever name you want) role and database yourself. I recommend setting a different password than the one given...:

$ sudo -u postgres psql
psql (9.3.9)
Type "help" for help.

postgres=# create role spdx with login password 'spdx';
CREATE ROLE
postgres=# create database spdx with owner spdx;
CREATE DATABASE

Then change the connection_uri variable in your dosocs2.conf:

# connection_uri = postgresql://user:pass@host:port/database
connection_uri = postgresql://spdx:spdx@localhost:5432/spdx

Step 4 - Database setup

Finally, to create all necessary tables and views in the database:

$ dosocs2 dbinit

You only need to do this once. This command will drop all existing tables from your SPDX database, so be careful!

Usage

The simplest use case is scanning a package, generating a document, and printing an SPDX document in one shot:

$ dosocs2 oneshot package.tar.gz
dosocs2: package.tar.gz: package_id: 1
dosocs2: running nomos on package 1
dosocs2: package.tar.gz: document_id: 1
[... document output here ...]

Also works on directories:

$ dosocs2 oneshot ./path/to/directory

The scan results and other collected metadata are saved in the database so that subsequent document generations will be much faster.

To just scan a package and store its information in the database:

$ dosocs2 scan package.tar.gz
dosocs2: package_tar_gz: package_id: 456
dosocs2: running nomos on package 456

In the default configuration, if a scanner is not specified, only nomos is run by default. It gathers license information, but is a bit slow. One can use the -s option to explicitly specify which scanners to run:

$ dosocs2 scan -s nomos_deep,dummy package.tar.gz
dosocs2: package_tar_gz: package_id: 456
dosocs2: running nomos_deep on package 456
dosocs2: running dummy on package 456

After dosocs2 scan, no SPDX document has yet been created. To create one in the database (specifying the package ID):

$ dosocs2 generate 456
dosocs2: (package_id 456): document_id: 123

Then, to compile and output the document in tag-value format:

$ dosocs2 print 123
[... document output here ...]

Use dosocs2 --help to get the full help text. The doc directory here also provides more detailed information about how dosocs2 works and how to use it.

Potential Organizational Use of dosocs2

History

dosocs2 owes its name and concept to the DoSOCS tool created by Zac McFarland, which in turn was spun off from the do_spdx plugin for Yocto Project, created by Jake Cloyd and Liang Cao.

dosocs2 aims to fill the same role as DoSOCS, but with support for SPDX 2.x, a larger feature set, and a more modular implementation, among other changes.

Maintainers

DoSOCSv2 organization

(This work has been funded through the National Science Foundation VOSS-IOS Grant: 1122642.)

dosocsv2's People

Contributors

Stargazers

Watchers

Forkers

sschuberth pombredanne leimaohui tpflueger jgeley bwolatz ayy-ay-oos kla587 nadusumilli-unomaha khtran1994 jeremiah zhengrq-fnst nebrethar udaykor dalavancloud computationalmystic apalarcon trellixvulnteam

dosocsv2's Issues

Update maintainer list

With my departure from the team, the maintainer info in README.md needs to be updated to point to Uday and Josiah instead of me.

Add logic for generating tag-value document output

We have a template for tag-value documents already. Just need to write the function (and any needed database views) to query the database for the relevant information.

Write SQLAlchemy queries for DESCRIBES, etc

The move to database-agnostic queries means we need to redo the auto-create code for:
DESCRIBES
DESCRIBED_BY
CONTAINS
CONTAINED_BY

Add a true document cache

Since it is not trivial to render a document from its relational form in the database, we can improve performance even more by caching compiled document texts. Assuming we cache both tag and RDF documents, I imagine this requiring two tables (seen here in pseudo-SQL):

create table document_cache (
    document_cache_id    serial,
    document_text        text not null,
    document_type        integer not null,
    document_id          integer not null,
    sha1                 char(40),
    primary key (document_cache_id),
    constraint uc_document_cache_document_id unique (document_id),
    foreign key (document_id) references documents (document_id),
    foreign key (document_type) references document_types (document_type_id)
);

create table document_types (
    document_type_id    serial,
    name                text not null,
    primary key (document_type_id),
    constraint (uc_document_type_name) unique (name)
);

Switch to a command-style interface

dosocs2 does different things that accept different options. Something like this might be more intuitive to use:

Usage:
dosocs2 scan [options] (PATH)
dosocs2 print [options] (DOC-ID)
dosocs2 generate [options] (PACKAGE-ID)
dosocs2 init [--no-confirm]
dosocs2 --help

[... options stuff ...]

It would probably be nice, also, to have a command that will scan a package, generate a document, and dump the document to stdout, all in one shot, using cached data whenever it is available.

Set up as a python package

Add the needed files (setup.py, etc) to make this into a proper Python package.

Support directory scans

We can scan a package, i.e., a single archive file. But it would be nice to be able to scan directories without creating a tarball out of them first.

Missing relationship comment

Per SPDX 2.0 section 6.2, a relationship may have an associated comment.

Python 3 compatibility

Eventually I want to ensure compatibility with both Python 2 and Python 3. It's a big effort that will require a lot of testing and certainly some code changes, though I have no idea how many.

Implement relationship COPY_OF

This one should be easy since we already store all file and package SHA1s in the database.

Create and drop tables/views in dbinit are not in a transaction

Since PostgreSQL allows manipulating the schema without auto-committing, we should take advantage of this by wrapping the drop/create stuff in a transaction when dosocs --init is called.

Detect more file types

Currently the code (in util.py) only detects those file types that were in SPDX 1.2 -- 'ARCHIVE', 'SOURCE', 'BINARY' and 'OTHER'. There are many more (per SPDX 2.0 section 4.3) that we can try to detect. Not sure which ones are feasible. Some research is needed.

Get rid of the nomos parallelism option

Confirmed broken here:
fossology/fossology#396

Will have to use an alternate method of improving performance.

Include friendly short name in generated license ID strings

Instead of, say, LicenseRef-20ce2a9f-9399-4bd9-8bc1-3a70b56bc7a0, generate as LicenseRef-MIT-style-20ce2a9f, to make it somewhat human-readable.

Scrape license list version and include in database

The license list version is included on https://spdx.org/licenses/ from which the license list itself is scraped. We need to pull this number into the database along with the licenses themselves so that the license list version field in generated documents can be filled dynamically.

Verify `gen_ver_code` against the reference implementation

gen_ver_code is an included function, but I don't know if it is implemented correctly. It can be verified against the reference implementation in the SPDX 2.0 tools. (https://spdx.org/tools/spdx/consolidated-spdx-tools-and-library)

Document produced from directory scan (not package scan) has package SHA-1 of "None"

...that is, the string "None", not the Python object None. The correct behavior would be to leave out the PackageChecksum field from the document when there is no checksum.

Add unique constraint on `sha1` column to `files` and `packages`

We don't want two records with the same sha1 to exist in either of these tables -- the relevant row should be updated rather than a new one inserted.

Implement relationships CONTAINS and CONTAINED BY

We have this information for packages already. Just need to have the program create these relationship rows when a document is created.

Factor out scanner base classes into new module

To avoid circular importing when adding new scanners, need to make a new module for the scanner base classes (Scanner, FileLicenseScanner, etc), and put each "real" scanner (e.g. nomos) in its own module.

Support ignore regex

The feature allowing one to specify a regex matching files for the scanner to ignore was removed previously due to difficulties in proper implementation; these difficulties have been resolved and this feature can go back in.

Support recursive unpack

A directory or archive may have other archives (tar, zip, jar) in it; we would like to unpack these during scanning, rather than treat them as monolithic files.

Constraint uc_package_id_file_id_file_name is overkill

In table packages_files:

create table if not exists
packages_files (
    package_file_id         serial,
    package_id              integer not null,
    file_id                 integer not null,
    concluded_license_id    integer,
    license_comment         text not null,
    file_name               text not null,
    primary key (package_file_id),
    constraint uc_package_id_file_id_file_name unique (package_id, file_id, file_name),
    foreign key (package_id) references packages (package_id),
    foreign key (file_id) references files (file_id),
    foreign key (concluded_license_id) references licenses (license_id)
);

We have a unique index over (package_id, file_id, file_name). To pull in all three columns is unnecessary. It would be sufficient to just have a unique index over (package_id, file_name).

Add basic install and usage instructions

Currently nothing in the README about how to install and use.

Add name of file operated on to output

So, instead of this:

$ dosocs2 test.tar.bz2
package_id: 123

We would have this:

$ dosocs2 test.tar.bz2
test.tar.bz2: package_id: 123

$ dosocs2 -c 123
(package_id 123): document_id: 456

Or something similar.

Implement relationship DESCRIBES

We can create one DESCRIBES relationship record for each file and package that are contained in a document.

Create table documents_packages

SPDX 2.0 allows a many-to-many relationship between documents and packages. Right now the relationship is a one-to-many (a document can contain only one package). We can move the foreign key package_id out of documents into a new junction table documents_packages to support this

Write a parser for nomossa highlight info output

When run with the -S option, Nomossa produces some output that shows where each license was found in the input file:

File DoSOCSv2/dosocs2/fixtures/licenses.json contains license(s) ANTLR-PD,Adobe-SCLA,Apache-2.0,BSL-1.0,CECILL-B,CUA-OPL-1.0,ClArtistic,D-FSL-1.0,GFDL-1.3,IJG,Intel,LGPL-2.1+,MirOS,NCSA,NPOSL-3.0,NTP,ODbL-1.0,Python,SGI-B-1.0,SNIA,SugarCRM-1.1.3,W3C,YPL-1.0,ZPL-2.1,Zimbra-1.3,Zlib Highlighting Info at Keyword at 66954, length 15, index = 0, Keyword at 67053, length 15, index = 0, Keyword at 67099, length 15, index = 0, Keyword at 2580, length 9, index = 0, Keyword at 11522, length 9, index = 0, Keyword at 11761, length 9, index = 0, Keyword at 12000, length 9, index = 0, Keyword at 12239, length 9, index = 0, Keyword at 12478, length 9, index = 0, Keyword at 12710, length 9, index = 0, Keyword at 13833, length 9, index = 0, Keyword at 34939, length 9, index = 0, Keyword at 35582, length 9, index = 0, Keyword at 43138, length 9, index = 0, Keyword at 56949, length 9, index = 0, Keyword at 3663, length 2, index = 0, Keyword at 12619, length [...]

We need to parse this, use it to get the "ExtractedText" from the input file, and store the extracted text in the database for that file/license. That will take care of #30 also.

Support sqlite3 database

One may not want to set up a PostgreSQL database. The sqlite3 connector is part of the Python standard library; it might be useful to support this for doing "one shot" SPDX document generation without having to set up a database server.

Since we're using an ORM, this just means creating new SQL scripts for drop/create. The views should not need modification but the table create script will.

Document name for package 'yocto-spdx.tar.bz2' is generated as 'yocto-spdx.tar'

Due to the use of Python's os.path.splitext(), we only get the '.bz2' chopped off. The ideal document name is 'yocto-spdx'.

Tag documents are generated with empty ExtractedText fields

Example:

LicenseID: LicenseRef-Python
LicenseName: Python
ExtractedText:
LicenseCrossReference:
LicenseComment: <text>found by nomos</text>

This is technically invalid. The ExtractedText field should at least have the <text> tags.

UPDATE 6/17/2015: Even adding the tags is insufficient... the document still fails validation because this field needs to actually have something in it; it needs to be populated with license text. This is a harder problem.

Friendlier SPDXRef strings

Our identifier strings look like "SPDXRef-b8564831-ebc4-44a8-8c66-c8d2511b6f19". This is bad for humans. We could at least include some friendly information in this string instead of a UUID. Like file name, package name, SHA1, whatever...

Add document listing command

I'm thinking something like dosocs2 alldocs printing a pretty table with document ID, name, and creation date, for each document in the database.

Need support for scanning packages through FOSSology job scheduler

Currently using nomos directly to scan files. cp2foss or a similar tool to load files into FOSSology offers numerous advantages and should probably be the default back-end.

UPDATE 6/9/2015: This problem is turning out to be quite difficult because of the limitations of the two main ways one can interact with FOSSology:

Web UI
cp2foss on the FOSSology host machine

Only the second one can be automated; only the first one can be used over a network. We want both capabilities.

Add a unique constraint on `(document_namespace, id_string)`

In the identifiers table, it is possible to have a document namespace with the same SPDX identifier string ('SPDXRef-whatever') duplicated. This is bad. Solution is to create a unique index on these two columns together.

Add debug output option

Something like a -v or --verbose flag to get all the juicy details during operations like package scanning, document creation, etc.

Get rid of two-column primary keys

There are a couple junction tables that have a two-column primary key. It would be better to add a single, dedicated primary key column to these tables.

SQL drop/create scripts duplicated between here and spdx2.0-schema repo

The exact same scripts used here for dropping and creating tables for the spdx20 database are "officially" housed in another repo. Every time I update them there I'm copying them over to here. There should be some sensible way to avoid this duplication.

Create table documents_files

SPDX 2.0 allows a document to contain any number of files that are not within a package. We can create a documents_files junction table to support this feature.

This, and the lack of a documents_packages table are huge pieces of technical debt that will continue to haunt us until fixed!

Directory scans do not use package-level cache

For a package file (jar, tar, etc.) SHA-1 is good enough to determine that we've already scanned it. For directory scans there is no equivalent, so scanning the same directory tree multiple times will create multiple package records. We need some way to uniquely identify directories. Maybe package verification code plus hash of directory listing?

Create table license_lists

Support updating the license list by adding a table license_lists with column version, then refer to that table from the licenses and documents tables.

Move to a standard versioning scheme

The v0.xxx format is silly and poorly thought out on my part; not to mention, it doesn't follow the standard for Python packages, which is v0.x.x (two dots). So the first version under the new numbering scheme would be v0.1.0.

Support adding new scan to existing packages/files

That is, we should be able to add file-license relationships AFTER file info has been stored in the database. For performance reasons this will likely require two new tables:

create table scanners (
    -- ...
)

create table scans (
    -- ...
    file_license_id    integer not null
    scanner_id         integer not null
    created_ts         timestamp with time zone not null
    -- ...
)

More specifically, if we scan a file with, say, nomos, and then we request a scan again with nomos, we should be able to use cached results, unless a new scan is explicitly requested.

Client-server capability

Startup overhead for dosocs2 is pretty large for a command line app. A daemon mode would be lovely--I'm thinking something similar to emacs --daemon and emacsclient. Maybe with a Unix socket as IPC? (That or TCP/IP socket allowing communication over a network.)

This is a big one and will probably take a lot of time and effort to get right... if it is even a good idea. Perhaps it would be better to somehow attack the root problem of slow startup time.

Created timestamp in incorrect format

Refer to SPDX 2.0 spec:

2.9.4 Data Format: YYYY-MM-DDThh:mm:ssZ
where:
 YYYY is year
 MM is month with leading zero
 DD is day with leading zero
 T is delimiter for time
 hh is hours with leading zero in 24 hour time
 mm is minutes with leading zero
 ss is seconds with leading zero
 Z is universal time indicator

Currently timestamps look like this:
2015-06-16 12:09:53.533268
or
YYYY-MM-DD hh:mm:ss.uuuuuu

Create RDF template

We have a tag-value template already. Only thing needed to support RDF output is to create a template for it.

Nomossa scanner

Maybe you already know this, but apparently you can build nomos as a standalone (separate from fossology). It appears to have the same output as nomos, without all of the requirements of postgres, and the UI components.

Source of this knowledge:
https://github.com/fossology/fossology/blob/master/src/nomos/agent/Notes

I built the scanner and played with it a bit. Seemed to work fine, and now I can use the 500k executable instead of the entire FOSSology install.

More general information than an issue, might make the dependencies a bit more manageable.

Side note I haven't tried doing the same with monk yet.

License scanner fails when no ignore pattern is provided in config file

Was fixed with 03fd3f2, awaiting merge into master.

Document creation from command line allows nonexistent package id

$ ./dosocs2 -c 1234
Traceback (most recent call last):
  File "./dosocs2", line 194, in <module>
    main()
  File "./dosocs2", line 178, in main
    document = spdx.create_document(package_id)
  File "/home/tgurney/gitstuff/dosocs2/src/spdxdb.py", line 192, in create_document
    doc_name = kwargs.get('name') or util.package_friendly_name(package.file_name)
AttributeError: 'NoneType' object has no attribute 'file_name'

"Key (document_namespace_id, file_id)=(...) already exists", when same file appears twice in a package

Scan a package containing nothing except two empty files.

$ ./dosocs2 two-empties.tar
package_id: 1

Attempt to create a document using the same package.

$ ./dosocs2 -c 1
[...]
[SQL: 'INSERT INTO identifiers (document_namespace_id, id_string, document_id, package_id, file_id) VALUES (%(document_namespace_id)s, %(id_string)s, %(document_id)s, %(package_id)s, %(file_id)s) RETURNING identifiers.identifier_id'] [parameters: {'document_namespace_id': 3, 'package_id': None, 'id_string': 'SPDXRef-7a2cf822-e79e-4695-b82f-b9eefb60421b', 'file_id': 906, 'document_id': None}]

Caused by the unique index on (document_namespace_id, file_id) in the identifiers table.