Code Monkey home page Code Monkey logo

project-kb's Introduction

Project KB

Go Report Card Go License PRs Welcome Join the chat at https://gitter.im/project-kb/general GitHub All Releases REUSE status Pytest

Description

The goal of Project KB is to enable the creation, management and aggregation of a distributed, collaborative knowledge base of vulnerabilities affecting open-source software.

Project KB consists of vulnerability data vulnerability knowledge-base as well as set of tools to support the mining, curation and management of such data.

Motivations

In order to feed Eclipse Steady with fresh data, we have spent a considerable amount of time, in the past few years, mining and curating a knowledge base of vulnerabilities that affect open-source components. We know that other parties have been doing the same, in academia as well as in the industry. From this experience, we have learnt that with the growing size of open source ecosystems and the pace at which new vulnerabilities are discovered, the old approach cannot scale. We are also more and more convinced that vulnerability knowledge-bases about open-source should be open-source themselves and adopt the same community-oriented model that governs the rest of the open-source ecosystem.

These considerations have pushed us to release our vulnerability knowledge base in early 2019. In June 2020, we made a further step releasing the kaybee tool support to make the creation, aggregation, and consumption of vulnerability data much easier. In late 2020, we also released, as a proof-of-concept, the prototype prospector, whose goal is to automate the mapping of vulnerability advisories onto their fix-commits.

We hope this will encourage more contributors to join our efforts to build a collaborative, comprehensive knowledge base where each party remains in control of the data they produce and of how they aggregate and consume data from the other sources.

Kaybee

Kaybee is a vulnerability data management tool, it makes possible to fetch the vulnerability statements from this repository (or from any other repository) and export them to a number of formats, including a script to import them to a Steady backend.

For details and usage instructions check out the kaybee README.

Prospector

Prospector is a vulnerability data mining tool that aims at reducing the effort needed to find security fixes for known vulnerabilities in open source software repositories. The tool takes a vulnerability description (in natural language) as input and produces a ranked list of commits, in decreasing order of relevance.

For details and usage instructions check out the prospector README.

Vulnerability data

The vulnerability data of Project KB are stored in textual form as a set of YAML files, in the vulnerability-data branch.

Publications

In early 2019, a snapshot of the knowlege base from project "KB" was described in:

If you use the dataset for your research work, please cite it as:

@inproceedings{ponta2019msr,
    author={Serena E. Ponta and Henrik Plate and Antonino Sabetta and Michele Bezzi and
    C´edric Dangremont},
    title={A Manually-Curated Dataset of Fixes to Vulnerabilities of Open-Source Software},
    booktitle={Proceedings of the 16th International Conference on Mining Software Repositories},
    year=2019,
    month=May,
}

MSR 2019 DATA SHOWCASE SUBMISSION: please find here the data and the scripts described in that paper

If you wrote a paper that uses the data or the tools from this repository, please let us know (through an issue) and we'll add it to this list.

Star History

Star History Chart

Credits

EU-funded research projects

The development of Project KB is partially supported by the following projects:

Vulnerability data sources

Vulnerability information from NVD and MITRE might have been used as input for building parts of this knowledge base. See MITRE's CVE Usage license for more information.

Limitations and Known Issues

This project is work-in-progress, you can find the list of known issues here.

Currently the vulnerability knowledge base only contains information about vulnerabilities in Java and Python open source components.

Support

For the time being, please use GitHub issues to report bugs, request new features and ask for support.

Contributing

See How to contribute.

project-kb's People

Contributors

amilankovich-slab avatar bvwells avatar chicxurug avatar copernico avatar daanhommersom avatar geryxyz avatar gitter-badger avatar henrikplate avatar ichbinfrog avatar idarav avatar jonathanbaker7 avatar matteogreek avatar naramsim avatar riruk avatar sacca97 avatar serenaponta avatar sumeetpatil avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

project-kb's Issues

Document usage scenarios and corresponding workflows

Creation of statements

As a user that found information about a particular vulnerability, I want to create a corresponding statement; after having checked if other statements about that vulnerability exist already in one of my sources, I want the statement to be validated and then published on my own vulnerability repository.
After that, I also want to produce a script to import to my Steady backend this new information, reconciled with other statements about the same vulnerability (if they exist).

Examples (tentative):

kaybee lookup <VULN_ID>
kaybee create <VULN_ID>
kaybee push --validate
kaybee export  --target steady   # this produces steady.sh
tail -n 200 .kaybee/mergelog
bash ./steady.sh

Automated aggregation of vulnerability data

In an automated script, run periodically, all sources are to be pulled and merged using a suitable policy; for the statements that can be automatically reconciled, a script to import into Steady must be generated and run automatically. For the statements that could not be reconciled, a notification must be produced to trigger manual intervention.

# in crontab
kaybee pull && kaybee merge
kaybee prune -r 30   # purge stale local clones
kaybee export --target steady && bash ./steady.sh

Manual reconciliation

After having received a notification that a certain group of statements could not be reconciled, I want to inspect the MergeLog to determine what was the reason. I can then craft a new statement to override the necessary elements of the existing (3rd-party) statements and push to my own statement repository.

tail -n 200 .kaybee/mergelog
kaybee merge -p manual
kaybee push --validate

GH build action not working

For some reason, some (transitive?) dependencies are failing the build. Note: I cannot reproduce locally.

Commit pre-processor module

Extraction of the existing commit pre-processing (feature extraction) code and its restructuring into a separate module.

Integrate NVD feed endpoint

Re-use this implementation that is part of eclipse/steady to have a local replica of the NVD feeds.

Heads-up: most likely, steady will drop that module in the future, we can just clone it in project-kb and continue from there.

Implement command to create a new statement

A comand like create-statement should provide a UI to guide the user in entering the necessary data. This could be implemented as a multi-step form, on the CLI, in a browser-based app, or both.

kaybee import not working

Command with error -

kaybee import -c kaybeeconf.yaml -v
Using config file: /Users/I334616/fork/project-kb/kaybee/testdata/conf/kaybeeconf.yaml
Importing vulnerability data from http://localhost:8033/backend/ (using 0 workers)
2020/06/26 09:37:13 json: cannot unmarshal object into Go value of type []*model.Bug

build documentation automatically

At each PR, PUSH: make sure that the documentation (gh-pages) are built automatically correctly (Optional: check that they contain no dead links).

kaybee config for steady generates java properties file rather than json

kaybee config for steady generates java properties metadata file for commits rather than json file
Example -
kaybee export -c kaybeeconf_steady.yaml -t steady -f .kaybee/imported/HTTPCLIENT-1803/statement.yaml -o test2.sh
./test2.sh
cd HTTPCLIENT-1803
tree -L 2

.
 ├── 0554271750599756d4946c0d7ba43d04b1a7b2
  │   ├── after
  │   ├── before
  │   └── meta.properties //this should be a json file
 └── metadata.json

ensure that conflicting statements are reconciled only once

Scenario

A statement s_1 and a statement s_2, from sources S_1 and S_2 respectively, are conflicting. With some policy (or via manual intervention) they are reconciled and the result is statement s_3.

The next day, kaybee merge is run again, how to deal with the same conflict? Shall we keep a pointer to the last commit that was considered from each source so that reconciliation is not done again on the exact same set of conflicting statements?

kaybee setup fails

Kaybee setup fails with error -
Tested on MacOS

kaybee setup
Running setup...
Non-interactive mode
[+] Running Setup task
2020/06/26 14:15:52 stat /home/*******/devel/project-kb/kaybee/internal/tasks/data/default_config.yaml: no such file or directory

Design ML subsystem (model training + prediction)

Please refer to the diagram below:

prospector-assuremoss

Note: the label on the database at the bottom of the diagram is wrong: its contents are pre-processed commits, not vectorial representations of commits (which, for most features, depend on the query).

Validate kaybee statements

Kaybee statements should be validated automatically when we create a pull request
I can think of these basic validations -

  1. Validate Git Commit Id
  2. Validate Git branch
  3. Validate PURL for affected artifacts
  4. Validate Git/Svn Repository

kaybee setup requires config file

Command kaybee setup is supposed to be used to generate a config file, however it won't run complaining that no config file exists (unless there is one).

[prospector] Improve getting-started docs

There are several steps not documented (including installation of dependencies that are not in the requirements.txt file):

  • sklearn
  • spacy (+language model to be installed separately)
  • matplotlib
  • streamlit
  • (....)

Specify expected behaviour of a soft policy

The strict policy refuses to reconcile statements about the same vulnerability that come from different sources.

We need another policy that is able to automatically reconcile statements about the same vulnerability.

What is the expected behaviour of such policy?

Make git parsing more efficient

The current bottleneck in Prospector runtime is parsing the .git folder to an SQL database. We might take a look at other solutions, which parse .git folders and try them and learn from them to improve the processing speed of this initial step. Example projects:

  • gitbase seems abandoned
  • gitql looks good, but probably not optimized
  • askgit seems promising, adds github API support (in beta)

Use "reuse-tool" for licensing info

Generate license and copyright information with the 'Reuse' tool of the Free Software Foundation Europe (FSFE)
To adopt external standards and best practices, the 'Reuse' tool of FSFE must be used to generate copyright and license information

TODOs:

  • Generate the required information with the 'Reuse' tool as described here.
  • Remove the notices file which was used in the past to store the SAP copyright (the file might be named NOTICES or NOTICES.txt).
  • Remove the License section from the README.md file.

check if a new release exists

kaybee could check if a newer version exists on GH.
Separate command? Should also download?
Of course, the user should be able to opt-out/disable.

extend source specification to include a path

Currently a source is a 3-tuple:

  • repository url
  • branch
  • rank

It should become a 4-tuple and include also:

  • path

Example:

sources:
  - repo: https://github.com/sap/project-kb
    path: vulnerability-data
    branch: master
    rank: 10

This will allow to point to specific folders in the git repository and ahve separate "sources" in the same branch of the same repo.

database creation fails

If the binary files of the sqlite DB are not present (they should not be in Git), Prospector fails to create them.

Can't run the same analysis twice

When processing the same CVE twice in a row, the second run aborts with the following:

Traceback (most recent call last):
  File "main.py", line 220, in <module>
    plac.call(main)
  File "/home/i064196/.local/share/virtualenvs/prospector-_0gS2ZiV/lib/python3.6/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/home/i064196/.local/share/virtualenvs/prospector-_0gS2ZiV/lib/python3.6/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "main.py", line 153, in main
    advisory_record.gather_candidate_commits()
  File "/home/i064196/devel/project-kb/prospector/rank.py", line 781, in gather_candidate_commits
    self.validate_database_coverage()
  File "/home/i064196/devel/project-kb/prospector/rank.py", line 751, in validate_database_coverage
    database.add_repository_to_database(self.connection, self.repo_url, self.project_name, verbose=self.verbose)
  File "/home/i064196/devel/project-kb/prospector/database.py", line 439, in add_repository_to_database
    {'repo_url':repo_url, 'project_name':project_name})
sqlite3.IntegrityError: UNIQUE constraint failed: repositories.repo_url

New export target to generate changed source code

Add a new target to kaybee export e.g., create-changed-source-code, to generate a script that

  • clones the repositories and created the before and after folders for each vulnerability
  • compresses the created folders into an archive 'changed-source-code.tar.gz'

Note: the commit folders containing before and after should be removed after being compressed.

Compared to the existing steady target, no metadata.json nor invocation of kb-importer are required.

Provide support for interactive statement reconciliation

When merge cannot reconcile conflicting statements, the user will be asked to reconcile them manually.
This could be a separated command, such as: kaybee reconcile <vuln-id> or just a special policy that can be called as usual:
kaybee merge -p interactive.

Move go.mod in `kaybee` folder

Clean up the top-level folder and make it independent of kaybee and prospector (no go or python stuff).
Currently the go.mod file of kaybee is in the root folder.

Refactoring: move the export scripts out of the main config file

We could keep in the config file just a configuration that says in which folder the tool can find "exporter definitions"; these will be separated one-file-per-target, so that the main configuration is much leaner. Also, the "standard" exporters could be packaged with the binary itself or created upon kaybee setup

kaybee merge take long even with one single repository configured

OS: macOS
kaybee merge take long even with one single repository
kaybee version 0.6.15

I used the git cloned file system to do the tests. kaybeeconf.yaml contains one single repo -
repo: file:///*********

Git cloned file system with tar containing 577 vulnerabilities -

kaybee pull
date && kaybee merge && date   
Fri Nov  6 11:20:33 CET 2020
......
Fri Nov  6 11:45:36 CET 2020

It took ~25mins

Git cloned file system without tar containing 720 vulnerabilities -

kaybee pull
date && kaybee merge && date   
Fri Nov  6 12:04:08 CET 2020
......
Fri Nov  6 12:17:47 CET 2020

It took ~13mins

Extract query-dependent commit features

Certain commit features can only be computed based on a particular query (=they depend on the vulnerability
advisory record) and as such cannot be pre-computed and persisted in the DB.

We need a module that contains functions to extract such features and to augment datamodel.Commit instances,
so that filter_rank can consider them all.

This functionality could be in the commit-preprocessor package or in the filter_rank (i would prefer the former,
although in that case a more general name for the package would be advised).

A few good first example features:

  • does the commit contain a reference to the vuln id in the commit message?
  • does the commit changes a path that matches the "filepath" entities in the AdvisoryRecord?
  • distance in time between commit message and advisory publication date
  • commit falls in the commit interval determined based on the tags that (seem to) correspond to the versions mentioned in the advisory (yes, this is the most complex :-P )

See prospector legacy for more ideas

[prospector] clarify use cases

In view of an architectural overhaul of prospector, we need to clarify which use cases it is meant to serve.
This will guide the design of the APIs of a prospector backend component.

Import specific vulnerabilities

There can be a scenario where a user wants to import specific vulnerabilities.
I saw that there is no import capability where users can import specific vulnerabilities. It has -n parameter where I can specify the number of vulnerabilities to be imported but cannot specify one or more vulnerability IDs.

Create statements for malicious packages

One additional source of statements could be the list of known malicious packages maintained at https://github.com/dasfreak/Backstabbers-Knife-Collection.

It contains a file package_index.csv with the following columns: Type, Package Name, Affected Version, Published, Reported, Sample, Injection Component, Obfuscation, Trigger, Conditional, Targeted OS, Objective, Details, Source, Comment, Typo Target, Campaign, Location of malicious snippet. See here for a detailed description of those columns.

The first three columns can be used to create one or more PURLs for artifacts, (some of) the other columns can be used for the description and references.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.