sap / project-kb Goto Github PK

Home page of project "KB"

Home Page: https://sap.github.io/project-kb/

License: Apache License 2.0

Python 59.55% Jupyter Notebook 5.93% Makefile 0.95% Go 26.70% Shell 0.62% HTML 4.74% Dockerfile 0.91% Jinja 0.22% CSS 0.27% JavaScript 0.10%

open-source vulnerability-data

project-kb's Introduction

Project KB

Description

The goal of Project KB is to enable the creation, management and aggregation of a distributed, collaborative knowledge base of vulnerabilities affecting open-source software.

Project KB consists of vulnerability data vulnerability knowledge-base as well as set of tools to support the mining, curation and management of such data.

Motivations

In order to feed Eclipse Steady with fresh data, we have spent a considerable amount of time, in the past few years, mining and curating a knowledge base of vulnerabilities that affect open-source components. We know that other parties have been doing the same, in academia as well as in the industry. From this experience, we have learnt that with the growing size of open source ecosystems and the pace at which new vulnerabilities are discovered, the old approach cannot scale. We are also more and more convinced that vulnerability knowledge-bases about open-source should be open-source themselves and adopt the same community-oriented model that governs the rest of the open-source ecosystem.

These considerations have pushed us to release our vulnerability knowledge base in early 2019. In June 2020, we made a further step releasing the kaybee tool support to make the creation, aggregation, and consumption of vulnerability data much easier. In late 2020, we also released, as a proof-of-concept, the prototype prospector, whose goal is to automate the mapping of vulnerability advisories onto their fix-commits.

We hope this will encourage more contributors to join our efforts to build a collaborative, comprehensive knowledge base where each party remains in control of the data they produce and of how they aggregate and consume data from the other sources.

Kaybee

Kaybee is a vulnerability data management tool, it makes possible to fetch the vulnerability statements from this repository (or from any other repository) and export them to a number of formats, including a script to import them to a Steady backend.

For details and usage instructions check out the kaybee README.

Prospector

Prospector is a vulnerability data mining tool that aims at reducing the effort needed to find security fixes for known vulnerabilities in open source software repositories. The tool takes a vulnerability description (in natural language) as input and produces a ranked list of commits, in decreasing order of relevance.

For details and usage instructions check out the prospector README.

Vulnerability data

The vulnerability data of Project KB are stored in textual form as a set of YAML files, in the vulnerability-data branch.

Publications

In early 2019, a snapshot of the knowlege base from project "KB" was described in:

Serena E. Ponta, Henrik Plate, Antonino Sabetta, Michele Bezzi, Cédric Dangremont, A Manually-Curated Dataset of Fixes to Vulnerabilities of Open-Source Software, MSR, 2019

If you use the dataset for your research work, please cite it as:

@inproceedings{ponta2019msr,
    author={Serena E. Ponta and Henrik Plate and Antonino Sabetta and Michele Bezzi and
    C´edric Dangremont},
    title={A Manually-Curated Dataset of Fixes to Vulnerabilities of Open-Source Software},
    booktitle={Proceedings of the 16th International Conference on Mining Software Repositories},
    year=2019,
    month=May,
}

MSR 2019 DATA SHOWCASE SUBMISSION: please find here the data and the scripts described in that paper

If you wrote a paper that uses the data or the tools from this repository, please let us know (through an issue) and we'll add it to this list.

Star History

Credits

EU-funded research projects

The development of Project KB is partially supported by the following projects:

Sec4AI4Sec (Grant No. 101120393)
AssureMOSS (Grant No. 952647).
Sparta (Grant No. 830892).

Vulnerability data sources

Vulnerability information from NVD and MITRE might have been used as input for building parts of this knowledge base. See MITRE's CVE Usage license for more information.

Limitations and Known Issues

This project is work-in-progress, you can find the list of known issues here.

Currently the vulnerability knowledge base only contains information about vulnerabilities in Java and Python open source components.

Support

For the time being, please use GitHub issues to report bugs, request new features and ask for support.

Contributing

See How to contribute.

project-kb's People

Contributors

Stargazers

Watchers

Forkers

chubbymaggie vuduclyunitn bitjson hehondou pombredanne zhanyongtang ashr cg122 ronhab lhmtriet leeyashalti baranowb zhangle59 zeovan mahmoud-alfadel fyy14 flash5 ist20 shabidev piliguori haceng mehdikh2012 wangjie66 fehrethe alexamar0714 eastwolf666 recklessxiao ioanszilagyi gitter-badger ichbinfrog naramsim karthikrajkumar pushkarkishore sumeetpatil aacreate daanhommersom pasqualescudieri luigicut rcaa harshakumarakalutarage amilankovich-slab frontendart copernico spr593 riruk sbs2001 elanzini magielbruntink henrikplate gzu-fj mervesc icloudsong kawsar-arkin sonnguyenvnu larsyx stschott sacca97 hawkjjyy isabella232 tobolov lingxiaotang shalinir11 matteogreek msayagh tttzx mayaba simonescalco dareenkf zyqilla marklee131 lauraschauer assi16

project-kb's Issues

Document usage scenarios and corresponding workflows

Creation of statements

As a user that found information about a particular vulnerability, I want to create a corresponding statement; after having checked if other statements about that vulnerability exist already in one of my sources, I want the statement to be validated and then published on my own vulnerability repository.
After that, I also want to produce a script to import to my Steady backend this new information, reconciled with other statements about the same vulnerability (if they exist).

Examples (tentative):

kaybee lookup <VULN_ID>
kaybee create <VULN_ID>
kaybee push --validate
kaybee export  --target steady   # this produces steady.sh
tail -n 200 .kaybee/mergelog
bash ./steady.sh

Automated aggregation of vulnerability data

In an automated script, run periodically, all sources are to be pulled and merged using a suitable policy; for the statements that can be automatically reconciled, a script to import into Steady must be generated and run automatically. For the statements that could not be reconciled, a notification must be produced to trigger manual intervention.

# in crontab
kaybee pull && kaybee merge
kaybee prune -r 30   # purge stale local clones
kaybee export --target steady && bash ./steady.sh

Manual reconciliation

After having received a notification that a certain group of statements could not be reconciled, I want to inspect the MergeLog to determine what was the reason. I can then craft a new statement to override the necessary elements of the existing (3rd-party) statements and push to my own statement repository.

tail -n 200 .kaybee/mergelog
kaybee merge -p manual
kaybee push --validate

Explain relationship with OWASP dependency check

Does this project have any relationship with the OWASP dependency check project? See:

https://www.owasp.org/index.php/OWASP_Dependency_Check

Specifically:

Functionally, is it possible to leverage this information when performing a check using the aforementioned OWASP tooling, and
technically, can this type of data store be accessed from the OWASP dependency check tool?

Switch to python 3.8

setup -f does not overwrite existing config file

The -f flag of the setup command is simply ignored.

integrate changelog generation in Makefile

We could use something like: https://github.com/git-chglog/git-chglog

Generation of PURL for affected libraries is not correct

As per the doc - https://github.com/package-url/purl-spec, PURL should be something like - pkg:maven/org.apache.xmlgraphics/[email protected] but it is generated as pkg:maven://org.apache.xmlgraphics/[email protected] which is invalid.

Design backend API

GH build action not working

For some reason, some (transitive?) dependencies are failing the build. Note: I cannot reproduce locally.

[import] implement extraction of fix-commits from NVD feeds

The NVD does contain fix-commits for some CVEs: extract them and represent them as statements.

Commit pre-processor module

Extraction of the existing commit pre-processing (feature extraction) code and its restructuring into a separate module.

Link mentioned in README to download kaybee binary is not working

https://github.com/SAP/project-kb#installing-the-kaybee-tool , the link mentioned there is not working.

Integrate NVD feed endpoint

Re-use this implementation that is part of eclipse/steady to have a local replica of the NVD feeds.

Heads-up: most likely, steady will drop that module in the future, we can just clone it in project-kb and continue from there.

Implement command to create a new statement

A comand like create-statement should provide a UI to guide the user in entering the necessary data. This could be implemented as a multi-step form, on the CLI, in a browser-based app, or both.

Cannot execute kaybee export

It is not possible to generate the correct steady.sh file when using the attached statement.yaml file.
Exports does not raise any error but CVE data are not included as expected.

steady.sh.txt

statement.yml.txt

kaybee import not working

go version go1.14.2 darwin/amd64
Using the test file from KaybeeConf with backend URL http://localhost:8033/backend/

Command with error -

kaybee import -c kaybeeconf.yaml -v
Using config file: /Users/I334616/fork/project-kb/kaybee/testdata/conf/kaybeeconf.yaml
Importing vulnerability data from http://localhost:8033/backend/ (using 0 workers)
2020/06/26 09:37:13 json: cannot unmarshal object into Go value of type []*model.Bug

build documentation automatically

At each PR, PUSH: make sure that the documentation (gh-pages) are built automatically correctly (Optional: check that they contain no dead links).

kaybee config for steady generates java properties file rather than json

kaybee config for steady generates java properties metadata file for commits rather than json file
Example -
kaybee export -c kaybeeconf_steady.yaml -t steady -f .kaybee/imported/HTTPCLIENT-1803/statement.yaml -o test2.sh
./test2.sh
cd HTTPCLIENT-1803
tree -L 2

.
 ├── 0554271750599756d4946c0d7ba43d04b1a7b2
  │   ├── after
  │   ├── before
  │   └── meta.properties //this should be a json file
 └── metadata.json

ensure that conflicting statements are reconciled only once

Scenario

A statement s_1 and a statement s_2, from sources S_1 and S_2 respectively, are conflicting. With some policy (or via manual intervention) they are reconciled and the result is statement s_3.

The next day, kaybee merge is run again, how to deal with the same conflict? Shall we keep a pointer to the last commit that was considered from each source so that reconciliation is not done again on the exact same set of conflicting statements?

kaybee setup fails

Kaybee setup fails with error -
Tested on MacOS

kaybee setup
Running setup...
Non-interactive mode
[+] Running Setup task
2020/06/26 14:15:52 stat /home/*******/devel/project-kb/kaybee/internal/tasks/data/default_config.yaml: no such file or directory

Setup GH actions to run pytest

Design ML subsystem (model training + prediction)

Please refer to the diagram below:

Note: the label on the database at the bottom of the diagram is wrong: its contents are pre-processed commits, not vectorial representations of commits (which, for most features, depend on the query).

Validate kaybee statements

Kaybee statements should be validated automatically when we create a pull request
I can think of these basic validations -

Validate Git Commit Id
Validate Git branch
Validate PURL for affected artifacts
Validate Git/Svn Repository

kaybee setup requires config file

Command kaybee setup is supposed to be used to generate a config file, however it won't run complaining that no config file exists (unless there is one).

[vulnerability-data] CVE-2018-11248 patch commit

I was checking the statement file for vulnerability CVE-2018-11248 and I noticed it points to commit lingochamp/FileDownloader@ff240b8 while the vulnerability seems to be fixed at lingochamp/FileDownloader@b023cc0.

[prospector] Improve getting-started docs

There are several steps not documented (including installation of dependencies that are not in the requirements.txt file):

sklearn
spacy (+language model to be installed separately)
matplotlib
streamlit
(....)

Specify expected behaviour of a soft policy

The strict policy refuses to reconcile statements about the same vulnerability that come from different sources.

We need another policy that is able to automatically reconcile statements about the same vulnerability.

What is the expected behaviour of such policy?

Make git parsing more efficient

The current bottleneck in Prospector runtime is parsing the .git folder to an SQL database. We might take a look at other solutions, which parse .git folders and try them and learn from them to improve the processing speed of this initial step. Example projects:

gitbase seems abandoned
gitql looks good, but probably not optimized
askgit seems promising, adds github API support (in beta)

Use "reuse-tool" for licensing info

Generate license and copyright information with the 'Reuse' tool of the Free Software Foundation Europe (FSFE)
To adopt external standards and best practices, the 'Reuse' tool of FSFE must be used to generate copyright and license information

TODOs:

Generate the required information with the 'Reuse' tool as described here.
Remove the notices file which was used in the past to store the SAP copyright (the file might be named NOTICES or NOTICES.txt).
Remove the License section from the README.md file.

check if a new release exists

kaybee could check if a newer version exists on GH.
Separate command? Should also download?
Of course, the user should be able to opt-out/disable.

extend source specification to include a path

Currently a source is a 3-tuple:

repository url
branch
rank

It should become a 4-tuple and include also:

path

Example:

sources:
  - repo: https://github.com/sap/project-kb
    path: vulnerability-data
    branch: master
    rank: 10

This will allow to point to specific folders in the git repository and ahve separate "sources" in the same branch of the same repo.

database creation fails

If the binary files of the sqlite DB are not present (they should not be in Git), Prospector fails to create them.

Can't run the same analysis twice

When processing the same CVE twice in a row, the second run aborts with the following:

Traceback (most recent call last):
  File "main.py", line 220, in <module>
    plac.call(main)
  File "/home/i064196/.local/share/virtualenvs/prospector-_0gS2ZiV/lib/python3.6/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/home/i064196/.local/share/virtualenvs/prospector-_0gS2ZiV/lib/python3.6/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "main.py", line 153, in main
    advisory_record.gather_candidate_commits()
  File "/home/i064196/devel/project-kb/prospector/rank.py", line 781, in gather_candidate_commits
    self.validate_database_coverage()
  File "/home/i064196/devel/project-kb/prospector/rank.py", line 751, in validate_database_coverage
    database.add_repository_to_database(self.connection, self.repo_url, self.project_name, verbose=self.verbose)
  File "/home/i064196/devel/project-kb/prospector/database.py", line 439, in add_repository_to_database
    {'repo_url':repo_url, 'project_name':project_name})
sqlite3.IntegrityError: UNIQUE constraint failed: repositories.repo_url

New export target to generate changed source code

Add a new target to kaybee export e.g., create-changed-source-code, to generate a script that

clones the repositories and created the before and after folders for each vulnerability
compresses the created folders into an archive 'changed-source-code.tar.gz'

Note: the commit folders containing before and after should be removed after being compressed.

Compared to the existing steady target, no metadata.json nor invocation of kb-importer are required.

Redesign CLI

Add `reason` field to Fixes in Statement structure

This will allow us to keep the general notes field for text that concerns the vulnerability itself, whereas information about where and how the fix commits were found can be kept in the new field.

Provide support for interactive statement reconciliation

When merge cannot reconcile conflicting statements, the user will be asked to reconcile them manually.
This could be a separated command, such as: kaybee reconcile <vuln-id> or just a special policy that can be called as usual:
kaybee merge -p interactive.

Move go.mod in `kaybee` folder

Clean up the top-level folder and make it independent of kaybee and prospector (no go or python stuff).
Currently the go.mod file of kaybee is in the root folder.

For python, init files are to be checked out as part of the shell script

When we do a kaybee export for steady with a statement containing python Vulnerability, steady also needs __init__ files to be checkout as part of the bash generated script. These __init__ files are then used in Steady to create valid python constructs.

export to Steady: the script does not delete local clones of repositories

The temporary cloned repositories take a lot of space. Delete it as soon it is consumed by KB-Importer
https://github.com/SAP/project-kb/blob/master/kaybee/internal/tasks/data/default_config.yaml

Refactoring: move the export scripts out of the main config file

We could keep in the config file just a configuration that says in which folder the tool can find "exporter definitions"; these will be separated one-file-per-target, so that the main configuration is much leaner. Also, the "standard" exporters could be packaged with the binary itself or created upon kaybee setup

Import from Steady: commit and branch info concatenated

For certain vulnerabilities, the branch info appears in the commit id field (concatenated to the commit, sha:branch)
e.g.:
id: 3.0.x
commits:

id: aaa97cb1:3.0.x

CVE-2016-6812 CVE-2016-8739 CVE-2016-1000031

kaybee merge take long even with one single repository configured

OS: macOS
kaybee merge take long even with one single repository
kaybee version 0.6.15

I used the git cloned file system to do the tests. kaybeeconf.yaml contains one single repo -
repo: file:///*********

Git cloned file system with tar containing 577 vulnerabilities -

kaybee pull
date && kaybee merge && date   
Fri Nov  6 11:20:33 CET 2020
......
Fri Nov  6 11:45:36 CET 2020

It took ~25mins

Git cloned file system without tar containing 720 vulnerabilities -

kaybee pull
date && kaybee merge && date   
Fri Nov  6 12:04:08 CET 2020
......
Fri Nov  6 12:17:47 CET 2020

It took ~13mins

version information not displayed on macos and windows

kaybee version does not show any info.
The Makefile needs an update.

Implement preprocessed commit database

This component is supposed to abstract the storage of pre-processed commits so that they can be retrieved efficiently
when predicting.

Import (or push) cmd should give an option to create sources tarballs with changed files

Right now this is achieved with the bash script generated by kaybee export -t steady-with-changed-source-code
but it's more logical to allow the user to achieve the same with something like: kaybee import --make-tarballs
or kaybee push --make-tarballs.

https://github.com/SAP/project-kb/blob/master/kaybee/internal/tasks/data/default_config.yaml#L221

Extract query-dependent commit features

Certain commit features can only be computed based on a particular query (=they depend on the vulnerability
advisory record) and as such cannot be pre-computed and persisted in the DB.

We need a module that contains functions to extract such features and to augment datamodel.Commit instances,
so that filter_rank can consider them all.

This functionality could be in the commit-preprocessor package or in the filter_rank (i would prefer the former,
although in that case a more general name for the package would be advised).

A few good first example features:

does the commit contain a reference to the vuln id in the commit message?
does the commit changes a path that matches the "filepath" entities in the AdvisoryRecord?
distance in time between commit message and advisory publication date
commit falls in the commit interval determined based on the tags that (seem to) correspond to the versions mentioned in the advisory (yes, this is the most complex :-P )

See prospector legacy for more ideas

[prospector] clarify use cases

In view of an architectural overhaul of prospector, we need to clarify which use cases it is meant to serve.
This will guide the design of the APIs of a prospector backend component.

Import specific vulnerabilities

There can be a scenario where a user wants to import specific vulnerabilities.
I saw that there is no import capability where users can import specific vulnerabilities. It has -n parameter where I can specify the number of vulnerabilities to be imported but cannot specify one or more vulnerability IDs.

kaybee setup gives old config file

When I run kaybee setup, a default conf file is generated. The new changes are not seen as part of the conf file - d6a59122b21a615e2455815952ba2f3c91d7cff7

Create statements for malicious packages

One additional source of statements could be the list of known malicious packages maintained at https://github.com/dasfreak/Backstabbers-Knife-Collection.

It contains a file package_index.csv with the following columns: Type, Package Name, Affected Version, Published, Reported, Sample, Injection Component, Obfuscation, Trigger, Conditional, Targeted OS, Objective, Details, Source, Comment, Typo Target, Campaign, Location of malicious snippet. See here for a detailed description of those columns.

The first three columns can be used to create one or more PURLs for artifacts, (some of) the other columns can be used for the description and references.