crocs-muni / sec-certs Goto Github PK

View Code? Open in Web Editor NEW

9.0 8.0 7.0 46.81 MB

Tool for analysis of security certificates and their security targets (Common Criteria, NIST FIPS140-2...).

Home Page: https://sec-certs.org

License: MIT License

Python 16.88% HTML 17.91% Dockerfile 0.05% Jupyter Notebook 65.14% Shell 0.01%

cc fips-140 security-certificates data-science python common-criteria

sec-certs's Introduction

Sec-certs

A tool for data scraping and analysis of security certificates from Common Criteria and FIPS 140-2/3 frameworks.

Paper pre-prints

Two papers related to this tool are accepted for publication. See the arXiv pre-prints below.

Note

Janovsky, A., Jancar, J., Svenda, P., Chmielewski, Ł., Michalik, J., & Matyas, V. (2023). sec-certs: Examining the security certification practice for better vulnerability mitigation. To appear in Computers & Security journal.
Janovsky, A., Chmielewski, Ł., Svenda, P., Jancar, J., & Matyas, V. (2024). Chain of trust: Unraveling the references among Common Criteria certified products. To appear in IFIP SEC 2024.

Installation

Use Docker with docker pull seccerts/sec-certs or just pip install -U sec-certs && python -m spacy download en_core_web_sm. For more elaborate description, see docs.

Usage

There are two main steps in exploring the world of security certificates:

Data scraping and data processing all the certificates
Exploring and analysing the processed data

For the first step, we currently provide CLI. For the second step, we provide simple API that can be used directly inside our Jupyter notebook or locally, together with a fully processed datasets that can be downloaded.

More elaborate usage is described in docs/quickstart. Also, see example notebooks either at GitHub or at docs. From docs, you can also run our notebooks in Binder.

Data scraping

Run sec-certs cc all for Common Criteria processing, sec-certs fips all for FIPS 140 processing.

Data analysis

Most probably, you don't want to fully process the certification artifacts by yourself. Instead, you can use our results and explore them as a data structure. An example snippet follows. For more, see example notebooks. Tip: these can be run with Binder from our docs.

from sec_certs.dataset import CCDataset

dset = CCDataset.from_web_latest() # now you can inspect the object, certificates are held in dset.certs
df = dset.to_pandas()  # Or you can transform the object into Pandas dataframe
dset.to_json(
    './latest_cc_snapshot.json')  # You may want to store the snapshot as json, so that you don't have to download it again
dset = CCDataset.from_json('./latest_cc_snapshot.json')  # you can now load your stored dataset again

# Get certificates with some CVE
vulnerable_certs = [x for x in dset if x.heuristics.related_cves]
df_vulnerable = df.loc[~df.related_cves.isna()]

# Show CVE ids of some vulnerable certificate
print(f"{vulnerable_certs[0].heuristics.related_cves=}")

# Get certificates from 2015 and newer
df_2015_and_newer = df.loc[df.year_from > 2014]

# Plot distribution of years of certification
df.year_from.value_counts().sort_index().plot.line()

sec-certs's People

Contributors

Stargazers

Watchers

Forkers

j08ny keleranv michaelepley lrvy xmoravec julik24 petrs

sec-certs's Issues

Fix broken PP IDs parsing

While PPs are parsed from statically downloaded files, PP IDs are not properly processed and matched to certificates.

Create full-fledged CLI for FIPS

The following concerns adjustments to be made to FIPS entrypoint:

Existing CLI from examples directory should be taken and moved into root of the repository.
Adjustments to CLI shall be made, it should have similar interface as CC CLI, i.e. one argument specifying set of actions, and options to specify additional stuff
Readme should introduce two sections: FIPS and CC
FIPS should get documented in readme

New API: Standard naming of manufacturers

The old API had a function that (among many other things) attempted to introduce some standard naming of certificate manufacturers. In exact, the function:

Defines separators, like , and /
Splits the manufacturer name into tokens with separators
If any of the tokens is on itself a valid manufacturer of some existing certificate, such token is considered a separate manufacturer
Then, similarity search is run on the tokenized lists of manufacturers and edges are drawn into a dotgraph between nodes that could represent the same manufacturer

As an example, one can imagine two distinct manufacturers hidden in the following field

Oberthur Technologies / NXP Semiconductors GmbH

I did not find any real use for such functionality, leaving it for further implementation if we ever encounter a use-case for that. In case when we'll be implementing this:

pandas could be leveraged for data processing
graphviz python package should be used to create the graphs
such function naturally fits into the analysis section of the tool

CC: Start linking SAR, SFR, security levels from protection profiles to certificates

CI/CD: Publish docker image on release

Since our tool has also some system dependencies (e.g., pdftotext) we should, along with PyPi package, provide fully reproducible environment -- a docker image -- published to a public repository

The product should use GitHub Actions or Travis CI/CD triggered on release that will build and push a docker image into public repository, so that anyone can pull that image.
Also, corresponding badges should be made and put into readme
The short documentation on how to pull the container should be put into readme

CVEDataset handling optimizations

Currently, CVEDataset object is required for both CPE matching, and for CVE matching task. When both tasks are completed in a sequence, we deserialize the CVEDataset object twice (which takes some time). Adding a logic that would keep the CVEDataset opened for subsequent use could save some computation time.

Low priority probably, this just adds few seconds to comp. time.

Functionality: detect possibly vulnerable certificates by shared terms in certificate(s)

Example use-case: estID, ROCA and ID 163484 eIDAS memo. The memo referenced vulnerable chips, but a different ones than used in estID cards.

Idea:

input: list of known vulnerable certificates
generate --find-affecting graph
find term(s) shared by all known vulnerable certificates in their --find-affecting graphs
generate graph with potentially affected certificates for these terms using --find-affected

New API: Allow for full-fledged processing of protection profiles

Currently, the static dataset of protection profiles is loaded from web storage. This dataset was created using old API.

The new tool should allow for full processing of protection profiles without downloading the static dataset.

CC Leaking local paths in serialized errors

E.g.

{'state': {'errors': ['Failed to read metadata of /Users/adam/phd/projects/certificates/datasets/cc_full_dataset/certs/targets/pdf/e6959c2b66202cb8.pdf,

Either add option to delete the errors (or state) once the dataset is processed or try to convert paths to relative.

CC: cert_lab not extracted properly

When computing heuristics, cert_lab is not extracted properly and attempt for extraction crashes the program.
Also, cc_cli does not process maintenance updates even when actions is specified
We should not attempt to run further actions on files for which some of the actions failed.

CI/CD: Publish package to PyPi on release

On release (alternatively on push to master, but we would need to create dev branch then) GitHub action or Travis CI/CD should be triggered that will publish the current version of the tool to PyPi, so that it can be installed with pip install sec-certs
Add relevant badges (e.g. see https://thomas-cokelaer.info/blog/2014/08/1013/) to the readme

Some relevant information can be found at:

Search on web does not find match when searched for cert id

When using web text search for particular certificate id, the matching certificate is not found.

E.g., BSI-DSZ-CC-0804-2012 (cv act ePasslet/ePKI v3.6)
will not found the relevant certificate
https://seccerts.org/cc/search/?q=BSI-DSZ-CC-0804-2012&cat=abcdefghijklmop&status=any&sort=match

but https://seccerts.org/cc/search/?q=ePasslet%2FePKI%20v3.6&cat=abcdefghijklmop&status=any&sort=match will find it properly
https://seccerts.org/cc/c3a110dda0b5031dc2ca/

Include search for specific APDU commands in documentation

Commands to obtain get proprietary chip info etc.
List of regular expressions are here:
petrs/pyAPDUFuzzer@3cccb8f
petrs/pyAPDUFuzzer@d31dbd5
petrs/pyAPDUFuzzer@54f2060

Tests: Make resources locally accessible

For now, many of tests rely on downloads from commoncriteria.org, which itself is quite unreliable webpage. Apart from test that check the correctness of such downloads, we should rely more on local resources, so that the tests don't fail due to inaccessible cc web.

Fix too complex functions

Fix functions that, according to Flake8 are too complex:

./sec_certs/sample/common_criteria.py:363:5: C901 'CommonCriteriaCert.from_html_row' is too complex (24)
./sec_certs/sample/fips.py:602:5: C901 'FIPSCertificate.parse_cert_file_common' is too complex (15)
./sec_certs/sample/fips.py:753:5: C901 'FIPSCertificate.remove_algorithms' is too complex (14)
./sec_certs/sample/cve.py:96:5: C901 'CVE.from_nist_dict' is too complex (12)
./sec_certs/model/cpe_matching.py:183:5: C901 'CPEClassifier.get_candidate_list_of_vendors' is too complex (11)
./sec_certs/model/dependency_finder.py:23:5: C901 'DependencyFinder._build_cert_references' is too complex (12)
./sec_certs/dataset/fips.py:398:5: C901 'FIPSDataset._validate_id' is too complex (11)

So far, the check for complex functions has been disabled in .flake8, make sure to allow it again once done.

Decide on static analysis tool for this repository

The goal is to:

Go through a list of available tools for static analysis of Python code (CodeQL, LGTM, ...)
Identify one, at most two most plausible choices
Incorporate them into CI/CD process, ideally on each commit. Would that be too heavy, we can stick with running the analysis on each pull request

Note that the security of the repository is not of the upmost importance, as we expect the tool to run in an trusted environment. We mainly hope that the tools may improve our code quality.

Fix PyPi package

It seems that whatever is being pushed to PyPi repository, is not working. To be exact, running

pip3 install -e .
process-certs

works as expected, so the local install is ok. However, trying

pip3 install -U sec-certs
process-certs

exits with the following error:

Traceback (most recent call last):
  File "/Users/adam/.pyenv/versions/3.8.1/bin/process-certs", line 5, in <module>
    from process_certificates import main
ModuleNotFoundError: No module named 'process_certificates'

we should fix the PyPi package.

Invalid escape sequences in ANSSI headers

Flake8 shouts W605 invalid escape sequence '\’' in function https://github.com/crocs-muni/sec-certs/blob/a8543e1364d377878b901283eb012da3ceb0d088/sec_certs/helpers.py#L228

As @petrs was writing this function, he could take a look at this.

CC: Limit serialized variables for Maintenance updates or analyze them fully

Maintenance updates now inherit from CommonCriteriaCert and are serialized using its methods. However, since they are not fully analyzed as of now, it makes no sense to serialize many empty variables. A decision should be made to choose one from:

Limit number of (de)serialized variables
Run full analysis on maintenance updates -- download and analyze pdfs, ...

Example of current serialization

{
    "_type": "CommonCriteriaMaintenanceUpdate",
    "dgst": "cert_822d871f3bbd06d7_update_b93e6033924ed6f3",
    "status": "",
    "category": "",
    "name": "SonicWall SonicOS Enhanced V6.5.4 with VPN and IPS on TZ and SOHO Appliances Security",
    "manufacturer": null,
    "scheme": "",
    "security_level": [
        ""
    ],
    "not_valid_before": null,
    "not_valid_after": null,
    "report_link": "https://www.commoncriteriaportal.org/files/epfiles/st_vid11028-add1.pdf",
    "st_link": "https://www.commoncriteriaportal.org/files/epfiles/st_vid11028-st-1.pdf",
    "cert_link": null,
    "manufacturer_web": null,
    "protection_profiles": [],
    "maintainance_updates": [],
    "state": {
        "_type": "InternalState",
        "st_download_ok": true,
        "report_download_ok": true,
        "st_convert_ok": true,
        "report_convert_ok": true,
        "st_extract_ok": true,
        "report_extract_ok": true,
        "errors": []
    },
    "pdf_data": {
        "_type": "PdfData",
        "report_metadata": null,
        "st_metadata": null,
        "report_frontpage": null,
        "st_frontpage": null,
        "report_keywords": null,
        "st_keywords": null
    },
    "heuristics": {
        "_type": "Heuristics",
        "extracted_versions": null,
        "cpe_matches": null,
        "labeled": false,
        "verified_cpe_matches": null,
        "related_cves": null,
        "cert_lab": null,
        "cert_id": null
    },
    "related_cert_digest": "822d871f3bbd06d7",
    "maintenance_date": "2020-08-17"
}

AttributeError: 'FIPSCertificate' object has no attribute 'processed'

Running

fips-certs all

on a new dataset results in:

Traceback (most recent call last):
  File "sec-certs/virt/bin/fips-certs", line 33, in <module>
    sys.exit(load_entry_point('sec-certs', 'console_scripts', 'fips-certs')())
  File "sec-certs/virt/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "sec-certs/virt/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "sec-certs/virt/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "sec-certs/virt/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "sec-certs/fips_cli.py", line 210, in main
    dset.plot_graphs(show=True)
  File "sec-certs/sec_certs/dataset/fips.py", line 544, in plot_graphs
    self.get_dot_graph("full_graph", show=show)
  File "sec-certs/sec_certs/dataset/fips.py", line 488, in get_dot_graph
    processed = self._get_processed_list(connection_list, key)
  File "sec-certs/sec_certs/dataset/fips.py", line 455, in _get_processed_list
    return getattr(self.certs[key], attr).connections
AttributeError: 'FIPSCertificate' object has no attribute 'processed'

Add mechanism for publishing latest CC, FIPS dataset on web

We should add some mechanism that will allow the administrators of the project publish new versions of the JSON datasets for other researchers to download on seccerts.org. This can be done:

manually
through some web interface

the CCDataset now implements method CCDataset.from_web_latest() that should fetch the latest dataset published by us into an object. This is handy especially when working with some notebooks, as any researcher can write

dset = CCDataset.from_web_latest()
df = dset.to_pandas()

and obtain table representation of the dataset that he can easily experiment with, draw plots, etc. The method from_web_latest() references static url address where the latest dataset should always sit.

Add predictable URL links for certificates based on cert id

An url for page for particular certificate is now not predictable (e.g., https://seccerts.org/cc/c3a110dda0b5031dc2ca/).

Add predictable one based on certificate id (allows for offline generation of links to items with known certificate ID), e.g.,
https://seccerts.org/cc/BSI-DSZ-CC-0804-2012

May be realized as redirect with other URL for better readability (cert id, name...):

page X with content is https://seccerts.org/cc/c3a110dda0b5031dc2ca/
redirecting page to X is https://seccerts.org/cc/BSI-DSZ-CC-0804-2012
redirecting page to X is https://seccerts.org/cc/cv_act_ePasslet_ePKI_v3.6

Generate pdf with graph subcomponents

Refactor CC tests

The technological debt is increasing slowly in the tests. They are at the moment not logically divided into classes. We could benefit from their better structure, naming, etc. Also, some functionality is not tested:

Test maintenance updates
Refactor into better classes
Select more appropriate names for the methods
Get rid of the unittest, migrate to pytest
Set asserts accordingly with PyTest, #76

Add analysis of Technical Decisions

Technical Decisions can update / clarify meaning of SFRs or fix errors
https://www.niap-ccevs.org/Documents_and_Guidance/view_tds.cfm

Parse the update, link to relevant SFR, analyze which SFR was most updated and how

Fix Docker image and add version names

While the Docker image is correctly built and published on DockerHub, it seems that the image won't execute properly, in fact running

docker run seccerts/sec-certs

exits with the following error

Traceback (most recent call last):
  File "/opt/sec-certs/examples/cc_oop_demo.py", line 1, in <module>
    from sec_certs.dataset import CCDataset
ModuleNotFoundError: No module named 'sec_certs'

Also, the push to DockerHub should fetch the correct version tag from GitHub, not pushing as sec-certs:latest but with proper version.

CI/CD: Migrate tests to GitHub Actions

Currently, each time a push to this repository is made, tests run against Travis docker image according to the specification in .travis.yml.

As we will perform other CI/CD operations in GitHub, we should move this component under Github Actions as well. The resolved issue should hence:

Run all tests in the directory test using the GitHub actions on every push to the repository
Get rid of stale Travis configuration once finished
Add test badge into a readme (see: https://docs.github.com/en/actions/managing-workflow-runs/adding-a-workflow-status-badge)

Refactoring: Split large files that cover multiple classes

We should probably split gigantic files dataset.py and certificate.py into multiple files, one file per class.

Add slots workaround for massively spawned dataclasses

Some of the dataclasses that we use (CPE, CVE) are spawned in large quantities and would benefit from __slots__ implementation. However, prior to Python 3.10, slots are not immediately possible for dataclasses. We should study https://stackoverflow.com/questions/50180735/how-can-dataclasses-be-made-to-work-better-with-slots and implement suitable workaround.

As @J08nY suggested, make alterations after class object instantiation section could be of our focus.

Collect information also directly from ANSSI/BSI pages

For the Common Criteria certificates we currently collect information from:

the csv from the Common Criteria portal at https://www.commoncriteriaportal.org/products/certified_products.csv,
the html from the Common Criteria portal at https://www.commoncriteriaportal.org/products/,
the certificate report documents (PDF) linked from 1. and 2., and
the security target documents (PDF) linked from 1. and 2..

We should probably also collect information directly from the pages of big members of CC that produce the certifications (https://www.commoncriteriaportal.org/ccra/schemes/) like ANSSI and BSI. We can then cross-check this data with the data we collect using our existing method and possibly augment it using this new data source if we see some improvement.

The steps in this task are:

Examine the pages of the CC members (e.g. ANSSI, BSI) that produce certifications (linked from https://www.commoncriteriaportal.org/ccra/schemes/) and see which ones have some sort of a listing of products they certified which has at least some minimal amount of information about the certificates.
- Example for ANSSI: https://www.ssi.gouv.fr/en/products/certified-products/
- Example for BSI: https://www.bsi.bund.de/EN/Topics/Certification/certified_products/certified_products_node.html
Implement functionality that parses interesting data about the certificates out of the aforementioned pages.
- Start with ANSSI and BSI pages.
- Get inspired by the existing codebase and how it works with Certificate objects (but no need to be completely like the existing codebase at the start).
- Also have the ability to export the results to JSON (like the current datasets and certificates).
Compare the extracted data from the aforementioned pages with data collected using our current methods.
- Do we correctly match the certificate id for the certificates?
- Are the PDFs linked from the pages of the CC members the same as the ones linked from Common Criteria directly?
- Is there some data that we are missing?
Consider a way of enriching our current dataset collected from the CC with the data collected from the aforementioned pages.
- Only makes sense if there is something we are missing or that we have wrong.

No such file or directory: nvdcpematch-1.0.json

Running

fips-certs new-run --output fips_dataset --name fips_dataset

after a fresh install of the tool results in:

Traceback (most recent call last):
  File "sec-certs/virt/bin/fips-certs", line 33, in <module>
    sys.exit(load_entry_point('sec-certs', 'console_scripts', 'fips-certs')())
  File "sec-certs/virt/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "sec-certs/virt/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "sec-certs/virt/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "sec-certs/virt/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "sec-certs/fips_cli.py", line 217, in main
    dset.finalize_results()
  File "sec-certs/sec_certs/serialization/json.py", line 51, in inner_func
    result = func(*args, **kwargs)
  File "sec-certs/sec_certs/dataset/fips.py", line 421, in finalize_results
    self.compute_cpe_heuristics()
  File "sec-certs/sec_certs/serialization/json.py", line 51, in inner_func
    result = func(*args, **kwargs)
  File "sec-certs/sec_certs/dataset/dataset.py", line 211, in compute_cpe_heuristics
    return self._compute_cpe_matches()
  File "sec-certs/sec_certs/dataset/dataset.py", line 197, in _compute_cpe_matches
    cve_dset = self._prepare_cve_dataset(False)
  File "sec-certs/sec_certs/dataset/dataset.py", line 175, in _prepare_cve_dataset
    cve_dataset.build_lookup_dict(use_nist_cpe_matching_dict, self.nist_cve_cpe_matching_dset_path)
  File "sec-certs/sec_certs/dataset/cve.py", line 68, in build_lookup_dict
    matching_dict = self.get_nist_cpe_matching_dict(nist_matching_filepath)
  File "sec-certs/sec_certs/dataset/cve.py", line 202, in get_nist_cpe_matching_dict
    with unzipped_path.open('r') as handle:
  File "/usr/lib/python3.9/pathlib.py", line 1252, in open
    return io.open(self, mode, buffering, encoding, errors, newline,
  File "/usr/lib/python3.9/pathlib.py", line 1120, in _opener
    return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp19xbck6o/nvdcpematch-1.0.json'

Re-running the same command afterwards results in a different error further down the processing pipeline.

Unify FIPS and CC identification.

Currently a FIPS certificate has

cert_id that is a string type (in JSON) but that really represents the integer certificate number.
dgst that is equal to the cert_id above.
However, a CC certificate has no cert_id (only an id in heuristics, as there is no guaranteed certificate ID present for CC).
A CC certificate also has a dgst that is a hash of some fields of the certificate that will hopefully not change. The dgst is
a 16 character hex string.

I propose the following:

Change the FIPS cert_id type to integer in the serialized JSON.
Change the FIPS dgst to also be a 16 character hex digest of the hash of the cert_id for example, so that this field is unified in the two datasets.

I need this because:

I am using the cert_id field in the seccerts.org page and a sort on a string field containing numbers is just wrong.
I am using the dgst field in the seccerts.org page and having it formatted differently for FIPS/CC causes issues.

FIPSDataset's get_certs_from_web() returns empty certificates

When creating FIPS dataset with

from sec_certs.dataset.fips import FIPSDataset

dset: FIPSDataset = FIPSDataset({}, Path('./my_debug_dataset'), 'sample_dataset', 'sample dataset description')
dset.get_certs_from_web(no_download_algorithms=True)

dataset filled with cert_id: None is created. I would expect that parsing the data from html sources would already populate the FIPS Certificates with the relevant data. The resulting dataset should therefore contain cert_id: FIPSCertificate objects instead of Nones. The current behaviour complicates multiple situations. For example, when one attempts to perform CPE/CVE matching, they must first download and process the pdfs, which is time consuming.

Can this bug be fixed?

Write tests for cc data extraction from pdf, txt files

Leftover work with MyPY

Problems:

Typing of inherited types (Certificate -> CommonCriteriaCert)
TODOs in general
Dataset class does not inherit from ComplexSerializableType
FIPS Html table parsing reverted (does it work?)
Resolve # type: ignore stuff
Introduce TyperVarfor handeling overloaded functions in child classes.

Enforce pandas serialization by abstract class

Currently, there's slight chaos in how pandas serialization should be handleded. We can prepare two classes, something like PandasRow and PandasDataFrame from which our classes (e.g. CommonCriteriaCert, CCDataset) would inherit. The inheritance could be used to enforce united interface of Dataframe serialization.

Unify types of sequences in datasets

Often, types of sequences (mostly lists and sets) are used interchangeably throughout the project. E.g., consider heuristics.cpe_matches of CommonCriteriaCert and FIPSCertificate. We should unify (or even enforce) how this objects are being handled.

Special care must be taken when (de)serializing.

CC: Better exception handling on pdf processing

When running dset.extract_pdf_metadata() on CCDataset , a PyPDF2 package is called on multiple occassions. Its functions are failing here and there. There are some possible improvements of the current processing:

If the pdf is encrypted, PyPDF2 attempts to override encryption to learn number of pages. If it fails, our whole function returns, ignoring getDocumentInfo() that needs not to fail.
Many unnecessary warnings are displayed, like: PdfReadWarning: Superfluous whitespace found in object header b'50' b'0' [pdf.py:1665]
Some other issues (at least three more distinct are present) should be explored and if possible, the problems fixed.

Dockerfile not working with MyBinder

We use the following instance on MyBinder.org: https://mybinder.org/v2/gh/crocs-muni/sec-certs/cc-feature-parity?filepath=notebooks%2Fcc_data_exploration.ipynb

Currently, MyBinder notebook will not start properly (although it will build the Dockerimage) because it is not compatible with our Dockerfile for unknown reasons. At the same time, once DockerFile is present in the root of repository, one cannot configure MyBinder any other way -- the Dockerfile will override all settings.

We should therefore do one of the following:

Adjust Dockerfile to start MyBinder properly: https://mybinder.readthedocs.io/en/latest/tutorials/dockerfile.html
Move Dockerfile elsewhere to not interfere with MyBinder

Introduce MyPy into CI/CD

We should finalize typehints across the project and run MyPy to check that we adhere to typing best practices.

Regex match of SFR with three components (e.g., FDP_ACF_CIMC.2)

We currently match only two-component SFR (e.g., FDP_ACF.1), but not FDP_ACF_CIMC.2
Fore example http://216.117.4.138/files/epfiles/st_vid10359-st.pdf contains numerous non-matched SFRs in '10. Strength of Function (SoF) Requirements ' section

New API: Sanitize configuration file

Currently, the configuration files are not restricted in terms of structure and are loaded into variable without any sanitization.

I suggest using JSONschema for validating the yaml documents. That would also steer the user w.r.t. bad input.

CPE dataset: Allow for storage of files along with the CC dataset

Currently, the CPEDataset class has constructors from_web(), from_xml() and from_json(). However, apart from the from_web(), the rest are going to be rarely invoked as CCDataset does not download the xml, nor does it save the transformed json.

Edit: The same actually holds for CVEDataset

The solution is to:

The CCDataset should on first invokation of CPEDataset download the xml, transform it to json and store that json along its files.
The CCDataset should always look for CPE json file in its directory to avoid costly download and transformation into json.
To drive the procedures above, the CPEDataset should implement to_json() method.

Add mybinder.org example for simple CC data analysis

Introduce simple mybinder.org Jupyter notebook that will:

fetch processed dataset from web
transform it into pandas
Demonstrate one or two things one can do with the pandas notebook

this will lower the engagement barrier for users as they'll be able to play with our data directly from their browser.

CC dataset get_certs_from_web fails

Running the cc-certs command on a fresh dataset fails:

cc-certs all -o cc_dataset

due to the detection of a HTML table row with more than 7 td elements:

2021-12-10 18:50:19,643 - sec_certs.dataset.dataset - INFO - Downloading required csv and html files.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:18<00:00, 39.04s/it]
2021-12-10 18:51:37,734 - sec_certs.dataset.dataset - INFO - Successfully downloaded 2 files, 0 failed.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.50s/it]
2021-12-10 18:51:40,747 - sec_certs.dataset.dataset - INFO - Successfully downloaded 2 files, 0 failed.
2021-12-10 18:51:40,748 - sec_certs.dataset.dataset - INFO - Adding CSV certificates to CommonCriteria dataset.
2021-12-10 18:51:40,871 - sec_certs.dataset.dataset - WARNING - The CSV cc_dataset/web/cc_products_active.csv contains 3 duplicates by the primary key.
2021-12-10 18:51:40,968 - sec_certs.dataset.dataset - INFO - Parsed 1634 certificates from: cc_products_active.csv
Skipping line 922: ',' expected after '"'
Skipping line 923: ',' expected after '"'
Skipping line 972: ',' expected after '"'
Skipping line 973: ',' expected after '"'
Skipping line 997: ',' expected after '"'
Skipping line 998: ',' expected after '"'
2021-12-10 18:51:41,094 - sec_certs.dataset.dataset - WARNING - The CSV cc_dataset/web/cc_products_archived.csv contains 10 duplicates by the primary key.
2021-12-10 18:51:41,248 - sec_certs.dataset.dataset - INFO - Parsed 3209 certificates from: cc_products_archived.csv
2021-12-10 18:51:41,273 - sec_certs.dataset.dataset - INFO - Added 4840 new and merged further 0 certificates to the dataset.
2021-12-10 18:51:41,273 - sec_certs.dataset.dataset - INFO - Adding HTML certificates to CommonCriteria dataset.
2021-12-10 18:51:44,156 - sec_certs.sample.certificate - ERROR - Unexpected number of cells in CC html row.
Traceback (most recent call last):
  File "sec-certs/virt/bin/cc-certs", line 33, in <module>
    sys.exit(load_entry_point('sec-certs', 'console_scripts', 'cc-certs')())
  File "sec-certs/virt/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "sec-certs/virt/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "sec-certs/virt/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "sec-certs/virt/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "sec-certs/cc_cli.py", line 71, in main
    dset.get_certs_from_web()
  File "sec-certs/sec_certs/serialization/json.py", line 51, in inner_func
    result = func(*args, **kwargs)
  File "sec-certs/sec_certs/dataset/common_criteria.py", line 238, in get_certs_from_web
    html_certs = self._get_all_certs_from_html(get_active, get_archived)
  File "sec-certs/sec_certs/dataset/common_criteria.py", line 338, in _get_all_certs_from_html
    partial_certs = self._parse_single_html(self.web_dir / file)
  File "sec-certs/sec_certs/dataset/common_criteria.py", line 412, in _parse_single_html
    certs.update(_parse_table(soup, cert_status, key, val))
  File "sec-certs/sec_certs/dataset/common_criteria.py", line 377, in _parse_table
    table_certs = {x.dgst: x for x in [
  File "sec-certs/sec_certs/dataset/common_criteria.py", line 378, in <listcomp>
    CommonCriteriaCert.from_html_row(row, cert_status, category_string) for row in body]}
  File "sec-certs/sec_certs/sample/common_criteria.py", line 388, in from_html_row
    raise
RuntimeError: No active exception to reraise

The code is:

sec-certs/sec_certs/sample/common_criteria.py

Lines 385 to 388 in 4830412

    
           cells = list(row.find_all('td')) 
        
           if len(cells) != 7: 
        
               logger.error('Unexpected number of cells in CC html row.') 
        
               raise

Docker: Add the ability to export experimental results out of docker

As @KeleranV pointed on today's call, the current problem with Docker is that while the cc_oop_demo.py script runs flawlessly, the Dataset files, jsons, etc., they remain trapped inside the Docker image :).

As of now, the Python script is parametrized by a single local path parameter, where all results are stored. This will remain true even when replaced by full-experiment script in the future. Thus, we should decide on how to make the experiment results accessible outside of the docker for someone who uses Docker.

We discussed today with @KeleranV that he will investigate the options and pick some suitable one. @J08nY, do you have any opinion on this?

Introduce SimpleComplexSerializableType

Since ComplexSerializableType is an abstract type, all child classes must implement methods to_dict() and from_dict(). But most of the child classes only define these as

def to_dict(self):
    return copy.deepcopy(self.__dict__)

@classmethod
def from_dict(cls, dct: Dict):
    return cls(*tuple(dct.values()))

it would thus be benefitial to create a SimpleComplexSerializableType that implements these and let other classes inherit from this one. Would save many LoC.

Also, a default to_json() and from_json() implementation should be introduced.

Adhere to single code style

As suggested by @GeorgeFI, we could enforce single code style on the project. As the project is getting maintained by more and more people, it's probably good idea.

Improve to_pandas() function of CCDataset

Multiple enhancements should be done w.r.t. to_pandas() function in CCDataset class. Namely:

Cast columns to proper types
Unwind complex dataclasses into multiple columns, instead of holding it as Python objects

Normalize and connect matched certificate ids (missing year, versions, maintainance reports)

For example, following certificates are logically connected:
BSI-DSZ-CC-0857
BSI-DSZ-CC-0857-V2
BSI-DSZ-CC-0857-V2-2015_M01
and different example:
BSI-DSZ-CC-1059-2018
BSI-DSZ-CC-1059-V2-2019
BSI-DSZ-CC-1059-V3-2019
BSI-DSZ-CC-1059-V4-2021

	cells = list(row.find_all('td'))
	if len(cells) != 7:
	logger.error('Unexpected number of cells in CC html row.')
	raise