deeprob-org / deeprob-kit Goto Github PK

View Code? Open in Web Editor NEW

59.0 3.0 8.0 330 KB

A Python Library for Deep Probabilistic Modeling

Home Page: https://deeprob-kit.readthedocs.io/en/latest/

License: MIT License

Makefile 0.16% Python 99.64% Shell 0.20%

normalizing-flows sum-product-networks probabilistic-models probabilistic-circuits

deeprob-kit's Introduction

DeeProb-kit

DeeProb-kit is a unified library written in Python consisting of a collection of deep probabilistic models (DPMs) that are tractable and exact representations for the modelled probability distributions. The availability of a representative selection of DPMs in a single library makes it possible to combine them in a straightforward manner, a common practice in deep learning research nowadays. In addition, it includes efficiently implemented learning techniques, inference routines, statistical algorithms, and provides high-quality fully-documented APIs. The development of DeeProb-kit will help the community to accelerate research on DPMs as well as to standardise their evaluation and better understand how they are related based on their expressivity.

Features

Inference algorithms for SPNs. ¹ ²
Learning algorithms for SPNs structure. ¹ ³ ⁴ ² ⁵
Chow-Liu Trees (CLT) as SPN leaves. ⁶
Cutset Networks (CNets) with various learning criteria. ⁷
Batch Expectation-Maximization (EM) for SPNs with arbitrarily leaves. ⁸ ⁹
Structural marginalization and pruning algorithms for SPNs.
High-order moments computation for SPNs.
JSON I/O operations for SPNs and CLTs. ²
Plotting operations based on NetworkX for SPNs and CLTs. ²
Randomized And Tensorized SPNs (RAT-SPNs). ¹⁰
Deep Generalized Convolutional SPNs (DGC-SPNs). ¹¹
Masked Autoregressive Flows (MAFs). ¹²
Real Non-Volume-Preserving (RealNVP) flows. ¹³
Non-linear Independent Component Estimation (NICE) flows. ¹⁴

The collection of implemented models is summarized in the following table.

Model	Description
Binary-CLT	Binary Chow-Liu Tree (CLT)
Binary-CNet	Binary Cutset Network (CNet)
SPN	Vanilla Sum-Product Network
MSPN	Mixed Sum-Product Network
XPC	Random Probabilistic Circuit
RAT-SPN	Randomized and Tensorized Sum-Product Network
DGC-SPN	Deep Generalized Convolutional Sum-Product Network
MAF	Masked Autoregressive Flow
NICE	Non-linear Independent Components Estimation Flow
RealNVP	Real-valued Non-Volume-Preserving Flow

Installation

The library can be installed either from PIP repository or by source code.

# Install from PIP repository
pip install deeprob-kit

# Install from `main` git branch
pip install -e git+https://github.com/deeprob-org/deeprob-kit.git@main#egg=deeprob-kit

Project Directories

The documentation is generated automatically by Sphinx using sources stored in the docs directory.

A collection of code examples and experiments can be found in the examples and experiments directories respectively. Moreover, benchmark code can be found in the benchmark directory.

Cite

@misc{loconte2022deeprob,
  doi = {10.48550/ARXIV.2212.04403},
  url = {https://arxiv.org/abs/2212.04403},
  author = {Loconte, Lorenzo and Gala, Gennaro},
  title = {{DeeProb-kit}: a Python Library for Deep Probabilistic Modelling},
  publisher = {arXiv},
  year = {2022}
}

Related Repositories

References

Peharz et al. On Theoretical Properties of Sum-Product Networks. AISTATS (2015). ↩ ↩²
Molina, Vergari et al. SPFLOW : An easy and extensible library for deep probabilistic learning using Sum-Product Networks. CoRR (2019). ↩ ↩² ↩³ ↩⁴
Poon and Domingos. Sum-Product Networks: A New Deep Architecture. UAI (2011). ↩
Molina, Vergari et al. Mixed Sum-Product Networks: A Deep Architecture for Hybrid Domains. AAAI (2018). ↩
Di Mauro et al. Sum-Product Network structure learning by efficient product nodes discovery. AIxIA (2018). ↩
Di Mauro, Gala et al. Random Probabilistic Circuits. UAI (2021). ↩
Rahman et al. Cutset Networks: A Simple, Tractable, and Scalable Approach for Improving the Accuracy of Chow-Liu Trees. ECML-PKDD (2014). ↩
Desana and Schnörr. Learning Arbitrary Sum-Product Network Leaves with Expectation-Maximization. CoRR (2016). ↩
Peharz et al. Einsum Networks: Fast and Scalable Learning of Tractable Probabilistic Circuits. ICML (2020). ↩
Peharz et al. Probabilistic Deep Learning using Random Sum-Product Networks. UAI (2020). ↩
Van de Wolfshaar and Pronobis. Deep Generalized Convolutional Sum-Product Networks for Probabilistic Image Representations. PGM (2020). ↩
Papamakarios et al. Masked Autoregressive Flow for Density Estimation. NeurIPS (2017). ↩
Dinh et al. Density Estimation using RealNVP. ICLR (2017). ↩
Dinh et al. NICE: Non-linear Independent Components Estimation. ICLR (2015). ↩

deeprob-kit's People

Contributors

Stargazers

Watchers

Forkers

fedous marco299 rohithpeddi yangyang-pro davidrmh godlovxiari mitchell-xiyunfeng remilvus

deeprob-kit's Issues

On flows, mean and standard deviation of default base distribution are not kept constant during training

When training a normalizing flow having a Standard Gaussian as base distribution (i.e. using in_base=None by default), mean and standard deviation are not kept constant during training. The expected behavior is that they must be kept constant during training.

This is probably due to a wrong initialization of mean and standard deviation parameters: https://github.com/deeprob-org/deeprob-kit/blob/main/deeprob/flows/models/base.py#L52-L53.

Histogram2D does not support string arguments for bins anymore

Implementation of Greedy Variable Splitting (GVS) is broken.

Setup PyLint

Setup PyLint static code analyser.
Also, setup GitHub Action to automatically print a report about code quality.

Setup "deeprob-kit-docs" repository or "gh-pages" branch for automatically versioned documentation

Setup a new repository deeprob-org/deeprob-kit-docs or the special branch gh-pages containing versioned documentation.
Refer to sphinx-multiversion for building versioned documentation.
In particular, refer to a fork of sphinx-multiversion supporting sphinx-apidoc and sphinx-autodoc.

Finally, setup a GitHub Action to automatically push new documentation versions when:

A push on main branch is made.
A new tag/release is pushed.

However, this can also be done using Travis CI.

Check the implementation of MPE queries on Binary Chow-Liu Trees (CLTs)

Review the implementation of MPE queries on Binary Chow-Liu Trees (CLTs) and write unit tests related on the correctness of MPE queries.

Introduce compatibility check between two PCs

Introduce method are_compatible between two PCs.
Rewrite method is_structured_decomposable, since we can define it as self-compatibility.

Make benchmark reults more reliable

Disable automatic garbage collection in Python.
Execute garbage collection manually outside of code blocks that measure elapsed times.
Increase the number of repetitions from 20 to 50.
Add benchmarks rule to Makefile.

Example `plot_spn.py` raises an error

I was trying to run plot_spn.py, but the code raises an error. Here's the output:

Plotting the dummy SPN to spn-dummy.svg ...
Traceback (most recent call last):
  File ".../deeprob-kit/examples/spn_plot.py", line 25, in <module>
    spn.plot_spn(root, spn_filename)
  File ".../miniconda3/envs/deeprob/lib/python3.9/site-packages/deeprob/spn/structure/io.py", line 317, in plot_spn
    pos = nx_pydot.graphviz_layout(graph, prog='dot')
  File ".../miniconda3/envs/deeprob/lib/python3.9/site-packages/networkx/drawing/nx_pydot.py", line 357, in graphviz_layout
    return pydot_layout(G=G, prog=prog, root=root)
  File ".../miniconda3/envs/deeprob/lib/python3.9/site-packages/networkx/drawing/nx_pydot.py", line 406, in pydot_layout
    P = to_pydot(G)
  File ".../miniconda3/envs/deeprob/lib/python3.9/site-packages/networkx/drawing/nx_pydot.py", line 263, in to_pydot
    raise ValueError(
ValueError: Node names and attributes should not contain ":" unless they are quoted with "".                For example the string 'attribute:data1' should be written as '"attribute:data1"'.                Please refer https://github.com/pydot/pydot/issues/258

RuntimeError when sampling from RealNVP with n_flows > 1

Forward evaluation of RealNVP with more than one of sequential multiscale architectures raises a RuntimeError.

Feedback on the example running experience

I ran all examples. They are a nice way of testing how the code runs on one's computer and show its capabilities. Below, I provide some points of feedback/suggestions that may improve the experience people have when running the examples. Some of that feedback may pertain or be relevant to other parts of the code base as well.

Often, files are created as part of an example, such as the nice illustrative figures. It would be useful to alert the user of all files being created, so that they are aware of this even if they do not keep an eye on their working folder. Also, some files have unclear purpose (such as the pt files). Clarifying their use when alerting they are created would therefore be useful. (If they are temporary files, delete them at the end of running the example or use the tempfile module.)
The console output provides useful information about the time it takes to run an example. If possible, generalize this to all examples that are not trivially short. (I think the first stage of spn_latent_mnist.py does not.)
The console output numerical values often have a large number of digits displayed. There is little reason to believe that many are actually significant. Furthermore, it makes the output more difficult to read and digest. Ideally, output only significant digits, but if you do not know how many digits are significant, 4 digits in total is a good upper bound (like 57.63 %, 1234, 1.234e6).
Many of the console output numbers have units (s, it/s, batch/s). The international standard is to always have a space between a number and its unit.
Sometimes, JSON output is created either as console output or in files. Try to pretty-print it a bit, to make it easier to scan. If it is not meant to be read, perhaps consider omitting it.
For many of the examples, you generate images, which is great. It would add value to have every example generate some image, even if it is not a sample. Namely, the examples can also provide users of the package inspiration of the type of images that they might generate.
In one case, an image was generated in an interactive window (nvp1d_moons.py) and not in an image file. That is nice. Could it be generalized to all examples, with a fallback to image file generation?
In two cases, the examples automatically downloaded some datasets. While convenient, some users might not expect this, may not like it, or may not have an internet connection. I think it would be more user-friendly to ask first or instruct first where to download the dataset. Furthermore, I saw that MNIST was downloaded from LeCunn's original website, who explicitly requests not to do that (“Please refrain from accessing these files from automated scripts with high frequency. Make copies!”); it would be polite to honor that request. In general, make sure to download from permanent repositories if possible instead of possibly non-permanent websites.
The console output lists accuracy percentages. These generally are quite a bit closer to 100 % than to 0 %. Therefore, the initial digit (7, 8, 9) is often not very significant and therefore distracting. It is more user friendly to use error rate instead, so, e.g., [12.49, 8.66, 4.57] instead of [87.51, 91.34, 95.43].

Obviously, these are mostly cosmetic suggestions, so I'd understand that you classify (parts of) this issue as ‘wontfix’.

Implement the FID score for normalizing flows

Implement the FID score for generative models. A suitable package to place the function fid_score is deeprob.utils.statistics.

Moreover, include the FID score, aside the BPP (bits-per-pixel) metric, in the results given by normalizing flows experiments.

Add a string flag "method" on SPN learning wrappers

Add a string flag method on learn_estimator function (in module deeprob.spn.learning.wrappers) that permits to choose between different SPN learning algorithms.

At the moment, the flag method must support two values: learnspn and learnxpc, corresponding to LearnSPN and LearnXPC algorithms respectively.

Introduce additional scripts regarding XPCs

Add an example about learning and using XPCs.
Add a script (similar to experiments/spn.py) to launch XPC experiments.
Add basic unit tests about XPC related modules.

Refactor Unit Tests

Refactor tests to use pytest instead of unittest
Add tests for shapes checking
Introduce Continuous Integration (CI) (e.g. GitHub Action using Codecov on merge on main)

Unclear where experiments folder is

I installed deeprob-kit using pip:

$ pip install --user deeprob-kit

Now, I want try out the experiments to see if the code works on my system. However, I do not seem to have the experiments folder and therefore do not seem to be able to run them or put the datasets in place. Namely, what I have after installation is the following tree:

~/.local/lib/python3.9 $ tree -L 3
.
└── site-packages
    ├── deeprob
    │   ├── __init__.py
    │   ├── __pycache__
    │   ├── context.py
    │   ├── flows
    │   ├── spn
    │   ├── torch
    │   └── utils
    └── deeprob_kit-1.0.0.dist-info
        ├── INSTALLER
        ├── LICENSE
        ├── METADATA
        ├── RECORD
        ├── REQUESTED
        ├── WHEEL
        └── top_level.txt

8 directories, 9 files

My impression is that the bundle on PyPi only contains deeprob-kit itself, without any of the other materials. Perhaps putting the experiments folder (and others) under deeprob may provide a solution, but I guess you chose the current structure for a reason. Or perhaps I am looking in the wrong location.

Deprecated dependency

The library depends on the deprecated sklearn package and should be updated to use scikit-learn instead.

See: https://pypi.org/project/sklearn/

Add link to Arxiv article into readme

Perhaps the following link to the corresponding Arxiv article can be included in the readme:

https://arxiv.org/abs/2212.04403

Create TreeBN class

Most of the code available in BinaryCLT actually works for any tree-shaped Bayesian Network. Therefore, it would be better to create a super-class called TreeBN and then make BinaryCLT a subclass of it.

Fully differentiable MAFs

The method "apply_forward" of the class "AutoregressiveLayer" is not differentiable, due to in-place operations. This makes training using the flow sampling direction impossible.

Update README.md and fix implicit imports

Update the table of implemented models in README.md
Add NormalizingFlow abstract class import in flows/models/__init__.py
Add RatSpn abstract class import in spn/models/__init__.py
Fix 'type' object is not subscriptable using sphinx
Prepend MIT license information to every source file in deeprob/

Introduce multithreaded implementation of forward and backward evaluation of SPNs

The forward evaluation (used for EVI, MAR and MPE queries and sampling) can be parallelized by considering a layered topological ordering of the SPN graph. That is, every leaf node can be evaluated in parallel and, after that, every parent node can be computed in parallel as well, and so on.
The backward evaluation (used for MPE query and sampling) can be parallelized by considering a layered topological ordering of the SPN graph, as for forward evaluation.
Moreover, introduce unit tests ensuring the correctness of the implementation.

A suitable multiprocessing library for this task is joblib, for which it is possible to specify 'threading' as lightweight backend.

Write a README.md file for each sub-directory

Split the README.md file at root directory into multiple markdown files discussing the content (and usage) of the scripts present in the following directories:

benchmark
docs
examples
experiments

Fix most Pylint errors

Fix most of the errors given by the Pylint static code analyzer.