hlt-isti / quapy Goto Github PK

A framework for Quantification written in Python

License: BSD 3-Clause "New" or "Revised" License

Shell 0.07% Python 99.93%

machine-learning quantification supervised-prevalence-estimation

quapy's Introduction

QuaPy

QuaPy is an open source framework for quantification (a.k.a. supervised prevalence estimation, or learning to quantify) written in Python.

QuaPy is based on the concept of "data sample", and provides implementations of the most important aspects of the quantification workflow, such as (baseline and advanced) quantification methods, quantification-oriented model selection mechanisms, evaluation measures, and evaluations protocols used for evaluating quantification methods. QuaPy also makes available commonly used datasets, and offers visualization tools for facilitating the analysis and interpretation of the experimental results.

Last updates:

Version 0.1.8 is released! major changes can be consulted here.
A detailed wiki is available here
The developer API documentation is available here

Installation

pip install quapy

Cite QuaPy

If you find QuaPy useful (and we hope you will), plese consider citing the original paper in your research:

@inproceedings{moreo2021quapy,
  title={QuaPy: a python-based framework for quantification},
  author={Moreo, Alejandro and Esuli, Andrea and Sebastiani, Fabrizio},
  booktitle={Proceedings of the 30th ACM International Conference on Information \& Knowledge Management},
  pages={4534--4543},
  year={2021}
}

A quick example:

The following script fetches a dataset of tweets, trains, applies, and evaluates a quantifier based on the Adjusted Classify & Count quantification method, using, as the evaluation measure, the Mean Absolute Error (MAE) between the predicted and the true class prevalence values of the test set.

import quapy as qp
from sklearn.linear_model import LogisticRegression

dataset = qp.datasets.fetch_twitter('semeval16')

# create an "Adjusted Classify & Count" quantifier
model = qp.method.aggregative.ACC(LogisticRegression())
model.fit(dataset.training)

estim_prevalence = model.quantify(dataset.test.instances)
true_prevalence  = dataset.test.prevalence()

error = qp.error.mae(true_prevalence, estim_prevalence)

print(f'Mean Absolute Error (MAE)={error:.3f}')

Quantification is useful in scenarios characterized by prior probability shift. In other words, we would be little interested in estimating the class prevalence values of the test set if we could assume the IID assumption to hold, as this prevalence would be roughly equivalent to the class prevalence of the training set. For this reason, any quantification model should be tested across many samples, even ones characterized by class prevalence values different or very different from those found in the training set. QuaPy implements sampling procedures and evaluation protocols that automate this workflow. See the Wiki for detailed examples.

Features

Implementation of many popular quantification methods (Classify-&-Count and its variants, Expectation Maximization, quantification methods based on structured output learning, HDy, QuaNet, quantification ensembles, among others).
Versatile functionality for performing evaluation based on sampling generation protocols (e.g., APP, NPP, etc.).
Implementation of most commonly used evaluation metrics (e.g., AE, RAE, NAE, NRAE, SE, KLD, NKLD, etc.).
Datasets frequently used in quantification (textual and numeric), including:
- 32 UCI Machine Learning binary datasets.
- 5 UCI Machine Learning multiclass datasets (new in v0.1.8!).
- 11 Twitter quantification-by-sentiment datasets.
- 3 product reviews quantification-by-sentiment datasets.
- 4 tasks from LeQua competition (new in v0.1.7!)
- IFCB dataset of plankton water samples (new in v0.1.8!).
Native support for binary and single-label multiclass quantification scenarios.
Model selection functionality that minimizes quantification-oriented loss functions.
Visualization tools for analysing the experimental results.

Requirements

scikit-learn, numpy, scipy
pytorch (for QuaNet)
svmperf patched for quantification (see below)
joblib
tqdm
pandas, xlrd
matplotlib
ucimlrepo

Contributing

In case you want to contribute improvements to quapy, please generate pull request to the "devel" branch.

Documentation

The developer API documentation is available here.

Check out our Wiki, in which many examples are provided:

Acknowledgments:

quapy's People

Contributors

Stargazers

Watchers

Forkers

pglez82 fabseb60 onetoolscollection valgur miguelrodrvalles gpucce dysby aesuli mirkobunse jahall jcmo-research aicgijon lorenzovolpi pawel-czyz

quapy's Issues

Pandas dependency

If you install quapy in a clean pip environment it will install pandas 2.0.x.

In this version of pandas, append was removed so the line df = df.append(series, ignore_index=True) in _prevalence_report in file evaluation.py breaks.

I would fix the dependency to stick to version 1.x our change the append by concat.

Error in LSTMnet

I think there is the function init_hidden:

def init_hidden(self, set_size):
        opt = self.hyperparams
        var_hidden = torch.zeros(opt['lstm_nlayers'], set_size, opt['lstm_hidden_size'])
        var_cell = torch.zeros(opt['lstm_nlayers'], set_size, opt['lstm_hidden_size'])
        if next(self.lstm.parameters()).is_cuda:
            var_hidden, var_cell = var_hidden.cuda(), var_cell.cuda()
        return var_hidden, var_cell

Where it says opt['lstm_hidden_size'] should be opt['hidden_size']

Wiki correction

In the last part of the Methods wiki page, where it says:

from classification.neural import NeuralClassifierTrainer, CNNnet

I think it should say:

from quapy.classification.neural import NeuralClassifierTrainer, LSTMnet

Simplify the newcomer first test (suggested by Pablo)

A newcomer runs "pip install quapy" and wants to quickly test if it is working.
All examples in the "examples" dir looks overwhelming
The "quick example" in the readme downloads 300MB...

Would be desirable to come with a much simpler example for testing the installation.

Using a different gpu than cuda:0

The code seems to be tied up to using only 'cuda', which by default uses the first gpu in the system ('cuda:0'). It would be handy to be able to tell the library in which cuda gpu you want to train (cuda:0, cuda:1, etc).

Additional multi-class quantification estimators

Dear all,

What an excellent package – I wish I had known about QuaPy earlier! Albert Ziegler and I were working on a Bayesian quantification estimator (with the aim of quantifying epistemic uncertainty around the prevalence estimate) and in our codebase we have implemented several baselines for multi-class quantification problems, including:

Invariant ratio estimator of Vaz et al. (2019).
Black-box shift estimator of Lipton et al. (2018).

I have been wondering if I should submit a PR to QuaPy implementing any of these methods.
I should also add that for Bayesian quantification we will replace soon the current PyMC backend with JAX and NumPyro (which is faster) and we are planning to add a JAX-based Gibbs sampler, which is a slight extension of the expectation-maximization algorithm.

Please, let me know what you think!

Best wishes,
Pawel

Couldn't train QuaNet on multiclass data

Hi, I am having trouble in training a QuaNet quantifier for multiclass (20) data. Everything works fine with where my dataset only has 2 classes. It looks like the ACC quantifier is not able to aggregate from more than 2 classes?

The classifier is built and trained as with the code below

classifier = LSTMnet(dataset.vocabulary_size, dataset.n_classes)
learner = NeuralClassifierTrainer(classifier)
learner.fit(*dataset.training.Xy)

where it has all the default configurations

{'embedding_size': 100, 'hidden_size': 256, 'repr_size': 100, 'lstm_class_nlayers': 1, 'drop_p': 0.5}

Then I tried to train QuaNet with following code

model = QuaNetTrainer(learner, qp.environ['SAMPLE_SIZE'])
model.fit(dataset.training, fit_learner=False)

and it showed that QuaNet is built as

QuaNetModule(
(lstm): LSTM(120, 64, batch_first=True, dropout=0.5, bidirectional=True)
(dropout): Dropout(p=0.5, inplace=False)
(ff_layers): ModuleList(
(0): Linear(in_features=208, out_features=1024, bias=True)
(1): Linear(in_features=1024, out_features=512, bias=True)
)
(output): Linear(in_features=512, out_features=20, bias=True)
)

And then the error occured in model.fit().

Attached is the error I get.

Traceback (most recent call last):
File "quanet-test.py", line 181, in
model.fit(dataset.training, fit_learner=False)
File "/home/vickys/.local/lib/python3.6/site-packages/quapy/method/neural.py", line 126, in fit
self.epoch(train_data_embed, train_posteriors, self.tr_iter, epoch_i, early_stop, train=True)
File "/home/vickys/.local/lib/python3.6/site-packages/quapy/method/neural.py", line 182, in epoch
quant_estims = self.get_aggregative_estims(sample_posteriors)
File "/home/vickys/.local/lib/python3.6/site-packages/quapy/method/neural.py", line 145, in get_aggregative_estims
prevs_estim.extend(quantifier.aggregate(predictions))
File "/home/vickys/.local/lib/python3.6/site-packages/quapy/method/aggregative.py", line 238, in aggregate
return ACC.solve_adjustment(self.Pte_cond_estim_, prevs_estim)
File "/home/vickys/.local/lib/python3.6/site-packages/quapy/method/aggregative.py", line 246, in solve_adjustment
adjusted_prevs = np.linalg.solve(A, B)
File "<array_function internals>", line 6, in solve
File "/usr/local/lib64/python3.6/site-packages/numpy/linalg/linalg.py", line 394, in solve
r = gufunc(a, b, signature=signature, extobj=extobj)
ValueError: solve1: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (m,m),(m)->(m) (size 2 is different from 20)

Thank you!

Parameter fit_learner in QuaNetTrainer (fit method)

The parameter fit_leaner is not used in the function:

def fit(self, data: LabelledCollection, fit_learner=True):

and the learner is fitted every time:

self.learner.fit(*classifier_data.Xy)

Dependencies that miss and have mismatched versions

A new, plain installation of QuaPy is currently not importable without installing an additional dependency (certifi) and a specific version of another dependency (matplotlib v3.8).

Steps to reproduce:

python -m venv venv
venv/bin/pip install quapy
venv/bin/python -c "import quapy"

Leads to ModuleNotFoundError: No module named 'certifi'.

Calling

venv/bin/pip install certifi
venv/bin/python -c "import quapy"

fixes the first issue but now leads to ImportError: cannot import name 'get_cmap' from 'matplotlib.cm'.

This second issue is resolved by explicitly installing a previous version of matplotlib:

venv/bin/pip install matplotlib==3.8

My suggestion for the certifi issue is to add a corresponding entry to setup.py. My suggestion for the second issue is to use a variant of get_cmap that is available in older versions and in the current version of matplotlib.

quapy breaks the warnings module

First I would like to thank the authors for this very useful library - it helped a colleague and me a lot! But today I stumbled upon the following lines at quapy.data.datasets while testing my code (using with pytest.warns(...):):

def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

These lines disable basically the whole warnings module (and caused my tests to fail). This happens globally as a side effect purely by importing quapy. A better solution is to use the catch_warnings context manager whenever you expect some undesired warning.

For all that need a temporary fix before this is resolved:

import warnings
import importlib
import quapy.data.datasets
importlib.reload(warnings)

This reloads the warnings module and thus resets warnings.warn globally.