cblearn / cblearn Goto Github PK

Comparison-based Machine Learning in Python

License: MIT License

Python 100.00%

ordinal embedding machine-learning machinelearning scikit-learn non-metric scaling multidimensional-scaling multidimensional

cblearn's Introduction

cblearn

Comparison-based Machine Learning in Python

Comparison-based Learning algorithms are the Machine Learning algorithms to use when training data contains similarity comparisons ("A and B are more similar than C and D") instead of data points.

Triplet comparisons from human observers help model the perceived similarity of objects. These human triplets are collected in studies, asking questions like "Which of the following bands is most similar to Queen?" or "Which color appears most similar to the reference?".

This library provides an easy-to-use interface for comparison-based learning algorithms. It plays hand-in-hand with scikit-learn:

from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score

from cblearn.datasets import make_random_triplets
from cblearn.embedding import SOE
from cblearn.metrics import QueryScorer

X = load_iris().data
triplets = make_random_triplets(X, result_format="list-order", size=1000)

estimator = SOE(n_components=2)
# Measure the fit with scikit-learn's cross-validation
scores = cross_val_score(estimator, triplets, cv=5)
print(f"The 5-fold CV triplet error is {sum(scores) / len(scores)}.")

# Estimate the scale on all triplets
embedding = estimator.fit_transform(triplets)
print(f"The embedding has shape {embedding.shape}.")

Please try the Examples.

Getting Started

Install cblearn as described here and try the examples.

Find a theoretical introduction to comparison-based learning, the datatypes, algorithms, and datasets in the User Guide.

Features

Datasets

cblearn provides utility methods to simplify the loading and conversion of your comparison datasets. In addition, some functions download and load multiple real-world comparisons.

Dataset	Query	#Object	#Response	#Triplet
Vogue Cover	Odd-out Triplet	60	1,107	2,214
Nature Scene	Odd-out Triplet	120	3,355	6,710
Car	Most-Central Triplet	60	7,097	14,194
Material	Standard Triplet	100	104,692	104,692
Food	Standard Triplet	100	190,376	190,376
Musician	Standard Triplet	413	224,792	224,792
Things Image Testset	Odd-out Triplet	1,854	146,012	292,024
ImageNet Images v0.1	Rank 2 from 8	1,000	25,273	328,549
ImageNet Images v0.2	Rank 2 from 8	50,000	384,277	5M

Embedding Algorithms

Algorithm	Default	Pytorch (GPU)	Reference Wrapper
Crowd Kernel Learning (CKL)	X	X
FORTE		X
GNMDS	X	X
Maximum-Likelihood Difference Scaling (MLDS)	X		MLDS (R)
Soft Ordinal Embedding (SOE)	X	X	loe (R)
Stochastic Triplet Embedding (STE/t-STE)	X	X

Contribute

We are happy about your bug reports, questions or suggestions as Github Issues and code or documentation contributions as Github Pull Requests. Please see our Contributor Guide.

Authors and Acknowledgement

cblearn was initiated by current and former members of the Theory of Machine Learning group of Prof. Dr. Ulrike von Luxburg at the University of Tübingen. The leading developer is David-Elias Künstle.

We want to thank all the contributors here on GitHub. This work has been supported by the Machine Learning Cluster of Excellence, funded by EXC number 2064/1 – Project number 390727645. The authors would like to thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting David-Elias Künstle.

License

This library is free under the MIT License conditions. Please cite this library appropriately if it contributes to your scientific publication. We would also appreciate a short email (optionally) to see how our library is being used.

cblearn's People

Contributors

Stargazers

Watchers

Forkers

jfhoelscher trybnetic conzel tml-tuebingen stsievert vivdaddy buzzlightyear726

cblearn's Issues

GNMDS X-Torch Implementation

Is the GNMDS implementation correct? It appears identical to SOE.
https://github.com/dekuenstle/cblearn/blob/d25dae9a6518d7b54d9b18bb58a738b2887ab2a8/cblearn/embedding/_gnmds.py#L141
https://github.com/dekuenstle/cblearn/blob/d25dae9a6518d7b54d9b18bb58a738b2887ab2a8/cblearn/embedding/_soe.py#L137

Add triplet kernel and other methods for triplets -> distances

Psychophysics how-tos

How to choose the number of trials?
How to choose the dimension?
How to simulate your experiment? (incl. noise)
How to analyse the fit and uncertainty of an embedding?
How to choose the right (embedding) algorithm?

Benchmark embedding algorithms

Rename question/answer -> query/response

Separate Continuous Integration Tests

Currently, we have one big test workflow. We should separate it to multiple workflows to keep it more maintainable, making the source of fails more obvious, and probably even speeding up.

I would suggest the following workflows:

Unittests (Core)
Unittests (wrapper: R and Octave bridges)
[probably: Unittests (GPU)]
Code Style
Documentation

This might be a nice issuefor someone interested in getting started with Github Actions + test automation.

Contributor installation instructions broken

The URL in the contributor installation instructions is broken and needs to be changed to: [email protected]:cblearn/cblearn.git

(cc openjournals/joss-reviews#6139)

Bug with estimate_dimensionality_cv

Hi,

There seems to be a bug in the embedding.estimate_dimensionality_cv function. (Using current version 706bfba)

The issue seems to be an issue in the initialization of the embedding. It happens for both SOE and TSTE.

Here is a basic working example.

import cblearn
from cblearn.embedding import SOE
from cblearn.embedding import TSTE
from cblearn.embedding import estimate_dimensionality_cv
from cblearn.datasets import make_random_triplet_indices, make_all_triplet_indices, triplet_response, noisy_triplet_response
import numpy as np

SUBSPACE_DIM = 5
SPACE_DIM = 10
EMBEDDING_DIM = 5
SEED = 42
NUM_TASKS = 25

noisy_points = np.random.randn(NUM_TASKS, SPACE_DIM)
triplets = make_all_triplet_indices(NUM_TASKS, True)
responses = triplet_response(triplets, noisy_points)
estimator = SOE(EMBEDDING_DIM, random_state=SEED)
# Even the following line doesn't work
#estimator = TSTE(EMBEDDING_DIM, random_state=SEED)
dimension = estimate_dimensionality_cv(estimator, responses, test_dimensions=list(range(10)), n_splits=10, n_repeats=1, n_jobs=1, random_state=SEED)
dimension.plot_scores()

The error I get is

Renaming master branch to main

As soon as Github rolls out an seamless way for renaming (ETA end January), we should rename as well.
See https://github.com/github/renaming for more information.

I leave this here mainly as a reminder.

More algorithm implementations

Here we keep track of some comparison-based algorithms that could be interesting to implement.

Clustering: Crowd-median algorithm (Heikinheimo & Ukkonen, 2013)
Embedding: Multi-view embedding (Amid & Ukkonen, 2015)
Embedding: Landmark approaches (e.g., Anderton & Aslam, 2019; Ghosh et al., 2019)
Embedding: Active sampling methods (e.g., based on CKL or LOE)

Align multiple embedding, bootstrap embeddings

Add utility to align and standardize multiple embeddings, e.g. with Procrustes' method.

Add utility to bootstrap queries with any OE algorithm to create aligned embedding

ordinal_embedding.ipynb doesn't work

ordinal_embedding.ipynb example doesn't work.

Predicted 2D embedding: (100, 2)
Predicted 3D embedding: (100, 3)

The estimated embedding can be evaluated from different perspectives.

The procrustes distance is a square distance between the true and the estimated embeddings, where scale, rotation and translation transformations are ignored. This is only possible, if the true embedding is known and the embeddings have the same dimensionality.
The training triplet error is the fraction of training comparisons, which do not comply with the estimated embedding.
The cross-validation triplet error indicates the fraction of unknown triplets which do not comply with the estimated embedding. Note, that 5-fold cross validation requires refitting the model 5 times.
Procrustes distance: 0.00322 in 3d
Training triplet error: 0.126 in 2d vs 0.000 in 3d.
---------------------------------------------------------------------------
Empty                                     Traceback (most recent call last)
File ~/anaconda3/envs/cblearn/lib/python3.9/site-packages/joblib/parallel.py:1423, in Parallel.dispatch_one_batch(self, iterator)
   1422 try:
-> 1423     tasks = self._ready_batches.get(block=False)
   1424 except queue.Empty:
   1425     # slice the iterator n_jobs * batchsize items at a time. If the
   1426     # slice returns less than that, then the current batchsize puts
   (...)
   1429     # accordingly to distribute evenly the last items between all
   1430     # workers.

File ~/anaconda3/envs/cblearn/lib/python3.9/queue.py:168, in Queue.get(self, block, timeout)
    167     if not self._qsize():
--> 168         raise Empty
    169 elif timeout is None:

Empty: 

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
Cell In[6], line 12
      9 error_3d = 1 - transformer_3d.score(triplets)
     10 print(f"Training triplet error: {error_2d:.3f} in 2d vs {error_3d:.3f} in 3d.")
---> 12 cv_error_2d = 1 - cross_val_score(transformer_3d, triplets, cv=5, n_jobs=-1).mean()
     13 cv_error_3d = 1 - cross_val_score(transformer_3d, triplets, cv=5, n_jobs=-1).mean()
     14 print(f"CV triplet error: {cv_error_2d:.3f} in 2d vs {cv_error_3d:.3f} in 3d.")

File ~/anaconda3/envs/cblearn/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:562, in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score)
    559 # To ensure multimetric format is not supported
    560 scorer = check_scoring(estimator, scoring=scoring)
--> 562 cv_results = cross_validate(
    563     estimator=estimator,
    564     X=X,
    565     y=y,
    566     groups=groups,
    567     scoring={"score": scorer},
    568     cv=cv,
    569     n_jobs=n_jobs,
    570     verbose=verbose,
    571     fit_params=fit_params,
    572     pre_dispatch=pre_dispatch,
    573     error_score=error_score,
    574 )
    575 return cv_results["test_score"]

File ~/anaconda3/envs/cblearn/lib/python3.9/site-packages/sklearn/utils/_param_validation.py:214, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    208 try:
    209     with config_context(
    210         skip_parameter_validation=(
    211             prefer_skip_nested_validation or global_skip_validation
    212         )
    213     ):
--> 214         return func(*args, **kwargs)
    215 except InvalidParameterError as e:
    216     # When the function is just a wrapper around an estimator, we allow
    217     # the function to delegate validation to the estimator, but we replace
    218     # the name of the estimator by the name of the function in the error
    219     # message to avoid confusion.
    220     msg = re.sub(
    221         r"parameter of \w+ must be",
    222         f"parameter of {func.__qualname__} must be",
    223         str(e),
    224     )

File ~/anaconda3/envs/cblearn/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:309, in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, return_indices, error_score)
    306 # We clone the estimator to make sure that all the folds are
    307 # independent, and that it is pickle-able.
    308 parallel = Parallel(n_jobs=n_jobs, verbose=verbose, pre_dispatch=pre_dispatch)
--> 309 results = parallel(
    310     delayed(_fit_and_score)(
    311         clone(estimator),
    312         X,
    313         y,
    314         scorers,
    315         train,
    316         test,
    317         verbose,
    318         None,
    319         fit_params,
    320         return_train_score=return_train_score,
    321         return_times=True,
    322         return_estimator=return_estimator,
    323         error_score=error_score,
    324     )
    325     for train, test in indices
    326 )
    328 _warn_or_raise_about_fit_failures(results, error_score)
    330 # For callable scoring, the return type is only know after calling. If the
    331 # return type is a dictionary, the error scores can now be inserted with
    332 # the correct key.

File ~/anaconda3/envs/cblearn/lib/python3.9/site-packages/sklearn/utils/parallel.py:65, in Parallel.__call__(self, iterable)
     60 config = get_config()
     61 iterable_with_config = (
     62     (_with_config(delayed_func, config), args, kwargs)
     63     for delayed_func, args, kwargs in iterable
     64 )
---> 65 return super().__call__(iterable_with_config)

File ~/anaconda3/envs/cblearn/lib/python3.9/site-packages/joblib/parallel.py:1950, in Parallel.__call__(self, iterable)
   1944 self._call_ref = weakref.ref(output)
   1946 # The first item from the output is blank, but it makes the interpreter
   1947 # progress until it enters the Try/Except block of the generator and
   1948 # reach the first `yield` statement. This starts the aynchronous
   1949 # dispatch of the tasks to the workers.
-> 1950 next(output)
   1952 return output if self.return_generator else list(output)

File ~/anaconda3/envs/cblearn/lib/python3.9/site-packages/joblib/parallel.py:1588, in Parallel._get_outputs(self, iterator, pre_dispatch)
   1586 detach_generator_exit = False
   1587 try:
-> 1588     self._start(iterator, pre_dispatch)
   1589     # first yield returns None, for internal use only. This ensures
   1590     # that we enter the try/except block and start dispatching the
   1591     # tasks.
   1592     yield

File ~/anaconda3/envs/cblearn/lib/python3.9/site-packages/joblib/parallel.py:1571, in Parallel._start(self, iterator, pre_dispatch)
   1562 def _start(self, iterator, pre_dispatch):
   1563     # Only set self._iterating to True if at least a batch
   1564     # was dispatched. In particular this covers the edge
   (...)
   1568     # was very quick and its callback already dispatched all the
   1569     # remaining jobs.
   1570     self._iterating = False
-> 1571     if self.dispatch_one_batch(iterator):
   1572         self._iterating = self._original_iterator is not None
   1574     while self.dispatch_one_batch(iterator):

File ~/anaconda3/envs/cblearn/lib/python3.9/site-packages/joblib/parallel.py:1434, in Parallel.dispatch_one_batch(self, iterator)
   1431 n_jobs = self._cached_effective_n_jobs
   1432 big_batch_size = batch_size * n_jobs
-> 1434 islice = list(itertools.islice(iterator, big_batch_size))
   1435 if len(islice) == 0:
   1436     return False

File ~/anaconda3/envs/cblearn/lib/python3.9/site-packages/sklearn/utils/parallel.py:61, in <genexpr>(.0)
     56 # Capture the thread-local scikit-learn configuration at the time
     57 # Parallel.__call__ is issued since the tasks can be dispatched
     58 # in a different thread depending on the backend and on the value of
     59 # pre_dispatch and n_jobs.
     60 config = get_config()
---> 61 iterable_with_config = (
     62     (_with_config(delayed_func, config), args, kwargs)
     63     for delayed_func, args, kwargs in iterable
     64 )
     65 return super().__call__(iterable_with_config)

File ~/anaconda3/envs/cblearn/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:311, in <genexpr>(.0)
    306 # We clone the estimator to make sure that all the folds are
    307 # independent, and that it is pickle-able.
    308 parallel = Parallel(n_jobs=n_jobs, verbose=verbose, pre_dispatch=pre_dispatch)
    309 results = parallel(
    310     delayed(_fit_and_score)(
--> 311         clone(estimator),
    312         X,
    313         y,
    314         scorers,
    315         train,
    316         test,
    317         verbose,
    318         None,
    319         fit_params,
    320         return_train_score=return_train_score,
    321         return_times=True,
    322         return_estimator=return_estimator,
    323         error_score=error_score,
    324     )
    325     for train, test in indices
    326 )
    328 _warn_or_raise_about_fit_failures(results, error_score)
    330 # For callable scoring, the return type is only know after calling. If the
    331 # return type is a dictionary, the error scores can now be inserted with
    332 # the correct key.

File ~/anaconda3/envs/cblearn/lib/python3.9/site-packages/sklearn/base.py:75, in clone(estimator, safe)
     41 """Construct a new unfitted estimator with the same parameters.
     42 
     43 Clone does a deep copy of the model in an estimator
   (...)
     72 found in :ref:`randomness`.
     73 """
     74 if hasattr(estimator, "__sklearn_clone__") and not inspect.isclass(estimator):
---> 75     return estimator.__sklearn_clone__()
     76 return _clone_parametrized(estimator, safe=safe)

File ~/anaconda3/envs/cblearn/lib/python3.9/site-packages/sklearn/base.py:268, in BaseEstimator.__sklearn_clone__(self)
    267 def __sklearn_clone__(self):
--> 268     return _clone_parametrized(self)

File ~/anaconda3/envs/cblearn/lib/python3.9/site-packages/sklearn/base.py:106, in _clone_parametrized(estimator, safe)
     98             raise TypeError(
     99                 "Cannot clone object '%s' (type %s): "
    100                 "it does not seem to be a scikit-learn "
    101                 "estimator as it does not implement a "
    102                 "'get_params' method." % (repr(estimator), type(estimator))
    103             )
    105 klass = estimator.__class__
--> 106 new_object_params = estimator.get_params(deep=False)
    107 for name, param in new_object_params.items():
    108     new_object_params[name] = clone(param, safe=False)

File ~/anaconda3/envs/cblearn/lib/python3.9/site-packages/sklearn/base.py:195, in BaseEstimator.get_params(self, deep)
    193 out = dict()
    194 for key in self._get_param_names():
--> 195     value = getattr(self, key)
    196     if deep and hasattr(value, "get_params") and not isinstance(value, type):
    197         deep_items = value.get_params().items()

AttributeError: 'SOE' object has no attribute 'n_init'

Used conda env mentioned in the documentation, so it uses python3.9.

(cc openjournals/joss-reviews#6139)

PyTorch usage not clear

What's the reasoning behind "Note that pytorch might need about 1GB of disk space." on the user guide. When does that happen?

Is it possible to use PyTorch without GPUs? Is it possible to unset CUDA_VISIBLE_DEVICES and set backend="pytorch"?

K-NN Triplet Sampling

PyPI Deploy Action

Github action to deploy releases to PyPI, based on Github release tags.

Workflow:

Mark a release on github with the appropriate semver version tag
Build distribution with the version from tag
Push distribution to PyPI

Todo:

Version from tag
Documentation

Bug: Init generator fails, when printing result message

https://github.com/dekuenstle/cblearn/blob/c72907374f7bdd1b5fb2c19b42b9226b48e9d91b/cblearn/embedding/_soe.py#L129

  File "/mnt/qb/work/wichmann/dkuenstle56/conda/envs/cblearn/lib/python3.10/site-packages/cblearn/embedding/_soe.py", line 162, in fit
    f"{'Retry with another initialization...' if init != inits[-1] else ''}")
TypeError: 'generator' object is not subscriptable

Object-oriented query API - more flexible and maintainable

Comparison queries can ask different questions:

triplet: d(A, B) < d(A, C)
quadrupled: d(A, B) < d(C, D)
odd-one-out: d(B, C) < d(A, B) & d(A, C)
most-central: d(A, B) & d(A, C) < d(B, C)
choose-n-similar: d(A, B1) & d(A, B2) ... & d(A, Bn) < d(A, C1) ... & d(A, Cm)
rank-n-similar: d(A, B1) < d(A, B2) ... < d(A, Bn) < d(A, C1) ... & d(A, Cm)

Notice their relations:

triplet is a special form of quadrupled with A or B = C or D
triplet is a special form of n-similar with n=1, m=1
odd-one-out, choose-n-similar, and rank-n-similar can be represented by multiple triplet

These queries can be represented in different formats that have their own advantages:

list of queries with a column per object. The order of the entries indicates the response
list of queries and a list of responses.
sparse tensor

Even responses, if not implicit in the order of query items, can come in different formats:

True, False of the inequality
-1, 1, 0 -> false, true, undecided of the inequality
index/indices of the selected item
one hot encoding of the selected item

Currently, we assume a query is a triplet and provide preprocessing functions to convert other queries to triplets. These triplets are stores as plain arrays (plus response array) or sparse tensors.
We infer the format, then and convert it in a single utility function. This is neither easy to extend nor to maintain.

I would like to switch to an object oriented API instead. We provide classes for the different questions that provide conversion methods. These classes store data as a list of queries, but can be build from and export to multiple data formats.
Still provide a function to infer the query if possible, so that all functions accept either the Query subclass or the "raw" data formats.

class Query:
    ...
    def to_X():
        ....
    def to_X_y():
        # ask for response format as parameter
        ....
    def to_sparse():
        # ask for response format as parameter
        ....

class Triplet(Query):
    ...
    def to_quadruplet():
         return Quadruplet(self.data[:, [0, 1, 0, 2]])
    
class Quadruplet(Query):
    def to_triplet():
        # check if the rows are actually triplets (A or B = C or D)
        # otherwise throw error
        return Triplet(self.data[:, [col indices...]])

...

Re-design query-response sampling

Separate the index sampling and responses to avoid the argument chains that we currently have (and which would get worse):

index generation (e.g. uniform, k-NN, radius, custom; from embedding or distance matrix?; all, random or active?)
response model (direct, different noise models, lazy/active?)
Object oriented for low-level API (20% use cases), but functions for high-level (80% use cases)?

Add utilities for common combinations of the above (similar to what we already have)

Prepare active and quadrupled queries.

Move query generation to utils?

Docs are missing for some functions

The docs for QueryScorer and query_error are empty. What do those functions do?

Additional similarity datasets

Here is a huge collection of similarity datasets: https://osf.io/ey9vp/wiki/home/

Bug: dataset descr are missing after install

The descr folder is not part of the package assets. We have to add it explicitly.

triplet_formats.py example does not work

I failed to run the triplet_format example. Here is the stacktrace:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[1], line 62
     58 data = [triplets_ordered, (triplets_boolean, answers_boolean),
     59         (triplets_numeric, answers_numeric), triplet_spmatrix]
     60 formats = ["list-order", "list-boolean", "list-count", "tensor-count"]
---> 62 timings = [
     63     (time_convert_triplet(triplets, to_format),
     64      f"{from_format}->{to_format}")
     65     for from_format, triplets in zip(formats, data)
     66     for to_format in formats
     67 ]
     69 for seconds, desc in sorted(timings):
     70     print(f"{seconds * 1000:.2f}ms {desc}")

Cell In[1], line 63, in <listcomp>(.0)
     58 data = [triplets_ordered, (triplets_boolean, answers_boolean),
     59         (triplets_numeric, answers_numeric), triplet_spmatrix]
     60 formats = ["list-order", "list-boolean", "list-count", "tensor-count"]
     62 timings = [
---> 63     (time_convert_triplet(triplets, to_format),
     64      f"{from_format}->{to_format}")
     65     for from_format, triplets in zip(formats, data)
     66     for to_format in formats
     67 ]
     69 for seconds, desc in sorted(timings):
     70     print(f"{seconds * 1000:.2f}ms {desc}")

Cell In[1], line 54, in time_convert_triplet(triplets, to_format)
     52 def time_convert_triplet(triplets, to_format):
     53     time_start = time.process_time()
---> 54     check_query_response(triplets, result_format=to_format)
     55     return (time.process_time() - time_start)

File ~/test/cblearn/cblearn/utils/_validate_data.py:292, in check_query_response(query, response, result_format, standard)
    290     query = check_tensor_query_response(query, (QueryFormat.TENSOR, input_response_format), standard=False)
    291     query, response = query.coords.T, query.data
--> 292 return check_list_query_response(query, response, (output_query_format, output_response_format),
    293                                  standard=standard)

File ~/test/cblearn/cblearn/utils/_validate_data.py:135, in check_list_query_response(query, response, result_format, standard)
    132     raise ValueError(f"Expects result_format list-..., got {result_format}.")
    134 if response_format is ResponseFormat.ORDER:
--> 135     return check_order_list_query_response(query, response)
    136 elif response_format is ResponseFormat.BOOLEAN:
    137     return check_bool_list_query_response(query, response, standard=standard)

File ~/test/cblearn/cblearn/utils/_validate_data.py:96, in check_order_list_query_response(query, response)
     95 def check_order_list_query_response(query, response):
---> 96     query, response = _check_list_query_response(query, response)
     97     __, input_response_format = data_format(query, response)
     99     if input_response_format is ResponseFormat.COUNT:

File ~/test/cblearn/cblearn/utils/_validate_data.py:15, in _check_list_query_response(query, response)
     13 def _check_list_query_response(query, response):
     14     if response is None:
---> 15         return check_array(query, dtype=np.uint32), None
     16     else:
     17         return check_X_y(query, response, dtype=np.uint32)

File ~/test/cblearn/venv/lib/python3.10/site-packages/sklearn/utils/validation.py:904, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    899 warnings.simplefilter("error", ComplexWarning)
    900 if dtype is not None and xp.isdtype(dtype, "integral"):
    901     # Conversion float -> int should not contain NaN or
    902     # inf (numpy#14412). We cannot use casting='safe' because
    903     # then conversion float -> int would be disallowed.
--> 904     array = _asarray_with_order(array, order=order, xp=xp)
    905     if xp.isdtype(array.dtype, ("real floating", "complex floating")):
    906         _assert_all_finite(
    907             array,
    908             allow_nan=False,
   (...)
    911             input_name=input_name,
    912         )

File ~/test/cblearn/venv/lib/python3.10/site-packages/sklearn/utils/_array_api.py:380, in _asarray_with_order(array, dtype, order, copy, xp)
    378     array = numpy.array(array, order=order, dtype=dtype)
    379 else:
--> 380     array = numpy.asarray(array, order=order, dtype=dtype)
    382 # At this point array is a NumPy ndarray. We convert it to an array
    383 # container that is consistent with the input's namespace.
    384 return xp.asarray(array)

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (2, 1000000) + inhomogeneous part.

Run with Ubuntu 22.10, Python 3.10, based on latest version from repo installed following contributor instructions.

(cc openjournals/joss-reviews#6139)

Musician Dataset: Fix triplet index mapping to artists

Largest index is 413, but only 413 artists are available (off-by-1-error)

Separate tests with optional dependencies

Running pytest should not fail, when the optional dependencies aren'T installed (e.g. torch, r_wrapper). Thus, skip by default the relevant tests and include them just in the CI configuration.

Consistent naming of dataset methods

Both the dataset functions themself (...matrix, ...triplets, ...distances, ...similarities?) , but also the result object's arguments (e.g. triplet or data? singular or plural?)

Memory efficient triplet sampling

The current (random) triplet sampling requires a lot of memory if a reasonable number of objects is used.

estimate_dimensionality_cv not in latest pypi version.

Hi,

The current cblearn version (0.1.2) from pypi does not seem to contain embedding.estimate_dimensionality_cv. Does the next release plan to? Or is this a bug?

I got it to work by copying the files over from the github repo, but it would be nice for it to work right off the bat.

Simplify workflows

The current CI workflow is rather complicated, because not just tests cblearn but also cblearn's R bridges.
This slows down the workflow and makes it more unstable (e.g. this issue).
We should move the bridge-tests to a separate workflow.

r_wrapper installation instructions missing

The ordinal embedding example cannot be executed without the r_wrapper. The installation for this is not documented and does not work out-of-the-box on a clean Ubuntu 22.04.

In general, there should be transparency for which features the r-wrapper is required, as well as further information what needs to be installed prior to calling pip install cblearn[r_wrapper].

(cc openjournals/joss-reviews#6139)

More triplet datasets

Eidolon dataset (unpublished, Siavash)

Triplets in the following datasets indicate, that one object is dissimilar to both others (outlier, odd-one-out task).
E.g. "our" triplets (i, j, k) (j, i, k) is identical to one triplet (k | i, j) in these datasets.

human judgements of natural objects: Paper.
Nature and Vogue datasets: Ukkonen et al. (2015)
Judgement and embeddings on ImageNet: Roads and Love (2020)
Material triplet judgements: Queries adaptivly selected using CKL, responses via MTurk

Triplets in the following dataset indicate, which item is the most central.

Car dataset

Configurable distance metrics

Where ever we use distance metrics (usually Euclidean), make this configurable.

Default euclidean. Allow string or function argument and additional parameters.+
Minimum: Minkowski distance, at best every metric of pdist (in sklearn or pytorch).

Evaluate, when this works well and when not.

Clarify when response_map is required for query_from_columns.

If a response_map argument is required whenever 'response_columns' are given, this should be made explicit in the documentation and with an appropriate error message. As it is, one could think that one could omit response_map if responses are already in an appropriate format, resulting in an AttributeError:

 cblearn.preprocessing.query_from_columns([[4, 5, 6, -1]], [0, 1, 2], 3)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_4171/2593374199.py in 
----> 1 cblearn.preprocessing.query_from_columns([[4, 5, 6, -1]], [0, 1, 2], 3)

~/thesis/cblearn/cblearn/preprocessing/_label.py in query_from_columns(data, query_columns, response_columns, response_map, return_transformer)
    173 
    174     if response_columns:
--> 175         inverse_map = {v: k for k, v in response_map.items()}
    176         response_enc = FunctionTransformer(func=np.vectorize(response_map.get),
    177                                            inverse_func=np.vectorize(inverse_map.get), check_inverse=False)

AttributeError: 'NoneType' object has no attribute 'items'

Embedding .score is broken for tensor-count and list-count input

T, T_test = ...  # triplets
assert 1 > SOE(1).fit(T).score(check_triplet_answers(T_test, result_format='list-order'))
assert 1 > SOE(1).fit(T).score(*check_triplet_answers(T_test, result_format='list-count'))
assert 1 > SOE(1).fit(T).score(*check_triplet_answers(T_test, result_format='list-boolean'))
# but:
assert 1.0 == SOE(1).fit(T).score(check_triplet_answers(T_test, result_format='tensor-count'))

Bug in data ~~check_triplet_answers~~, .predict, or triplet_error

A related problem is, that score of list-count is approx 0.5 * list-boolean

Add more performance metrics

Common metrics in the field are:

k-NN accuracy
(root) Mean Squared Error of embedding- but align with Procrustes
Procrustes disparity
correlation / MSE / ... of distance triangle

SOE is a lot slower than previously

While experimenting, I noticed unusually slow behaviour of SOE, which I didn't notice previously.
After some experimenting, it seems to be the changes made in commit 82fb66c, which introduced some changes to the SOE loss and initialisation mechanisms.

I have used the following code to test my hypothesis:

from cblearn.embedding import SOE
from cblearn.datasets import make_random_triplets
import numpy as np
import time

np.random.seed(42)
x = np.random.random((100, 2))
t, r = make_random_triplets(x, "list-boolean", 2000)
soe = SOE(n_components=2, random_state=2)

start = time.time()
soe.fit_transform(t, r)
print(f"{time.time() - start:.2f} seconds to fit 2000 triplets with SOE.")

This runs at around 4 seconds on the current main (cdbacb3). If commit 82fb66c is reverted (or we can checkout to the commit immediately before it), the runtime drops to 0.2 seconds. This is especially noticeable with larger amounts of data (the slowdown might scale with the number of triplets?)

Tested on an 2020 M1 Macbook Air. cblearn was used with the CPU version, not the Pytorch one.

Validate uniqueness of comparison entries

In the new queries api, we have to add explicit checks for unique items per row.
Eg. add here : https://github.com/cblearn/cblearn/blob/queries/cblearn/core/check_comparison.py

Such a check for uniqueness can be found in the triplet_response function.

Catch invalid dimension and raise error message

An easy mistake in the ordinal embedding estimators and in the estimate_dimensionality_cv function is to add dimensions (components) < 1 (e.g., #71). This results in crude shape errors. Instead, we should catch these inputs and raise a custom ValueError to help the user.


`/home/johannes/miniconda3/envs/thesis/lib/python3.9/site-packages/numpy/core/_asarray.py:102: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  return array(a, dtype, copy=False, order=order)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_5640/3369351507.py in 
      9 formats = ["list-order", "list-boolean", "list-count", "tensor-count"]
     10 
---> 11 timings = [
     12     (time_convert_triplet(triplets, to_format),
     13      f"{from_format}->{to_format}")

/tmp/ipykernel_5640/3369351507.py in (.0)
     10 
     11 timings = [
---> 12     (time_convert_triplet(triplets, to_format),
     13      f"{from_format}->{to_format}")
     14     for from_format, triplets in zip(formats, data)

/tmp/ipykernel_5640/3369351507.py in time_convert_triplet(triplets, to_format)
      1 def time_convert_triplet(triplets, to_format):
      2     time_start = time.process_time()
----> 3     check_query_response(triplets, result_format=to_format)
      4     return (time.process_time() - time_start)
      5 

~/thesis/cblearn/cblearn/utils/_validate_data.py in check_query_response(query, response, result_format, standard)
    290             query = check_tensor_query_response(query, (QueryFormat.TENSOR, input_response_format), standard=False)
    291             query, response = query.coords.T, query.data
--> 292         return check_list_query_response(query, response, (output_query_format, output_response_format),
    293                                          standard=standard)

~/thesis/cblearn/cblearn/utils/_validate_data.py in check_list_query_response(query, response, result_format, standard)
    133 
    134     if response_format is ResponseFormat.ORDER:
--> 135         return check_order_list_query_response(query, response)
    136     elif response_format is ResponseFormat.BOOLEAN:
    137         return check_bool_list_query_response(query, response, standard=standard)

~/thesis/cblearn/cblearn/utils/_validate_data.py in check_order_list_query_response(query, response)
     94 
     95 def check_order_list_query_response(query, response):
---> 96     query, response = _check_list_query_response(query, response)
     97     __, input_response_format = data_format(query, response)
     98 

~/thesis/cblearn/cblearn/utils/_validate_data.py in _check_list_query_response(query, response)
     13 def _check_list_query_response(query, response):
     14     if response is None:
---> 15         return check_array(query, dtype=np.uint32), None
     16     else:
     17         return check_X_y(query, response, dtype=np.uint32)

~/miniconda3/envs/thesis/lib/python3.9/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0

~/miniconda3/envs/thesis/lib/python3.9/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    665                     # inf (numpy#14412). We cannot use casting='safe' because
    666                     # then conversion float -> int would be disallowed.
--> 667                     array = np.asarray(array, order=order)
    668                     if array.dtype.kind == 'f':
    669                         _assert_all_finite(array, allow_nan=False,

~/miniconda3/envs/thesis/lib/python3.9/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order, like)
    100         return _asarray_with_like(a, dtype=dtype, order=order, like=like)
    101 
--> 102     return array(a, dtype, copy=False, order=order)
    103 
    104 

ValueError: could not broadcast input array from shape (1000000,3) into shape (1000000,)

cblearn / cblearn Goto Github PK

cblearn's Introduction

cblearn

Comparison-based Machine Learning in Python

Getting Started

Features

Datasets

Embedding Algorithms

Contribute

Authors and Acknowledgement

License

cblearn's People

Contributors

Stargazers

Watchers

Forkers

cblearn's Issues

Recommend Projects

Recommend Topics

Recommend Org