cytomining / copairs Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 5.0 140 KB

Find pairs and compute metrics between them.

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

copairs's People

Contributors

Stargazers

Watchers

Forkers

johnarevalo alxndrkalinin afermg jessica-ewald

copairs's Issues

Copairs will fail when missing writing permission

The line in question is

copairs/src/copairs/compute.py

Line 139 in 89b2c62

cache_dir.mkdir(parents=True, exist_ok=True)

.
A specific library should be installed for local caching, probably similar to the way pooch does it, for instance, https://github.com/python-cachier/cachier.

Allow absolute cosine similarity

Sometimes both, strong positive and strong negative correlations are meaningful, so we should allow for that.

One way to do this is to add a boolean absolute_value argument to pairwise_cosine

copairs/src/copairs/compute.py

Lines 65 to 69 in c10c93e

    
           def pairwise_cosine(x_sample: np.ndarray, y_sample: np.ndarray) -> np.ndarray: 
        
               x_norm = x_sample / np.linalg.norm(x_sample, axis=1)[:, np.newaxis] 
        
               y_norm = y_sample / np.linalg.norm(y_sample, axis=1)[:, np.newaxis] 
        
               c_sim = np.sum(x_norm * y_norm, axis=1) 
        
               return c_sim

def pairwise_cosine(x_sample: np.ndarray, y_sample: np.ndarray, absolute_value: bool = False) -> np.ndarray:
    x_norm = x_sample / np.linalg.norm(x_sample, axis=1)[:, np.newaxis]
    y_norm = y_sample / np.linalg.norm(y_sample, axis=1)[:, np.newaxis]
    c_sim = np.sum(x_norm * y_norm, axis=1)
    if absolute_value:
        c_sim = np.abs(c_sim)
    return c_sim

Remove hard coded value

copairs/src/copairs/sampler.py

Line 105 in e7a9d75

max_nunique = 5000 # Elements to pair require a limit to avoid memleak.

Adding `pd.query()` style selections raises index error

Description

This issue is centered around implementing pd.query()-based selection for the run_pipeline() function. I've encountered this challenge while working on a notebook, and I've documented the problem in a corresponding PR for visibility and collaboration.

To reporduce the error, I've created a small dataset and a standalone notebook to replicate the issue. You can find the test data here, which I used to reproduce the error in the referenced notebook. Additionally, the reproduce_error.ipynb notebook provides the code to recreate the issue.

The error that I receive is this:

KeyError                                  Traceback (most recent call last)
/home/erikserrano/Development/Mitocheck-MAP-analysis/notebooks/mitocheck-map-analysis.ipynb Cell 7 line 3
     33 # execute pipeline on negative control with trianing dataset with cp features
     34 logging.info(f"Running pipeline on CP features using {phenotype} phenotype")
---> 35 cp_negative_training_result = run_pipeline(meta=negative_training_cp_meta,
     36                                         feats=negative_training_cp_feats,
     37                                         pos_sameby=pos_sameby,
     38                                         pos_diffby=pos_diffby,
     39                                         neg_sameby=neg_sameby,
     40                                         neg_diffby=neg_diffby,
     41                                         batch_size=batch_size,
     42                                         null_size=null_size)
     43 map_results_neg_cp.append(cp_negative_training_result)                                       
     45 # execute pipeline on negative control with trianing dataset with dp features

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/copairs/map.py:115, in run_pipeline(meta, feats, pos_sameby, pos_diffby, neg_sameby, neg_diffby, null_size, batch_size, seed)
    105 def run_pipeline(meta,
    106                  feats,
    107                  pos_sameby,
   (...)
    112                  batch_size=20000,
    113                  seed=0) -> pd.DataFrame:
    114     columns = flatten_str_list(pos_sameby, pos_diffby, neg_sameby, neg_diffby)
--> 115     validate_pipeline_input(meta, feats, columns)
    117     # Critical!, otherwise the indexing wont work
    118     meta = meta.reset_index(drop=True).copy()

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/copairs/map.py:99, in validate_pipeline_input(meta, feats, columns)
     98 def validate_pipeline_input(meta, feats, columns):
---> 99     if meta[columns].isna().any(axis=None):
    100         raise ValueError('metadata columns should not have null values.')
    101     if len(meta) != len(feats):

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/pandas/core/frame.py:3899, in DataFrame.__getitem__(self, key)
   3897     if is_iterator(key):
   3898         key = list(key)
-> 3899     indexer = self.columns._get_indexer_strict(key, "columns")[1]
   3901 # take() does not accept boolean indexers
   3902 if getattr(indexer, "dtype", None) == bool:

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/pandas/core/indexes/base.py:6114, in Index._get_indexer_strict(self, key, axis_name)
   6111 else:
   6112     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6114 self._raise_if_missing(keyarr, indexer, axis_name)
   6116 keyarr = self.take(indexer)
   6117 if isinstance(key, Index):
   6118     # GH 42790 - Preserve name from an Index

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/pandas/core/indexes/base.py:6178, in Index._raise_if_missing(self, key, indexer, axis_name)
   6175     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   6177 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 6178 raise KeyError(f"{not_found} not in index")

KeyError: "['Metadata_is_control == 0'] not in index"

The root cause is traced to the validate_pipeline_input() function, which struggles with recognizing pd.query() style calls. I attmpted to bypass this issue by commenting out the validation leads to a subsequent problem, as shown below.

ValueError                                Traceback (most recent call last)
/home/erikserrano/Development/Mitocheck-MAP-analysis/notebooks/mitocheck-map-analysis.ipynb Cell 7 line 3
     33 # execute pipeline on negative control with trianing dataset with cp features
     34 logging.info(f"Running pipeline on CP features using {phenotype} phenotype")
---> 35 cp_negative_training_result = run_pipeline(meta=negative_training_cp_meta,
     36                                         feats=negative_training_cp_feats,
     37                                         pos_sameby=pos_sameby,
     38                                         pos_diffby=pos_diffby,
     39                                         neg_sameby=neg_sameby,
     40                                         neg_diffby=neg_diffby,
     41                                         batch_size=batch_size,
     42                                         null_size=null_size)
     43 map_results_neg_cp.append(cp_negative_training_result)                                       
     45 # execute pipeline on negative control with trianing dataset with dp features

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/copairs/map.py:120, in run_pipeline(meta, feats, pos_sameby, pos_diffby, neg_sameby, neg_diffby, null_size, batch_size, seed)
    118 meta = meta.reset_index(drop=True).copy()
    119 logger.info('Indexing metadata...')
--> 120 matcher = create_matcher(meta, pos_sameby, pos_diffby, neg_sameby,
    121                          neg_diffby)
    123 logger.info('Finding positive pairs...')
    124 pos_pairs = matcher.get_all_pairs(sameby=pos_sameby, diffby=pos_diffby)

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/copairs/map.py:61, in create_matcher(obs, pos_sameby, pos_diffby, neg_sameby, neg_diffby, multilabel_col)
     59 if multilabel_col:
     60     return MatcherMultilabel(obs, columns, multilabel_col, seed=0)
---> 61 return Matcher(obs, columns, seed=0)

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/copairs/matching.py:77, in Matcher.__init__(self, dframe, columns, seed, max_size)
     73         elems = rng.choice(elems, max_size)
     74     return elems
     76 mappers = [
---> 77     reverse_index(dframe[col]).apply(clip_list) for col in dframe
     78 ]
     80 # Create a column order based on the number of potential row matches
     81 # Useful to solve queries with more than one sameby
     82 n_pairs = {}

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/copairs/matching.py:22, in reverse_index(col)
     20 def reverse_index(col: pd.Series) -> pd.Series:
     21     '''Build a reverse_index for a given column in the DataFrame'''
---> 22     return pd.Series(col.groupby(col).indices, name=col.name)

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/pandas/core/frame.py:8869, in DataFrame.groupby(self, by, axis, level, as_index, sort, group_keys, observed, dropna)
   8866 if level is None and by is None:
   8867     raise TypeError("You have to supply one of 'by' and 'level'")
-> 8869 return DataFrameGroupBy(
   8870     obj=self,
   8871     keys=by,
   8872     axis=axis,
   8873     level=level,
   8874     as_index=as_index,
   8875     sort=sort,
   8876     group_keys=group_keys,
   8877     observed=observed,
   8878     dropna=dropna,
   8879 )

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/pandas/core/groupby/groupby.py:1278, in GroupBy.__init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, observed, dropna)
   1275 self.dropna = dropna
   1277 if grouper is None:
-> 1278     grouper, exclusions, obj = get_grouper(
   1279         obj,
   1280         keys,
   1281         axis=axis,
   1282         level=level,
   1283         sort=sort,
   1284         observed=False if observed is lib.no_default else observed,
   1285         dropna=self.dropna,
   1286     )
   1288 if observed is lib.no_default:
   1289     if any(ping._passed_categorical for ping in grouper.groupings):

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/pandas/core/groupby/grouper.py:1020, in get_grouper(obj, key, axis, level, sort, observed, validate, dropna)
   1015         in_axis = False
   1017     # create the Grouping
   1018     # allow us to passing the actual Grouping as the gpr
   1019     ping = (
-> 1020         Grouping(
   1021             group_axis,
   1022             gpr,
   1023             obj=obj,
   1024             level=level,
   1025             sort=sort,
   1026             observed=observed,
   1027             in_axis=in_axis,
   1028             dropna=dropna,
   1029         )
   1030         if not isinstance(gpr, Grouping)
   1031         else gpr
   1032     )
   1034     groupings.append(ping)
   1036 if len(groupings) == 0 and len(obj):

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/pandas/core/groupby/grouper.py:601, in Grouping.__init__(self, index, grouper, obj, level, sort, observed, in_axis, dropna, uniques)
    599 if getattr(grouping_vector, "ndim", 1) != 1:
    600     t = str(type(grouping_vector))
--> 601     raise ValueError(f"Grouper for '{t}' not 1-dimensional")
    603 grouping_vector = index.map(grouping_vector)
    605 if not (
    606     hasattr(grouping_vector, "__len__")
    607     and len(grouping_vector) == len(index)
    608 ):

ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional

It appears that bypassing the validation steps results in a failure to construct the Grouper class.

using the repo

If you would like to explore and test the issue, please feel free to use the dedicated repository I've set up. Here are the steps to get started:

git clone https://github.com/WayScience/Mitocheck-MAP-analysis.git && cd Mitocheck-MAP-analysis

conda env create -f map_env.yaml

conda activate map

These commands will clone the repository, set up the required conda environment using the provided map_env.yaml file, and activate the environment, respectively.

Check meta and feats have the same number of rows in `run_pipeline` function

Allow multiple columns in mAP calculations

Consider change the return for `get_all_pairs` from dict to list

Because it many not be informative and also the composite key can be a burden.

Move sampling to query

from the Matcher.__init__ method. Doing it during indexing may led to invalid pairing

Make consistent variable names with compute values in `dists` vs similarities

single string as a diffby

Why groupby can be a single string when calling get_all_pairs ), but diffby has to be an array?

A profile's average precision is 1 if it contains any NaNs

The problem

All results come from this notebook

If a profile contains even a single NaN, every pairwise similarity will also be NaN. Here is an example:

Here is an example where the profile that we are computing AP for has no NaN, but some of its pairs do:

When all pairs (pos and neg) have NaN similarities, copairs returns an AP of 1. This results in any profile with even a single NaN getting an AP of 1 (see ZMYND10_Arg340Gln and UBQLN2_Pro506Thr rows):

What to do?

As @alxndrkalinin pointed out, probably we don't want to enforce any NaN handling strategies inside of the pairwise similarity calculations because this can get complex.

But, maybe this also shouldn't happen without any warning? Users could easily have a small number of NaNs in their data and never notice because the mAP values would look normal (between 0 and 1), but would be biased upwards whenever one/some of the individual AP values are 1.

I suggest we add one simple check for NaNs in the feats input in the validate_pipeline_input function, and just give users a warning describing this behavior and let them know they should resolve NaNs. Another (more complicated) solution would be to change the tie strategy when computing average precision from the ranked list such that the average precision is like NaN instead of 1 if everything is NaN and therefore tied.

Add support for multi label in mAP calculation

compute.compute_ap_contiguos returns NaNs silently

When using map.run_pipeline(), if some samples are only present once in the dataset compute.ap_contiguos returns zeros. This is due to compute.to_cutoffs being applied to an array of length 1, so the line in

copairs/src/copairs/compute.py

Line 171 in 491ebaf

cutoffs[0], cutoffs[1:] = 0, counts.cumsum()[:-1]

always returns [0].

These 0s are propagated to num_pos

copairs/src/copairs/compute.py

Line 97 in 491ebaf

num_pos = np.add.reduceat(rel_k_list, cutoffs)

and thus things are divided by zero, as this vector is used as a divisor

copairs/src/copairs/compute.py

Line 105 in 491ebaf

ap_scores = np.add.reduceat(pr_k * rel_k_list, cutoffs) / num_pos

It was a bit puzzling to figure out why my test with a small number of plates was plagued with nans, so it is probably worth adding a warning when 1 (or maybe a fraction, i.e., 1/4) of all elements appear only once.

When computing mAP, raise error if there are not valid pairs

Add tests for run_pipeline

Include a small dataset along with precomputed mAPs. Add test cases

Rename groupby

Rename groupby to sameby or other name to avoid confusion with pandas groupby

Change defaultdict return to a regular dict

defaultdict may lead to confusions when asks for in inexistent IDs.

[bug] AP of 1.0 should have min p-value

When calculated AP is 1.0, corresponding p-value is too large.

For example, for 2 positive profiles, 16 controls, null size 1000, and AP = 1, p-value = 0.061938, which is incorrect, because the proportion of null to the right of the 1.0 should be 0, and p-value should be p=(num + 1) / (null_size + 1) = 1/1001 = ~0.000999

Add notebook explaining how to compute map

Rename Sampler class to Indexer

Or other name that reflects it find all match pairs, while sample negative pairs.

Rename columns after aggregation

after running aggregate, the average_precision column is still called the same. It shoud become mean_average_precision

Check behavior of np.inf feature values

Related to #59, mAP scores may produce unexpected values if feature values contain any np.inf.

`get_all_pairs` fails with empty sameby & > 2 cols in diffby

when sameby=[] and diffby contains more than 2 columns, _only_diffby raises IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes ...

get_all_pairs should check that sameby/diffby are present in the matcher column list

Implement random baseline in tensorflow

accepting pandas `query` conditions in sameby/diffby

We discussed earlier that sometimes filtering needs to be done as a part of creating pairs and this could be achieved by allowing sameby/diffby to support not just column names (e.g., A), but pandas.DataFrame.query syntax instead (e.g., A==1).

I considered 2 possible option for implementing it:

Pass just column names to the Matcher constructor and then filter data based on conditions at pair creation step inside get_all_pairs. This approach was motivated by the idea to build and preserve the generic reverse index once and re-use it for any subsets of data based on different filters (i.e., call Matcher once and then call different get_all_pairs). The issue with this option is that the index has already been created when the condition is used to filter data, so after filtering the index will contain: (a) incorrect keys – filtered column values that have been dropped, and (b) incorrect row numbers for other columns corresponding to dropped rows. Addressing this issue by correctly sampling original index turned to be not so easy, and it's not clear whether in most cases the index would actually have to be rebuilt from scratch. That's why I switched to the 2nd option.
Alternative approach applies filter at Matcher initialization, such that the index is build on pre-filtered data. This avoids issues with requesting invalid rows and the need to re-building the index, but limits get_all_pairs calls to filtered subset of the original data and in trivial implementation returns pairs of indices that do not match the original dataframe. To address that I proposed storing the original index in the Matcher state and use to convert pair indices to the original range.

expose intermediate outputs in map calculation

By disaggregating the run_pipeline function in a modular way

Implement `_only_diffby` in multilabel column

Track potential discrepancies between `matric` and `copairs`

@alxndrkalinin noticed discrepancies beween matric and evalzoo.

The edges and similarities are identical, but the metrics differ.

We suspect there's a bug in matric in some edge cases.

We think this is a bug in matric and not copairs because something looks funky in the indices of the metrics_level_1_0 files – they don't match the indices of the collated_sim file.

Here's how we'd test it. The example below does not show any discrepancies. But some examples that @alxndrkalinin had did show discrepancies

l_1_0 <- read_parquet("results/10547453/metrics_level_1_0_non_rep.parquet")
csim <- read_parquet("results/10547453/collatedsim.parquet")
l_1_0_id1 <- l_1_0 %>% distinct(id1)
csim_id1 <- csim %>% distinct(id1)
# we expect all the `id1` in `metrics_level_1_0_non_rep` to be present in `collatedsim`
stopifnot(l_1_0_id1 %>% anti_join(csim_id1, by = "id1") %>% nrow() == 0)

Length of values (m) does not match length of index (n) when no valid pairs are found for 1+ samples.

run_pipeline throws an exception when there are not valid pairs for a sample and such sample is not used in neg_pairs either.

copairs/src/copairs/map.py

Line 157 in 11570eb

meta['average_precision'] = ap_scores

Add filter param for positive pairs in mAP computation

Pair creation interface

There are a few suggestions regarding the interface.

Right now, pos/neg + sameby/diffby are required twice: in the Matcher constructor and in the create_all_pairs method call. It seems redundant, since these column lists are usually the same, in fact, Matcher uses them to remove all other columns from the dataframe that are not listed in any of these. If create_all_pairs is called with a column that was not present at matcher instantiation, it fails with an obscure TypeError: '<' not supported between instances of 'NoneType' and 'int'.

One option is we could save these column lists to a Matcher instance and use them by default on create_all_pairs call. To separate getting positive and negative pairs, we can either add a new argument to create_all_pairs (type=pos/neg) or actually make create_all_pairs to create both positive and negative pairs using saved column lists (by creating methods for each, so they still can be called separately if needed). The all_pairs method still can have column lists as arguments in case the user wants to override the defaults with their subsets.

Alternatively, we could separate Matcher and creating pairs. IIUC Matcher is needed to store data state + index, which can be returned as an object that is then passed into create_pairs function.

There is no option to define negative pairs as anything that is not positive.

This could be a default result when calling method to create negative pairs with empty column lists.

Currently, copairs does not support any, for neither sameby nor diffby.

Options for any interface can be following:

introducing new parameters sameby_any/diffby_any that are passed to the Matcher constructor and to create_all_pairs method call
changing existing sameby/diffby such that they can also accept a dict with keys all and any (if str or list are passed, they're treated as all)

There are probably more and better options, but this is I could think of now.

	def pairwise_cosine(x_sample: np.ndarray, y_sample: np.ndarray) -> np.ndarray:
	x_norm = x_sample / np.linalg.norm(x_sample, axis=1)[:, np.newaxis]
	y_norm = y_sample / np.linalg.norm(y_sample, axis=1)[:, np.newaxis]
	c_sim = np.sum(x_norm * y_norm, axis=1)
	return c_sim

cytomining / copairs Goto Github PK

copairs's People

Contributors

Stargazers

Watchers

Forkers

copairs's Issues

Description

using the repo

The problem

What to do?

Recommend Projects

Recommend Topics

Recommend Org