Code Monkey home page Code Monkey logo

copairs's People

Contributors

afermg avatar alxndrkalinin avatar jessica-ewald avatar johnarevalo avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

copairs's Issues

Allow absolute cosine similarity

Sometimes both, strong positive and strong negative correlations are meaningful, so we should allow for that.

One way to do this is to add a boolean absolute_value argument to pairwise_cosine

def pairwise_cosine(x_sample: np.ndarray, y_sample: np.ndarray) -> np.ndarray:
x_norm = x_sample / np.linalg.norm(x_sample, axis=1)[:, np.newaxis]
y_norm = y_sample / np.linalg.norm(y_sample, axis=1)[:, np.newaxis]
c_sim = np.sum(x_norm * y_norm, axis=1)
return c_sim

def pairwise_cosine(x_sample: np.ndarray, y_sample: np.ndarray, absolute_value: bool = False) -> np.ndarray:
    x_norm = x_sample / np.linalg.norm(x_sample, axis=1)[:, np.newaxis]
    y_norm = y_sample / np.linalg.norm(y_sample, axis=1)[:, np.newaxis]
    c_sim = np.sum(x_norm * y_norm, axis=1)
    if absolute_value:
        c_sim = np.abs(c_sim)
    return c_sim

Adding `pd.query()` style selections raises index error

Description

This issue is centered around implementing pd.query()-based selection for the run_pipeline() function. I've encountered this challenge while working on a notebook, and I've documented the problem in a corresponding PR for visibility and collaboration.

To reporduce the error, I've created a small dataset and a standalone notebook to replicate the issue. You can find the test data here, which I used to reproduce the error in the referenced notebook. Additionally, the reproduce_error.ipynb notebook provides the code to recreate the issue.

The error that I receive is this:

KeyError                                  Traceback (most recent call last)
/home/erikserrano/Development/Mitocheck-MAP-analysis/notebooks/mitocheck-map-analysis.ipynb Cell 7 line 3
     33 # execute pipeline on negative control with trianing dataset with cp features
     34 logging.info(f"Running pipeline on CP features using {phenotype} phenotype")
---> 35 cp_negative_training_result = run_pipeline(meta=negative_training_cp_meta,
     36                                         feats=negative_training_cp_feats,
     37                                         pos_sameby=pos_sameby,
     38                                         pos_diffby=pos_diffby,
     39                                         neg_sameby=neg_sameby,
     40                                         neg_diffby=neg_diffby,
     41                                         batch_size=batch_size,
     42                                         null_size=null_size)
     43 map_results_neg_cp.append(cp_negative_training_result)                                       
     45 # execute pipeline on negative control with trianing dataset with dp features

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/copairs/map.py:115, in run_pipeline(meta, feats, pos_sameby, pos_diffby, neg_sameby, neg_diffby, null_size, batch_size, seed)
    105 def run_pipeline(meta,
    106                  feats,
    107                  pos_sameby,
   (...)
    112                  batch_size=20000,
    113                  seed=0) -> pd.DataFrame:
    114     columns = flatten_str_list(pos_sameby, pos_diffby, neg_sameby, neg_diffby)
--> 115     validate_pipeline_input(meta, feats, columns)
    117     # Critical!, otherwise the indexing wont work
    118     meta = meta.reset_index(drop=True).copy()

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/copairs/map.py:99, in validate_pipeline_input(meta, feats, columns)
     98 def validate_pipeline_input(meta, feats, columns):
---> 99     if meta[columns].isna().any(axis=None):
    100         raise ValueError('metadata columns should not have null values.')
    101     if len(meta) != len(feats):

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/pandas/core/frame.py:3899, in DataFrame.__getitem__(self, key)
   3897     if is_iterator(key):
   3898         key = list(key)
-> 3899     indexer = self.columns._get_indexer_strict(key, "columns")[1]
   3901 # take() does not accept boolean indexers
   3902 if getattr(indexer, "dtype", None) == bool:

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/pandas/core/indexes/base.py:6114, in Index._get_indexer_strict(self, key, axis_name)
   6111 else:
   6112     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6114 self._raise_if_missing(keyarr, indexer, axis_name)
   6116 keyarr = self.take(indexer)
   6117 if isinstance(key, Index):
   6118     # GH 42790 - Preserve name from an Index

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/pandas/core/indexes/base.py:6178, in Index._raise_if_missing(self, key, indexer, axis_name)
   6175     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   6177 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 6178 raise KeyError(f"{not_found} not in index")

KeyError: "['Metadata_is_control == 0'] not in index"

The root cause is traced to the validate_pipeline_input() function, which struggles with recognizing pd.query() style calls. I attmpted to bypass this issue by commenting out the validation leads to a subsequent problem, as shown below.

ValueError                                Traceback (most recent call last)
/home/erikserrano/Development/Mitocheck-MAP-analysis/notebooks/mitocheck-map-analysis.ipynb Cell 7 line 3
     33 # execute pipeline on negative control with trianing dataset with cp features
     34 logging.info(f"Running pipeline on CP features using {phenotype} phenotype")
---> 35 cp_negative_training_result = run_pipeline(meta=negative_training_cp_meta,
     36                                         feats=negative_training_cp_feats,
     37                                         pos_sameby=pos_sameby,
     38                                         pos_diffby=pos_diffby,
     39                                         neg_sameby=neg_sameby,
     40                                         neg_diffby=neg_diffby,
     41                                         batch_size=batch_size,
     42                                         null_size=null_size)
     43 map_results_neg_cp.append(cp_negative_training_result)                                       
     45 # execute pipeline on negative control with trianing dataset with dp features

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/copairs/map.py:120, in run_pipeline(meta, feats, pos_sameby, pos_diffby, neg_sameby, neg_diffby, null_size, batch_size, seed)
    118 meta = meta.reset_index(drop=True).copy()
    119 logger.info('Indexing metadata...')
--> 120 matcher = create_matcher(meta, pos_sameby, pos_diffby, neg_sameby,
    121                          neg_diffby)
    123 logger.info('Finding positive pairs...')
    124 pos_pairs = matcher.get_all_pairs(sameby=pos_sameby, diffby=pos_diffby)

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/copairs/map.py:61, in create_matcher(obs, pos_sameby, pos_diffby, neg_sameby, neg_diffby, multilabel_col)
     59 if multilabel_col:
     60     return MatcherMultilabel(obs, columns, multilabel_col, seed=0)
---> 61 return Matcher(obs, columns, seed=0)

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/copairs/matching.py:77, in Matcher.__init__(self, dframe, columns, seed, max_size)
     73         elems = rng.choice(elems, max_size)
     74     return elems
     76 mappers = [
---> 77     reverse_index(dframe[col]).apply(clip_list) for col in dframe
     78 ]
     80 # Create a column order based on the number of potential row matches
     81 # Useful to solve queries with more than one sameby
     82 n_pairs = {}

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/copairs/matching.py:22, in reverse_index(col)
     20 def reverse_index(col: pd.Series) -> pd.Series:
     21     '''Build a reverse_index for a given column in the DataFrame'''
---> 22     return pd.Series(col.groupby(col).indices, name=col.name)

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/pandas/core/frame.py:8869, in DataFrame.groupby(self, by, axis, level, as_index, sort, group_keys, observed, dropna)
   8866 if level is None and by is None:
   8867     raise TypeError("You have to supply one of 'by' and 'level'")
-> 8869 return DataFrameGroupBy(
   8870     obj=self,
   8871     keys=by,
   8872     axis=axis,
   8873     level=level,
   8874     as_index=as_index,
   8875     sort=sort,
   8876     group_keys=group_keys,
   8877     observed=observed,
   8878     dropna=dropna,
   8879 )

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/pandas/core/groupby/groupby.py:1278, in GroupBy.__init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, observed, dropna)
   1275 self.dropna = dropna
   1277 if grouper is None:
-> 1278     grouper, exclusions, obj = get_grouper(
   1279         obj,
   1280         keys,
   1281         axis=axis,
   1282         level=level,
   1283         sort=sort,
   1284         observed=False if observed is lib.no_default else observed,
   1285         dropna=self.dropna,
   1286     )
   1288 if observed is lib.no_default:
   1289     if any(ping._passed_categorical for ping in grouper.groupings):

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/pandas/core/groupby/grouper.py:1020, in get_grouper(obj, key, axis, level, sort, observed, validate, dropna)
   1015         in_axis = False
   1017     # create the Grouping
   1018     # allow us to passing the actual Grouping as the gpr
   1019     ping = (
-> 1020         Grouping(
   1021             group_axis,
   1022             gpr,
   1023             obj=obj,
   1024             level=level,
   1025             sort=sort,
   1026             observed=observed,
   1027             in_axis=in_axis,
   1028             dropna=dropna,
   1029         )
   1030         if not isinstance(gpr, Grouping)
   1031         else gpr
   1032     )
   1034     groupings.append(ping)
   1036 if len(groupings) == 0 and len(obj):

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/pandas/core/groupby/grouper.py:601, in Grouping.__init__(self, index, grouper, obj, level, sort, observed, in_axis, dropna, uniques)
    599 if getattr(grouping_vector, "ndim", 1) != 1:
    600     t = str(type(grouping_vector))
--> 601     raise ValueError(f"Grouper for '{t}' not 1-dimensional")
    603 grouping_vector = index.map(grouping_vector)
    605 if not (
    606     hasattr(grouping_vector, "__len__")
    607     and len(grouping_vector) == len(index)
    608 ):

ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional

It appears that bypassing the validation steps results in a failure to construct the Grouper class.

using the repo

If you would like to explore and test the issue, please feel free to use the dedicated repository I've set up. Here are the steps to get started:

git clone https://github.com/WayScience/Mitocheck-MAP-analysis.git && cd Mitocheck-MAP-analysis
conda env create -f map_env.yaml
conda activate map

These commands will clone the repository, set up the required conda environment using the provided map_env.yaml file, and activate the environment, respectively.

A profile's average precision is 1 if it contains any NaNs

The problem

All results come from this notebook

If a profile contains even a single NaN, every pairwise similarity will also be NaN. Here is an example:
image

Here is an example where the profile that we are computing AP for has no NaN, but some of its pairs do:
image

When all pairs (pos and neg) have NaN similarities, copairs returns an AP of 1. This results in any profile with even a single NaN getting an AP of 1 (see ZMYND10_Arg340Gln and UBQLN2_Pro506Thr rows):
image

What to do?

As @alxndrkalinin pointed out, probably we don't want to enforce any NaN handling strategies inside of the pairwise similarity calculations because this can get complex.

But, maybe this also shouldn't happen without any warning? Users could easily have a small number of NaNs in their data and never notice because the mAP values would look normal (between 0 and 1), but would be biased upwards whenever one/some of the individual AP values are 1.

I suggest we add one simple check for NaNs in the feats input in the validate_pipeline_input function, and just give users a warning describing this behavior and let them know they should resolve NaNs. Another (more complicated) solution would be to change the tie strategy when computing average precision from the ranked list such that the average precision is like NaN instead of 1 if everything is NaN and therefore tied.

compute.compute_ap_contiguos returns NaNs silently

When using map.run_pipeline(), if some samples are only present once in the dataset compute.ap_contiguos returns zeros. This is due to compute.to_cutoffs being applied to an array of length 1, so the line in

cutoffs[0], cutoffs[1:] = 0, counts.cumsum()[:-1]
always returns [0].

These 0s are propagated to num_pos

num_pos = np.add.reduceat(rel_k_list, cutoffs)
and thus things are divided by zero, as this vector is used as a divisor
ap_scores = np.add.reduceat(pr_k * rel_k_list, cutoffs) / num_pos
.

It was a bit puzzling to figure out why my test with a small number of plates was plagued with nans, so it is probably worth adding a warning when 1 (or maybe a fraction, i.e., 1/4) of all elements appear only once.

Rename groupby

Rename groupby to sameby or other name to avoid confusion with pandas groupby

[bug] AP of 1.0 should have min p-value

When calculated AP is 1.0, corresponding p-value is too large.

For example, for 2 positive profiles, 16 controls, null size 1000, and AP = 1, p-value = 0.061938, which is incorrect, because the proportion of null to the right of the 1.0 should be 0, and p-value should be p=(num + 1) / (null_size + 1) = 1/1001 = ~0.000999

accepting pandas `query` conditions in sameby/diffby

We discussed earlier that sometimes filtering needs to be done as a part of creating pairs and this could be achieved by allowing sameby/diffby to support not just column names (e.g., A), but pandas.DataFrame.query syntax instead (e.g., A==1).

I considered 2 possible option for implementing it:

  1. Pass just column names to the Matcher constructor and then filter data based on conditions at pair creation step inside get_all_pairs. This approach was motivated by the idea to build and preserve the generic reverse index once and re-use it for any subsets of data based on different filters (i.e., call Matcher once and then call different get_all_pairs). The issue with this option is that the index has already been created when the condition is used to filter data, so after filtering the index will contain: (a) incorrect keys – filtered column values that have been dropped, and (b) incorrect row numbers for other columns corresponding to dropped rows. Addressing this issue by correctly sampling original index turned to be not so easy, and it's not clear whether in most cases the index would actually have to be rebuilt from scratch. That's why I switched to the 2nd option.

  2. Alternative approach applies filter at Matcher initialization, such that the index is build on pre-filtered data. This avoids issues with requesting invalid rows and the need to re-building the index, but limits get_all_pairs calls to filtered subset of the original data and in trivial implementation returns pairs of indices that do not match the original dataframe. To address that I proposed storing the original index in the Matcher state and use to convert pair indices to the original range.

Track potential discrepancies between `matric` and `copairs`

@alxndrkalinin noticed discrepancies beween matric and evalzoo.

The edges and similarities are identical, but the metrics differ.

We suspect there's a bug in matric in some edge cases.

We think this is a bug in matric and not copairs because something looks funky in the indices of the metrics_level_1_0 files – they don't match the indices of the collated_sim file.

Here's how we'd test it. The example below does not show any discrepancies. But some examples that @alxndrkalinin had did show discrepancies

l_1_0 <- read_parquet("results/10547453/metrics_level_1_0_non_rep.parquet")
csim <- read_parquet("results/10547453/collatedsim.parquet")
l_1_0_id1 <- l_1_0 %>% distinct(id1)
csim_id1 <- csim %>% distinct(id1)
# we expect all the `id1` in `metrics_level_1_0_non_rep` to be present in `collatedsim`
stopifnot(l_1_0_id1 %>% anti_join(csim_id1, by = "id1") %>% nrow() == 0)

Pair creation interface

There are a few suggestions regarding the interface.

  1. Right now, pos/neg + sameby/diffby are required twice: in the Matcher constructor and in the create_all_pairs method call. It seems redundant, since these column lists are usually the same, in fact, Matcher uses them to remove all other columns from the dataframe that are not listed in any of these. If create_all_pairs is called with a column that was not present at matcher instantiation, it fails with an obscure TypeError: '<' not supported between instances of 'NoneType' and 'int'.

One option is we could save these column lists to a Matcher instance and use them by default on create_all_pairs call. To separate getting positive and negative pairs, we can either add a new argument to create_all_pairs (type=pos/neg) or actually make create_all_pairs to create both positive and negative pairs using saved column lists (by creating methods for each, so they still can be called separately if needed). The all_pairs method still can have column lists as arguments in case the user wants to override the defaults with their subsets.

Alternatively, we could separate Matcher and creating pairs. IIUC Matcher is needed to store data state + index, which can be returned as an object that is then passed into create_pairs function.

  1. There is no option to define negative pairs as anything that is not positive.

This could be a default result when calling method to create negative pairs with empty column lists.

  1. Currently, copairs does not support any, for neither sameby nor diffby.

Options for any interface can be following:

  • introducing new parameters sameby_any/diffby_any that are passed to the Matcher constructor and to create_all_pairs method call
  • changing existing sameby/diffby such that they can also accept a dict with keys all and any (if str or list are passed, they're treated as all)

There are probably more and better options, but this is I could think of now.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.