The morar's discuss from swarchal

morar.DataFrame.drop is borked, self counts as an argument

TypeError: drop() takes 1 positional argument but 2 were given

change tests from nose to pytest

Add morar.DataFrame.aggregate() method

Stop using deprecated .ix in tests

tests/test_stats.py::test_mad_dataframe_row
  /home/scott/code/morar/tests/test_stats.py:38: DeprecationWarning: 
  .ix is deprecated. Please use
  .loc for label based indexing or
  .iloc for positional indexing

outliers.get_outlier_index needs feature_data kwargs

utils.get_metadata should be able to handle metadata in middle of strings

At the moment it will miss columns such as Image_Metadata_compound, if the prefix is set to "Metadata".

Add extra to utils.img_to_metadata

Want to be able to add in a string or a list of strings as an argument for column prefixes which are classed as Metadata. e.g extra=["Nuclei", "Cells"]

Find outliers with PCA

Calculate PCA, find outliers in PCA space and return indices which can then be dropped in the original dataset.

calling morar.utils.drop() results in a TypeError

~/anaconda3/lib/python3.7/site-packages/morar/dataframe.py in dropna(self, **kwargs)
     96     def dropna(self, **kwargs):
     97         """dropna via pandas.DataFrame.dropna"""
---> 98         _check_inplace(kwargs)
     99         pandas_df = pd.DataFrame(self)
    100         result = pandas_df.dropna(**kwargs)

TypeError: _check_inplace() takes 0 positional arguments but 1 was given

stats.find_low_var doesn't work with all nan columns

aggregate will lose text columns that are not metadata

write a wrapper for principal component calculation

Already imports scikit-learn so might as well use sklearn.decomposition.PCA

PCA on just feature data
concat principal components with metadata

export common functions directory into toplevel namespace

morar.DataFrame.dropna should return morar.DataFrame not pandas.DataFrame

impute seems to break if all columns are featuredata

aggregate.aggregate has no option to change metadata prefix

Should be able to pass an argument to utils.get_metadata and utils.get_featuredata

add robust_z_score

aggregation module

stats.scale_features should return metadata as well as featuredata

Modify dataframe.featuredata or dataframe.metadata

It would be nice to be able to modify the featuredata or metadata using it's attribute.

feature_subset = ["feature_1", "feature_2"]
my_data.featuredata = my_data[feature_subset]

or

my_data.featuredata = my_data.featuredata[feature_subset]

Though at the moment, this produces the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-48-3e99f8310ea1> in <module>()
----> 1 df.featuredata = df.featredata[previous_features]

/home/scott/.local/lib/python3.5/site-packages/pandas/core/generic.py in __getattr__(self, name)
   2670             if name in self._info_axis:
   2671                 return self[name]
-> 2672             return object.__getattribute__(self, name)
   2673 
   2674     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'featredata'

In [49]: df.featuredata = df.featuredata[previous_features]
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/home/scott/.local/lib/python3.5/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
   2702                 else:
-> 2703                     object.__setattr__(self, name, value)
   2704             except (AttributeError, TypeError):

AttributeError: can't set attribute

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
<ipython-input-49-1c984f1ae98a> in <module>()
----> 1 df.featuredata = df.featuredata[previous_features]

/home/scott/.local/lib/python3.5/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
   2703                     object.__setattr__(self, name, value)
   2704             except (AttributeError, TypeError):
-> 2705                 object.__setattr__(self, name, value)
   2706 
   2707     # ----------------------------------------------------------------------

AttributeError: can't set attribute

p_normalise should be able to use division normalisation

read and aggregate

Read in each object.csv file and create median aggregates by ImageNumber.

If files are large:

read in chunks; aggregate by ImageNumber
store aggregated intermediates
read in intermediates
row-bind into single dataframe
remove intermediates

latest scikit-learn and pandas require python>=3.5

normalise error : "ValueError: operands could not be broadcast together with shapes (18000,) (20,)"

normalise should return the whole dataframe, not just featuredata

Stop using deprecated Imputer from sklearn

tests/test_utils.py::test_impute
tests/test_utils.py::test_impute_mean
tests/test_utils.py::test_impute_with_metadata
  /home/scott/anaconda3/lib/python3.7/site-packages/sklearn/utils/deprecation.py:66:
    DeprecationWarning: Class Imputer is deprecated; Imputer was deprecated in version
    0.20 and will be removed in 0.22. Import impute.SimpleImputer from sklearn instead.

morar.DataFrame.groupby groups to keep morar.DataFrame attributes

No idea how to implement this, but it would be nice.

parallelise normalise functions

Work is done by splitting the dataframe into groups via .groupby(), and looping through each group. This should be able to be done in parallel to speed things up.

e.g, from SO:

import pandas as pd
from joblib import Parallel, delayed
import multiprocessing

def tmpFunc(df):
    df['c'] = df.a + df.b
    return df

def applyParallel(dfGrouped, func):
    retLst = Parallel(n_jobs=multiprocessing.cpu_count())(delayed(func)(group) for name, group in dfGrouped)
    return pd.concat(retLst)

if __name__ == '__main__':
    df = pd.DataFrame({'a': [6, 2, 2], 'b': [4, 5, 6]},index= ['g1', 'g1', 'g2'])
    print 'parallel version: '
    print applyParallel(df.groupby(df.index), tmpFunc)

    print 'regular version: '
    print df.groupby(df.index).apply(tmpFunc)

    print 'ideal version (does not work): '
    print df.groupby(df.index).applyParallel(tmpFunc)

Multi-indexing & databases.

Unlikely that multi-indexed columned dataframes are easily converted to sqlite tables. Therefore needs to be some option to flatten the column names (simple paste?).

Ideally would like to automatically flatten columns to store as a database table, and convert back to multi-index columns when reading in as a dataframe from the database.

store object files as MultiIndex DataFrame

Likely to have columns of identical names in different objects. Should try to use MultiIndex dataframes in pandas.

i.e

              object 1                |                object 2
         featuredata | metadata       |          featuredata | metadata

data = morar.DataFrame(data, metadata_prefix="Image_Metadata")

swarchal / morar Goto Github PK

morar's Issues

Recommend Projects

Recommend Topics

Recommend Org