The morar from swarchal

i.e

data = morar.DataFrame(data, metadata_prefix="Image_Metadata")

impute seems to break if all columns are featuredata

latest scikit-learn and pandas require python>=3.5

Add morar.DataFrame.aggregate() method

morar.DataFrame should have an impute method

morar.DataFrame.dropna should return morar.DataFrame not pandas.DataFrame

outliers.get_outlier_index needs feature_data kwargs

stats.scale_features should return metadata as well as featuredata

Add extra to utils.img_to_metadata

Want to be able to add in a string or a list of strings as an argument for column prefixes which are classed as Metadata. e.g extra=["Nuclei", "Cells"]

morar.DataFrame.groupby groups to keep morar.DataFrame attributes

No idea how to implement this, but it would be nice.

read and aggregate

Read in each object.csv file and create median aggregates by ImageNumber.

If files are large:

read in chunks; aggregate by ImageNumber
store aggregated intermediates
read in intermediates
row-bind into single dataframe
remove intermediates

p_normalise should be able to use division normalisation

aggregation module

aggregate.aggregate has no option to change metadata prefix

Should be able to pass an argument to utils.get_metadata and utils.get_featuredata

morar.DataFrame.drop is borked, self counts as an argument

TypeError: drop() takes 1 positional argument but 2 were given

add robust_z_score

Find outliers with PCA

Calculate PCA, find outliers in PCA space and return indices which can then be dropped in the original dataset.

utils.get_metadata should be able to handle metadata in middle of strings

At the moment it will miss columns such as Image_Metadata_compound, if the prefix is set to "Metadata".

Stop using deprecated Imputer from sklearn

tests/test_utils.py::test_impute
tests/test_utils.py::test_impute_mean
tests/test_utils.py::test_impute_with_metadata
  /home/scott/anaconda3/lib/python3.7/site-packages/sklearn/utils/deprecation.py:66:
    DeprecationWarning: Class Imputer is deprecated; Imputer was deprecated in version
    0.20 and will be removed in 0.22. Import impute.SimpleImputer from sklearn instead.

change tests from nose to pytest

normalise error : "ValueError: operands could not be broadcast together with shapes (18000,) (20,)"

Modify dataframe.featuredata or dataframe.metadata

It would be nice to be able to modify the featuredata or metadata using it's attribute.

feature_subset = ["feature_1", "feature_2"]
my_data.featuredata = my_data[feature_subset]

or

my_data.featuredata = my_data.featuredata[feature_subset]

Though at the moment, this produces the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-48-3e99f8310ea1> in <module>()
----> 1 df.featuredata = df.featredata[previous_features]

/home/scott/.local/lib/python3.5/site-packages/pandas/core/generic.py in __getattr__(self, name)
   2670             if name in self._info_axis:
   2671                 return self[name]
-> 2672             return object.__getattribute__(self, name)
   2673 
   2674     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'featredata'

In [49]: df.featuredata = df.featuredata[previous_features]
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/home/scott/.local/lib/python3.5/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
   2702                 else:
-> 2703                     object.__setattr__(self, name, value)
   2704             except (AttributeError, TypeError):

AttributeError: can't set attribute

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
<ipython-input-49-1c984f1ae98a> in <module>()
----> 1 df.featuredata = df.featuredata[previous_features]

/home/scott/.local/lib/python3.5/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
   2703                     object.__setattr__(self, name, value)
   2704             except (AttributeError, TypeError):
-> 2705                 object.__setattr__(self, name, value)
   2706 
   2707     # ----------------------------------------------------------------------

AttributeError: can't set attribute

Stop using deprecated .ix in tests

tests/test_stats.py::test_mad_dataframe_row
  /home/scott/code/morar/tests/test_stats.py:38: DeprecationWarning: 
  .ix is deprecated. Please use
  .loc for label based indexing or
  .iloc for positional indexing

morar.DataFrame.dropna is calling drop

calling morar.utils.drop() results in a TypeError

~/anaconda3/lib/python3.7/site-packages/morar/dataframe.py in dropna(self, **kwargs)
     96     def dropna(self, **kwargs):
     97         """dropna via pandas.DataFrame.dropna"""
---> 98         _check_inplace(kwargs)
     99         pandas_df = pd.DataFrame(self)
    100         result = pandas_df.dropna(**kwargs)

TypeError: _check_inplace() takes 0 positional arguments but 1 was given

Multi-indexing & databases.

Unlikely that multi-indexed columned dataframes are easily converted to sqlite tables. Therefore needs to be some option to flatten the column names (simple paste?).

Ideally would like to automatically flatten columns to store as a database table, and convert back to multi-index columns when reading in as a dataframe from the database.

stats.hampel on a Series returns an array with the wrong number of dimensions

normalise should return the whole dataframe, not just featuredata

feature_selection.find_correlation should only act on featuredata

write a wrapper for principal component calculation

Already imports scikit-learn so might as well use sklearn.decomposition.PCA

PCA on just feature data
concat principal components with metadata

export common functions directory into toplevel namespace

store object files as MultiIndex DataFrame

Likely to have columns of identical names in different objects. Should try to use MultiIndex dataframes in pandas.

i.e

              object 1                |                object 2
         featuredata | metadata       |          featuredata | metadata

parallelise normalise functions

Work is done by splitting the dataframe into groups via .groupby(), and looping through each group. This should be able to be done in parallel to speed things up.

e.g, from SO:

import pandas as pd
from joblib import Parallel, delayed
import multiprocessing

def tmpFunc(df):
    df['c'] = df.a + df.b
    return df

def applyParallel(dfGrouped, func):
    retLst = Parallel(n_jobs=multiprocessing.cpu_count())(delayed(func)(group) for name, group in dfGrouped)
    return pd.concat(retLst)

if __name__ == '__main__':
    df = pd.DataFrame({'a': [6, 2, 2], 'b': [4, 5, 6]},index= ['g1', 'g1', 'g2'])
    print 'parallel version: '
    print applyParallel(df.groupby(df.index), tmpFunc)

    print 'regular version: '
    print df.groupby(df.index).apply(tmpFunc)

    print 'ideal version (does not work): '
    print df.groupby(df.index).applyParallel(tmpFunc)

swarchal / morar Goto Github PK

morar's People

Contributors

Stargazers

Watchers

Forkers

morar's Issues

Recommend Projects

Recommend Topics

Recommend Org