rushter / heamy Goto Github PK

View Code? Open in Web Editor NEW

550.0 16.0 114.0 92 KB

A set of useful tools for competitive data science.

Home Page: http://heamy.readthedocs.io/en/latest/

License: MIT License

Makefile 3.12% Python 96.88%

machine-learning data-science stacking

heamy's Introduction

Twitter: @rushter
Blog: https://rushter.com/blog/

heamy's People

Contributors

Stargazers

Watchers

Forkers

caiotaniguchi shannonyu zihaovgw ompanda ferrine mathkann libardo1 vyraun collawolley plantsgo ash-datalytica tshilidzimudau laol777 hhh920406 lyq617 yanghaha11514 tdilcy haoxuu benjamesbabala falconzyx chagge everyonelijin lidaguo ajoeajoe amorgun prokopyev ntopi kesjien kvr777 nkhuyu selvamshan futurev ruting1 dayeren scofieldyoo tongli12 zwt233 kevinwkc xuliwu whmnoe4j aabdrabbou xzstanford joejiong karammawas xiaomaohoujiao2 afcarl kriswu618 babylls ranjiththavamaniraj ringwraith liyi19950329 mejihero tkainazarov godkillok zhangjm12 joey10huawei karangautam afaist triper1022 lyj555 tkazusa alexanderspiridonov bezova ailzy dl-talent2g sumitsidana mengd2 carrychang charygao jorgeporca lidadreamer mlbyte aihill rafmacalaba yuanjie-ai magnieet huanzhang999 bigbearstar xxzcool h4rr9 kcostya marksendong kartik-nighania lxs0202 neverload phamcuong92 priyatamnayak xuejunwinner ctcome boyle-coffee limingbei angelacy kmori1229 sharov-am maxvv lpcteste toraaglobal tuzhirun juandag97 leumastai

heamy's Issues

Using different feature set for each model

It is advised to use different feature sub-sets across the models for diversity.

Is it possible using heamy?

Scipy sparse matrices support

Heamy does not seem to support sparse matrices at the moment.

When I create a dataset where X_train and X_test are scipy sparse matrices, I get the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-37-cc350d1da8a6> in <module>()
      1 pipeline = ModelsPipeline(*classifiers)
----> 2 pipeline.stack()

/home/agrigorev/anaconda2/lib/python2.7/site-packages/heamy/pipeline.pyc in stack(self, k, stratify, shuffle, seed, full_test, add_diff)
    131 
    132         for model in self.models:
--> 133             result = model.stack(k=k, stratify=stratify, shuffle=shuffle, seed=seed, full_test=full_test)
    134             train_df = pd.DataFrame(result.X_train, columns=generate_columns(result.X_train, model.name))
    135             test_df = pd.DataFrame(result.X_test, columns=generate_columns(result.X_test, model.name))

/home/agrigorev/anaconda2/lib/python2.7/site-packages/heamy/estimator.pyc in stack(self, k, stratify, shuffle, seed, full_test)
    245         if self.use_cache:
    246             pdict = {'k': k, 'stratify': stratify, 'shuffle': shuffle, 'seed': seed, 'full_test': full_test}
--> 247             dhash = self._dhash(pdict)
    248             c = Cache(dhash, prefix='s')
    249             if c.available:

/home/agrigorev/anaconda2/lib/python2.7/site-packages/heamy/estimator.pyc in _dhash(self, params)
    132         """Get hash of the dictionary object."""
    133         m = hashlib.new('md5')
--> 134         m.update(self.hash.encode('utf-8'))
    135         for key in sorted(params.keys()):
    136             h_string = ('%s-%s' % (key, params[key])).encode('utf-8')

/home/agrigorev/anaconda2/lib/python2.7/site-packages/heamy/estimator.pyc in hash(self)
     78                 m.update(h_string)
     79             m.update(self.estimator_name.encode('utf-8'))
---> 80             m.update(self.dataset.hash.encode('utf-8'))
     81 
     82             if not self._is_class:

/home/agrigorev/anaconda2/lib/python2.7/site-packages/heamy/dataset.pyc in hash(self)
    235             m = hashlib.new('md5')
    236             if self._preprocessor is None:
--> 237                 m.update(numpy_buffer(self._X_train))
    238                 m.update(numpy_buffer(self._y_train))
    239                 if self._X_test is not None:

/home/agrigorev/anaconda2/lib/python2.7/site-packages/heamy/cache.pyc in numpy_buffer(ndarray)
     55         ndarray = ndarray.values
     56 
---> 57     if ndarray.flags.c_contiguous:
     58         obj_c_contiguous = ndarray
     59     elif ndarray.flags.f_contiguous:

/home/agrigorev/anaconda2/lib/python2.7/site-packages/scipy/sparse/base.pyc in __getattr__(self, attr)
    523             return self.getnnz()
    524         else:
--> 525             raise AttributeError(attr + " not found")
    526 
    527     def transpose(self):

AttributeError: flags not found

The matrices are obtained via DictVectorizer from sklearn

As a temporary solution, I use X.toarray()

Algorithms references

@rushter , thanks for your library!

Can you please add references (articles, etc.) describes which (exactly) algorithms heamy realizes?

Thanks.

[question] feature encoding - save state

Hi,

does this really work to one hot encode features to always the same labels? I think of deploying my model as an api where possibly new labels will show up.

your implementation only seems to handle the existing labels e.g. no possibility to error / ignore fresh incoming labels
for me it is unclear how the labels are stored in a pipeline so that new incoming data can be encoded with fitting labels.

train[column] = train[column].astype('category', categories=categories)
test[column] = test[column].astype('category', categories=categories)
# from: https://github.com/rushter/heamy/blob/master/heamy/feature.py

Configure Cache Path

How it possible to set custom path for caching objects?

class used in a function without importing it first

I tried the "heamy" module as shown in this example and it works as expected.

https://github.com/rushter/heamy/blob/master/examples/walkthrough.ipynb

I expected the line 24 to fail as Sequential class is not imported anywhere in the script. mlp_model function should not complete without error, I guess. What am I missing?

How to predict the result of your stacking process

Hi,

First thanks for your work, I'm pretty existing to test it and play with it !

I search on documentation and examples but can't find it. I would like to use my stacking process and predict the result (like .predict() in Scikit)

I made a notebook to illustrate my problem.

I'm pretty sure I miss something...

rushter / heamy Goto Github PK

heamy's Introduction

heamy's People

Contributors

Stargazers

Watchers

Forkers

heamy's Issues

Using different feature set for each model

Scipy sparse matrices support

Algorithms references

[question] feature encoding - save state

Configure Cache Path

class used in a function without importing it first

How to predict the result of your stacking process

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent