How to use GroupKFold? about lofo-importance HOT 10 CLOSED

aerdem4 commented on May 24, 2024

How to use GroupKFold?

from lofo-importance.

Comments (10)

aerdem4 commented on May 24, 2024 2

You can actually provide any (train_index, test_index) iterator to the cv parameter. sklearn's crossvalidate function accepts both kfold objects and iterators (kfold object's split outputs) as inputs. Example would be:

lofo_imp = LOFOImportance(dataset, cv=GroupKFold(4).split(X, y, groups), scoring="roc_auc")

from lofo-importance.

aerdem4 commented on May 24, 2024 2

New sklearn version seems to have problems with iterables in cross_validate. Converting iterables to list is a workaround:

lofo_imp = LOFOImportance(dataset, cv=list(GroupKFold(n_splits=4).split(X=tr, y=tr['pressure'], groups=tr['breath_id'])), scoring="neg_mean_absolute_error")

from lofo-importance.

RainFung commented on May 24, 2024 1

Thanks. It's better to add some document about it.

from lofo-importance.

graceyangfan commented on May 24, 2024 1

@aerdem4 I meet this error when use groupkfold
'
In
cv_results = cross_validate(self.model, X, y, cv=self.cv, scoring=self.scoring, fit_params=fit_params)

ValueError: not enough values to unpack (expected 3, got 0)
'

from lofo-importance.

BartlomiejSkwira commented on May 24, 2024

Sklearn cross_validate function (which is used by lofo-imortance in LOFOImportance._get_cv_score) has a groups keyword argument, I forked this repo and added it there. You can see it in this PR BartlomiejSkwira#1 (it's a work in progress, requires tests)

@aerdem4 would it be a good PR candidate to your repo?

from lofo-importance.

aerdem4 commented on May 24, 2024

@BartlomiejSkwira GroupKFold is supported with the workaround above. Your PR looks nice but it only covers one out of many validation schemes. From minimalistic point of view, I am thinking maybe keeping the repo without exceptions is better. But if you have an idea to include most common validation schemes in a generic way, you are welcome.

from lofo-importance.

BartlomiejSkwira commented on May 24, 2024

@aerdem4 This workaround did't work for me, I would get a:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
    ...
<my code calling lofo_importance.get_importance()>
   ...
  File "/opt/conda/lib/python3.8/site-packages/lofo/lofo_importance.py", line 85, in get_importance
    lofo_cv_scores.append(self._get_cv_score(feature_to_remove=f))
  File "/opt/conda/lib/python3.8/site-packages/lofo/lofo_importance.py", line 59, in _get_cv_score
    cv_results = cross_validate(self.model, X, y, cv=self.cv, scoring=self.scoring, fit_params=fit_params, groups=self.groups)
  File "/opt/conda/lib/python3.8/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 260, in cross_validate
    results = _aggregate_score_dicts(results)
  File "/opt/conda/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 1675, in _aggregate_score_dicts
    for key in scores[0]
IndexError: list index out of range

I wonder if I used it correctly, here is how I used lofo:

pipe = pipeline.Pipeline(steps=[("cls", ensemble.RandomForestClassifier(random_state=RANDOM_STATE))])
cv = model_selection.GroupKFold(n_splits=N_SPLITS)
search = model_selection.GridSearchCV(
    pipe,
    param_grid,
    n_jobs=-1,
    scoring=scoring,
    cv=cv,
    verbose=0,
    refit=true,
)
search.fit(X, y, groups=groups)
dataset = Dataset(
        df=df,
        target="some_target",
        features=attribute_columns,
)

# define the validation scheme and scorer.
lofo_importance = LOFOImportance(
    dataset,
    cv=cv.split(X, y, groups),
    scoring=scoring,
    model=search.best_estimator_,
    n_jobs=n_jobs,
    # groups=groups,
)

# get the mean and standard deviation of the importances in pandas format
importance_df = lofo_importance.get_importance() # this line throws an exeption

from lofo-importance.

aerdem4 commented on May 24, 2024

Can you check the length of generated list in cv.split just before feeding it to LOFO? The functions you use before can mutate cv and cv.split may return an empty list.

from lofo-importance.

aerdem4 commented on May 24, 2024

@graceyangfan How do you use groupkfold? Like the way I suggested? Can you check the input or share a reproducible code?

from lofo-importance.

Quetzalcohuatl commented on May 24, 2024

Getting the same error as Grace.

lofo_imp = LOFOImportance(dataset, cv=GroupKFold(n_splits=4).split(X=tr, y=tr['pressure'], groups=tr['breath_id']), scoring="neg_mean_absolute_error")

ValueError: not enough values to unpack (expected 3, got 0)

from lofo-importance.

How to use GroupKFold? about lofo-importance HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent