Hi, So far loving this package! Question, I am using time series da

At the risk of making myself sound like a dummy <p dir=

Timeseries Cross Validation about hyperparameter_hunter HOT 6 OPEN

jmrichardson commented on May 25, 2024 1

Timeseries Cross Validation

from hyperparameter_hunter.

Comments (6)

HunterMcGushion commented on May 25, 2024

Thanks for opening this, and thank you for the example code and traceback!

It looks like the issue stems from the fact that cv_params is used to initialize the cv_type class. There's currently no method of providing extra arguments to the split method of cv_type, which is where PurgedWalkForwardCV seems to expect pred_times and eval_times.

I'd love to add support for this! Can you provide/recommend a toy dataset that looks like the one you're trying to use, so I can build some regression tests?

At the risk of making myself sound like a dummy, I don't think I've worked on a problem that uses both pred_times and eval_times. Are these generally provided as additional input columns in the same DataFrame as the rest of the input data? Or do they get some special treatment?

from hyperparameter_hunter.

jmrichardson commented on May 25, 2024

At the risk of making myself sound like a dummy

Lol! That is not possible!

I don't think I've worked on a problem that uses both pred_times and eval_times. Are these generally provided as additional input columns in the same DataFrame as the rest of the input data? Or do they get some special treatment?

They are not features in the DataFrame. They are distinct pandas Series that are timestamps of when an equity trade is made. In Finance, typically want to use a walk forward analysis where we remove training samples that have eval times posterior to the validation prediction times. These samples are removed based on the pred_times and eval_times needed as arguments to PurgedWalkForwardCV. In my case, I have a training set and two series of timestamps for pred_times and eval_times where the date indexes all match.

I'd love to add support for this! Can you provide/recommend a toy dataset that looks like the one you're trying to use, so I can build some regression tests?

I created 3 pickle files for the data set (X_train, eval_times, pred_times):

https://github.com/jmrichardson/data

Here's a sample of how to make the splits:

from timeseriescv.cross_validation import PurgedWalkForwardCV
cv = PurgedWalkForwardCV(n_splits=5)

count=0
for train_set, test_set in cv.split(X_train, pred_times=pred_times, eval_times=eval_times, split_by_time=False):
    count += 1
    print(count)

Thank you so much for your help :)

from hyperparameter_hunter.

HunterMcGushion commented on May 25, 2024

Sorry about the delay! The TL;DR version of my findings is that I don’t think HH can support time-series CV right now. Here’s the long version:

I was able to throw together a quick/dirty subclass of PurgedWalkForwardCV that got past the error you posted. I’ll include the code for it below, along your snippet, slightly modified to use the new subclass. Anyways, it got past the TypeError and successfully made predictions and evaluations for folds, until it started expecting data for all n_splits when there was none. As I mentioned, I’m no expert on time-series forecasting, so I just realized that OOF predictions won’t actually be generated for all of the training data because of how time-series CV schemes work. In the end, the Experiment tried to keep going past the third fold, on to a fourth and fifth because n_splits=5; however, there are actually only 3 splits (as your second snippet shows).

So I don’t think we can get this working properly just by using some combination of custom cv_type classes and lambda_callbacks. I believe a new Experiment class would need to be added to specifically deal with time-series data. Because the Experiment class is built out of very modular callback classes that it dynamically inherits at instantiation, I think we could get by with just some new predictors/evaluators callbacks that are updated to make predictions and evaluations for the appropriate number of splits for a time-series problem. If you’re interested in contributing, I’d love to work together to add support for this. At the moment, I’m still not familiar enough with time-series problems to be able to get the job done myself. Either way, I’d love to see this added soon!

My quick and dirty PurgedWalkForwardCV subclass code follows, along with the output.

from hyperparameter_hunter import Environment, CVExperiment
from timeseriescv.cross_validation import PurgedWalkForwardCV
from xgboost import XGBClassifier
import pandas as pd

class UglyPurgedWalkForwardCV(PurgedWalkForwardCV):
    def __init__(self, pred_times=None, eval_times=None, split_by_time=False, **kwargs):
        """Override initialization to receive the three extra kwargs expected by 
        :meth:`split`. Mangle the attribute names to avoid any possible 
        collisions with the original attributes of :class:`PurgedWalkForwardCV`"""
        self.__pred_times = pred_times
        self.__eval_times = eval_times
        self.__split_by_time = split_by_time
        super().__init__(**kwargs)

    def split(self, X, y=None, **kwargs):
        """Override `split` to look more like SKLearn's CV classes, and fetch the 
        mangled attributes set on initialization, rather than expecting them here"""
        return super().split(
            X,
            y,
            pred_times=self.__pred_times,
            eval_times=self.__eval_times,
            split_by_time=self.__split_by_time
        )

if __name__ == "__main__":
    data_df = pd.read_pickle(train_data_path)
    p_times = pd.read_pickle(pred_times_path)
    e_times = pd.read_pickle(eval_times_path)

    env = Environment(
        train_dataset=data_df,
        target_column="bin",
        results_path="HyperparameterHunterAssets",
        metrics=["roc_auc_score"],
        cv_type=UglyPurgedWalkForwardCV,
        cv_params=dict(n_splits=5, pred_times=p_times, eval_times=e_times),
    )

    exp = CVExperiment(XGBClassifier)

Output/error traceback:

Cross-Experiment Key:   'pncMgwGMRAZMDReR0Sd8_6PF3817sHiugD2LI1SYXGI='
<18:28:05> Initialized Experiment: 'ee1aed75-4d3c-4d0d-8d51-9e0aa448ee97'
<18:28:05> Hyperparameter Key:     'o-wi1kDtaizmgwFvBNdNJDF7X_OJBQPoj0iG3Gne-SM='
<18:28:05>
<18:28:05> R0-f0-r-  |  OOF(roc_auc_score=0.43833)  |  Time: 0.04179 s
<18:28:05> R0-f1-r-  |  OOF(roc_auc_score=0.50000)  |  Time: 0.04892 s
<18:28:05> R0-f2-r-  |  OOF(roc_auc_score=0.50000)  |  Time: 0.05401 s
<18:28:05> Uncaught exception!   RuntimeError: generator raised StopIteration
Traceback (most recent call last):
  File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiments.py", line 792, in <genexpr>
    yield (next(indices) for _ in range(cv_params["n_splits"]))
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "time_series_cv_example.py", line 82, in <module>
    execute()
  File "time_series_cv_example.py", line 78, in execute
    exp = CVExperiment(XGBClassifier)
  File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiment_core.py", line 165, in __call__
    return super().__call__(*args, **kwargs)
  File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiments.py", line 749, in __init__
    target_metric=target_metric,
  File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiments.py", line 595, in __init__
    target_metric=target_metric,
  File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiments.py", line 303, in __init__
    self.experiment_workflow()
  File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiments.py", line 335, in experiment_workflow
    self.execute()
  File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiments.py", line 607, in execute
    self.cross_validation_workflow()
  File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiments.py", line 623, in cross_validation_workflow
    for self._fold, (self.train_index, self.validation_index) in enumerate(rep_indices):
RuntimeError: generator raised StopIteration

I'd love to hear your thoughts on adding support for time-series problems! Sorry it's not working at the moment, though!

from hyperparameter_hunter.

jmrichardson commented on May 25, 2024

Thank you so much for your effort to get this to work. I should be able to work with the sklearn TimeSeriesSplit method so it's not a huge deal. I will definitely look into how I can help to support timeseriescv. I am working on a project at the moment that is taking a considerable amount of time so I'm not sure I can look into it for a few weeks. I will definitely keep you posted no how it goes.

On a side note, this package is really nice. I really appreciate you sharing this to the community! It will definitely be part of my toolbox going forward!

from hyperparameter_hunter.

HunterMcGushion commented on May 25, 2024

Does SKLearn's TimeSeriesSplit work with HH? If so, I should probably add an example using it... Thanks a lot for your support!

from hyperparameter_hunter.

jmrichardson commented on May 25, 2024

Sorry for the delay, I have been traveling and will be for the next couple of weeks. I recall that it accepted the parameters (ie the lack of event times). However, I don't recall if I tested completely the sklearn SplitTimeSeries. I will give it a shot soon and report back. Thanks again for your help on this.

from hyperparameter_hunter.

Timeseries Cross Validation about hyperparameter_hunter HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent