Comments (6)
Thanks for opening this, and thank you for the example code and traceback!
It looks like the issue stems from the fact that cv_params
is used to initialize the cv_type
class. There's currently no method of providing extra arguments to the split
method of cv_type
, which is where PurgedWalkForwardCV
seems to expect pred_times
and eval_times
.
I'd love to add support for this! Can you provide/recommend a toy dataset that looks like the one you're trying to use, so I can build some regression tests?
At the risk of making myself sound like a dummy, I don't think I've worked on a problem that uses both pred_times
and eval_times
. Are these generally provided as additional input columns in the same DataFrame as the rest of the input data? Or do they get some special treatment?
from hyperparameter_hunter.
At the risk of making myself sound like a dummy
Lol! That is not possible!
I don't think I've worked on a problem that uses both pred_times and eval_times. Are these generally provided as additional input columns in the same DataFrame as the rest of the input data? Or do they get some special treatment?
They are not features in the DataFrame. They are distinct pandas Series that are timestamps of when an equity trade is made. In Finance, typically want to use a walk forward analysis where we remove training samples that have eval times posterior to the validation prediction times. These samples are removed based on the pred_times and eval_times needed as arguments to PurgedWalkForwardCV. In my case, I have a training set and two series of timestamps for pred_times and eval_times where the date indexes all match.
I'd love to add support for this! Can you provide/recommend a toy dataset that looks like the one you're trying to use, so I can build some regression tests?
I created 3 pickle files for the data set (X_train, eval_times, pred_times):
https://github.com/jmrichardson/data
Here's a sample of how to make the splits:
from timeseriescv.cross_validation import PurgedWalkForwardCV
cv = PurgedWalkForwardCV(n_splits=5)
count=0
for train_set, test_set in cv.split(X_train, pred_times=pred_times, eval_times=eval_times, split_by_time=False):
count += 1
print(count)
Thank you so much for your help :)
from hyperparameter_hunter.
Sorry about the delay! The TL;DR version of my findings is that I don’t think HH can support time-series CV right now. Here’s the long version:
I was able to throw together a quick/dirty subclass of PurgedWalkForwardCV
that got past the error you posted. I’ll include the code for it below, along your snippet, slightly modified to use the new subclass. Anyways, it got past the TypeError and successfully made predictions and evaluations for folds, until it started expecting data for all n_splits
when there was none. As I mentioned, I’m no expert on time-series forecasting, so I just realized that OOF predictions won’t actually be generated for all of the training data because of how time-series CV schemes work. In the end, the Experiment tried to keep going past the third fold, on to a fourth and fifth because n_splits=5
; however, there are actually only 3 splits (as your second snippet shows).
So I don’t think we can get this working properly just by using some combination of custom cv_type
classes and lambda_callback
s. I believe a new Experiment class would need to be added to specifically deal with time-series data. Because the Experiment class is built out of very modular callback classes that it dynamically inherits at instantiation, I think we could get by with just some new predictors
/evaluators
callbacks that are updated to make predictions and evaluations for the appropriate number of splits for a time-series problem. If you’re interested in contributing, I’d love to work together to add support for this. At the moment, I’m still not familiar enough with time-series problems to be able to get the job done myself. Either way, I’d love to see this added soon!
My quick and dirty PurgedWalkForwardCV
subclass code follows, along with the output.
from hyperparameter_hunter import Environment, CVExperiment
from timeseriescv.cross_validation import PurgedWalkForwardCV
from xgboost import XGBClassifier
import pandas as pd
class UglyPurgedWalkForwardCV(PurgedWalkForwardCV):
def __init__(self, pred_times=None, eval_times=None, split_by_time=False, **kwargs):
"""Override initialization to receive the three extra kwargs expected by
:meth:`split`. Mangle the attribute names to avoid any possible
collisions with the original attributes of :class:`PurgedWalkForwardCV`"""
self.__pred_times = pred_times
self.__eval_times = eval_times
self.__split_by_time = split_by_time
super().__init__(**kwargs)
def split(self, X, y=None, **kwargs):
"""Override `split` to look more like SKLearn's CV classes, and fetch the
mangled attributes set on initialization, rather than expecting them here"""
return super().split(
X,
y,
pred_times=self.__pred_times,
eval_times=self.__eval_times,
split_by_time=self.__split_by_time
)
if __name__ == "__main__":
data_df = pd.read_pickle(train_data_path)
p_times = pd.read_pickle(pred_times_path)
e_times = pd.read_pickle(eval_times_path)
env = Environment(
train_dataset=data_df,
target_column="bin",
results_path="HyperparameterHunterAssets",
metrics=["roc_auc_score"],
cv_type=UglyPurgedWalkForwardCV,
cv_params=dict(n_splits=5, pred_times=p_times, eval_times=e_times),
)
exp = CVExperiment(XGBClassifier)
Output/error traceback:
Cross-Experiment Key: 'pncMgwGMRAZMDReR0Sd8_6PF3817sHiugD2LI1SYXGI='
<18:28:05> Initialized Experiment: 'ee1aed75-4d3c-4d0d-8d51-9e0aa448ee97'
<18:28:05> Hyperparameter Key: 'o-wi1kDtaizmgwFvBNdNJDF7X_OJBQPoj0iG3Gne-SM='
<18:28:05>
<18:28:05> R0-f0-r- | OOF(roc_auc_score=0.43833) | Time: 0.04179 s
<18:28:05> R0-f1-r- | OOF(roc_auc_score=0.50000) | Time: 0.04892 s
<18:28:05> R0-f2-r- | OOF(roc_auc_score=0.50000) | Time: 0.05401 s
<18:28:05> Uncaught exception! RuntimeError: generator raised StopIteration
Traceback (most recent call last):
File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiments.py", line 792, in <genexpr>
yield (next(indices) for _ in range(cv_params["n_splits"]))
StopIteration
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "time_series_cv_example.py", line 82, in <module>
execute()
File "time_series_cv_example.py", line 78, in execute
exp = CVExperiment(XGBClassifier)
File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiment_core.py", line 165, in __call__
return super().__call__(*args, **kwargs)
File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiments.py", line 749, in __init__
target_metric=target_metric,
File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiments.py", line 595, in __init__
target_metric=target_metric,
File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiments.py", line 303, in __init__
self.experiment_workflow()
File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiments.py", line 335, in experiment_workflow
self.execute()
File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiments.py", line 607, in execute
self.cross_validation_workflow()
File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiments.py", line 623, in cross_validation_workflow
for self._fold, (self.train_index, self.validation_index) in enumerate(rep_indices):
RuntimeError: generator raised StopIteration
I'd love to hear your thoughts on adding support for time-series problems! Sorry it's not working at the moment, though!
from hyperparameter_hunter.
Thank you so much for your effort to get this to work. I should be able to work with the sklearn TimeSeriesSplit method so it's not a huge deal. I will definitely look into how I can help to support timeseriescv. I am working on a project at the moment that is taking a considerable amount of time so I'm not sure I can look into it for a few weeks. I will definitely keep you posted no how it goes.
On a side note, this package is really nice. I really appreciate you sharing this to the community! It will definitely be part of my toolbox going forward!
from hyperparameter_hunter.
Does SKLearn's TimeSeriesSplit
work with HH? If so, I should probably add an example using it... Thanks a lot for your support!
from hyperparameter_hunter.
Sorry for the delay, I have been traveling and will be for the next couple of weeks. I recall that it accepted the parameters (ie the lack of event times). However, I don't recall if I tested completely the sklearn SplitTimeSeries. I will give it a shot soon and report back. Thanks again for your help on this.
from hyperparameter_hunter.
Related Issues (20)
- Support for various advanced functionality? HOT 9
- Unable to access docs HOT 2
- How to do predict_proba in catboost classifier? HOT 3
- Support for nested parameters/parameterizing objects that can't be called by name. HOT 2
- How to handle variable number of layers?
- How can I set class weights in a multiclass classification with imbalance dataset? HOT 1
- How to use the Experiments? HOT 6
- OSError: could not get source code HOT 9
- Q: what is RandomForestOptPro exactly? HOT 3
- Doc improvement suggestion HOT 4
- logloss issue with multiclass task HOT 4
- Metaclass conflict with Keras 2.3.0 HOT 1
- Any way to send hyperparameter_hunter.Integer to custom feature_engineer function? HOT 3
- pd.DataFrame.sparse supported in environment?
- Interface with pytorch or tensorflow models HOT 2
- Problem with library scikit-optimize in Python 3.6 HOT 9
- ROC AUC scores don't match to those of sklearn HOT 1
- Failed to import packages from hyperparameter_hunter HOT 3
- ImportError: cannot import name 'Log10' HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hyperparameter_hunter.