royalhaskoningdhv / sam Goto Github PK
View Code? Open in Web Editor NEWPython package for time series analysis and machine learning
License: MIT License
Python package for time series analysis and machine learning
License: MIT License
No longer used in feature engineering classes. Make sure all references/dependencies are resolved.
You can supply the sam models with quantile values outside of the 0,1 range, but this will lead to -Inf losses, maybe good to add a line that checks if all the quantile values lie within the [0, 1] range and if not, throw an error
Replacing missing values is most convenient with a sklearn.impute_BaseImputer
. For time series data, forward fill / interpolation / etc. are the most common methods to impute missing values, but for those techniques there are no transformers in scikit-learn.
The visualization module needs some refactoring, right now it seems like a random collection of (useful) functions.
This might also be an opportunity to get rid of the seaborn dependency and switch to OO-style of matplotlib
Similar to MLPTimeseriesRegressor
. Support for recurrent neural networks (LSTM/GRU) would be nice to have.
We can use sam.models.create_keras_quantile_rnn()
and sam.preprocessing.RNNReshaper
Linear model with a similar interface as sam.models.BaseTimeseriesRegressor
.
The KNMI and regenradar API's sometimes do not work. A retry decorator could make the functions in sam more robust.
Especially for the KNMI functions this might resolve the problem of failing unit tests by chance.
test
The optional geom argument (string) can only have a certain length. If too long, the request will fail.
We need to raise a warning to inform users why an error occurs (the API does not handle that well).
e.g. feature_engineer
arg is not used in SPCRegressor
. Supporting unused kwargs makes it possible to replace objects without changing parameters.
The score function calculates the tilted/pinball loss without using the included joint_tilted_loss
function of this package. It would be a lot cleaner to use the internal function, instead of calculating it in multiple places
The class is too big and not the same format as the new feature engineering pipeline:
Right now, most init parameters of the sam models can be none, with some exceptions like rolling_window_size
would be nice if this could also be None, if you don't want to do any rolling features
Models may only raise a warning if there is no timecol
or datetime index. Technically this can still work, but it is impossible to validate data being monospaced.
Do we want to raise an error instead?
SAmQuantileMLP / TimeseriesMLP or something different?
Option 1: *TimeseriesRegressor
BaseTimeseriesRegressor
ConstantTimeseriesRegressor
LinearTimeseriesRegressor
MLPTimeseriesRegressor
Option 2:*QuantileForecaster
BaseQuantileForecaster
ConstantQuantileForecaster
LinearQuantileForecaster
MLPQuantileForecaster
Other suggestions are welcome.
Right now SamQuantileMLP doesn't pass the check, since it doesn't accept numpy arrays as input. This probably requires a relatively small addition, so it also works with numpy arrays that contain a Time column
read_regenradar
does not support long time periods because of API time outs.
Solution: get data in batches and concatenate the results.
RHDHV has done several projects where we build validation pipelines to validate sensor data. These should also be added to SAM to make it easier to share the work we did there
Currently we don't use GitHub releases, but it would be nice if we use the semantic release GitHub action to automatically post new releases to Github including the changelog
See: https://python-semantic-release.readthedocs.io/en/latest/#getting-started
Having two signals measuring the same thing (+ some independent noise), we want to be able to align them using e.g. the cross-correlation. It should be possible to have signals of unequal length. It should be possible to do this for numpy arrays, as well as pandas data frames, where not only the signals of interest, but the whole dataframes are aligned according to the specified alignment columns for each data frame.
All the DOCSTRING examples should run without any doctest errors, right now that's not the case
DoD checklist
python -m pytest --doctest-modules
should succeedsklearn has deprecated get_feature_names()
in favor of get_feature_names_out()
this is blocking when updating to a new sklearn version, so we should update these functions, the behaviour stays the same
Right now we can add synthethic timeseries and dateranges, but not synthetic sam dataframes, that would be really helpful for testing purposes
Maybe we should add this check to the basemodel? We always do this check right?
Originally posted by @rubenpeters91 in #49 (comment)
SAM doesn't allow to use lagged features of the target, but also have predict_ahead==0
. This was by design to prevent leaking data, however there could be a usecase where you only want lagged features of the target (not the target itself). This is however hard to check, but could be a nice addition.
Regarding a question from Ruud Kassing about roles and responsibilities
What roles are we going to define for maintenance/development/contact. Do we need support from other teams?
Test all docstring examples with pytest --doctest-modules sam/*
An example could look something like this, including
repos:
Building the wheels locally and uploading the dist to PyPI manually works fine, but probably something goes wrong in the workflow.
Somehow the built package is empty, and no modules of functions can be uploaded.
Not critical, because current build was uploaded manually.
ConstantTemplate
(the underlying sklearn estimator for ConstantTimeseriesRegressor
) only uses the input data X to determine the output shape of the predictions. It shouldn't actually be necessary for X to even contain data, or be array-like at all, as long as it specifies a length (implements the __len__()
dunder).
In #75 and #76 I already loosened the validation on X by allowing NaN/Inf values, but the requirements are still needlessly restrictive because of the assumptions in BaseTimeseriesRegressor
. I would like the ability at least to pass an empty dataframe X = pd.DataFrame(index=range(100))
.
Moreover, it should be noted that scikit-learn actually has DummyRegressor
, implementing the same logic as ConstantTemplate
. Although I haven't tested it, ConstantTemplate
is probably equivalent to DummyRegressor
, and if so can be removed completely in favor of the latter.
The type-hint for predict_ahead in SamQuantileMLP is
predict_ahead: Union[int, Sequence[int]] = 1,
But the parent BaseTimeseriesRegressor only supports List:
predict_ahead: Union[int, List[int]] = 1,
>>> predict_ahead = (0,)
>>> isinstance(predict_ahead, Sequence)
True
>>> model = SamQuantileMLP(predict_ahead=predict_ahead)
>>> model.predict_ahead
[(0,)]
>>> model.validate_predict_ahead()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\921266\source\repos\sam\sam\models\base_model.py", line 141, in validate_predict_ahead
if not all([p >= 0 for p in self.predict_ahead]):
File "C:\Users\921266\source\repos\sam\sam\models\base_model.py", line 141, in <listcomp>
if not all([p >= 0 for p in self.predict_ahead]):
TypeError: '>=' not supported between instances of 'tuple' and 'int'
This is caused by the following line:
self.predict_ahead = (
predict_ahead if isinstance(predict_ahead, List) else [predict_ahead]
)
This should reduce the learning curve of using SAM. The current feature engineer in SamQuantile models are too complicated.
BuildRollingFeatures is also not a necessity, since pandas rolling functionality is providing the same. A way to provide a custom feature engineering function would make the required code for a simple model much easier.
Having a BaseValidator
class in a similar way as BaseTimeseriesRegressor
and BaseFeatureEngineer
should make adding new techniques easier.
With version 3.0, when using the TimeSeriesMLP the first rows of the data will be removed in fit because of the rolling features, this no longer happens in the feature engineer, since it can also contain custom functions. This behaviour of course changes when using an imputer. We should consider if this is what we want
This is not reflective of SAM code not working, so rather than throwing an error, this should only raise a warning that KNMI API is down.
affected:
When using use_diff_of_y
you apparently can't set predict_ahead = [0]
in TimeseriesMLP, there are multiple checks for this in the code, and removing the first error will lead to predicting a straight line. Using use_diff_of_y
with any other predict_ahead
works as expected.
We should update the documentation for release and also fix the autodoc functionality, since currently it seems broken on: https://sam-rhdhv.readthedocs.io/en/latest/
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.