royalhaskoningdhv / sam Goto Github PK

View Code? Open in Web Editor NEW

26.0 26.0 6.0 12.28 MB

Python package for time series analysis and machine learning

License: MIT License

Python 100.00%

anomaly-detection asset-management data-science forecasting machine-learning python time-series

sam's People

Contributors

Stargazers

Watchers

Forkers

harm-nomden-sweco davidswinkels mariocastrogama beck0003 mi2354 yxinjiang

sam's Issues

Remove `decompose_datetime` and `CyclicalMaxes`

No longer used in feature engineering classes. Make sure all references/dependencies are resolved.

Add a check on quantile values

You can supply the sam models with quantile values outside of the 0,1 range, but this will lead to -Inf losses, maybe good to add a line that checks if all the quantile values lie within the [0, 1] range and if not, throw an error

Implementation of forward fill / interpolation as a transformer

Replacing missing values is most convenient with a sklearn.impute_BaseImputer. For time series data, forward fill / interpolation / etc. are the most common methods to impute missing values, but for those techniques there are no transformers in scikit-learn.

Revise and refactor sam.visualization

The visualization module needs some refactoring, right now it seems like a random collection of (useful) functions.
This might also be an opportunity to get rid of the seaborn dependency and switch to OO-style of matplotlib

RNNTimeseriesRegressor

Similar to MLPTimeseriesRegressor. Support for recurrent neural networks (LSTM/GRU) would be nice to have.

We can use sam.models.create_keras_quantile_rnn() and sam.preprocessing.RNNReshaper

LinearTimeseriesRegressor / TimeseriesLinearRegressor

Linear model with a similar interface as sam.models.BaseTimeseriesRegressor.

retry decorators for knmi / nrr functions

The KNMI and regenradar API's sometimes do not work. A retry decorator could make the functions in sam more robust.

Especially for the KNMI functions this might resolve the problem of failing unit tests by chance.

https://pypi.org/project/retry/

Test documenation webhook

test

Add requirement for geom in read_regenradar

The optional geom argument (string) can only have a certain length. If too long, the request will fail.

We need to raise a warning to inform users why an error occurs (the API does not handle that well).

Make model objects interchangable (by using kwargs)

e.g. feature_engineer arg is not used in SPCRegressor. Supporting unused kwargs makes it possible to replace objects without changing parameters.

Refactor MLPTimeseriesRegressor `score` function

The score function calculates the tilted/pinball loss without using the included joint_tilted_loss function of this package. It would be a lot cleaner to use the internal function, instead of calculating it in multiple places

Update Automatic Feature Engineering class to fit the new Feature engineering pipeline

The class is too big and not the same format as the new feature engineering pipeline:

Update Automatic Feature Engineering class to new format
Update Tutorial
Update docs

Make sure all init parameters of BaseQuantileRegressor can be None

Right now, most init parameters of the sam models can be none, with some exceptions like rolling_window_size would be nice if this could also be None, if you don't want to do any rolling features

Do we want to support models without providing datetimes?

Models may only raise a warning if there is no timecol or datetime index. Technically this can still work, but it is impossible to validate data being monospaced.

Do we want to raise an error instead?

Decide on model naming conventions

SAmQuantileMLP / TimeseriesMLP or something different?

Option 1: *TimeseriesRegressor
BaseTimeseriesRegressor
ConstantTimeseriesRegressor
LinearTimeseriesRegressor
MLPTimeseriesRegressor

Option 2:*QuantileForecaster
BaseQuantileForecaster
ConstantQuantileForecaster
LinearQuantileForecaster
MLPQuantileForecaster

Other suggestions are welcome.

Make sure all Sam models pass the `check_estimator` check from sklearn

Right now SamQuantileMLP doesn't pass the check, since it doesn't accept numpy arrays as input. This probably requires a relatively small addition, so it also works with numpy arrays that contain a Time column

Batch processing for downloading regenradar

read_regenradar does not support long time periods because of API time outs.

Solution: get data in batches and concatenate the results.

Add validation rules from other projects from RHDHV

RHDHV has done several projects where we build validation pipelines to validate sensor data. These should also be added to SAM to make it easier to share the work we did there

Use Semantic Release to automate the release to GitHub

Currently we don't use GitHub releases, but it would be nice if we use the semantic release GitHub action to automatically post new releases to Github including the changelog

See: https://python-semantic-release.readthedocs.io/en/latest/#getting-started

Add temporal alignment functionality of two signals

Having two signals measuring the same thing (+ some independent noise), we want to be able to align them using e.g. the cross-correlation. It should be possible to have signals of unequal length. It should be possible to do this for numpy arrays, as well as pandas data frames, where not only the signals of interest, but the whole dataframes are aligned according to the specified alignment columns for each data frame.

Make sure all DOCSTRING examples work

All the DOCSTRING examples should run without any doctest errors, right now that's not the case

DoD checklist

python -m pytest --doctest-modules should succeed
doctest option is added to unit test workflow

Update all transformers to use `get_feature_names_out()` instead of `get_feature_names()`

sklearn has deprecated get_feature_names() in favor of get_feature_names_out() this is blocking when updating to a new sklearn version, so we should update these functions, the behaviour stays the same

Create github action to publish new PyPI release

Create github action to publish new PyPI release

Add synthetic_samdata() function

Right now we can add synthethic timeseries and dateranges, but not synthetic sam dataframes, that would be really helpful for testing purposes

Include some more checks and assertions in base model

Maybe we should add this check to the basemodel? We always do this check right?

Originally posted by @rubenpeters91 in #49 (comment)

Use lagged y features for `predict_ahead==0` should be possible

SAM doesn't allow to use lagged features of the target, but also have predict_ahead==0. This was by design to prevent leaking data, however there could be a usecase where you only want lagged features of the target (not the target itself). This is however hard to check, but could be a nice addition.

How should we define governance rules for this repo?

Regarding a question from Ruud Kassing about roles and responsibilities

What roles are we going to define for maintenance/development/contact. Do we need support from other teams?

Leadership and Governance | Open Source Guides

Add github action for doctest

Test all docstring examples with pytest --doctest-modules sam/*

Consider adding pre-commit hooks with pre-commit package

An example could look something like this, including

isort for ordering imports consistently
black for formatting in black style
trailing-whitespace to remove trailing whitespaces
flake8 for flake8 (pep8, pyflakes and circular complexity) - should be black compatible
bandit for checking for security vulnerabilities (e.g. writing securities in code). We want to allow using assert statements (B101) and allow using pickle (B301).

repos:

repo: https://github.com/pycqa/isort
rev: 5.10.1
hooks:
- id: isort
  name: isort
  args: [--profile "black"]
repo: https://github.com/psf/black.git
rev: 22.3.0
hooks:
- id: black
  name: black
  language_version: python3
repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.2.0
hooks:
- id: trailing-whitespace
  args: [--markdown-linebreak-ext=md]
- id: mixed-line-ending
- id: fix-byte-order-marker
- id: check-executables-have-shebangs
- id: check-shebang-scripts-are-executable
- id: check-merge-conflict
- id: check-symlinks
- id: check-case-conflict
- id: check-docstring-first
- id: check-json
- id: check-toml
- id: check-xml
- id: check-yaml
repo: https://gitlab.com/pycqa/flake8
rev: 4.0.1
hooks:
- id: flake8
  additional_dependencies:
  - pyproject-flake8
  - flake8-absolute-import
  - flake8-black
  - flake8-docstrings
repo: https://github.com/PyCQA/bandit
rev: 1.7.4
hooks:
- id: bandit
  name: bandit
  args: [--skip, "B101,B301" ]

Empty package from workflow to PyPI

Building the wheels locally and uploading the dist to PyPI manually works fine, but probably something goes wrong in the workflow.

Somehow the built package is empty, and no modules of functions can be uploaded.

Not critical, because current build was uploaded manually.

Simplify ConstantTimeseriesRegressor

ConstantTemplate (the underlying sklearn estimator for ConstantTimeseriesRegressor) only uses the input data X to determine the output shape of the predictions. It shouldn't actually be necessary for X to even contain data, or be array-like at all, as long as it specifies a length (implements the __len__() dunder).

In #75 and #76 I already loosened the validation on X by allowing NaN/Inf values, but the requirements are still needlessly restrictive because of the assumptions in BaseTimeseriesRegressor. I would like the ability at least to pass an empty dataframe X = pd.DataFrame(index=range(100)).

Moreover, it should be noted that scikit-learn actually has DummyRegressor, implementing the same logic as ConstantTemplate. Although I haven't tested it, ConstantTemplate is probably equivalent to DummyRegressor, and if so can be removed completely in favor of the latter.

BUG: SamQuantileMLP predict_ahead doesn't support Sequence

The type-hint for predict_ahead in SamQuantileMLP is
predict_ahead: Union[int, Sequence[int]] = 1,

But the parent BaseTimeseriesRegressor only supports List:
predict_ahead: Union[int, List[int]] = 1,

>>> predict_ahead = (0,)
>>> isinstance(predict_ahead, Sequence)
True
>>> model = SamQuantileMLP(predict_ahead=predict_ahead)
>>> model.predict_ahead
[(0,)]
>>> model.validate_predict_ahead()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\921266\source\repos\sam\sam\models\base_model.py", line 141, in validate_predict_ahead
    if not all([p >= 0 for p in self.predict_ahead]):
  File "C:\Users\921266\source\repos\sam\sam\models\base_model.py", line 141, in <listcomp>
    if not all([p >= 0 for p in self.predict_ahead]):
TypeError: '>=' not supported between instances of 'tuple' and 'int'
This is caused by the following line:
self.predict_ahead = (
    predict_ahead if isinstance(predict_ahead, List) else [predict_ahead]
)

Improve feature engineering in SamQuantileMLP

This should reduce the learning curve of using SAM. The current feature engineer in SamQuantile models are too complicated.

BuildRollingFeatures is also not a necessity, since pandas rolling functionality is providing the same. A way to provide a custom feature engineering function would make the required code for a simple model much easier.

Refactor current validation module by adding `BaseValidator` class

Having a BaseValidator class in a similar way as BaseTimeseriesRegressor and BaseFeatureEngineer should make adding new techniques easier.

Consider effect of removing first rows after rolling features

With version 3.0, when using the TimeSeriesMLP the first rows of the data will be removed in fit because of the rolling features, this no longer happens in the feature engineer, since it can also contain custom functions. This behaviour of course changes when using an imputer. We should consider if this is what we want

Unittests will fail when KNMI API returns empty results

This is not reflective of SAM code not working, so rather than throwing an error, this should only raise a warning that KNMI API is down.

affected:

sam.data_sources.weather.knmi.read_knmi
TestWeather.test_read_knmi_hourly

Use `use_diff_of_y` and `predict_ahead == [0]` at the same time

When using use_diff_of_y you apparently can't set predict_ahead = [0] in TimeseriesMLP, there are multiple checks for this in the code, and removing the first error will lead to predicting a straight line. Using use_diff_of_y with any other predict_ahead works as expected.

Update documentation for release

We should update the documentation for release and also fix the autodoc functionality, since currently it seems broken on: https://sam-rhdhv.readthedocs.io/en/latest/

Move important information from General documents to a new "Introduction" section
Remove General documents section
Replace example notebooks with the new examples
Fix autodoc errors so the docs work on readthedocs
Fix tensorflow import so autodocs works for metrics, models and visualization