Code Monkey home page Code Monkey logo

sktime / sktime Goto Github PK

View Code? Open in Web Editor NEW
7.4K 102.0 1.3K 67.67 MB

A unified framework for machine learning with time series

Home Page: https://www.sktime.net

License: BSD 3-Clause "New" or "Revised" License

Python 99.64% Shell 0.02% Makefile 0.04% Dockerfile 0.02% Jupyter Notebook 0.27%
time-series machine-learning scikit-learn time-series-classification time-series-regression forecasting time-series-analysis data-science data-mining hacktoberfest

sktime's Introduction

Welcome to sktime

A unified interface for machine learning with time series

πŸš€ Version 0.28.1 out now! Check out the release notes here.

sktime is a library for time series analysis in Python. It provides a unified interface for multiple time series learning tasks. Currently, this includes time series classification, regression, clustering, annotation, and forecasting. It comes with time series algorithms and scikit-learn compatible tools to build, tune and validate time series models.

Overview
OpenΒ Source BSD 3-clause
Tutorials Binder !youtube
Community !discord !slack
CI/CD github-actions !codecov readthedocs platform
Code !pypi !conda !python-versions !black
Downloads Downloads Downloads Downloads
Citation !zenodo

πŸ“š Documentation

Documentation
⭐ Tutorials New to sktime? Here's everything you need to know!
πŸ“‹ Binder Notebooks Example notebooks to play with in your browser.
πŸ‘©β€πŸ’» User Guides How to use sktime and its features.
βœ‚οΈ Extension Templates How to build your own estimator using sktime's API.
πŸŽ›οΈ API Reference The detailed reference for sktime's API.
πŸ“Ί Video Tutorial Our video tutorial from 2021 PyData Global.
πŸ› οΈ Changelog Changes and version history.
🌳 Roadmap sktime's software and community development plan.
πŸ“ Related Software A list of related software.

πŸ’¬ Where to ask questions

Questions and feedback are extremely welcome! We strongly believe in the value of sharing help publicly, as it allows a wider audience to benefit from it.

Type Platforms
πŸ› Bug Reports GitHub Issue Tracker
✨ Feature Requests & Ideas GitHub Issue Tracker
πŸ‘©β€πŸ’» Usage Questions GitHub Discussions Β· Stack Overflow
πŸ’¬ General Discussion GitHub Discussions
🏭 Contribution & Development dev-chat channel · Discord
🌐 Meet-ups and collaboration sessions Discord - Fridays 4 pm UTC, dev/meet-ups channel

πŸ’« Features

Our objective is to enhance the interoperability and usability of the time series analysis ecosystem in its entirety. sktime provides a unified interface for distinct but related time series learning tasks. It features dedicated time series algorithms and tools for composite model building such as pipelining, ensembling, tuning, and reduction, empowering users to apply an algorithm designed for one task to another.

sktime also provides interfaces to related libraries, for example scikit-learn, statsmodels, tsfresh, PyOD, and fbprophet, among others.

Module Status Links
Forecasting stable Tutorial Β· API Reference Β· Extension Template
Time Series Classification stable Tutorial Β· API Reference Β· Extension Template
Time Series Regression stable API Reference
Transformations stable Tutorial Β· API Reference Β· Extension Template
Parameter fitting maturing API Reference Β· Extension Template
Time Series Clustering maturing API Reference Β· Extension Template
Time Series Distances/Kernels maturing Tutorial Β· API Reference Β· Extension Template
Time Series Alignment experimental API Reference Β· Extension Template
Annotation experimental Extension Template
Time Series Splitters maturing Extension Template
Distributions and simulation experimental

⏳ Install sktime

For troubleshooting and detailed installation instructions, see the documentation.

  • Operating system: macOS X Β· Linux Β· Windows 8.1 or higher
  • Python version: Python 3.8, 3.9, 3.10, 3.11, and 3.12 (only 64-bit)
  • Package managers: pip Β· conda (via conda-forge)

pip

Using pip, sktime releases are available as source packages and binary wheels. Available wheels are listed here.

pip install sktime

or, with maximum dependencies,

pip install sktime[all_extras]

For curated sets of soft dependencies for specific learning tasks:

pip install sktime[forecasting]  # for selected forecasting dependencies
pip install sktime[forecasting,transformations]  # forecasters and transformers

or similar. Valid sets are:

  • forecasting
  • transformations
  • classification
  • regression
  • clustering
  • param_est
  • networks
  • annotation
  • alignment

Cave: in general, not all soft dependencies for a learning task are installed, only a curated selection.

conda

You can also install sktime from conda via the conda-forge channel. The feedstock including the build recipe and configuration is maintained in this conda-forge repository.

conda install -c conda-forge sktime

or, with maximum dependencies,

conda install -c conda-forge sktime-all-extras

(as conda does not support dependency sets, flexible choice of soft dependencies is unavailable via conda)

⚑ Quickstart

Forecasting

from sktime.datasets import load_airline
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.theta import ThetaForecaster
from sktime.split import temporal_train_test_split
from sktime.performance_metrics.forecasting import mean_absolute_percentage_error

y = load_airline()
y_train, y_test = temporal_train_test_split(y)
fh = ForecastingHorizon(y_test.index, is_relative=False)
forecaster = ThetaForecaster(sp=12)  # monthly seasonal periodicity
forecaster.fit(y_train)
y_pred = forecaster.predict(fh)
mean_absolute_percentage_error(y_test, y_pred)
>>> 0.08661467738190656

Time Series Classification

from sktime.classification.interval_based import TimeSeriesForestClassifier
from sktime.datasets import load_arrow_head
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_arrow_head()
X_train, X_test, y_train, y_test = train_test_split(X, y)
classifier = TimeSeriesForestClassifier()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
accuracy_score(y_test, y_pred)
>>> 0.8679245283018868

πŸ‘‹ How to get involved

There are many ways to join the sktime community. We follow the all-contributors specification: all kinds of contributions are welcome - not just code.

Documentation
πŸ’ Contribute How to contribute to sktime.
πŸŽ’ Mentoring New to open source? Apply to our mentoring program!
πŸ“… Meetings Join our discussions, tutorials, workshops, and sprints!
πŸ‘©β€πŸ”§ Developer Guides How to further develop sktime's code base.
🚧 Enhancement Proposals Design a new feature for sktime.
πŸ… Contributors A list of all contributors.
πŸ™‹ Roles An overview of our core community roles.
πŸ’Έ Donate Fund sktime maintenance and development.
πŸ›οΈ Governance How and by whom decisions are made in sktime's community.

πŸ† Hall of fame

Thanks to all our community for all your wonderful contributions, PRs, issues, ideas.


πŸ’‘ Project vision

  • By the community, for the community -- developed by a friendly and collaborative community.
  • The right tool for the right task -- helping users to diagnose their learning problem and suitable scientific model types.
  • Embedded in state-of-art ecosystems and provider of interoperable interfaces -- interoperable with scikit-learn, statsmodels, tsfresh, and other community favorites.
  • Rich model composition and reduction functionality -- build tuning and feature extraction pipelines, solve forecasting tasks with scikit-learn regressors.
  • Clean, descriptive specification syntax -- based on modern object-oriented design principles for data science.
  • Fair model assessment and benchmarking -- build your models, inspect your models, check your models, and avoid pitfalls.
  • Easily extensible -- easy extension templates to add your own algorithms compatible with sktime's API.

sktime's People

Contributors

achieveordie avatar aiwalter avatar benheid avatar chrisholder avatar ciaran-g avatar danbartl avatar dependabot[bot] avatar eenticott-shell avatar fkiraly avatar goastler avatar guzalbulatova avatar hazrulakmal avatar jasonlines avatar jesellier avatar khrapovs avatar kishmanani avatar lmmentel avatar ltsaprounis avatar matthewmiddlehurst avatar miraep8 avatar mloning avatar patrickzib avatar prockenschaub avatar rnkuhns avatar sajaysurya avatar samialavi avatar thayeylolu avatar tonybagnall avatar viktorkaz avatar yarnabrina avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sktime's Issues

Problem with load_gunpoint_dataframe

Thank you for sharing this project !

I was testing load_gunpoint_dataframe and got an str as type for x_train.

I used X_train, y_train = load_gunpoint_dataframe(split='TRAIN', return_X_y=True).

Thanks

tests should not rely on the internet

multiple integration tests rely on downloading data from the internet - making connection problems to their source (more specifically timeseriesclassification.com) a potential failure risk. I suggest removing this implicit outside dependency for cleanness.

Design/implement multivariate ensembles and composites

I thought it worth spelling out the structure of how people can use and implement classification algorithms in sktime. The balance is between modularity and efficiency. This is summarising what we have, but I think it worth it as a we try to widen the development base.

Modular approach:
I propose two classifiers as pipelines
TimeSeriesEnsemble (or time_series_ensemble? need to resolve naming conventions. Another ticket!). This pipeline is a set of transforms and an estimator/classifier as the last element. The logic is that the transforms are applied independently to each member of the ensemble, and the final estimator is the base classifier for the ensemble. The transformers are not applied sequentially. They are applied independently to the data and the results are concatenated. They must be Randomizable, so that a different transform is applied for each ensemble member. The transformers and the base classifier should be seedable for reproducability.

Compiling cython under linux: undefined symbol / no such file or directory

Error: fatal error: Python.h: No such file or directory
OR
Error: .so : undefined symbol: _Py_ZeroStruct

Both are caused by incorrect packages / setup under linux. You need the python3-dev package to be installed to fix the former issue, and ensure setup.py is called with python3 rather than python2. To build cython run: python3 setup.py build_ext -i

Cython should then successfully build '.so' and '.c' files

TSC/TSR: implement pipeline building functionality with transformers on y

xpandas provides functionality for fusing series-to-tabular transformers with tabular supervised learning methods.

This should be interfaced, or replicated in the more general interface.

In this interface, it should be possible to chain series-to-series transformers (e.g., truncation) with series-to-tabular transformers to obtain a series-to-tabular transformer, or series-to-series with series-to-tabular with a tabular SL method to obtain a TSC/TSR method.

Conditional on a consolidated interface design as in #5 and #6.

If easy interfacing is not possible, it may also require separate pipeline design, in this case please raise issue of pipeline API design.

Implement important series-to-series transformers

Some series-to-series transformers that would be useful.

unfitted, single-series, simple

  • binning/aggregation transformer

Behaviour:
returns the sequence of [aggregator application] (e.g., count) in the bins. Index is start time, end time, or index (from start) of bin, depending on index hyper-parameter

Hyper-parameters:
bin specs - start: time/index, end : time/index, numbins : integer
index - 'start', 'end', or 'bin'
aggregator - function to apply to values within bin, default = count

alternative to bin specs: index sequence

  • truncation transformer

Behaviour:
cuts off any entry in the sequence with index outside [lower, upper]

Hyper-parameters:
lower, upper : time

  • simple equal spacing transformer

Behaviour:
intra/extrapolates series to the nodes by the specified strategy, e.g., fill in nearest or next (careful with boundaries)

Hyper-parameters:
node specs - start: time/index, end : time/index, numsteps : integer
index - 'start', 'end', or 'bin'
strategy - 'nearest', 'last' , 'next', 'pw_linear'

alternative to node specs: index sequence

  • re-indexing transformer

Behaviour:
changes the index by a the strategy indicated in the reindexing parameter
integer = replace with ascending count
field = get from data frame column

Hyper-parameters:
strategy - 'integer', 'field'

  • index extractor transformer

Behaviour:
creates a series from the index of the series

  • NA remover transformer

Behaviour:
removes sequence elements that are numpy.nan

  • padding transformer

Behaviour:
pads a sequence/series with value at start or end until it has the desired length

Hyper-parameters:
where - 'start', 'end'
what - value
length - integer
optional: index treatment

  • NA imputer

Behaviour:
Fills in NA values by the specified strategy

Hyper-parameters:
strategy - 'nearest', 'last' , 'next', 'pw_linear'

unfitted, single-series, reduction

  • interpolation transformer

Behaviour:
uses a scikit-learn regressor or classifier to interpolate to the specified index set.
Fits series values against series index, and uses the regressor/classifier to predict value from index

Hyper-parameters:
index set
estimator - sklearn regressor

  • Supervised NA imputer

Behaviour:
Fills in NA values by the specified strategy by using a scikit-learn regressor or classifier. Fits non-NA series values against series index, and uses the regressor/classifier to predict value from index

Hyper-parameters:
strategy - 'nearest', 'last' , 'next', 'pw_linear'

  • advanced: exogeneous or multi-column versions

unfitted, multiple-series

Note: the below are "unfitted" since they run on the entire series

  • index homogenization transformer
    Behaviour:
    Looks up the indices for all the series and introduces them for all the series. Fills in values at new nodes by the specified strategy.

Hyper-parameters:
strategy - ' NA', 'nearest', 'last', 'next'

design questions

  • would it make sense to create an "interpolator" class which in predict takes a series and an index sequence and returns the values?
  • does it make sense to expose dedicated index parameter interfaces in some of the above?

TSC/TSR: sklearn-like grid search tuning wrapper for TSC/TSR

wrapper which implements grid search tuning for TSC/TSR methods

interface should be exactly as GridSearchCV in sklearn except for the TSC/TSR use case, i.e.:

  • wrapper method with constructor initialization
  • lazy use of data, only in tuned method's fit/predict

Conditional on #3 and a "proper" hyper-parameter interface in #2 since calls evaluation per hyper-parameter choice in grid

Workshop Project 1: Deep Learning for Time Series

Not sure if this is the best way to structure, not that familiar with GIT, so advise if there is a better way. First project is to integrate sktime with keras so that we can reproduce existing research. I've added some suggested tasks on the project following a MoSCoW approach, but please adapt as you see fit. I am keen that we test it all installs properly on windows early on though.

TSC/TSR: orchestration/benchmarking framework

Worfklow automation, method evaluation, benchmarking and post-hoc analyses
Should be straightforward (but slightly tedious) by adapting mlaut framework interface

Conditional on evaluation framework #3 which in turn is conditional on prediction interface #2

Workshop Project 1: Deep Learning for Time Series Classification

Not sure if this is the best way to structure, not that familiar with GIT, so advise if there is a better way. First project is to integrate sktime with keras so that we can reproduce existing research. I've added some suggested tasks on the project following a MoSCoW approach, but please adapt as you see fit. I am keen that we test it all installs properly on windows early on though.

naming conventions

I will by default use camel case for classifier names etc. This is just my habit. Can we formalise some naming conventions to help me please? Probably best to copy sklearn?

TSC/TSR: Implement/interface favourite UEA algorithms

Not necessarily in sktime unified interface.
Priority is scalable/efficient and robust implementation which can be interfaced.

Care needs to be taken with writing code in a way such that it includes:

  • a fitting method, to data, which returns a stored models
  • a prediction method, which takes a model and returns predictions given the feature
  • a method interface which explicitly exposes hyper-parameters (e.g., per declaration or as a return dictionary)

load data functionality

currently loading data is performed in utils.load_data and works for .ts format. It would be good to bring back the methods to load from arff and from long format and add them as methods here

recreate wiki

I've been writing these in issues now.

A nicer (and more persistent) place should be created in the wiki for this, potentially updated after subsequent discussions (mirroring the status quo).

This is to provide a solid source for whenever we wish to write a paper/publication or manual.

(maybe best to assign to me, but let's discuss next meeting)

Design/implementation of time series data container (pandas vs xpandas ... or sth else?)

As @sajaysurya pointed out, we may not at all need xpandas for the core use cases.

This is a high-priority issue to be decided by w/c Feb 4 meeting, since it the data container is a central design decision. This thread is to collect pro/con until the decision is made - issue is complete once we decide to remain with, or leave xpandas as the data container solution for the API and implement the alternative (note: in the case of leave the issue is only complete once the alternative is implemented in the existing code).

Proximity Forest

thread to discuss the proximity forest implementation. Contributors

  1. George Oastler (UEA)
  2. Jason Lines (UEA)
  3. Francois Petitjean (Monash)
  4. Ahmed Shifas (Monash)

Design/implement tuning for classical forecasting

  • For reduction strategies from classical forecasting to time series regression, tuning is already covered by sklearn's GridSearchCV class.
  • For classical forecasters, like ARIMA, we need low-level tuning meta-estimator(s) similar to GridSearchCV and other available tuning strategies.

TSC/TSR: evaluation functionality and interface to losses/metrics

implementation of rudimentary evaluation functionality, should include:

  • computation of methods' predictions on test set, after training on training set
  • computation of common average and aggregate losses/scores for classification and regression
  • interface to metrics/scoring functionality as in sklearn
  • at first, preliminary design of the above

Note: not full-blown orchestration and experiment management

Conditional on #2 since calls the estimator interface which needs to be consolidated first.

TimeSeriesForest implementation

To consider in the first instance

  1. basic structure (Tony approach vs Marcus approach)
  2. handling missing values (transform vs bespoke)
  3. Classifying multivariate data (independent vs dependent).

Point 1 is obviously fundamental. To summarise, this is how I see the differences, I have done a minimal sci-kit learn implementation where the intervals and summary stats are hard coded, but the base classifier is configurable through the constructor. Marcus has implemented it with an internal Pipeline, where the transforms are externally configurable through the constructor, but the classifier is hard coded. I have inherited from RandomForest directly, whereas Marcus has cloned and adapted the base classifiers and the ensembles.

My reasoning is that, whilst a generic configurable TSClassifier with a pipeline is desirable, it should be the base classifier. If we implement a classifier called TSF that directly refers to a paper that describes the algorithm, then we should match the paper description. With a general purpose pipeline, users could build a classifier that is called TSF but is in fact something completely different. I also think transformers may be an over design for this very simple algorithm, and I would like to test what overhead (if any) they introduce.

Obviously, I prefer mine :) I am of course more than happy to discuss and to go the other way if that is the consensus.

write extension guideline

needs to contain:

  • description of folder structure
  • description of necessary API elements to implement, inheritance
  • for atomic transformers (series-to-tabular, series-to-series; one-series, multi-row)
  • and for atomic classifiers and regressors
  • low and high level interface

Should cython be compiled on the fly?

Reminder for @mloning and I. Cython is currently a dependency, but if you're using non-cython stuff you still have to compile it (which we know can be very painful). I've seen cython compiled on the fly, so maybe that should be the deal for ease of use?

GitFlow for UEA PhD mini-projects

This is arising from #9 and #10, and @TonyBagnall 's suggestion of sub-projects.

I propose to have one branch per algorithm in development, then @jasonlines @TonyBagnall to review & approve pull requests into dev.

Collaborators to be added here - I think with the golden Turing subscription to Github we have fine-grained control over credentials etc.

[BUG] load_from_ucr_tsv_to_dataframe yields pd.Series whose indices start at 1

Describe the bug

When using sktime.utils.load_data.load_from_ucr_tsv_to_dataframe to load one of UCR's TSV files as a DataFrame whose cells contain Series, the indices of those Series start at 1, as opposed to 0, what one would expect.

This leads to problems when fitting the sktime.classifiers.elastic_ensemble.ElasticEnsemble on such a DataFrame since that estimator (as well as some utility components employed by the estimator) expect the Series indices to start at 0.

I have no idea why sktime.utils.load_data.load_from_ucr_tsv_to_dataframe creates 1-indexed Series. For now, I'm working around the problem by manually resetting the indices of all loaded Series after import.

I haven't checked whether other means of loading the data also behave this way.

To Reproduce

from sktime.utils.load_data import load_from_ucr_tsv_to_dataframe
from sktime.classifiers.elastic_ensemble import ElasticEnsemble

X_train, y_train = load_from_ucr_tsv_to_dataframe("GunPoint_TRAIN.tsv")

# what follows is the workaround
#X_train = X_train.applymap(lambda series: series.reset_index(drop=True))

elastic_ensemble = ElasticEnsemble()
elastic_ensemble.fit(X_train, y_train)

Expected behavior

To reiterate, load_from_ucr_tsv_to_dataframe should yield a DataFrame containing Series whose indices always start at 0.

Versions

Linux-4.19.45-1-MANJARO-x86_64-with-arch-Manjaro-Linux

Python 3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0]
NumPy 1.16.4
SciPy 1.2.1
sktime 0.2.0

Prioritize transformers for implementation or interfacing

for supporting #6.

Triaging based on favourite list.
To be put against existing implementations of transformers in tslearn, tsfresh, pyts, numpy.
Make decision of implement/interface/leave it (with priority perhaps).
To add a time estimate for implement/interface.

Create user documentation

  1. Use sphinx with doc strings
  2. Settle on convention how to write doc strings
  3. Disseminate info of how to write doc strings that conforms to it

How to handle missing values

  • transformers for distribution of missing values (see #6)
  • adapting existing algorithms to internally handle missing values if possible
  • add details about missing values in meta-data

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.