sktime / sktime Goto Github PK

A unified framework for machine learning with time series

License: BSD 3-Clause "New" or "Revised" License

Python 99.64% Shell 0.02% Makefile 0.04% Dockerfile 0.02% Jupyter Notebook 0.27%

time-series machine-learning scikit-learn time-series-classification time-series-regression forecasting time-series-analysis data-science data-mining hacktoberfest

sktime's Introduction

Welcome to sktime

A unified interface for machine learning with time series

🚀 Version 0.28.1 out now! Check out the release notes here.

sktime is a library for time series analysis in Python. It provides a unified interface for multiple time series learning tasks. Currently, this includes time series classification, regression, clustering, annotation, and forecasting. It comes with time series algorithms and scikit-learn compatible tools to build, tune and validate time series models.

Overview
Open Source
Tutorials
Community
CI/CD
Code
Downloads
Citation

📚 Documentation

Documentation
⭐ Tutorials	New to sktime? Here's everything you need to know!
📋 Binder Notebooks	Example notebooks to play with in your browser.
👩‍💻 User Guides	How to use sktime and its features.
✂️ Extension Templates	How to build your own estimator using sktime's API.
🎛️ API Reference	The detailed reference for sktime's API.
📺 Video Tutorial	Our video tutorial from 2021 PyData Global.
🛠️ Changelog	Changes and version history.
🌳 Roadmap	sktime's software and community development plan.
📝 Related Software	A list of related software.

💬 Where to ask questions

Questions and feedback are extremely welcome! We strongly believe in the value of sharing help publicly, as it allows a wider audience to benefit from it.

Type	Platforms
🐛 Bug Reports	GitHub Issue Tracker
✨ Feature Requests & Ideas	GitHub Issue Tracker
👩‍💻 Usage Questions	GitHub Discussions · Stack Overflow
💬 General Discussion	GitHub Discussions
🏭 Contribution & Development	`dev-chat` channel · Discord
🌐 Meet-ups and collaboration sessions	Discord - Fridays 4 pm UTC, dev/meet-ups channel

💫 Features

Our objective is to enhance the interoperability and usability of the time series analysis ecosystem in its entirety. sktime provides a unified interface for distinct but related time series learning tasks. It features dedicated time series algorithms and tools for composite model building such as pipelining, ensembling, tuning, and reduction, empowering users to apply an algorithm designed for one task to another.

sktime also provides interfaces to related libraries, for example scikit-learn, statsmodels, tsfresh, PyOD, and fbprophet, among others.

Module	Status	Links
Forecasting	stable	Tutorial · API Reference · Extension Template
Time Series Classification	stable	Tutorial · API Reference · Extension Template
Time Series Regression	stable	API Reference
Transformations	stable	Tutorial · API Reference · Extension Template
Parameter fitting	maturing	API Reference · Extension Template
Time Series Clustering	maturing	API Reference · Extension Template
Time Series Distances/Kernels	maturing	Tutorial · API Reference · Extension Template
Time Series Alignment	experimental	API Reference · Extension Template
Annotation	experimental	Extension Template
Time Series Splitters	maturing	Extension Template
Distributions and simulation	experimental

⏳ Install sktime

For troubleshooting and detailed installation instructions, see the documentation.

Operating system: macOS X · Linux · Windows 8.1 or higher
Python version: Python 3.8, 3.9, 3.10, 3.11, and 3.12 (only 64-bit)
Package managers: pip · conda (via conda-forge)

pip

Using pip, sktime releases are available as source packages and binary wheels. Available wheels are listed here.

pip install sktime

or, with maximum dependencies,

pip install sktime[all_extras]

For curated sets of soft dependencies for specific learning tasks:

pip install sktime[forecasting]  # for selected forecasting dependencies
pip install sktime[forecasting,transformations]  # forecasters and transformers

or similar. Valid sets are:

forecasting
transformations
classification
regression
clustering
param_est
networks
annotation
alignment

Cave: in general, not all soft dependencies for a learning task are installed, only a curated selection.

conda

You can also install sktime from conda via the conda-forge channel. The feedstock including the build recipe and configuration is maintained in this conda-forge repository.

conda install -c conda-forge sktime

or, with maximum dependencies,

conda install -c conda-forge sktime-all-extras

(as conda does not support dependency sets, flexible choice of soft dependencies is unavailable via conda)

⚡ Quickstart

Forecasting

from sktime.datasets import load_airline
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.theta import ThetaForecaster
from sktime.split import temporal_train_test_split
from sktime.performance_metrics.forecasting import mean_absolute_percentage_error

y = load_airline()
y_train, y_test = temporal_train_test_split(y)
fh = ForecastingHorizon(y_test.index, is_relative=False)
forecaster = ThetaForecaster(sp=12)  # monthly seasonal periodicity
forecaster.fit(y_train)
y_pred = forecaster.predict(fh)
mean_absolute_percentage_error(y_test, y_pred)
>>> 0.08661467738190656

Time Series Classification

from sktime.classification.interval_based import TimeSeriesForestClassifier
from sktime.datasets import load_arrow_head
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_arrow_head()
X_train, X_test, y_train, y_test = train_test_split(X, y)
classifier = TimeSeriesForestClassifier()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
accuracy_score(y_test, y_pred)
>>> 0.8679245283018868

👋 How to get involved

There are many ways to join the sktime community. We follow the all-contributors specification: all kinds of contributions are welcome - not just code.

Documentation
💝 Contribute	How to contribute to sktime.
🎒 Mentoring	New to open source? Apply to our mentoring program!
📅 Meetings	Join our discussions, tutorials, workshops, and sprints!
👩‍🔧 Developer Guides	How to further develop sktime's code base.
🚧 Enhancement Proposals	Design a new feature for sktime.
🏅 Contributors	A list of all contributors.
🙋 Roles	An overview of our core community roles.
💸 Donate	Fund sktime maintenance and development.
🏛️ Governance	How and by whom decisions are made in sktime's community.

🏆 Hall of fame

Thanks to all our community for all your wonderful contributions, PRs, issues, ideas.

💡 Project vision

By the community, for the community -- developed by a friendly and collaborative community.
The right tool for the right task -- helping users to diagnose their learning problem and suitable scientific model types.
Embedded in state-of-art ecosystems and provider of interoperable interfaces -- interoperable with scikit-learn, statsmodels, tsfresh, and other community favorites.
Rich model composition and reduction functionality -- build tuning and feature extraction pipelines, solve forecasting tasks with scikit-learn regressors.
Clean, descriptive specification syntax -- based on modern object-oriented design principles for data science.
Fair model assessment and benchmarking -- build your models, inspect your models, check your models, and avoid pitfalls.
Easily extensible -- easy extension templates to add your own algorithms compatible with sktime's API.

sktime's People

Contributors

Stargazers

Watchers

Forkers

fpetitjean zhuliguang tonybagnall tomfisher hfawaz rtavenar eycab forestier goastler stevenlol jingmouren hivewang mindis theamrzaki kyuhwas sprinterzzj raphaels1 abostrom james-large shenseanchen w090613 fkiraly britneyzeng gitter-badger zibagandomkar simonfahle leofrota dotnet54 sichqq claudiasanches ati-ozgur fpli-mbr limingbei yanlirock chlubba pythonthings linusec xaviervasques arita37 yanqiuyan vishalbelsare kenuku paulhb7 rogerfitz dansinh ducdh1210 ashishpatel26 thavlik abdkg roarkemc baereilon shalevy1 matteogales sanjitpal09 big-o valeman asketkaur ejhortala simone-pignotti lnthach heerme changweitan naftalic wzpy cheaps10 tallamjr maheshmj007 danielgaoyy leonardocordoba njulewis mlgig lokeshgithub garcio aclong fangniuer ashwini-padhy m-o-r-o ayushmaanseth nivedhithae yarenty tahiya31 lyndonckz alla-abdella withington liuyushi i3evelyn aa25desh tvanasse troublem1 hyang1996 magittan daiyicn dmitriyvaletov ykacer nishita2001 mo-saif davidbeamish forkedrepositories pythonfilco jasonmellone

sktime's Issues

Problem with load_gunpoint_dataframe

Thank you for sharing this project !

I was testing load_gunpoint_dataframe and got an str as type for x_train.

I used X_train, y_train = load_gunpoint_dataframe(split='TRAIN', return_X_y=True).

Thanks

tests should not rely on the internet

multiple integration tests rely on downloading data from the internet - making connection problems to their source (more specifically timeseriesclassification.com) a potential failure risk. I suggest removing this implicit outside dependency for cleanness.

TSC/TSR: Implementation of estimator-predictor interface (light-weight)

Including an abstract parent class
and 1-3 simple examples, optimally with test cases (test notebook or CI script)

Conditional on
#1

Design/implement multivariate ensembles and composites

I thought it worth spelling out the structure of how people can use and implement classification algorithms in sktime. The balance is between modularity and efficiency. This is summarising what we have, but I think it worth it as a we try to widen the development base.

Modular approach:
I propose two classifiers as pipelines
TimeSeriesEnsemble (or time_series_ensemble? need to resolve naming conventions. Another ticket!). This pipeline is a set of transforms and an estimator/classifier as the last element. The logic is that the transforms are applied independently to each member of the ensemble, and the final estimator is the base classifier for the ensemble. The transformers are not applied sequentially. They are applied independently to the data and the results are concatenated. They must be Randomizable, so that a different transform is applied for each ensemble member. The transformers and the base classifier should be seedable for reproducability.

Compiling cython under linux: undefined symbol / no such file or directory

Error: fatal error: Python.h: No such file or directory
OR
Error: .so : undefined symbol: _Py_ZeroStruct

Both are caused by incorrect packages / setup under linux. You need the python3-dev package to be installed to fix the former issue, and ensure setup.py is called with python3 rather than python2. To build cython run: python3 setup.py build_ext -i

Cython should then successfully build '.so' and '.c' files

classical forecasting bake-off

reference bake-off in the literature for classical forecasting: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0194889

TSC/TSR: implement pipeline building functionality with transformers on y

xpandas provides functionality for fusing series-to-tabular transformers with tabular supervised learning methods.

This should be interfaced, or replicated in the more general interface.

In this interface, it should be possible to chain series-to-series transformers (e.g., truncation) with series-to-tabular transformers to obtain a series-to-tabular transformer, or series-to-series with series-to-tabular with a tabular SL method to obtain a TSC/TSR method.

Conditional on a consolidated interface design as in #5 and #6.

If easy interfacing is not possible, it may also require separate pipeline design, in this case please raise issue of pipeline API design.

Extend orchestration/evaluation of experiments to classical forecasting

Make existing Orchestrator and evaluation framework compatible with classical forecasting

Implement important series-to-series transformers

Some series-to-series transformers that would be useful.

unfitted, single-series, simple

binning/aggregation transformer

Behaviour:
returns the sequence of [aggregator application] (e.g., count) in the bins. Index is start time, end time, or index (from start) of bin, depending on index hyper-parameter

Hyper-parameters:
bin specs - start: time/index, end : time/index, numbins : integer
index - 'start', 'end', or 'bin'
aggregator - function to apply to values within bin, default = count

alternative to bin specs: index sequence

truncation transformer

Behaviour:
cuts off any entry in the sequence with index outside [lower, upper]

Hyper-parameters:
lower, upper : time

simple equal spacing transformer

Behaviour:
intra/extrapolates series to the nodes by the specified strategy, e.g., fill in nearest or next (careful with boundaries)

Hyper-parameters:
node specs - start: time/index, end : time/index, numsteps : integer
index - 'start', 'end', or 'bin'
strategy - 'nearest', 'last' , 'next', 'pw_linear'

alternative to node specs: index sequence

re-indexing transformer

Behaviour:
changes the index by a the strategy indicated in the reindexing parameter
integer = replace with ascending count
field = get from data frame column

Hyper-parameters:
strategy - 'integer', 'field'

index extractor transformer

Behaviour:
creates a series from the index of the series

NA remover transformer

Behaviour:
removes sequence elements that are numpy.nan

padding transformer

Behaviour:
pads a sequence/series with value at start or end until it has the desired length

Hyper-parameters:
where - 'start', 'end'
what - value
length - integer
optional: index treatment

NA imputer

Behaviour:
Fills in NA values by the specified strategy

Hyper-parameters:
strategy - 'nearest', 'last' , 'next', 'pw_linear'

unfitted, single-series, reduction

interpolation transformer

Behaviour:
uses a scikit-learn regressor or classifier to interpolate to the specified index set.
Fits series values against series index, and uses the regressor/classifier to predict value from index

Hyper-parameters:
index set
estimator - sklearn regressor

Supervised NA imputer

Behaviour:
Fills in NA values by the specified strategy by using a scikit-learn regressor or classifier. Fits non-NA series values against series index, and uses the regressor/classifier to predict value from index

Hyper-parameters:
strategy - 'nearest', 'last' , 'next', 'pw_linear'

advanced: exogeneous or multi-column versions

unfitted, multiple-series

Note: the below are "unfitted" since they run on the entire series

index homogenization transformer
Behaviour:
Looks up the indices for all the series and introduces them for all the series. Fills in values at new nodes by the specified strategy.

Hyper-parameters:
strategy - ' NA', 'nearest', 'last', 'next'

design questions

would it make sense to create an "interpolator" class which in predict takes a series and an index sequence and returns the values?
does it make sense to expose dedicated index parameter interfaces in some of the above?

TSC/TSR: sklearn-like grid search tuning wrapper for TSC/TSR

wrapper which implements grid search tuning for TSC/TSR methods

interface should be exactly as GridSearchCV in sklearn except for the TSC/TSR use case, i.e.:

wrapper method with constructor initialization
lazy use of data, only in tuned method's fit/predict

Conditional on #3 and a "proper" hyper-parameter interface in #2 since calls evaluation per hyper-parameter choice in grid

Workshop Project 1: Deep Learning for Time Series

Not sure if this is the best way to structure, not that familiar with GIT, so advise if there is a better way. First project is to integrate sktime with keras so that we can reproduce existing research. I've added some suggested tasks on the project following a MoSCoW approach, but please adapt as you see fit. I am keen that we test it all installs properly on windows early on though.

implement tabulariser transformer

Implement transformer from nested pandas df to tabular data

build, or adapt by-row-transformer for series-to-series

Sources to investigate for copy/adapt:

sklearn, functionTransformer, by duck typing

pysf (MultiRowTransformer)

tslearn, seglearn, pyts

TSC/TSR: orchestration/benchmarking framework

Worfklow automation, method evaluation, benchmarking and post-hoc analyses
Should be straightforward (but slightly tedious) by adapting mlaut framework interface

Conditional on evaluation framework #3 which in turn is conditional on prediction interface #2

implement forecasting interface

implement forecasting interface according to #18

TSC/TSR: Implementation of estimator-predictor interface (heavy-weight)

Workshop Project 1: Deep Learning for Time Series Classification

Sktime and MLJ Workshop @ The Alan Turing Institute

This might interest you

https://github.com/alan-turing-institute/MLJ.jl/wiki/2019-MLJ---sktime-tutorial-and-development-sprint

Applications are closing shortly

naming conventions

I will by default use camel case for classifier names etc. This is just my habit. Can we formalise some naming conventions to help me please? Probably best to copy sklearn?

TSC/TSR: Implement/interface favourite UEA algorithms

Not necessarily in sktime unified interface.
Priority is scalable/efficient and robust implementation which can be interfaced.

Care needs to be taken with writing code in a way such that it includes:

a fitting method, to data, which returns a stored models
a prediction method, which takes a model and returns predictions given the feature
a method interface which explicitly exposes hyper-parameters (e.g., per declaration or as a return dictionary)

load data functionality

currently loading data is performed in utils.load_data and works for .ts format. It would be good to bring back the methods to load from arff and from long format and add them as methods here

recreate wiki

I've been writing these in issues now.

A nicer (and more persistent) place should be created in the wiki for this, potentially updated after subsequent discussions (mirroring the status quo).

This is to provide a solid source for whenever we wish to write a paper/publication or manual.

(maybe best to assign to me, but let's discuss next meeting)

investigate adopting pysf as low-level interface for (supervised or vanilla) forecasting

investigate what would need changing to make pysf to interface with a different data container (FK)

Design of transformer class interface

sklearn-like interface compatible with xpandas

Specify methods, variables, fit/transform interface

Clean up example notebooks

consistent naming, underscores
make sure they work

Design/implementation of time series data container (pandas vs xpandas ... or sth else?)

As @sajaysurya pointed out, we may not at all need xpandas for the core use cases.

This is a high-priority issue to be decided by w/c Feb 4 meeting, since it the data container is a central design decision. This thread is to collect pro/con until the decision is made - issue is complete once we decide to remain with, or leave xpandas as the data container solution for the API and implement the alternative (note: in the case of leave the issue is only complete once the alternative is implemented in the existing code).

Proximity Forest

thread to discuss the proximity forest implementation. Contributors

George Oastler (UEA)
Jason Lines (UEA)
Francois Petitjean (Monash)
Ahmed Shifas (Monash)

TSC/TSR: Design of estimator-predictor class interface (light-weight)

This should be a full specification document, including:

class methods, should include at least "fit" and "predict"
class variables (if applicable)
hyper-parameter interface methods, e.g., getter/setters

Design 2nd degree transformer, and related composition patterns for kernel learners and distance learners

As discussed in #5

TSC/TSR: interface favourite UEA algorithms

Interface UEA algorithms from their dedicated module.

Obviousuly conditional on #9, the algorithms implemented, and #2, the predictor-estimator interface implemented.

Design/implement tuning for classical forecasting

For reduction strategies from classical forecasting to time series regression, tuning is already covered by sklearn's GridSearchCV class.
For classical forecasters, like ARIMA, we need low-level tuning meta-estimator(s) similar to GridSearchCV and other available tuning strategies.

TSC/TSR: Design of estimator-predictor class interface (high-level)

split off from #1

TSC/TSR: evaluation functionality and interface to losses/metrics

implementation of rudimentary evaluation functionality, should include:

computation of methods' predictions on test set, after training on training set
computation of common average and aggregate losses/scores for classification and regression
interface to metrics/scoring functionality as in sklearn
at first, preliminary design of the above

Note: not full-blown orchestration and experiment management

Conditional on #2 since calls the estimator interface which needs to be consolidated first.

Improve integration testing and automatic unit testing

to go with the extension guidelines

should check conformity with interface design and extension guidelines

TimeSeriesForest implementation

To consider in the first instance

basic structure (Tony approach vs Marcus approach)
handling missing values (transform vs bespoke)
Classifying multivariate data (independent vs dependent).

Point 1 is obviously fundamental. To summarise, this is how I see the differences, I have done a minimal sci-kit learn implementation where the intervals and summary stats are hard coded, but the base classifier is configurable through the constructor. Marcus has implemented it with an internal Pipeline, where the transforms are externally configurable through the constructor, but the classifier is hard coded. I have inherited from RandomForest directly, whereas Marcus has cloned and adapted the base classifiers and the ensembles.

My reasoning is that, whilst a generic configurable TSClassifier with a pipeline is desirable, it should be the base classifier. If we implement a classifier called TSF that directly refers to a paper that describes the algorithm, then we should match the paper description. With a general purpose pipeline, users could build a classifier that is called TSF but is in fact something completely different. I also think transformers may be an over design for this very simple algorithm, and I would like to test what overhead (if any) they introduce.

Obviously, I prefer mine :) I am of course more than happy to discuss and to go the other way if that is the consensus.

Write loading methods and API for timeseriesclassification.com

For easy loading into workspace.

Similar to existing example code in xpandas but better.

write extension guideline

needs to contain:

description of folder structure
description of necessary API elements to implement, inheritance
for atomic transformers (series-to-tabular, series-to-series; one-series, multi-row)
and for atomic classifiers and regressors
low and high level interface

Design/implement transformers on y

Separated from #7.

Should cython be compiled on the fly?

Reminder for @mloning and I. Cython is currently a dependency, but if you're using non-cython stuff you still have to compile it (which we know can be very painful). I've seen cython compiled on the fly, so maybe that should be the deal for ease of use?

Extend pipeline functionality to classical forecasters

See pmdarima Python package for a working example
related to #7 and #44

Kernel based TSC

thread to discuss collaboration with Basque Centre for Applied Mathematics, based on this paper

https://link.springer.com/content/pdf/10.1007%2Fs10618-018-0596-4.pdf

GitFlow for UEA PhD mini-projects

This is arising from #9 and #10, and @TonyBagnall 's suggestion of sub-projects.

I propose to have one branch per algorithm in development, then @jasonlines @TonyBagnall to review & approve pull requests into dev.

Collaborators to be added here - I think with the golden Turing subscription to Github we have fine-grained control over credentials etc.

[BUG] load_from_ucr_tsv_to_dataframe yields pd.Series whose indices start at 1

Describe the bug

When using sktime.utils.load_data.load_from_ucr_tsv_to_dataframe to load one of UCR's TSV files as a DataFrame whose cells contain Series, the indices of those Series start at 1, as opposed to 0, what one would expect.

This leads to problems when fitting the sktime.classifiers.elastic_ensemble.ElasticEnsemble on such a DataFrame since that estimator (as well as some utility components employed by the estimator) expect the Series indices to start at 0.

I have no idea why sktime.utils.load_data.load_from_ucr_tsv_to_dataframe creates 1-indexed Series. For now, I'm working around the problem by manually resetting the indices of all loaded Series after import.

I haven't checked whether other means of loading the data also behave this way.

To Reproduce

from sktime.utils.load_data import load_from_ucr_tsv_to_dataframe
from sktime.classifiers.elastic_ensemble import ElasticEnsemble

X_train, y_train = load_from_ucr_tsv_to_dataframe("GunPoint_TRAIN.tsv")

# what follows is the workaround
#X_train = X_train.applymap(lambda series: series.reset_index(drop=True))

elastic_ensemble = ElasticEnsemble()
elastic_ensemble.fit(X_train, y_train)

Expected behavior

To reiterate, load_from_ucr_tsv_to_dataframe should yield a DataFrame containing Series whose indices always start at 0.

Versions

Linux-4.19.45-1-MANJARO-x86_64-with-arch-Manjaro-Linux

Python 3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0]
NumPy 1.16.4
SciPy 1.2.1
sktime 0.2.0

TSC/TSR: the grand fantabulous time series classification and time series regression super mega hyper bake-off

Bake-off in python toolbox with full set of methods and suite of post-hoc analyses.

Conditional on:

data loading implemented #8
baseline methods and transformer principles implemented #7
UEA favourite methods implemented #10
grid-tuning implemented #4

Prioritize transformers for implementation or interfacing

for supporting #6.

Triaging based on favourite list.
To be put against existing implementations of transformers in tslearn, tsfresh, pyts, numpy.
Make decision of implement/interface/leave it (with priority perhaps).
To add a time estimate for implement/interface.

Create user documentation

Use sphinx with doc strings
Settle on convention how to write doc strings
Disseminate info of how to write doc strings that conforms to it

Design & implementation of supervised forecasting API

For classical forecasting API, see #18, #218 and corresponding API design document
For exemplar supervised forecasting API, see pysf

How to handle missing values

transformers for distribution of missing values (see #6)
adapting existing algorithms to internally handle missing values if possible
add details about missing values in meta-data

(sup.) forecasting interface design

aka: FK and AG to finish their paper

Implement forecasting algorithms

AutoARIMA, see existing Python implementation based on original R implementation in forecast package
Theta method
"Seasonal last" strategy for DummyForecaster