business-science / pytimetk Goto Github PK

View Code? Open in Web Editor NEW

615.0 15.0 52.0 173.44 MB

Time series easier, faster, more fun. Pytimetk.

Home Page: https://business-science.github.io/pytimetk/

License: MIT License

Python 96.04% JavaScript 3.96%

pandas polars time time-series timeseries timeseries-analysis

pytimetk's Introduction

pytimetk

Time series easier, faster, more fun. Pytimetk.

Please ⭐ us on GitHub (it takes 2-seconds and means a lot).

Introducing pytimetk: Simplifying Time Series Analysis for Everyone

Time series analysis is fundamental in many fields, from business forecasting to scientific research. While the Python ecosystem offers tools like pandas, they sometimes can be verbose and not optimized for all operations, especially for complex time-based aggregations and visualizations.

Enter pytimetk. Crafted with a blend of ease-of-use and computational efficiency, pytimetk significantly simplifies the process of time series manipulation and visualization. By leveraging the polars backend, you can experience speed improvements ranging from 3X to a whopping 3500X. Let's dive into a comparative analysis.

Features/Properties	pytimetk	pandas (+matplotlib)
Speed	🚀 3X to 3500X Faster	🐢 Standard
Code Simplicity	🎉 Concise, readable syntax	📜 Often verbose
`plot_timeseries()`	🎨 2 lines, no customization	🎨 16 lines, customization needed
`summarize_by_time()`	🕐 2 lines, 13.4X faster	🕐 6 lines, 2 for-loops
`pad_by_time()`	⛳ 2 lines, fills gaps in timeseries	❌ No equivalent
`anomalize()`	📈 2 lines, detects and corrects anomalies	❌ No equivalent
`augment_timeseries_signature()`	📅 1 line, all calendar features	🕐 29 lines of `dt` extractors
`augment_rolling()`	🏎️ 10X to 3500X faster	🐢 Slow Rolling Operations

As evident from the table, pytimetk is not just about speed; it also simplifies your codebase. For example, summarize_by_time(), converts a 6-line, double for-loop routine in pandas into a concise 2-line operation. And with the polars engine, get results 13.4X faster than pandas!

Similarly, plot_timeseries() dramatically streamlines the plotting process, encapsulating what would typically require 16 lines of matplotlib code into a mere 2-line command in pytimetk, without sacrificing customization or quality. And with plotly and plotnine engines, you can create interactive plots and beautiful static visualizations with just a few lines of code.

For calendar features, pytimetk offers augment_timeseries_signature() which cuts down on over 30 lines of pandas dt extractions. For rolling features, pytimetk offers augment_rolling(), which is 10X to 3500X faster than pandas. It also offers pad_by_time() to fill gaps in your time series data, and anomalize() to detect and correct anomalies in your time series data.

Join the revolution in time series analysis. Reduce your code complexity, increase your productivity, and harness the speed that pytimetk brings to your workflows.

Explore more at our pytimetk homepage.

Installation

Install the latest stable version of pytimetk using pip:

pip install pytimetk

Alternatively you can install the development version:

pip install git+https://github.com/business-science/pytimetk.git

Quickstart:

This is a simple code to test the function summarize_by_time:

import pytimetk as tk
import pandas as pd

df = tk.datasets.load_dataset('bike_sales_sample')
df['order_date'] = pd.to_datetime(df['order_date'])

df \
    .groupby("category_2") \
    .summarize_by_time(
        date_column='order_date', 
        value_column= 'total_price',
        freq = "MS",
        agg_func = ['mean', 'sum']
    )

Documentation

Get started with the pytimetk documentation

Developers (Contributors): Installation

To install pytimetk using Poetry, follow these steps:

1. Prerequisites

Make sure you have Python 3.9 or later installed on your system.

2. Install Poetry

To install Poetry, you can use the official installer provided by Poetry. Do not use pip.

3. Clone the Repository

Clone the pytimetk repository from GitHub:

git clone https://github.com/business-science/pytimetk

4. Install Dependencies

Use Poetry to install the package and its dependencies:

poetry install

or you can create a virtualenv with poetry and install the dependencies

poetry shell
poetry install

🏆 More Coming Soon...

We are in the early stages of development. But it's obvious the potential for pytimetk now in Python. 🐍

Please ⭐ us on GitHub (it takes 2-seconds and means a lot).
To make requests, please see our Project Roadmap GH Issue #2. You can make requests there.
Want to contribute? See our contributing guide here.

pytimetk's People

Contributors

Stargazers

Watchers

Forkers

vishalbelsare justinkurland wulixin samuelmacedo83 iskode tackes lucaso21 alexriggio tiwarilaxuu dereje-workneh gtimothee iamjakkie rserran jekakao bhaskkar belkmouf gpietersz vineetp6 amircp jaedukseo nforeroba turgut090 shabbirhasan1 mfdahl sandy4321 romilly mjkunta hasheng andresveraf shanthshivam stjordanis seyf97 coulibaly-b ravi1g stevenmichiels hydrogeohc arokiarajan1 rabadzhiyski datastrategypro ssghost sijopkd valeman oktoreno mekongdelta-mind engmhabib sharangdev75 shashipal95 mandeepan detorrespa block-mole zdanovic mohitshrestha

pytimetk's Issues

Tests: Data Wrangling Functions

Create pytest tests for Data Wrangling Functions. Use chatgpt to help.

tk.summarize_by_time()
tk.future_frame()
tk.pad_by_time()

Tests: `plot_timeseries()`

Add tests to make sure that plot time series functions properly.

Augmentor: Exponential (EWM)

#65 Tracks Augmentors discussion

Plot_timeseries - bug

Creating ticket for a known bugs in plot_timeseries.

Removed tests on these until bugs are fixed.

BUG in plotly with "v" direction

Speed Improvement: polars backends

Running checklist of backends: #77 (comment)

Rolling Regression: Improve `tk.augment_rolling()`

tk.augment_rolling(): Upgrade to handle rolling regressions.

Timetk Package Philosophy

The timetk for python basics guide is the best place to start learning about the package philosophy: https://business-science.github.io/pytimetk/guides/02_timetk_concepts.html

Function: Integrate `skimpy`

Integrate skimpy: https://aeturrell.github.io/skimpy/

skim
clean_column_names

Applied Tutorial: Finance Investment Analysis

Showcase:

Augment Rolling
Plot timeseries

`tk.plot_timeseries()`

Implement tk.plot_timeseries() similar to R timetk plot_time_series().

Plotnine Implementation
Plotly Implementation

Lead: Matt Dancho & Samuel Macedo

Documentation Instructions - Quarto and Quartodoc

Documentation Instructions

Create package documentation (docstrings)
Use Quarto and Quartodoc to build the Python package documentation

1. Create Package Documentation

The easiest way to create documentation fast is to use Mintlify Doc Writer for Python

IMPORTANT: Quartodoc uses Numpy Docstring Formatting

You can then highlight a function and select "Generate Docstring".

2. Use Quartodoc & Quarto to generate Package Documentation

Make sure Quarto and Quartodoc are installed.

Quarto: https://quarto.org/
Quartodoc: https://machow.github.io/quartodoc/get-started/overview.html

The main commands are:

# Change directory to /docs folder
cd docs 

# Build the documentation
quartodoc build

# Preview the website
quarto preview

You should now see a website on your localhost:

Making Tutorials

We will eventually need to make some tutorials and documentation. Will cover this later after we create the core timetk functions.

Publishing Changes

You can just make a pull request with any changes. Once I merge I'll publish.
The command is quarto publish gh-pages, which publishes to the gh-pages branch.

Tests: Augment Functions

Need pytest tests for augment functions. (Use chatgpt to help)

`pad_by_time()` "auto" frequency fails.

Refer to this thread: #25 (comment)

New function: `apply_by_time`

summarize_by_time = agg + resample: Simple aggregations to only single columns as a series, highly optimized
apply_by_time = apply + resample: More complex aggregations allowing users to access all columns in the data, less optimized

`tk.ts_summary()`

Create a tk.ts_summary() function similar to timetk in R: https://business-science.github.io/timetk/reference/tk_summary_diagnostics.html

PyTimeTK Roadmap

Phase 1: MVP Package

Develop a minimal package with the most important functions.

Use this guide: https://py-pkgs.org/03-how-to-package-a-python

Priority 1 - Core Data and Data Frame Operations

summarise_by_time() / summarize_by_time()
Data Sets

Priority 2 - Plot Time Series

plot_time_series() - Not sure if we should go with plotly or altair for interactive mode. I feel we should go with plotnine for non-interactive. Will need smooth_vec().

Priority 3 - Data Wrangling

future_frame() - We will also need tk_make_future_timeseries() and tk_make_timeseries()
pad_by_time()

Priority 4 - Augment Operations

Note - These functions should overwrite columns that are named the same in the input data frame.

tk.augment_timeseries_signature() - tk.get_timeseries_signature()
tk.augment_holiday_signature() - Uses holidays package
tk.augment_lags() / tk.agument_leads()
tk.augment_rolling()
tk.augment_fourier()

Priority 5 - TS Features

tk.ts_features()

Phase 2: Expand Functionality

Anomalize in Python

Convert Anomalize R package to tk.anomalize()

Time Series Plotting Utilities

Time Series Inspection, Frequency, and Trend

TS Summary: tk.ts_summary()
Time Scale Template
Automatic Frequency Detection
Automatic Trend Detection

Applied Tutorials

Phase 3: Extend Sklearn

Time Series Splitting / Cross Validation Functionality
Preprocessors & Feature Engineering
Vectorized Functions - Box Cox,
Plot Time Series CV

Phase 4: Fill in Function Gaps Where Needed

Add additional functionality that was not identified in Phases 1-3.

Augment Expanding

Add utils.checks.py: Check for common issues with function inputs

Working on this.

Guide: Augment Functions

Add a guide Augment Functions including:

tk.augment_timeseries_signature()
tk.augment_holiday_signature()
tk.augment_lags() and tk.augment_leads()
tk.augment_rolling()

Will go here Guide: Adding Features (Augmenting): https://business-science.github.io/pytimetk/guides/05_augmenting.html

Example Guide (Data Wrangling): https://business-science.github.io/pytimetk/guides/04_wrangling.html

Package Name Change: `pytimetk`

Looks like timetk is taken on PyPi. pytimetk looks open.

Publish Version 0.1.0 on PyPi

`tk.plot_timeseries`: investigate color_column with `plotly` engine

import timetk as tk

df = tk.load_dataset('m4_monthly', parse_dates = ['date'])

# Plotly Object: Color Column

fig = (
    df
        .plot_timeseries(
            'date', 'value', 
            color_column = 'id',
            smooth = False,
            y_intercept = 0,
            x_axis_date_labels = "%Y",
            engine = 'plotly',
        )
)
fig

Refactor: Use `typing` `Union`

Refactor code to use typing:Union.

Speed improvements: Pandarallel

Investigate parallel processing:

augment_rolling
augment_rolling_apply

Any other long running functions?

Pandarallel: https://nalepae.github.io/pandarallel/

Applied Tutorial: Sales CRM Database Analysis (showcase `tk.summarize_by_time` and `tk.plot_timeseries()`

`tk.augment_fourier()`

Implement tk.augment_fourier() similar to how R timetk tk_augment_fourier() and vec_fourier work:

Lead: Justin Kurland

Bug: `plot_timeseries` plotly engine - Bollinger Band Example

Getting a weird bug. It's only when the color palette has duplicated colors.

When colors are duplicated

import timetk as tk
import pandas as pd

stocks_df = tk.load_dataset("stocks_daily", parse_dates = True)

# Bollinger Bands
bollinger_df = stocks_df[['symbol', 'date', 'adjusted']] \
    .groupby('symbol') \
    .augment_rolling(
        date_column = 'date',
        value_column = 'adjusted',
        window = 20,
        window_func = ['mean', 'std'],
        center = False
    ) \
    .assign(
        upper_band = lambda x: x['adjusted_rolling_mean_win_20'] + 2*x['adjusted_rolling_std_win_20'],
        lower_band = lambda x: x['adjusted_rolling_mean_win_20'] - 2*x['adjusted_rolling_std_win_20']
    )


# Visualize
fig = (bollinger_df

    # zoom in on dates
    .query('date >= "2023-01-01"') 

    # Convert to long format
    .melt(
        id_vars = ['symbol', 'date'],
        value_vars = ["adjusted", "adjusted_rolling_mean_win_20", "upper_band", "lower_band"]
    ) 

    # Group on symbol and visualize
    .groupby("symbol") 
    .plot_timeseries(
        date_column = 'date',
        value_column = 'value',
        color_column = 'variable',
        # Adjust colors for Bollinger Bands
        color_palette =["#2C3E50", "#E31A1C", '#18BC9C', '#18BC9C'],
        smooth = False, 
        facet_ncol = 2,
        width = 900,
        height = 700,
        engine = "plotly" 
    )
)
fig

When colors are not duplicated.

(bollinger_df

    # zoom in on dates
    .query('date >= "2023-01-01"') 

    # Convert to long format
    .melt(
        id_vars = ['symbol', 'date'],
        value_vars = ["adjusted", "adjusted_rolling_mean_win_20", "upper_band", "lower_band"]
    ) 

    # Group on symbol and visualize
    .groupby("symbol") 
    .plot_timeseries(
        date_column = 'date',
        value_column = 'value',
        color_column = 'variable',
        # Adjust colors for Bollinger Bands
        color_palette =["#2C3E50", "#E31A1C", '#18BC9C', '#000000'],
        smooth = False, 
        facet_ncol = 2,
        width = 900,
        height = 700,
        engine = "plotly" 
    )
)

Tests: `tk.ts_features()`

Write tests for tk.ts_features()

Website: Fix CSS on Mobile

Version 0.2.0

Guide: Data Wrangling

Add guide on Data Wrangling. Cover functions with examples:

summarize_by_time()
future_frame()
pad_by_time()

Will be housed here: https://business-science.github.io/pytimetk/guides/04_wrangling.html

Lead: Lucas O

Function: integrate pyjanitor

More Augment Functions: logarithmic, polynomial, hilbert, wavelet, short fourier

Per @JustinKurland:

There are a lot of opportunities for more augmentation functions:

tk.augment_logrithmic()
tk.augment_polynomial()
tk.augment_hilbert()
tk.augment_wavelet()
tk.augment_short_fourier() <- This is different than the normal fourier transform in that it breaks a signal into smaller segments to provide a time-varying analysis with adjustable time and frequency resolutions.

These are just a few, but all represent further oppotunities to try and add valuable information that has historically been leveraged in the extant time series and signal processing literature.

Error in plot_timeseries with engine = Matplotlib

When plotting data with grouped data, matplotlib returns an image size error.

ValueError: Image size of 140000x100000 pixels is too large. It must be less than 2^16 in each direction.

However, if we explicitly define the width and height, matplotlib works as expected.
Need default plot size to be defined.

Reproducible example:

import timetk as tk

NOT WORKING

df = tk.load_dataset('m4_monthly', parse_dates = ['date'])

fig = (
    df
        .groupby('id')
        .plot_timeseries(
            'date', 'value', 
            color_column = 'id',
            facet_ncol = 2,
            x_axis_date_labels = "%Y",
            engine = 'matplotlib'
        )
)
fig

WORKING

fig = (
    df
        .groupby('id')
        .plot_timeseries(
            'date', 'value', 
            color_column = 'id',
            facet_ncol = 2,
            x_axis_date_labels = "%Y",
            width = 1200,
            height = 800,
            engine = 'matplotlib'
        )
)
fig

Pad By Time Grouped - End Behavior

Update pad_by_time behavior for grouped data to extend to the end of the max time of all groups.

Example: groups A and B, where A have values (with gaps) between 1/1/22 and 1/6/22, and B has values between 1/2/22 and 1/5/22.
We expect group B to have values filled in to the end of the latest date for all group

In terms of data prep for a global model.. if 1/6 is the end of my training data, we would need group B to be extended to 1/6 as well

Applied Tutorial: Clustering (showcase `tk.ts_features`)

Quick roadmap corrections

Priority 3 - Augment Operations -> change to Priority 4 - Augment Operations
Note - These functions should overwrite columns that are named the same in the input data frame.

tk_augment_timeseries_signature() - tk_get_timeseries_signature()
tk_augment_lags() / tk_agument_leads() - Will need lag_vec() , lead_vec()
tk_augment_slidify() - May need slidify_vec()
add tk_augment_holiday_signature() and check it once merge request is completed

Tests: `tk.ts_summary()`

Write tests for tk.ts_summary()

Question: Should a more flexible version of summarize_by_time be created?

After reviewing polars and pandas more closely, I'm questioning the separation of value column and Agg functions.

Here's how polars accomplished Aggs:


df.group_by("a").agg(
    b_sum=pl.sum("b"),
    c_mean_squared=(pl.col("c") ** 2).mean(),
)

https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.dataframe.group_by.GroupBy.agg.html

Plot_timeseries: color column rework

The color_column was not set up properly.

New Data Set: For anomaly tutorial (Expedia Hotels)

New dataset for use with the forthcoming anomaly functionality.

More Rolling Augmentors: Rolling, Exponential (EWM), Expanding

Rolling
Exponential Weighted
Expanding

summarize_by_time - corr

corr does not appear to be a valid agg_function

For example:

df \
    .groupby("category_2") \
    .summarize_by_time(
        date_column='order_date', 
        value_column= ['total_price'],
        freq = "MS",
        agg_func = ['corr']
    )

Will generate the error:
AttributeError: 'corr' is not a valid function for 'DatetimeIndexResamplerGroupby' object

I think simply modifying the docstring here:

        - "sum": Sum of values
        - "mean": Mean of values
        - "median": Median of values
        - "min": Minimum of values
        - "max": Maximum of values
        - "std": Standard deviation of values
        - "var": Variance of values
        - "first": First value in group
        - "last": Last value in group
        - "count": Count of values
        - "nunique": Number of unique values
        - "corr": Correlation between values <- Just remove

is the simplest solution. I am not entirely sure what the intended use for corr here was anyway, was it for comparing to features/covariates or was it meant to compare from t1 to t2 to t3 ...

Regardless should just tweak the docstring for now.

In addition, the function as currently written includes a 'kind' parameter, this defaults to 'timestamp', but that it will work for 'period' is also not specified. This should be included in the docstring.

anomalize
plot_anomalies
plot_anomaly_decomp

Documentation:

Applied Tutorial: Anomaly Detection

Applied Tutorial: Demand Forecasting (showcase `tk.augment_timeseries_signature`)

Dataset: Please use walmart_sales dataset since its demand forecasting
Future Frame: Make use of the future frame function to create future dates by ID. This will allow us to show the future forecast.
Plotting: where possible use plot timeseries.

Meta Issue: Technical Trading / Finance Module Wishlist

Need to discuss what all we want to add for version 0.3.0

Guide: Data Visualization

Need a data visualization guide similar to: Plotting Time Series

tk.plot_timeseries()