Code Monkey home page Code Monkey logo

6-window-functions-with-pyspark's Introduction

Introduction

This repository contains the source code for a blog post about window functions in PySpark. Go to Use these 6 Window Functions to power up your PySpark queries, a comprehensive guide. In that post, I describe how to use the 6 most window functions in PySpark.

Window functions

Most of the time we use the SQL module in Spark. We create DataFrames with the DataFrame APIs which contain different optimizers that help support the processing of data with a wide range of data sources and algorithms for Big Data workloads.

In SQL, we have a particular type of operation called a Window Function. This operation calculates a function on a subset of rows based on the current row. For each row, a frame window is determined. On this frame, a calculation is made based on the rows in this frame. For every row, the calculation returns a value.

Because Spark uses a SQL module, we also have Window Functions at our disposal. When we combine the power of DataFrames with Window Functions, we can create some unique optimized calculations!

Repository

Getting started

# Create virtualenv
python -m venv .venv 

# Activate virtualenv
. .venv/bin/activate 

# Install dependencies
pip install -r requirements.txt

# Run the code
python most_recent.py

The repository contains the following files:

Aggregates Functions

How to calculate a cumulative sum (running total) ๐Ÿ“ˆ

Very easy with a SQL window function! ๐Ÿ‘‡๐Ÿป

cumulative_sum.py

How to calculate a moving average ๐Ÿ“ˆ

Filter out the noise to determine the direction of a trend!

moving_average.py

Ranking Functions

Select only the most recent records

Easy way to remove duplicate entries

most_recent.py

Break your dataset into equal groups

Rank each value in your dataset

rank.py

Value/Analytical Functions

Calculate the difference from preceeding rows

Very easy to select preceeding or following rows

difference.py

Get the first and last value of the month

Quickly analyze the start and end of each month

first_last.py

6-window-functions-with-pyspark's People

Contributors

mitchellvanrijkom avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.