Code Monkey home page Code Monkey logo

berserkr's Introduction

berserkr

Screenshot

In old norse berserkr is when a viking warrior entered battle wearing nothing for armor but an animal skin.

In this project, berserkr will tackle and perform massive machine learning at scale without fear

The goal is to find the likelyhood of a purchase for a certain item for a certain customer.

With the best fitting technologies to solve it.

Project design

Screenshot

Data processing

In this project the work is done in batches.

Data format

  • files are compressed as lzo on the bucket side
  • the raw data is then initially cleaned and saved as a parquet.lzo file
  • that raw data is then used by the machine learning parts of the application wich will utilize Apache Arrow.

LZO

Aimed at being very fast, lzop produces files slightly larger than gzip while only requiring a tenth of the CPU use and only slightly higher memory utilization, lzop is one of the fastest compressors available.

PySpark

Apache Spark is the chosen tool used within this project. Spark is quick and very responsive tool to perform data processing with. It provides an analytic engine for large-scale data processing. It is a general distributed in-memory computing framework implemented in scala. Hence spark operates on distributed data collections. However it does not do distributed storage. Spark is proven to be much faster than the popular Apache Hadoop framework. Apache Spark and Hadoop have different purposes but works good together. A high level overview of Apache Spark below:

Screenshot

In this project the work is done in batches.

Apache Parquet

Extracted and transformed files will be saved with the parquet format in S3. And fetched to be used in the machine learning parts of the application.

Parquet is a part of the spark eco-system and therefore been chosen as the format to store files in S3 with once they are prepared and ready.

Delta Lake

This project will try to utilise databricks open source project Delta Lake

  • All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.

Apache Arrow

In this project Apache Arrow will be utilised when performing operation with pandas udf in spark.

This will enable better performance in regards to the CPU not having to perform serialization and deserialization.

Some benefits:

  • All systems utilize the same memory format
  • No overhead for cross-system communication
  • Projects can share functionality (eg: Parquet-to-Arrow reader)

A great video explaining this!

Featuretools

Pandas UDF

Since PySpark 2.3.0 pandas UDFs are now available. This opens up amazing opportunities to work with the data with help of pandas to perform machine learning at massive scale.

Here is when Apache Arrow dataformat comes in and play a crucial part. Since we utilize Pandas UDF, we will be able to loose the overhead of going from a dataformat in the Spark JVM to Python. Basically we will have the same dataformat comming in and out from the calculations.

amazing right? =D

Dataset

Dataset has been fetched from

Airflow

Using Apache Airflow to schedule the applications within the project.

I a normal scenario this should be in its own repository but will keep on here just because I want to share this project as an open source example.

Run application

Run manually by first initialise virtualenv

virtualenv venv -p python3.7
source venv/bin/activate

Then install dependency packages:

Go to:

cd scripts

Run:

./run.sh

Explicitly for MacOS users

brew install lzop

subprocess will otherwise fail when running application locally (decompression when reading S3 files).

Local run

cd tools/docker

docker-compose up

A minio server with a default bucket and data has now been created, which can be used.

Online retailer app

Run:

make onlineretailer

Custom S3 endpoints with Spark LZO compression spark

Tests

Section to describe how to test the project.

Unit

make test

Integration

some command

berserkr's People

Contributors

thelin90 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.