Code Monkey home page Code Monkey logo

martandsingh / apachespark Goto Github PK

View Code? Open in Web Editor NEW
81.0 11.0 57.0 144 KB

This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.

Python 100.00%
apachespark data-analysis data-engineering database databricks datalake deltalake etl-pipeline hadoop hive

apachespark's Introduction

Data Engineering Using Azure Databricks

Introduction

This course include multiple sections. We are mainly focusing on Databricks Data Engineer certification exam. We have following tutorials:

  1. Spark SQL ETL
  2. Pyspark ETL

DATASETS

All the datasets used in the tutorials are available at: https://github.com/martandsingh/datasets

HOW TO USE?

follow below article to learn how to clone this repository to your databricks workspace.

https://www.linkedin.com/pulse/databricks-clone-github-repo-martand-singh/

Spark SQL

This course is the first installment of databricks data engineering course. In this course you will learn basic SQL concept which include:

  1. Create, Select, Update, Delete tables
  2. Create database
  3. Filtering data
  4. Group by & aggregation
  5. Ordering
  6. SQL joins
  7. Common table expression (CTE)
  8. External tables
  9. Sub queries
  10. Views & temp views
  11. UNION, INTERSECT, EXCEPT keywords
  12. Versioning, time travel & optimization

PySpark ETL

This course will teach you how to perform ETL pipelines using pyspark. ETL stands for Extract, Load & Transformation. We will see how to load data from various sources & process it and finally will load the process data to our destination.

This course includes:

  1. Read files
  2. Schema handling
  3. Handling JSON files
  4. Write files
  5. Basic transformations
  6. partitioning
  7. caching
  8. joins
  9. missing value handling
  10. Data profiling
  11. date time functions
  12. string function
  13. deduplication
  14. grouping & aggregation
  15. User defined functions
  16. Ordering data
  17. Case study - sales order analysis

you can download all the notebook from our

github repo: https://github.com/martandsingh/ApacheSpark

facebook: https://www.facebook.com/codemakerz

email: [email protected]

SETUP folder

you will see initial_setup & clean_up notebooks called in every notebooks. It is mandatory to run both the scripts in defined order. initial script will create all the mandatory tables & database for the demo. After you finish your notebook, execute clean up notebook, it will clean all the db objects.

pyspark_init_setup - this notebook will copy dataset from my github repo to dbfs. It will also generate used car parquet dataset. All the datasets will be avalable at

/FileStore/datasets

d5859667-databricks-logo

apachespark's People

Contributors

martandsingh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

apachespark's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.