This project demonstrates the end-to-end data pipelining process using various Azure services, including Azure Data Factory, Azure Databricks, Azure Data Lake Storage Gen2, and Azure Synapse Analytics. The purpose of this project is to showcase the ability to handle the entire data pipelining ecosystem effectively.
The project involves the following steps:
- Creation of Storage Account and Resource Group: Setting up the necessary Azure resources to store and process data.
- Data Ingestion: Pulling data from HTTP data sources into Azure Data Factory.
- Data Organization: Creating two folders (
raw_data
andtransformed_data
) in Azure Data Lake Storage Gen2. - Data Loading: Loading raw data into the
raw_data
folder. - Data Transformation: Using Azure Databricks with the PySpark framework to transform the data.
- Storing Transformed Data: Loading the transformed data into the
transformed_data
folder. - Exploratory Data Analysis (EDA): Loading the transformed data into Azure Synapse Analytics for EDA.
By executing these steps, the project aims to demonstrate the comprehensive handling of a data pipeline using Azure services.
- Set up a storage account and a resource group in Azure.
- Create containers to store the data.
- Use Azure Data Factory to pull data from HTTP data sources.
- Store the ingested data in the
raw_data
folder in Azure Data Lake Storage Gen2.
- Create two folders in the storage container:
raw_data
andtransformed_data
. - Ensure the raw data is correctly placed in the
raw_data
folder.
- Load the raw data into the
raw_data
folder in Azure Data Lake Storage Gen2.
- Utilize Azure Databricks and PySpark to transform the raw data.
- The transformation steps are outlined in the provided Jupyter Notebook (
transformation.ipynb
).
- Load the transformed data into the
transformed_data
folder in the storage container.
- Use Azure Synapse Analytics to perform EDA on the transformed data.
- Analyze the data to gain insights and validate the transformation process.
- transformation.ipynb: Jupyter Notebook containing the PySpark code used for data transformation.
This project highlights the capability to manage an entire data pipeline using Azure services. Each step, from data ingestion to transformation and analysis, is designed to demonstrate proficiency in handling complex data workflows in a cloud environment.