Code Monkey home page Code Monkey logo

aymane-maghouti / hr-data-pipeline-azure Goto Github PK

View Code? Open in Web Editor NEW
3.0 1.0 0.0 3.07 MB

This project is a comprehensive data engineering solution that extracts HR data from a GitHub repository, performs data transformations using Azure services, and creates an interactive HR dashboard using Power BI. The goal is to enable HR professionals and decision-makers to gain insights from the HR data for better workforce management.

Home Page: https://aymane-maghouti.github.io

Jupyter Notebook 100.00%
azure-blob-storage azure-cloud-services azure-data-factory azure-databricks data-engineering data-extraction data-ingestion data-pipeline data-visualization etl-process power-bi

hr-data-pipeline-azure's Introduction

Human Resources ETL Project

Table of Contents

  1. Project Overview
  2. Technologies Used
  3. Data Pipeline
  4. Repository Structure
  5. How to Run
  6. Dashboard
  7. Acknowledgments
  8. Conclusion
  9. Contacts

Project Overview

This project is a comprehensive data engineering solution that extracts HR data from a GitHub repository, performs data transformations using Azure services, and creates an interactive HR dashboard using Power BI. The goal is to enable HR professionals and decision-makers to gain insights from the HR data for better workforce management.

Technologies Used

  • Azure Data Factory: Used for orchestrating data extraction and loading processes.
  • Azure Blob Storage: Used as the data storage solution.
  • Azure Databricks (PySpark): Used for data transformation, leveraging the power of PySpark for processing.
  • Power BI: Used for data visualization and dashboard creation.
  • GitHub: Hosting the raw HR data.

Data Pipeline

Here is the HR data pipeline :

HR_Data_pipeline

The data pipeline consists of the following stages:

  1. Extract from GitHub Repository: Azure Data Factory is configured to extract HR data from a specified GitHub repository.

  2. Load to Azure Blob Storage: Extracted data is loaded into Azure Blob Storage for storage.

  3. Transform Data with Azure Databricks: Azure Databricks, powered by PySpark, is used to perform data transformations.

  4. Power BI Dashboard: Power BI connects to Azure Blob Storage and retrieves the transformed data. Power Query is used for additional data transformations within Power BI. The HR dashboard is created to provide interactive visualizations and insights into the HR data.

Repository Structure

HR-Data-Pipeline-Azure:.
│   README.md
│
├───dataset
│       HR Analytics Data.csv
│       HR employee data.csv
│       promomtion.csv
│       Retrenchment.csv
│
├───images
│       container.png
│       dashboard_hr.png
│       data_pipeline_data_factory.png
│       exemple-transformed-data.png
│       raw-data.png
│       services_azure.png
│       transformed-data.png
│       workflow_hr.png
│
└───Main
        DataBricks_noteBook.ipynb
        HR_Dashboard.pbix
        Power_BI _HR_Dashboard.pdf

How to Run

Prerequisites

Before running this project, ensure you have the following prerequisites in place:

  • Azure Subscription: You should have an active Azure subscription for using Azure services.
  • GitHub Account: Access to the GitHub repository containing the HR data.
  • Azure Data Factory: Set up an Azure Data Factory instance with appropriate permissions to create and manage pipelines.
  • Azure Databricks: Provision an Azure Databricks workspace and configure it for data transformation with PySpark.
  • Power BI Desktop: Install Power BI Desktop for editing and viewing the HR dashboard.

Here is the Azure services that I used :

azure services

Azure blob storage

create the two folders in your container (raw-data folder for the data comes from the github repo (data source) , transformed-data folder for the transformed data (data destination)).

blob storage

Azure data Factory

Here is the data pipeline created in azure data Factory :

azure data Factory

after running the data pipeline, the data will be loaded into the raw-data folder :

raw-data

Notebook Configuration

To configure the notebooks for data transformation in Azure Databricks, follow these steps:

  1. Open the PySpark notebooks in the Main/ directory.
  2. Update the following parameters in the notebooks with your Azure account information:
    • Client ID: Replace with your Azure Active Directory (AAD) Application Client ID.
    • Secret Key: Replace with your AAD Application Secret Key.
    • Directory (Tenant) ID: Replace with your AAD Directory (Tenant) ID.
    • Container Name: Set the name of the Azure Blob Storage container where the data will be stored.
    • Storage Account Name: Specify the name of your Azure Storage Account.

Make sure to securely manage and store these credentials. (in my case I just put them in the notebook which is not the best practice :), so you can use the azure key vault service to manage the credentials).

after running the script the transformed data will be loaded into the transformed-data folder inside the container :

transformed-data

and here is an example of the data loaded, the file framed in red is the actual data, the other files are just meta-data...

transformed-data exemple

Cleaning Up (Data in Azure)

Cleaning up resources is Optionally:

  1. Azure Blob Storage: Delete the Azure Blob Storage container used for storing intermediate and transformed data.
  2. Azure Data Factory: delete any Azure Data Factory pipelines used for data extraction and loading.
  3. Azure Databricks: Terminate any running clusters in Azure Databricks.

Power Bi connect

Set up Power BI to connect to Azure Blob Storage and open the provided .pbix file in the Main/ directory.

Dashboard

Here is the HR Dashboard created in Power BI:

HR Dashboard

Acknowledgments

  • Special thanks to the open-source communities behind azure cloud ,Python, and Power BI for providing powerful tools for data extraction, transformation, orchestration, and visualization.

Conclusion

This project demonstrates an end-to-end data engineering solution for HR data analysis using Azure services. It streamlines data extraction, transformation, and visualization to provide actionable insights.

Contacts

For any questions or further information, feel free to contact me :)

hr-data-pipeline-azure's People

Contributors

aymane-maghouti avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.