The Centralized Data Warehouse and ML Solution for Banking Analytics is a project that combines a centralized repository for banking data with machine learning algorithms to enable predictive analysis.
Our client, a bank, has various data sources, that contain valuable data. However, the data is not well-organized, making it challenging to analyze and gain insights. Additionally, the client wants to use the data to create machine learning models to improve their business processes and provide better services to their customers.
AWS Cloud Services :- EC2 Instance, IAM, S3 Bucket, RDS Database, Redshift Cluster, SNS, Sagemaker
Containerization-Orchestration Tool :- Docker, Kubernetes
Visualization Tools :- Tableau or Power Bi
MLOps Tool :- Kubeflow
Programming Language :- Python
Machine Learning - Scikit Learn
Apache spark
Apache Airflow
Jupyter Notebook
Enter bucket name -> Select your Region
In Object ownership -> select ACLs Enabled
Enable all public access
Enable bucket Version
Create Bucket
(Note:- In this bucket add 4000+ records csv file)
Search RDS on Search bar -> Select Database -> Create Database
Choose Standard Create method -> Select MySQL as Engine Type -> In Templates Select Free tier
In Setting First Enter Database Name -> Enter Username -> Enter Your Password
In Instance Select t2.micro or any other Instance type as per your requrirement
In Storage select gp2 type and 20GB or More storage as per your requrirement -> Disable Storage autoscaling
In Connectivity -> Select Don't Connect to EC2 and Network Type IPv4 select Default VPC or your Created VPC and Subnet -> Select Yes in Public access -> Select your Created Security Group -> In additional Configuartion keep 3306 port number
select Password Authentication in Database authentication
In Additional Configuration -> Enter Database name -> Disable automated backups Other option keep as it is
Click Create Database
Now Connect this database to Local MySQL Workbench to insert client Data.
(Note:- Here in this add 500+ records csv file)
So now let's create our AWS Storage and tool setup.
First We'll create two AWS S3 bucket one for Staging Area and Second for ML Team
So Just follow this step to create bucket
Enter bucket name-> staging_area -> Select your Region
In Object ownership -> select ACLs Enabled
Enable all public access
Enable bucket Version
Create Bucket
Enter bucket name-> ml_team -> Select your Region
In Object ownership -> select ACLs Enabled
Enable all public access
Enable bucket Version
Create Bucket
Both buckets are created.
In Name -> Enter SG name
In Description -> Enter Specific Description
In vpc -> select default vpc
In Inbound Rule Add some extra rule -> 1] All traffic and Source Anywhere IPv4
2] Redshift and Source Anywhere IPv4
3] MySQL and Source Anywhere IPv4
Create Security Group
Now Create One IAM User User With Full AdministratorAccess policy
Download user's Access and Secret Key
Go to AWS Redshift -> Click On Cluster
In identifire -> Enter Cluster name -> select dc2.large Node Type -> Select 1 Number of Node
In Database configuration -> Enter Username -> Enter Your Password
In Associated IAM Roles -> Select Created One IAM Role(Give Full Admin Access for Demo Purpose)
Create Cluster
It will take upto 10-12 min to avilable after that we have to Edit Configuration and select our created Security Group instead of Default SG. Also select Turn On Publicly Accessible.
(Note:- While creating an instance add one inbound rule in the security group - add Custom TCP rule and enter port as 8080)
Step 1:
sudo apt update
sudo apt upgrade -y
Step 2:
sudo apt install python3-pip
Step 3:
sudo apt install sqlite3
Step 4:
sudo apt install python3.10-venv (If it gives an error then run this - sudo apt install python3.8-venv )
Step 5:
python3 -m venv venv
source venv/bin/activate
Step 6:
i] sudo apt-get install libpq-dev
ii] pip install "apache-airflow[postgres]==2.5.0" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.5.0/constraints-3.7.txt"
(If It Gives an Error then run this - pip install "apache-airflow[postgres]==2.5.0" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.5.0/constraints-3.10.txt" )
iii] airflow db init
Step 7:
i] sudo apt-get install postgresql postgresql-contrib
ii] sudo -i -u postgres
iii] psql
1] CREATE DATABASE airflow;
2] CREATE USER airflow WITH PASSWORD 'airflow';
3] GRANT ALL PRIVILEGES ON DATABASE airflow TO airflow;
Hit Ctrl+D wait and one more time Ctrl+D
Step 8:
i] cd airflow/
ii] sed -i 's#sqlite:////home/ubuntu/airflow/airflow.db#postgresql+psycopg2://airflow:airflow@localhost/airflow#g' airflow.cfg
iii] sed -i 's#SequentialExecutor#LocalExecutor#g' airflow.cfg
Step 9:
i] airflow db init
ii] airflow users create -u airflow -f airflow -l airflow -r Admin -e [email protected]
Step 10:
i] airflow webserver &
ii] airflow scheduler (Run this command on the duplicate session)
eg: ec2-18-220-39-100.us-east-2.compute.amazonaws.com:8080 (replace this dns with your instance DNS and add port 8080)
So Just follow this installation steps.
Step 1:
Select AWS Ubuntu OS with t2.medium instance type and creating instance add IAM FullAdministrativeAccess Role to this instance (Note:- For Demo Purpose we are giving Full Administrator Access Role)
Step 2:
Update the packages
sudo apt-get update
sudo apt-get upgrade
Step 3:
Now start installing pip and java
1] Download pip
sudo apt install python3-pip
Check pip version
pip3 --version
2] Download Java
pip3 install py4j
sudo apt-get install openjdk-8-jdk
Step 4:
Now Install Jupyter Notebook
sudo apt install jupyter-notebook
Step 5:
Now We have to Configuring Jupyter Notebook settings
jupyter notebook --generate-config
After this commnad one .py file is created. So we have to configure it.
cd .jupyter
nano jupyter_notebook_config.py
ls
Now in this file we have to Increase the "#c.NotebookApp.iopub_data_rate_limit" value to "100000000" and remove the # from the front.
c.NotebookApp.iopub_data_rate_limit = 100000000
cd
Step 6:
Now install Apache Spark
pip3 install findspark
wget https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz
Unzip it and move it to your /opt folder and create a symbolic link
tar xvf spark-*
sudo mv spark-3.3.2-bin-hadoop3 /opt/spark-3.3.2
sudo ln -s /opt/spark-3.3.2 /opt/spark
Now Configure Spark & Java & PySpark driver to use Jupyter Notebook
nano ~/.bashrc
Now Enter following content here
export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
export JAVA_HOME='/usr/lib/jvm/java-8-openjdk-amd64'
export PATH=$JAVA_HOME/bin:$PATH
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip 0.0.0.0 --port=8888'
After saving path, you need to run
source ~/.bashrc
Step 7:
Connecting the Jupyter Notebook from your Web-Browser
First Enter following command
pyspark
(Note:- As you can see, there is a URL given in the last line. Copy the contents of the URL after token=, i.e. c6f2835c731caf65c9413a866113c015ae2a589c0a9abc31 in my case.)
Now open Browser
https://<your public dns>:8888
After opening Paste the copied token and create a new password if you want
Step 8:
For connecting to jars we have to download some jar files in it
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.12.183/aws-java-sdk-1.12.183.jar
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-core/1.12.183/aws-java-sdk-core-1.12.183.jar
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.0.0/hadoop-aws-3.0.0.jar
wget https://repo1.maven.org/maven2/com/databricks/spark-redshift_2.11/3.0.0-preview1/spark-redshift_2.11-3.0.0-preview1.jar
wget https://repo1.maven.org/maven2/io/github/spark-redshift-community/spark-redshift_2.12/5.1.0/spark-redshift_2.12-5.1.0.jar
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/1.12.183/aws-java-sdk-s3-1.12.183.jar
Download all this jar file into /opt/spark/jars dictionary
here also paste redshift cluster jar file to make connection between spark and redshift cluster (You can download this from your Redshift Created Cluster)
Step 9:
To stop jupyter notebook services
jupyter notebook stop 8888
So Just follow this installation steps.
task 1
Create Data Extraction Pipeline Using Apache Spark to extract client side data into our staging area bucket.
task 2
Read data from S3 staging area bucket using Apache Spark and Load it in our Redshift Data warehouse. Perform ETL Process here.
task 3
On Apache Spark new notebook read data from Redshift data warehouse and load some data in ml-team bucket to build Machine Learning model
task 4
Connect any Visualzation tool like Power Bi or tableau to Redshift cluster to Analyze the data and gain insights.
task 5
Read Data from ml-team s3 bucket to build ML model using Kubeflow.