- client id: Advertiser ID
- pubclient id: Publisher ID
- clickIp: IP Address
- clmbuser id : unique user id
- impr id: Unique Key for every served impression
- site id: Publisher wesite
- goal id: Conversion`s goal type identification id
- City id / State id / CountryDim id: Geo Details
- browser id: browser used for accessing publisher on any device on web.
- adslot id: slot id where advertisement is displayed on any site (unqiue for all sites)
- crtd: timestamp of the action
- itmclmb id: Image/Creative shown
- ispDimId: Internet Service Provider
- devTypeDimId: Device Id
- osVerDimId: OS Version
Data Set Download Data Set
- Second as the Dataset was imbalanced, so we have to balance the dataset by using undersampling ,oversampling or Smote.
- After that it is the turn to perform Exploratory Data Analysis on the dataset, generating insights and facts
- After that we check Transformations of the columns and try to fit the data in a Gaussian Distribution using Various Techniques,
- After Transformations its time for Feature Selection followed by Model Building and Hyperparameter Tuning.
- Last step is to work on metrics and then choose the appropriate model
- Then I moved on Model building and Hyperparameter Tuning and Evaluation of Metrics and then Rechecking them with the FAST AI Model.
After possibly fixing all the transformations,using the Box-cox Transformation,i moved for outlier Detection and Found Outliers in Two columns, and the best possible way to fill in values of those outliers was by filling with the min-max startegy,here is the image corresponding to it,we can observe that either Mode or Median is a better choice for filling in the outliers for the "Client ID" column
After fixing all the transformations and outliers,i went for Modelling where i started with Ensemble Techniques,The First thing which i did was to get the value of
ccp_alpha
,a hyperparamter which helps in pruning the Decision Tree so that we can avoid Overfitting,I could have used Bagging Technique as well
Here is the Graph for the Training and Testing Curves,which state the nature of the Decision Tree Algorithm
From the above graph we observe that if we set the ccp_alpha values in between [0.005,0.015] we will be getting a better version of the Decision Tree,here is one sample where
i have set the value of ccp_alpha
to be 0.015 and this is the resulted tree,previously the tree was very huge and its depth was almost till 20 levels!,after pruning it reduced
After this i moved on with Feature Selection Using the Mutual Information Gain and ExtraTrees Classifier and identified the features which gave me most information among all the features,here is the feature contribution and its value
After selecting the best features,then i moved on for model Building and Hyperparameter Tuning and Then Predictive Modelling using FAST AI
,here is the image of the
value_counts
of the conversion_fraud
column
Here are the Performance of various Machine Learning Algorithms on which i trained my Data
Random Forest Seems Great,but it was not,as it failed in predicting the test data may be due to OverFitting,I selected Decision Tree as the Model,as it gave all the metrics in a Good Number.After noticing the performance of various models which i displayed above,i went on to try out with Neural Networks,and guess what Neural Network Outlasted all of them,Neural Network if Regularised in a Proper Manner are the best Machine Learning Models,here is the network summary of my neural network
This was the initial Count of Conversion Frauds as per the given Data,now lets see how does the Model Predict on it,Training data had 966 Rows,where as Testing Data nealy 450 rows.
Decision Tree Performed Well in comparision to others as we can by the plots,by comparining the initial plots and Final Plots.
Neural Network did a job of almost 88% accuracy in classifiying the people,If trained on more parameters with optimised regularisation,Neural Network Can Perform Even better than other models,after all if more data and more training samples are provided to Neural Net it performs the best ! :)