Code Monkey home page Code Monkey logo

codegladiators's Introduction

CodeGladiators

Problem Statement

Digital Advertising is changing at a rapid pace with a huge increase in digital audience. At the same time, the digital advertising success metric is shifting from audience volume (eg. Impression count) to conversions (eg. lead submissions) as the success metric. This requires higher transparency and control on the conversions.

Colombia, the digital advertising arm of Times Internet Limited has seen significant growth in its digital advertising inventory. It wants to ensure that in all its conversion-based campaigns, no unfair advantage is given to the publishers generating fake leads.

Your task is to segregate the test data between genuine and false conversions by identifying the maximum possible leads generated by the malignant technique.

Note-
Joining with Click log:
imprId (Click Log) and imprid_cr (Conversion Log - Test and Train Data)

(Use click log data for additional data required for identifying conversion fraud)

Essential Columns
  • client id: Advertiser ID
  • pubclient id: Publisher ID
  • clickIp: IP Address
  • clmbuser id : unique user id
  • impr id: Unique Key for every served impression
  • site id: Publisher wesite
  • goal id: Conversion`s goal type identification id
  • City id / State id / CountryDim id: Geo Details
  • browser id: browser used for accessing publisher on any device on web.
  • adslot id: slot id where advertisement is displayed on any site (unqiue for all sites)
  • crtd: timestamp of the action
  • itmclmb id: Image/Creative shown
  • ispDimId: Internet Service Provider
  • devTypeDimId: Device Id
  • osVerDimId: OS Version

Download the Dataset from here

My Approach

  1. Second as the Dataset was imbalanced, so we have to balance the dataset by using undersampling ,oversampling or Smote.
  2. After that it is the turn to perform Exploratory Data Analysis on the dataset, generating insights and facts
  3. After that we check Transformations of the columns and try to fit the data in a Gaussian Distribution using Various Techniques,
  4. After Transformations its time for Feature Selection followed by Model Building and Hyperparameter Tuning.
  5. Last step is to work on metrics and then choose the appropriate model
  6. Then I moved on Model building and Hyperparameter Tuning and Evaluation of Metrics and then Rechecking them with the FAST AI Model.

Observations

After possibly fixing all the transformations,using the Box-cox Transformation,i moved for outlier Detection and Found Outliers in Two columns, and the best possible way to fill in values of those outliers was by filling with the min-max startegy,here is the image corresponding to it,we can observe that either Mode or Median is a better choice for filling in the outliers for the "Client ID" column

After fixing all the transformations and outliers,i went for Modelling where i started with Ensemble Techniques,The First thing which i did was to get the value of ccp_alpha,a hyperparamter which helps in pruning the Decision Tree so that we can avoid Overfitting,I could have used Bagging Technique as well Here is the Graph for the Training and Testing Curves,which state the nature of the Decision Tree Algorithm

From the above graph we observe that if we set the ccp_alpha values in between [0.005,0.015] we will be getting a better version of the Decision Tree,here is one sample where i have set the value of ccp_alpha to be 0.015 and this is the resulted tree,previously the tree was very huge and its depth was almost till 20 levels!,after pruning it reduced

After this i moved on with Feature Selection Using the Mutual Information Gain and ExtraTrees Classifier and identified the features which gave me most information among all the features,here is the feature contribution and its value

Final Overview of Model

Results

After selecting the best features,then i moved on for model Building and Hyperparameter Tuning and Then Predictive Modelling using FAST AI,here is the image of the value_counts of the conversion_fraud column

Here are the Performance of various Machine Learning Algorithms on which i trained my Data

Random Forest Seems Great,but it was not,as it failed in predicting the test data may be due to OverFitting,I selected Decision Tree as the Model,as it gave all the metrics in a Good Number.

After noticing the performance of various models which i displayed above,i went on to try out with Neural Networks,and guess what Neural Network Outlasted all of them,Neural Network if Regularised in a Proper Manner are the best Machine Learning Models,here is the network summary of my neural network

Accuracy(Left) and Loss(Right) Curves for Neural Network Training

Initial Counts (Given In Training Data)

This was the initial Count of Conversion Frauds as per the given Data,now lets see how does the Model Predict on it,Training data had 966 Rows,where as Testing Data nealy 450 rows.

Final Counts (Prediction Stage,Testing the Model on Testing Data)

Decision Tree Performed Well in comparision to others as we can by the plots,by comparining the initial plots and Final Plots.

Neural Network Preidctions

Neural Network did a job of almost 88% accuracy in classifiying the people,If trained on more parameters with optimised regularisation,Neural Network Can Perform Even better than other models,after all if more data and more training samples are provided to Neural Net it performs the best ! :)

Result

Decision Tree - 84.3%
Neural Network - 90%

Libraries Used

codegladiators's People

Contributors

mv1249 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.