Code Monkey home page Code Monkey logo

stack-overflow-tag-predictions's Introduction

Stackoverflow Tag Prediction

Business Problem

Stack Overflow is the largest, most trusted online community for developers to learn, share their programming knowledge, and build their careers.
Stack Overflow is something which every programmer use one way or another. Each month, over 50 million developers come to Stack Overflow to learn, share their knowledge, and build their careers. It features questions and answers on a wide range of topics in computer programming. The website serves as a platform for users to ask and answer questions, and, through membership and active participation, to vote questions and answers up or down and edit questions and answers in a fashion similar to a wiki or Digg. As of April 2014 Stack Overflow has over 4,000,000 registered users, and it exceeded 10,000,000 questions in late August 2015. Based on the type of tags assigned to questions, the top eight most discussed topics on the site are: Java, JavaScript, C#, PHP, Android, jQuery, Python and HTML.

Problem Statemtent

Suggest the tags based on the content that was there in the question posted on Stackoverflow.

Data Source

Source: https://www.kaggle.com/c/facebook-recruiting-iii-keyword-extraction/

Real World / Business Objectives and Constraints

Predict as many tags as possible with high precision and recall. Incorrect tags could impact customer experience on StackOverflow. No strict latency constraints.

Data Overview

Train.csv contains 4 columns: Id,Title,Body,Tags.
Test.csv contains the same columns but without the Tags, which you are to predict.
Size of Train.csv - 6.75GB
Number of rows in Train.csv = 6034195
The questions are randomized and contains a mix of verbose text sites as well as sites related to math and programming. The number of questions from each site may vary, and no filtering has been performed on the questions (such as closed questions).

Key Performance Index

Micro-Averaged F1-Score (Mean F Score): The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:

F1 = 2 (precision recall) / (precision + recall)

In the multi-class and multi-label case, this is the weighted average of the F1 score of each class.

'Micro f1 score': Calculate metrics globally by counting the total true positives, false negatives and false positives. This is a better metric when we have class imbalance.

'Macro f1 score': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

https://www.kaggle.com/wiki/MeanFScore
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

Prerequisites

You need to have installed following softwares and libraries in your machine before running this project.

  1. Python 3
  2. Anaconda: It will install ipython notebook and most of the libraries which are needed like sklearn, pandas, seaborn, matplotlib, numpy, scipy.
  3. Word Cloud: https://pypi.org/project/wordcloud/

Author

  • Yasin Shah

stack-overflow-tag-predictions's People

Contributors

dethebera avatar honeybhardwaj avatar imsiddhant07 avatar pratikshitvas avatar technocolabs100 avatar vatsalmehta-3009 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

stack-overflow-tag-predictions's Issues

Data Preparation

Remember we have 1 million questions with 42 k tags and training this amount of data will be very hectic and difficult. So I thought to consider a small subset of tags. Let’s say C = {42k tags} and C1 is the subset of C. To find the smallest subset c1, we can use the tag count that we’ve plotted earlier. We have the frequency of how many times a tag occurs, so by considering the top set of frequently occurring tags we can cover a maximum number of questions. After checking so many values, I come to know that the top 5500 most frequently occurring tags cover almost 99% of the questions.

Data Visualisation of Dataset

I would like to do data visualisation of the dataset to get more insights about the data we are going to work with.
Do assign this to me if possible 😀

Contribution for GSSOC

Hi, I would love to contribute to this project during GSSOC 21. I have a couple ideas about how we can go about it. Could you tell me a bit more about the progress on the project and what sort of features you expect us to implement?

Creating Contributing.md

Hi,
i would like to create the contribution.md file for this repo since its not been created. The file will have all the necessary details one should know to contribute in this project.

Regards
Honey Bhardwaj
GSSOC participant

Train and Test Split

You have to divide our 1 Million data into Train and test set Randomly. So we got (799999, 5500) points as train data and (200000, 5500) points as test data.

Data loading Using SQLite DB

To process further, We have used the simplest and most used database called SQLite. It is a Super powerful, lite weight and open-source database. The best part is that python comes inbuilt with SQLite.
More about SQLite (https://www.sqlite.org/index.html)
So instead of using a CSV file, we pump all of the data into the SQLite database and use SQL queries to process further.
we have taken all the data that is there in the CSV file and append it into the database ‘train.db’. Now we can use the created SQLite database for further process.

Data Preprocessing

You have to follow the below-mentioned steps to process further :
i. Sampled 1M data points because of computing and memory limitations.
ii. Separated code-snippets from Body
iii. Removed Special characters from Question title and description (not in code)
iv. Removed stop words (Except ‘C’)
v. Removed HTML Tags using Regular Expressions
vi. Converted all the characters into small letters
vii. Used SnowballStemmer to stem the words
Below we can find the example questions after preprocessed.

And now you have to create a new database called ‘Processed.db’ and loaded the preprocessed data into it.

Analysis of Tags

Tags are our class labels. As we were trying to predict them, we should deep dive and understand them very well. After removing all the duplicated data we are left with 4.2 Million data points and 42k unique tags.
The number of times a tag appeared is an interesting thing to understand. So I just counted it and put it into a dictionary. If we observe the table below, the “.a” tag appeared in 18 questions, the “.app” tag appeared in 37 questions, and so on. Remember, we will never have a tag repeating two times in the same question.

revamping README.md for better documentation

greetings, as a contributor for GSSOC'21
I would like to revamp the README.md documentation for clarity with better information & grammar.
what will my contribution focus on:

  • proper structured format for easy navigation (links and flow of interaction
  • add dynamic visualization to the readme (gifs & image media)
  • better links presentation
  • categorized heading and sub-heading for clarity
  • contributors highlight
  • installation guides (how to's, do's & don'ts)
  • open-source contribution procedures (forking, committing, branch checkouts, etc)

will get started, once this issue #30 is assigned.

Creating Codeof conduct

hi ,
I would like to create CODEOFCONDUCT.md for this repo since its not created.
regards
Honey Bhardwaj

Multi-Label Classification

We can solve Binary and Multi-Class classification problems with the help of algorithms like Logistic Regression, Support Vector Machines, Random Forest, etc. So to solve the present Multi-Label classification we need to convert it into a binary or a multi-class classification.
Imagine we have three questions with tags,
The total set of labels are 4 (t1,t2,t3,t4). So now we can convert each question into a vector of size 4.
Binary Vector Representation
We can use CountVectorizer to convert all our tags into a binary vector.

Add Issue and PR Template

Would like to add issue and pr templates to these project that will basically explain the type of issue and its description and also the type of pr the set of checklist that it obeys and its description

Kindly assign this to me as a part of Girl Script Summer of Code. I would love to do it

Logistic Regression: One vs Rest Classifier

We have very high dimensional data and we need to build many models in a binary representation. To tackle this, we have to take the help of Logistic Regression with One vs Rest classifier. The classifier takes each of the labels and trains 5500 logistic regression models. Training a Logistic Regression model is very cheap and easy when compared to other models like Support Vector Machines (SVM), Random Forest, etc..and it performs really well on high dimensional data.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.