technocolabs100 / stack-overflow-tag-predictions Goto Github PK

Tag Prediction from Stack Overflow Questions

Jupyter Notebook 100.00%

stack-overflow-tag-predictions's Introduction

Stackoverflow Tag Prediction

Business Problem

Stack Overflow is the largest, most trusted online community for developers to learn, share their programming knowledge, and build their careers.
Stack Overflow is something which every programmer use one way or another. Each month, over 50 million developers come to Stack Overflow to learn, share their knowledge, and build their careers. It features questions and answers on a wide range of topics in computer programming. The website serves as a platform for users to ask and answer questions, and, through membership and active participation, to vote questions and answers up or down and edit questions and answers in a fashion similar to a wiki or Digg. As of April 2014 Stack Overflow has over 4,000,000 registered users, and it exceeded 10,000,000 questions in late August 2015. Based on the type of tags assigned to questions, the top eight most discussed topics on the site are: Java, JavaScript, C#, PHP, Android, jQuery, Python and HTML.

Problem Statemtent

Suggest the tags based on the content that was there in the question posted on Stackoverflow.

Data Source

Source: https://www.kaggle.com/c/facebook-recruiting-iii-keyword-extraction/

Real World / Business Objectives and Constraints

Predict as many tags as possible with high precision and recall. Incorrect tags could impact customer experience on StackOverflow. No strict latency constraints.

Data Overview

Train.csv contains 4 columns: Id,Title,Body,Tags.
Test.csv contains the same columns but without the Tags, which you are to predict.
Size of Train.csv - 6.75GB
Number of rows in Train.csv = 6034195
The questions are randomized and contains a mix of verbose text sites as well as sites related to math and programming. The number of questions from each site may vary, and no filtering has been performed on the questions (such as closed questions).

Key Performance Index

Micro-Averaged F1-Score (Mean F Score): The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:

F1 = 2 (precision recall) / (precision + recall)

In the multi-class and multi-label case, this is the weighted average of the F1 score of each class.

'Micro f1 score': Calculate metrics globally by counting the total true positives, false negatives and false positives. This is a better metric when we have class imbalance.

'Macro f1 score': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

https://www.kaggle.com/wiki/MeanFScore
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

Prerequisites

You need to have installed following softwares and libraries in your machine before running this project.

Python 3
Anaconda: It will install ipython notebook and most of the libraries which are needed like sklearn, pandas, seaborn, matplotlib, numpy, scipy.
Word Cloud: https://pypi.org/project/wordcloud/

Author

Yasin Shah

stack-overflow-tag-predictions's People

Contributors

Stargazers

Watchers

Forkers

darkstarp abhisheka394 dethebera yukti845 honeybhardwaj eishagoel15 mma-ai avishake007 deepthi1107 aditi1403 kajalsinghbaghel

stack-overflow-tag-predictions's Issues

Multi-Label Text Classification

I can implement this using Multi-Label Text Classification. Please assign this to me.

Data Preparation

Remember we have 1 million questions with 42 k tags and training this amount of data will be very hectic and difficult. So I thought to consider a small subset of tags. Let’s say C = {42k tags} and C1 is the subset of C. To find the smallest subset c1, we can use the tag count that we’ve plotted earlier. We have the frequency of how many times a tag occurs, so by considering the top set of frequently occurring tags we can cover a maximum number of questions. After checking so many values, I come to know that the top 5500 most frequently occurring tags cover almost 99% of the questions.

Data Visualisation of Dataset

I would like to do data visualisation of the dataset to get more insights about the data we are going to work with.
Do assign this to me if possible 😀

Add github action to auto label issues

Add GitHub action to auto label issues with gssoc21 label when contributor asks to work under gssoc21.

preprocessing using nlp topics

i would work on this issue and cleanse the data (textmining)

Would love to contribute to this project. [GSSOC-21']

Can I start by collecting dataset for "TAGS" for posts on StackOverflow?
I am a participant in gssoc21'.

Contribution for GSSOC

Hi, I would love to contribute to this project during GSSOC 21. I have a couple ideas about how we can go about it. Could you tell me a bit more about the progress on the project and what sort of features you expect us to implement?

Creating Contributing.md

Hi,
i would like to create the contribution.md file for this repo since its not been created. The file will have all the necessary details one should know to contribute in this project.

Regards
Honey Bhardwaj
GSSOC participant

Train and Test Split

You have to divide our 1 Million data into Train and test set Randomly. So we got (799999, 5500) points as train data and (200000, 5500) points as test data.

Data loading Using SQLite DB

To process further, We have used the simplest and most used database called SQLite. It is a Super powerful, lite weight and open-source database. The best part is that python comes inbuilt with SQLite.
More about SQLite (https://www.sqlite.org/index.html)
So instead of using a CSV file, we pump all of the data into the SQLite database and use SQL queries to process further.
we have taken all the data that is there in the CSV file and append it into the database ‘train.db’. Now we can use the created SQLite database for further process.

Data Preprocessing

You have to follow the below-mentioned steps to process further :
i. Sampled 1M data points because of computing and memory limitations.
ii. Separated code-snippets from Body
iii. Removed Special characters from Question title and description (not in code)
iv. Removed stop words (Except ‘C’)
v. Removed HTML Tags using Regular Expressions
vi. Converted all the characters into small letters
vii. Used SnowballStemmer to stem the words
Below we can find the example questions after preprocessed.

And now you have to create a new database called ‘Processed.db’ and loaded the preprocessed data into it.

Cleaning and Preprocessing of Questions asked on Stack Overflow

Preprocess the questions, seperate the code snippet from the body, remove special characters from the questions and description and to stem the words using Python library.
Please assign this issue to me.

Analysis of Tags

Tags are our class labels. As we were trying to predict them, we should deep dive and understand them very well. After removing all the duplicated data we are left with 4.2 Million data points and 42k unique tags.
The number of times a tag appeared is an interesting thing to understand. So I just counted it and put it into a dictionary. If we observe the table below, the “.a” tag appeared in 18 questions, the “.app” tag appeared in 37 questions, and so on. Remember, we will never have a tag repeating two times in the same question.

Adding Contributor section in readme ( Automatically )

I will like to add a feature this will update contributor list in read me automatically !! Will start to work on this issue as soon as i get assigend !!

revamping README.md for better documentation

greetings, as a contributor for GSSOC'21
I would like to revamp the README.md documentation for clarity with better information & grammar.
what will my contribution focus on:

proper structured format for easy navigation (links and flow of interaction
add dynamic visualization to the readme (gifs & image media)
better links presentation
categorized heading and sub-heading for clarity
contributors highlight
installation guides (how to's, do's & don'ts)
open-source contribution procedures (forking, committing, branch checkouts, etc)

will get started, once this issue #30 is assigned.

Creating Codeof conduct

hi ,
I would like to create CODEOFCONDUCT.md for this repo since its not created.
regards
Honey Bhardwaj

error in documentation

remove grammatical mistakes

[Question]

Is this repository active ?

Update README.md file

I have made some grammatical edits to the README.md file. Kindly refer to #32

Multi-Label Classification

We can solve Binary and Multi-Class classification problems with the help of algorithms like Logistic Regression, Support Vector Machines, Random Forest, etc. So to solve the present Multi-Label classification we need to convert it into a binary or a multi-class classification.
Imagine we have three questions with tags,
The total set of labels are 4 (t1,t2,t3,t4). So now we can convert each question into a vector of size 4.
Binary Vector Representation
We can use CountVectorizer to convert all our tags into a binary vector.

Add Issue and PR Template

Would like to add issue and pr templates to these project that will basically explain the type of issue and its description and also the type of pr the set of checklist that it obeys and its description

Kindly assign this to me as a part of Girl Script Summer of Code. I would love to do it

Logistic Regression: One vs Rest Classifier

We have very high dimensional data and we need to build many models in a binary representation. To tackle this, we have to take the help of Logistic Regression with One vs Rest classifier. The classifier takes each of the labels and trains 5500 logistic regression models. Training a Logistic Regression model is very cheap and easy when compared to other models like Support Vector Machines (SVM), Random Forest, etc..and it performs really well on high dimensional data.