technocolabs100 / stack-overflow-tag-predictions Goto Github PK

Tag Prediction from Stack Overflow Questions

Jupyter Notebook 100.00%

stack-overflow-tag-predictions's Issues

Add github action to auto label issues

Add GitHub action to auto label issues with gssoc21 label when contributor asks to work under gssoc21.

Train and Test Split

You have to divide our 1 Million data into Train and test set Randomly. So we got (799999, 5500) points as train data and (200000, 5500) points as test data.

Creating Codeof conduct

hi ,
I would like to create CODEOFCONDUCT.md for this repo since its not created.
regards
Honey Bhardwaj

Data Visualisation of Dataset

I would like to do data visualisation of the dataset to get more insights about the data we are going to work with.
Do assign this to me if possible 😀

revamping README.md for better documentation

greetings, as a contributor for GSSOC'21
I would like to revamp the README.md documentation for clarity with better information & grammar.
what will my contribution focus on:

proper structured format for easy navigation (links and flow of interaction
add dynamic visualization to the readme (gifs & image media)
better links presentation
categorized heading and sub-heading for clarity
contributors highlight
installation guides (how to's, do's & don'ts)
open-source contribution procedures (forking, committing, branch checkouts, etc)

will get started, once this issue #30 is assigned.

We can solve Binary and Multi-Class classification problems with the help of algorithms like Logistic Regression, Support Vector Machines, Random Forest, etc. So to solve the present Multi-Label classification we need to convert it into a binary or a multi-class classification.
Imagine we have three questions with tags,
The total set of labels are 4 (t1,t2,t3,t4). So now we can convert each question into a vector of size 4.
Binary Vector Representation
We can use CountVectorizer to convert all our tags into a binary vector.

Data loading Using SQLite DB

To process further, We have used the simplest and most used database called SQLite. It is a Super powerful, lite weight and open-source database. The best part is that python comes inbuilt with SQLite.
More about SQLite (https://www.sqlite.org/index.html)
So instead of using a CSV file, we pump all of the data into the SQLite database and use SQL queries to process further.
we have taken all the data that is there in the CSV file and append it into the database ‘train.db’. Now we can use the created SQLite database for further process.

Logistic Regression: One vs Rest Classifier

We have very high dimensional data and we need to build many models in a binary representation. To tackle this, we have to take the help of Logistic Regression with One vs Rest classifier. The classifier takes each of the labels and trains 5500 logistic regression models. Training a Logistic Regression model is very cheap and easy when compared to other models like Support Vector Machines (SVM), Random Forest, etc..and it performs really well on high dimensional data.

Would love to contribute to this project. [GSSOC-21']

Can I start by collecting dataset for "TAGS" for posts on StackOverflow?
I am a participant in gssoc21'.

Adding Contributor section in readme ( Automatically )

I will like to add a feature this will update contributor list in read me automatically !! Will start to work on this issue as soon as i get assigend !!

preprocessing using nlp topics

i would work on this issue and cleanse the data (textmining)

[Question]

Is this repository active ?

Analysis of Tags

Tags are our class labels. As we were trying to predict them, we should deep dive and understand them very well. After removing all the duplicated data we are left with 4.2 Million data points and 42k unique tags.
The number of times a tag appeared is an interesting thing to understand. So I just counted it and put it into a dictionary. If we observe the table below, the “.a” tag appeared in 18 questions, the “.app” tag appeared in 37 questions, and so on. Remember, we will never have a tag repeating two times in the same question.

Cleaning and Preprocessing of Questions asked on Stack Overflow

Preprocess the questions, seperate the code snippet from the body, remove special characters from the questions and description and to stem the words using Python library.
Please assign this issue to me.

Data Preprocessing

You have to follow the below-mentioned steps to process further :
i. Sampled 1M data points because of computing and memory limitations.
ii. Separated code-snippets from Body
iii. Removed Special characters from Question title and description (not in code)
iv. Removed stop words (Except ‘C’)
v. Removed HTML Tags using Regular Expressions
vi. Converted all the characters into small letters
vii. Used SnowballStemmer to stem the words
Below we can find the example questions after preprocessed.

And now you have to create a new database called ‘Processed.db’ and loaded the preprocessed data into it.

Update README.md file

I have made some grammatical edits to the README.md file. Kindly refer to #32

Creating Contributing.md

Hi,
i would like to create the contribution.md file for this repo since its not been created. The file will have all the necessary details one should know to contribute in this project.

Regards
Honey Bhardwaj
GSSOC participant

Contribution for GSSOC

Hi, I would love to contribute to this project during GSSOC 21. I have a couple ideas about how we can go about it. Could you tell me a bit more about the progress on the project and what sort of features you expect us to implement?

error in documentation

remove grammatical mistakes

Data Preparation

Remember we have 1 million questions with 42 k tags and training this amount of data will be very hectic and difficult. So I thought to consider a small subset of tags. Let’s say C = {42k tags} and C1 is the subset of C. To find the smallest subset c1, we can use the tag count that we’ve plotted earlier. We have the frequency of how many times a tag occurs, so by considering the top set of frequently occurring tags we can cover a maximum number of questions. After checking so many values, I come to know that the top 5500 most frequently occurring tags cover almost 99% of the questions.

Add Issue and PR Template

Would like to add issue and pr templates to these project that will basically explain the type of issue and its description and also the type of pr the set of checklist that it obeys and its description

Kindly assign this to me as a part of Girl Script Summer of Code. I would love to do it

Multi-Label Text Classification

I can implement this using Multi-Label Text Classification. Please assign this to me.

technocolabs100 / stack-overflow-tag-predictions Goto Github PK

stack-overflow-tag-predictions's Issues

Recommend Projects

Recommend Topics

Recommend Org