Code Monkey home page Code Monkey logo

stack-overflow-tag-predictions's Issues

Train and Test Split

You have to divide our 1 Million data into Train and test set Randomly. So we got (799999, 5500) points as train data and (200000, 5500) points as test data.

Creating Codeof conduct

hi ,
I would like to create CODEOFCONDUCT.md for this repo since its not created.
regards
Honey Bhardwaj

Data Visualisation of Dataset

I would like to do data visualisation of the dataset to get more insights about the data we are going to work with.
Do assign this to me if possible 😀

revamping README.md for better documentation

greetings, as a contributor for GSSOC'21
I would like to revamp the README.md documentation for clarity with better information & grammar.
what will my contribution focus on:

  • proper structured format for easy navigation (links and flow of interaction
  • add dynamic visualization to the readme (gifs & image media)
  • better links presentation
  • categorized heading and sub-heading for clarity
  • contributors highlight
  • installation guides (how to's, do's & don'ts)
  • open-source contribution procedures (forking, committing, branch checkouts, etc)

will get started, once this issue #30 is assigned.

Multi-Label Classification

We can solve Binary and Multi-Class classification problems with the help of algorithms like Logistic Regression, Support Vector Machines, Random Forest, etc. So to solve the present Multi-Label classification we need to convert it into a binary or a multi-class classification.
Imagine we have three questions with tags,
The total set of labels are 4 (t1,t2,t3,t4). So now we can convert each question into a vector of size 4.
Binary Vector Representation
We can use CountVectorizer to convert all our tags into a binary vector.

Data loading Using SQLite DB

To process further, We have used the simplest and most used database called SQLite. It is a Super powerful, lite weight and open-source database. The best part is that python comes inbuilt with SQLite.
More about SQLite (https://www.sqlite.org/index.html)
So instead of using a CSV file, we pump all of the data into the SQLite database and use SQL queries to process further.
we have taken all the data that is there in the CSV file and append it into the database ‘train.db’. Now we can use the created SQLite database for further process.

Logistic Regression: One vs Rest Classifier

We have very high dimensional data and we need to build many models in a binary representation. To tackle this, we have to take the help of Logistic Regression with One vs Rest classifier. The classifier takes each of the labels and trains 5500 logistic regression models. Training a Logistic Regression model is very cheap and easy when compared to other models like Support Vector Machines (SVM), Random Forest, etc..and it performs really well on high dimensional data.

Analysis of Tags

Tags are our class labels. As we were trying to predict them, we should deep dive and understand them very well. After removing all the duplicated data we are left with 4.2 Million data points and 42k unique tags.
The number of times a tag appeared is an interesting thing to understand. So I just counted it and put it into a dictionary. If we observe the table below, the “.a” tag appeared in 18 questions, the “.app” tag appeared in 37 questions, and so on. Remember, we will never have a tag repeating two times in the same question.

Data Preprocessing

You have to follow the below-mentioned steps to process further :
i. Sampled 1M data points because of computing and memory limitations.
ii. Separated code-snippets from Body
iii. Removed Special characters from Question title and description (not in code)
iv. Removed stop words (Except ‘C’)
v. Removed HTML Tags using Regular Expressions
vi. Converted all the characters into small letters
vii. Used SnowballStemmer to stem the words
Below we can find the example questions after preprocessed.

And now you have to create a new database called ‘Processed.db’ and loaded the preprocessed data into it.

Creating Contributing.md

Hi,
i would like to create the contribution.md file for this repo since its not been created. The file will have all the necessary details one should know to contribute in this project.

Regards
Honey Bhardwaj
GSSOC participant

Contribution for GSSOC

Hi, I would love to contribute to this project during GSSOC 21. I have a couple ideas about how we can go about it. Could you tell me a bit more about the progress on the project and what sort of features you expect us to implement?

Data Preparation

Remember we have 1 million questions with 42 k tags and training this amount of data will be very hectic and difficult. So I thought to consider a small subset of tags. Let’s say C = {42k tags} and C1 is the subset of C. To find the smallest subset c1, we can use the tag count that we’ve plotted earlier. We have the frequency of how many times a tag occurs, so by considering the top set of frequently occurring tags we can cover a maximum number of questions. After checking so many values, I come to know that the top 5500 most frequently occurring tags cover almost 99% of the questions.

Add Issue and PR Template

Would like to add issue and pr templates to these project that will basically explain the type of issue and its description and also the type of pr the set of checklist that it obeys and its description

Kindly assign this to me as a part of Girl Script Summer of Code. I would love to do it

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.