Hey this is starting repos of ML with basic datasets
Added Data Preprocess for the text and standard dataset set
Added The Regression Folder into the Repository
From Today it will contain outputs with graphs

Data Preprocessing

The subject matter in Machine Learning is Data preprocessing
It is said that 80% is data preprocessing from the 100% of data science.
So I introduce You two types of data preprocessing in numbers

Categorical Variable
There are mainly 2 types
- Ordinal
- Numeral

Ordinal

They are the one's which can be compared or ordered
Say for example we can order grade A to grade B i.e.
Garde A > Grade B or say marks in grade A is > grade B.
They are handled by simply mapping ranking

Nominal

This is the one's which can not be compared or at same level
best example gender we cannot compare gender
They are to be one hot encoded i.e. say male => [1, 0]
Female => [0, 1] as per this [Male, Female] is the data

Regression

In Statistics, Regression is defined as the method of
Obtaining co realtions or a mapping such that F(x) ~= Y
i.e. an estimate of the general population.

But let's see or look it with a simple human prespective
Not the stastical one, Let's say you are in a party
And rather interesting game pops up i.e. you need to guess
The number of balls in jar without counting, opening or anyway
touching the jar (Closest number wins. And Winner get's a good gift.
You need that gift

So you think of a way to see, And guess a number, Now the way
To do this is Take note of the previous guess and perform an estimate
with the mean of the people who are coming back with wrong answer
This is basically sandwiching towards the right direction and make a guess.
The activity you just perform let you as a winner Why? Cause of stats
This activity is estimation, and that is what we do in regression
Like the jar winning method, we do is take the guess from avilable information
Calculate or guess a number find how much far we are and then see the next person(number)
with the closest and guess again until you reach the lowest error or estimate

NLP - Text Classification

It is field of AI in which we try to process natural language (language we speak -hindi , english etc) and draw neccessay insights from it .Just like we do with it .

Preprocessing

As computer doesnot understand natural language , we will encode it into numbers. We will do it with the help of tokenizer .The advance form of tokenizer is countervector which encode text into numbers and convert it into vectors.

Preprocessing also includes removing stopwords (e.g ['here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each']) and punctuation (e.g '!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~'). As both of them do not have importance in drawing insights from text .

Further if we want to know which have importance or not we can use TFIDF (A short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.) library of python .

Classifier

There are many classifier . But the ones which are frequently used are-

1. Naive bayes classifier

Naive Bayes is a family of statistical algorithms we can make use of when doing text classification. One of the members of that family is Multinomial Naive Bayes (MNB). One of its main advantages is that you can get really good results when data available is not much (~ a couple of thousand tagged samples) and computational resources are scarce.

All you need to know is that Naive Bayes is based on Bayes’s Theorem,

2. SVM Classifier

Support Vector Machines (SVM) is just one out of many algorithms we can choose from when doing text classification. Like naive bayes, SVM doesn’t need much training data to start providing accurate results. Although it needs more computational resources than Naive Bayes, SVM can achieve more accurate results.

3. Deep Learning

Deep learning is a set of algorithms and techniques inspired by how the human brain works. The two main deep learning architectures used in text classification are Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN).

On the one hand, deep learning algorithms require much more training data than traditional machine learning algorithms, i.e. at least millions of tagged examples. On the other hand, traditional machine learning algorithms such as SVM and NB reach a certain threshold where adding more training data doesn’t improve their accuracy. In contrast, deep learning classifiers continue to get better the more data you feed them with.
Deep learning algorithms such as Word2Vec or GloVe are also used in order to obtain better vector representations for words and improve the accuracy of classifiers trained with traditional machine learning algorithms(transfered learning)

birajcoder / machinelearningmodels Goto Github PK

machinelearningmodels's Introduction

Data Preprocessing

Categorical Variable

Ordinal

Nominal

Regression

NLP - Text Classification

Preprocessing

Classifier

1. Naive bayes classifier

2. SVM Classifier

3. Deep Learning

machinelearningmodels's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent