machine learning self-study programs
SONAR Rock vs. Mine Prediction with Python, by using a logistic regression model to identify the type of input whether belongs to Rock or Mine.
Hierarchical clustering: Cluster have a tree like structure or a parent child relationship
- Agglomerative: Bottom up approach: Begin with each element as a separate cluster and merge them into successively larger cluster
- Divisive: Top Down approach begin with the whole set and proceed to divide it successively smaller clusters.
- K-Means: Division of objects into clusters such that each object is in exactly one cluster, not several
- Fuzzy C-Means: Division of objects into clusters such that each object can belong to multiple clusters.
Distance Measure: distance measure will determine the similarity between two elements and it will influence the shape the clusters
-
Euclidean distance measure: is the ordinary straight line. It is the distance between two points in Euclidean space
-
Squared Euclidean distance measure: matric uses the same equation as the Euclidean distance but does not take the square root.
-
Manhattan distance measure: is the simple sum of the horizontal and vertical components or the distance between two points along axes at right angles.
-
Cosine distance measure: similarity measures the angle between the two vectors.
-
Decision Tree is a tree shape diagram used to determine a course of action. Each branch of the tree represents a possible decision, occurrence or reaction.
-
Classification: a classification tree will determine a set of logical if-then condition to classify problem. For example, discriminating between three types of flowers based on certain features.
-
Regression: Regression tree is used when the target variable is numerical or continues in nature. We fit a regression model to a target variable using each independent variables. Each splits is made based on the sum of squared error.
-
Simple to understand and interpret and visualize
-
Little effort required for data preparation
-
Can handle both numerical and categorical data
Non linear parameters don't effect its performance
-
Overfitting occurs when algorithm captures noise in the data
-
High variance: The model can get unstable due to small variation in data.
-
A highly complicated decision tree trends to have a low bias which makes it difficult for the model to work with new data.
Advantage of Random Forest
-
No overfitting:
- Use of multiple trees reduces the risks of overfitting
- Training time is less
-
High accuracy:
- Runs efficiently on large databases
- For large data, it produce highly accurate predictions
-
Estimate missing data:
- Random Forest can maintain accuracy when a large proportion of data is missing
Random Forest or random decision forest is a method that operates by constructing multiple Decision trees during training phase. The decision of the majority of the trees is chosen by the random forest as the final decision.
Decision Tree is a tree shaped diagram used to determine a course of action. Each branch of the tree represents a possible decision, occurrence or reaction.
Entropy: Entropy is the measure of randomness or unpredictability in the dataset.
Information gain: It is the measure of decrease in entropy after the dataset is split.
Leaf node: carries the classification or the decision
Decision node: has 2 or more branches
Root node: The top most decision node is known as the root node.