Dataset contains three classes (Iris-setosa, Iris-versicolor, Iris-virginica). These are classified based on sepal and petal features such as length and width.
We convert Species to int values for classification
- Pandas
- Numpy
- Matplotlib
- Seaborn
- Sikit Learn
We use Euclidean distance to calculate nearest neighbour and based on smallest distance we can classify their class.
-
We start at the tree root and split the data on the feature that results in the largest information gain (IG).
-
We can then repeat this splitting procedure at each child node until the leaves are pure. This means that the samples at each leaf node all belong to the same class.
-
We may set a limit on the depth of the tree to prevent overfitting. We compromise on purity here somewhat as the final leaves may still have some impurity.
Using n_neighbour= 11
in our model we get an accuracy of 1.0 and f1 score of 0.93.
Confusion Matrix
Using default values for our model we get accuracy of 0.96 and f1 score of 0.97.
Confusion Matrix