K-Means clustering is applied with K=2,3,4,5,6,7,8,9,10 for all 56 datasets and the optimal value of clusters is found using Silhouette Coefficient
and Davies–Bouldin
index. The respective scores for various K values are visualised using boxplots.
The best features are found using gini-split
, rank-sum test
, and PCA
. Their dimensionality is reduced for further classification.
For classification, three variants of naïve bayes classifiers
(Bernoulli, Multinomial and Gaussian), and two variants of decision tree
are applied on the data. Box plot diagrams are obtained to compare the accuracy and f-measure of the various classifier models.