Provided dataset consists total of 150 samples divided into two files irirs_train.csv and irirs_test.csv having 130 and 20 samples, respectively. As data set is iris flower, I assumed the column names as:
Column 1 - Sepal Length in cm Column 2 - Sepal Width in cm Column 3 - Petal Length in cm Column 4 - Petal Width in cm Column 5 – Species: Iris-Setosa, Iris-Versicolor and Iris-Virginica
Findings
- Data Cleaning and normalization: No Missing value or Null value found in input dataset. Calculated min-max normalization scaler to normalize data before passing to algorithm.
- Correlation Analysis: Outcomes of Correlation analysis: • Setosa petal lengths and widths are much smaller than Versicolor and Virginica. • Strong linear relationship between all the variables except sepal width, which is much weaker and negative. The below table identifies trends between variables. Depending on strength of the relationship, it assigns a number between -1 and 1.•
Looking at the below correlation table, we can see that there are 3 main variables (sepal length, petal length and petal width) that have a strong linear relationship with species_id. These variables are likely to be strong variables in predicting the species of a given data.