Data Cleaning Process
To read the given data and perform data cleaning and save the cleaned data to a file.
Data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect ,incompleted , irrelevant , duplicated or improperly formatted. Data cleaning is not simply about erasing data ,but rather finding a way to maximize datasets accuracy without necessarily deleting the information.
STEP 1: Read the given Data
STEP 2: Get the information about the data
STEP 3: Remove the null values from the data
STEP 4: Save the Clean data to the file
STEP 5: Remove outliers using IQR
STEP 6: Use zscore of to remove outliers
import pandas as pd
df=pd.read_csv("/content/SAMPLEIDS.csv")
df
print(df.head(7))
print(df.tail(2))
df.info()
print(df.describe())
df.isnull().sum()
df.nunique()
mn=df.TOTAL.mean()
mn
df.TOTAL.fillna(mn,inplace=True)
df
min=df.M4.min()
min
df.M4.fillna(min,inplace=True)
df
import pandas as pd
import seaborn as sns
age=[1,3,28,27,25,92,30,39,40,50,26,24,29,94]
af=pd.DataFrame(age)
af
sns.boxplot(data=af)
sns.scatterplot(data=af)
q1=af.quantile(0.25)
q2=af.quantile(0.50)
q3=af.quantile(0.75)
iqr=q3-q1
iqr
low=q1-1.5*iqr
low
high=q3+1.5*iqr
high
af=af[((af>=low)&(af<=high))]
af
af.dropna()
sns.boxplot(data=af)
sns.scatterplot(data=af)
data=[1,12,15,18,21,24,27,30,33,36,39,42,45,48,51,54,57,60,63,66,69,72,75,78,81,84,87,90,93,96,99,102,105]
df=pd.DataFrame(data)
df
import numpy as np
from scipy import stats
z=np.abs(stats.zscore(df))
z
Thus the given program executed successfully.