Code Monkey home page Code Monkey logo

ex-no-10-ds's Introduction

EX-NO-10-DATA SCIENCE PROCESS ON COMPLEX DATASET

AIM:

To Perform Data Science Process on a complex dataset and save the data to a file.

ALGORITHM:

STEP-1:

Read the given Data.

STEP-2:

Clean the Data Set using Data Cleaning Process.

STEP-3:

Apply Feature Generation/Feature Selection Techniques on the data set.

STEP-4:

Apply EDA /Data visualization techniques to all the features of the dataset.

CODE:

Data Cleaning Process:

import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

from google.colab import files

uploaded = files.upload()

df = pd.read_csv("Air Quality.csv")

df.head(10)

df.info()

df.describe()

df.isnull().sum()

Handling Outliers:

q1=df['PM2.5 AQI Value'].quantile(0.25)

q3=df['PM2.5 AQI Value'].quantile(0.75)

IQR=q3-q1

print("First quantile:",q1," Third quantile:",q3," IQR: ",IQR,"\n")

lower=q1-1.5*IQR

upper=q3+1.5*IQR

outliers=df[(df['PM2.5 AQI Value']>=lower)&(df['PM2.5 AQI Value']<=upper)]

from scipy.stats import zscore

z=outliers[(zscore(outliers['PM2.5 AQI Value'])<3)]

print("Cleaned Data: \n")

print(z)

EDA Techniques:

df.skew()

df.kurtosis()

sns.boxplot(x="Ozone AQI Value",data=df)

sns.countplot(x="AQI Value",data=df)

sns.distplot(df["AQI Value"])

sns.histplot(df["NO2 AQI Value"])

sns.displot(df["CO AQI Value"])

sns.scatterplot(x=df['AQI Value'],y=df['NO2 AQI Value'])

states=df.loc[:,["AQI Category","AQI Value"]]

states=states.groupby(by=["AQI Category"]).sum().sort_values(by="AQI Value")

plt.figure(figsize=(17,7))

sns.barplot(x=states.index,y="AQI Value",data=states)

plt.xlabel=("AQI Category")

plt.ylabel=("AQI Value")

plt.show()

df.corr()

sns.heatmap(df.corr(),annot=True)

Feature Generation:

from sklearn.preprocessing import LabelEncoder,OrdinalEncoder

from sklearn.preprocessing import OneHotEncoder

le=LabelEncoder()

df['AQI']=le.fit_transform(df['AQI Value'])

df

AQI=['Good','Moderate','Unhealthy','Unhealthy for Sensitive Groups','Very Unhealthy','Hazardous']

enc=OrdinalEncoder(categories=[AQI])

enc.fit_transform(df[['AQI Category']])

df['AQI CATEGORY']=enc.fit_transform(df[['AQI Category']])

df

ohe=OneHotEncoder(sparse=False)

enc=pd.DataFrame(ohe.fit_transform(df1[['CO AQI Category']]))

df1=pd.concat([df1,enc],axis=1)

df1

Feature Transformation:

import statsmodels.api as sm

import scipy.stats as stats

from sklearn.preprocessing import QuantileTransformer

from sklearn.preprocessing import PowerTransformer

sm.qqplot(df1['AQI Value'],fit=True,line='45')

plt.show()

transformer=PowerTransformer("yeo-johnson")

df1['NO2 AQI Value']=pd.DataFrame(transformer.fit_transform(df1[['NO2 AQI Value']]))

sm.qqplot(df1['NO2 AQI Value'],line='45')

plt.show()

qt=QuantileTransformer(output_distribution='normal')

df1['AQI Value']=pd.DataFrame(qt.fit_transform(df1[['AQI Value']]))

sm.qqplot(df1['AQI Value'],line='45')

plt.show()

Data Visualization:

sns.barplot(x="CO AQI Category",y="CO AQI Value",data=df1)

plt.xticks(rotation = 90)

plt.show()

sns.lineplot(x="CO AQI Value",y="NO2 AQI Category",data=df1,hue="AQI Category",style="AQI Category")

sns.scatterplot(x="AQI Value",y="NO2 AQI Value",hue="AQI Category",data=df1)

sns.relplot(data=df1,x=df1["CO AQI Category"],y=df1["CO AQI Value"],hue="CO AQI Category")

sns.histplot(data=df1, x="PM2.5 AQI Value", hue="PM2.5 AQI Category",element="step", stat="density")

OUTPUT:

Data Cleaning Process:

Screenshot (92)

Screenshot (94)

Screenshot (93)

Screenshot (95)

Handling Outliers:

Screenshot (96)

EDA Techniques:

Screenshot (98)

Screenshot (99)

Screenshot (100)

Screenshot (101)

Screenshot (102)

Screenshot (103)

Screenshot (104)

Screenshot (105)

Screenshot (106)

Screenshot (107)

Screenshot (108)

Feature Generation:

Screenshot (109)

Screenshot (110)

Screenshot (111)

Feature Transformation:

Screenshot (112)

Screenshot (113)

Screenshot (114)

Data Visualization:

Screenshot (115)

Screenshot (116)

Screenshot (117)

Screenshot (119)

Screenshot (118)

RESULT:

Thus the Data Science Process on Complex Dataset were performed and output was verified successfully.

ex-no-10-ds's People

Contributors

maheshs03 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.