Code Monkey home page Code Monkey logo

dsc-1-11-04-dealing-with-categorical-variables's Introduction

Dealing with Categorical Variables

Introduction

We now understand the intuition behind multiple linear regression. Great! However, because we'll start digging into bigger data sets with more predictors, we'll come across predictors that are slightly different from what we've seen before. Welcome to the wonderous world of categorical variables!

Objectives

You will be able to:

  • Understand what categorical variables are
  • Understand the need to create dummy variables for categorical predictors
  • Use Pandas and Scikit-Learn to create dummy variables

The auto-mpg data

In this section, we'll use the auto-mpg data to illustrate several elements of multiple linear regression. The auto-mpg data set contains technical specifications of cars. This data set is often used by aspiring data scientists who want to practice linear regression with multiple predictors. Generally, the mpg column (for "mileage per gallion") is the dependent variable, and what we want to know is how the other columns ("predictors") in the data set affect the mpg. Let's have a look at the data:

## import numpy as np
import pandas as pd
data = pd.read_csv("auto-mpg.csv")
data['horsepower'].astype(str).astype(int) # don't worry about this for now
data.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
mpg cylinders displacement horsepower weight acceleration model year origin car name
0 18.0 8 307.0 130 3504 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165 3693 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150 3436 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150 3433 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140 3449 10.5 70 1 ford torino
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 9 columns):
mpg             392 non-null float64
cylinders       392 non-null int64
displacement    392 non-null float64
horsepower      392 non-null int64
weight          392 non-null int64
acceleration    392 non-null float64
model year      392 non-null int64
origin          392 non-null int64
car name        392 non-null object
dtypes: float64(3), int64(5), object(1)
memory usage: 27.6+ KB

Except for "car name", every other column seems to be a candidate predictor for miles per gallon.

What are categorical variables?

Now let's take a closer look at the column "origin".

print(data["origin"].describe())
count    392.000000
mean       1.576531
std        0.805518
min        1.000000
25%        1.000000
50%        1.000000
75%        2.000000
max        3.000000
Name: origin, dtype: float64
print(data["origin"].nunique())
3

Values range from 1 to 3, moreover, actually the only values that are in the dataset are 1, 2 and 3! it turns out that "origin" is a so-called categorical variable. It does not represent a continuous number but refers to a location - say 1 may stand for US, 2 for Europe, 3 for Asia (note: for this data set the actual meaning is not disclosed).

So, categorical variables are exactly what they sound like: they represent categories instead of numerical features. Note that, even though that's not the case here, these features are often stored as text values which represent various levels of the observations. An example of this is gender: it can be described as "M" ("Male") or "F"("Female"), etc.

Identifying categorical variables

As categorical variables need to be treated in a particular manner, as you'll see later on, you need to make sure to identify which variables are categorical. In some cases, identifying will be easy (e.g. if they are stored as strings), in other cases they are numeric and the fact that they are categorical is not always immediately apparent. Note that this may not be trivial. A first thing you can do is use the .describe() function and .info()-function and get a better sense. .describe() will give you info on the data types (like strings, integers, etc), but even then continuous variables might have been imported as strings, so it's very important to really have a look at your data. This is illustrated in the scatter plots below.

import pandas as pd
import matplotlib.pyplot as plt

fig, axes = plt.subplots(nrows=1, ncols=4, figsize=(16,3))

for xcol, ax in zip(['acceleration', 'displacement', 'horsepower', 'weight'], axes):
    data.plot(kind='scatter', x=xcol, y='mpg', ax=ax, alpha=0.4, color='b')
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(12,3))

for xcol, ax in zip([ 'cylinders', 'model year', 'origin'], axes):
    data.plot(kind='scatter', x=xcol, y='mpg', ax=ax, alpha=0.4, color='b')

png

In the upper plots, we plotted the scatter plots for the continuous variables, and in the lower 3 plots, we plotted them for the categorical variables. You can tell the structure looks very different: instead of getting a pretty homogenous "cloud", our categorical variables creating scatter plots generates vertical lines, for discrete values. Another plot type that may ne useful looking at is the histogram.

import warnings
warnings.filterwarnings('ignore')
fig = plt.figure(figsize = (8,8))
ax = fig.gca()
data.hist(ax = ax);

png

And the number of unique values.

Transforming categorical variables

When you want to use categorical variables in regression models, they need to be transformed. There are two approaches to this:

    1. Perform label encoding
    1. Create dummy variables / one-hot-encoding

Label encoding

Let's illustrate label encoding and dummy creation with the following Pandas Series with 3 categories: "USA", "EU" and "ASIA".

origin = ["USA", "EU", "EU", "ASIA","USA", "EU", "EU", "ASIA", "ASIA", "USA"]
origin_series = pd.Series(origin)

Now when calling the .dtype()

Now you'll want to make sure Python recognizes there strings as categories. This can be done as follows:

cat_origin = origin_series.astype('category')
cat_origin
0     USA
1      EU
2      EU
3    ASIA
4     USA
5      EU
6      EU
7    ASIA
8    ASIA
9     USA
dtype: category
Categories (3, object): [ASIA, EU, USA]

Note how the dtype() here is category and the 3 categories are detected.

Sometimes you'll want to represent your labels as numbers. This is called label encoding.

You'll perform label encoding in a way that numerical labels are always between 0 and (number_of_categories)-1. There are several ways to do this, one way is using .cat.codes

cat_origin.cat.codes
0    2
1    1
2    1
3    0
4    2
5    1
6    1
7    0
8    0
9    2
dtype: int8

Another way is to use scikit-learn's LabelEncoder:

from sklearn.preprocessing import LabelEncoder
lb_make = LabelEncoder()

origin_encoded = lb_make.fit_transform(cat_origin)
origin_encoded
array([2, 1, 1, 0, 2, 1, 1, 0, 0, 2])

Note that where .cat.codes can only be used on variables that are transformed using .astype(category), this is not a requirement to use LabelEncoder.

Creating Dummy Variables

Another way to transform categorical variables is through using on-hot encoding or "dummy variables". The idea is to convert each category into anew column, and assign a 1 or 0 to the column. There are several libraries that support one-hot encoding, we'll cover 2 here:

pd.get_dummies(cat_origin)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
ASIA EU USA
0 0 0 1
1 0 1 0
2 0 1 0
3 1 0 0
4 0 0 1
5 0 1 0
6 0 1 0
7 1 0 0
8 1 0 0
9 0 0 1

See how the label name has become the column name! Another method is through using the LabelBinarizer in scikit-learn.

from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer()
origin_dummies = lb.fit_transform(cat_origin)
# you need to convert this back to a dataframe
origin_dum_df = pd.DataFrame(origin_dummies,columns=lb.classes_)

The advantage of using dummies is that, whatever algorithm you'll be using, your numerical values cannot be misinterpreted as being continuous. Going forward, it's important to know that for linear regression (and most other algorithms in scikit-learn), one-hot encoding is required when adding categorical variables in a regression model!

Back to our auto-mpg data

Let's go ahead and change our "cylinders", "model year" and "origin" columns over to dummies

cyl_dummies = pd.get_dummies(data["cylinders"], prefix="cyl")
yr_dummies = pd.get_dummies(data["model year"], prefix="yr")
orig_dummies = pd.get_dummies(data["origin"], prefix="orig")

Next, we'll remove the original columns from our data and add the dummy columns instead

data = data.drop(["cylinders","model year","origin"], axis=1)
data = pd.concat([data, cyl_dummies, yr_dummies, orig_dummies], axis=1)
data.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
mpg displacement horsepower weight acceleration car name cyl_3 cyl_4 cyl_5 cyl_6 ... yr_76 yr_77 yr_78 yr_79 yr_80 yr_81 yr_82 orig_1 orig_2 orig_3
0 18.0 307.0 130 3504 12.0 chevrolet chevelle malibu 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
1 15.0 350.0 165 3693 11.5 buick skylark 320 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
2 18.0 318.0 150 3436 11.0 plymouth satellite 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
3 16.0 304.0 150 3433 12.0 amc rebel sst 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
4 17.0 302.0 140 3449 10.5 ford torino 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0

5 rows ร— 111 columns

Summary

Great! In this lecture, you learned about categorical variables, and how to include them in your multiple linear regression model.

dsc-1-11-04-dealing-with-categorical-variables's People

Contributors

loredirick avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dsc-1-11-04-dealing-with-categorical-variables's Issues

errors on sklearn labelbinarizer

screen shot 2018-12-02 at 5 14 20 pm

I did not change anything.

TypeErrorTraceback (most recent call last)
in ()
2
3 lb = LabelBinarizer()
----> 4 origin_dummies = lb.fit_transform(cat_origin)
5 # you need to convert this back to a dataframe
6 origin_dum_df = pd.DataFrame(origin_dummies,columns=lb.classes_)

/opt/conda/envs/learn-env/lib/python3.6/site-packages/sklearn/preprocessing/label.py in fit_transform(self, y)
434 Shape will be [n_samples, 1] for binary problems.
435 """
--> 436 return self.fit(y).transform(y)
437
438 def transform(self, y):

/opt/conda/envs/learn-env/lib/python3.6/site-packages/sklearn/preprocessing/label.py in transform(self, y)
465 pos_label=self.pos_label,
466 neg_label=self.neg_label,
--> 467 sparse_output=self.sparse_output)
468
469 def inverse_transform(self, Y, threshold=None):

/opt/conda/envs/learn-env/lib/python3.6/site-packages/sklearn/preprocessing/label.py in label_binarize(y, classes, neg_label, pos_label, sparse_output)
579 # XXX Workaround that will be removed when list of list format is
580 # dropped
--> 581 y = check_array(y, accept_sparse='csr', ensure_2d=False, dtype=None)
582 else:
583 if _num_samples(y) == 0:

/opt/conda/envs/learn-env/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
478 # DataFrame), and store them. If not, store None.
479 dtypes_orig = None
--> 480 if hasattr(array, "dtypes") and len(array.dtypes):
481 dtypes_orig = np.array(array.dtypes)
482

TypeError: object of type 'CategoricalDtype' has no len()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.