Dummy Variable Trap in Regression Models

Articles —> Dummy Variable Trap in Regression Models

Using categorical data in Multiple Regression Models is a powerful method to include non-numeric data types into a regression model. Categorical data refers to data values which represent categories - data values with a fixed and unordered number of values, for instance gender (male/female) or season (summer/winder/spring/fall). In a regression model, these values can be represented by dummy variables - variables containing values such as 1 or 0 representing the presence or absence of the categorical value.

By including dummy variable in a regression model however, one should be careful of the Dummy Variable Trap. The Dummy Variable trap is a scenario in which the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others.

To demonstrate the Dummy Variable Trap, take the case of gender (male/female) as an example. Including a dummy variable for each is redundant (of male is 0, female is 1, and vice-versa), however doing so will result in the following linear model:

y ~ b + {0|1} male + {0|1} female

Represented in matrix form:


    | y1 |

    | y2 |

Y = | y3 |

    |... |

    | yn |


    | 1   m1   F1 |

    | 1   m2   f2 |

X = | 1   m3   f3 |

    |... ...  ... |

    | 1   mn   fn |

In the above model, the sum of all category dummy variable for each row is equal to the intercept value of that row - in other words there is perfect multi-collinearity (one value can be predicted from the other values). Intuitively, there is a duplicate category: if we dropped the male category it is inherently defined in the female category (zero female value indicate male, and vice-versa).

The solution to the dummy variable trap is to drop one of the categorical variables (or alternatively, drop the intercept constant) - if there are m number of categories, use m-1 in the model, the value left out can be thought of as the reference value and the fit values of the remaining categories represent the change from this reference.

As an example, lets take data containing 3 categories - C1, C2, and C3:

C1C2C3y
10012.4
10011.9
0108.3
0103.1
0015.4
0016.2

Using R, we can fit this model in several ways, but for demonstration I'll use the ordinary least squares linear algebra equation:

B = (XTX)-1XTY


> Y

   1    2    3    4    5    6 

12.4 11.9  8.3  8.1  5.4  6.2 

> X

  C1 C2 C3 b

1  1  0  0 1

2  1  0  0 1

3  0  1  0 1

4  0  1  0 1

5  0  0  1 1

6  0  0  1 1

> solve(t(X)%*% X)

Error in solve.default(t(X) %*% X) : 

  Lapack routine dgesv: system is exactly singular: U[4,4] = 0

Whoops, the matrix cannot be inverted because it is singular. To fix the issue, we can remove the intercept, or alternatively remove one of the dummy variable columns


> X = X[,-1]

> X

  C2 C3 b

1  0  0 1

2  0  0 1

3  1  0 1

4  1  0 1

5  0  1 1

6  0  1 1

> solve(t(X) %*% X) %*% t(X) %*% Y

    [,1]

C2 -3.95

C3 -6.35

b  12.15

The calculated values are now referenced to the dropped dummy variable (in this case C1). In other words, if the category is C2 it is -3.95 less than the reference (in this example the reference value is 12.15).

In some cases it may be necessary (or educational) to program dummy variables directly into a model. However in most cases a statistical package such as R can do the math for you - in R categories can be represented by factors, letting R deal with the details:


> a

     y C

1 12.4 1

2 11.9 1

3  8.3 2

4  8.1 2

5  5.4 3

6  6.2 3

#Column C is a factor column

> class(a[,2])

[1] "factor"

> lm(y ~ ., a)



Call:

lm(formula = y ~ ., data = a)



Coefficients:

(Intercept)           C2           C3  

      12.15        -3.95        -6.35 

The same answer produced with factors as using dummy variables directly (above).




Comments

  • Atik Khatri   -   July, 27, 2018

    I have one doubt regarding avoiding the dummy variable trap. If we have multiple categorical columns in our data, like Gender: {Male, Female} and Cities:{LA, NY, SF}, then do we have to remove one column from each category to avoid the trap? e.g removing male and removing LA!

  • Jagadeesh Kotra   -   August, 12, 2018

    Atik Khatri, Removing LA and Male for example should be enough.

  • S M   -   August, 21, 2018

    Hi, thanks for the article. I had a similar question: to use the example from the earlier comment, where we have more than one categorical columns, lets say gender and city; if we drop one factor in each of these, how do we interpret their individual effect? To be clear: if male and London city were dropped, how do I interpret the effect of just male or just London individually, since both of these information is now part of the intercept. Thanks a ton.

  • A D   -   December, 19, 2019

    Can somebody answer S M's question? I am also really interested in this!

  • Guylian Moos   -   January, 7, 2020

    Where do I sign up for WO3

  • Hovanes Gasparian   -   April, 1, 2020

    To answer the question above, you basically have 2 reference categories now: if you dropped male and LA, then your other coefficients will be interpreted in comparison to males in LA: so for a female in NY, or a female in SF, COMPARED TO a male in LA

  • Ricardo Sasso   -   May, 12, 2020

    Hello, i've been watching a free course on Itroduction to Computational Thinking and Data Science by MIT OCW and I found an interesting situation in which the professor made a logistic regression model to predict wheter a passenger died or not in the Titanic. The professor uses a data with cabin classes {C1, C2 and C3} and uses both 3 variables to fit the model. Wasn't he supposed to fall in the dummy variables trap?

  • Ricardo Sasso   -   May, 12, 2020

    I forgot to mention the link to the course

    https://www.youtube.com/watch?v=eg8DJYwdMyg

  • John Howell   -   June, 25, 2020

    @Ricardo - Not necessarily. There are a lot of ways to code variables that avoid the dummy trap. In the titanic example it sounds like the professor was using was is called the cell means model. You can include all three cabin classes in the model as long as there is not also an intercept. In this case the coefficients for the model will represent the average survival rate for each of the cabin classes. There is no intercept and thus there is no base line reference from which to difference. Using dummy coding with an intercept the omitted cabin class is the reference and each coefficient represents the difference between reference category and the cabin class.

    To explain more fully:

    The cell means model uses a design matriculates that looks like:

    C1 C2 C3

    1 0 0

    0 1 0

    0 0 1

    Notice there is no intercept and there is not any multicollinearity since C1 + C2 != C3. There is also no base case so each coefficient is just the predicted value for the dependent variable for that cabin class. (In linear regression this is just the mean for each cabin class.)

    The dummy coding version of the model would be:

    Int C1 C2

    1 1 0

    1 0 1

    1 0 0

    Notice there are still three rows and three columns even though we dropped the last column to avoid the dummy trap. In this setting Cabin Class 3 serves as the reference level since it is the variable that is dropped. In this case the interpretation of the coefficients are: The intercept is the prediction for the dependent viable assuming that the observation is not in C1 and not in C2. (i.e. They must be in Cabin Class 3 since the categories are exhaustive.) The coefficient for C1 represents the adjustment that needs to be made to the intercept (C3) to predict those in cabin class 1 (C1). This is the difference between Cabin class 1 and Cabin Class 3. C2 has a similar interpretation.

    The reason we use dummy coding is not just to avoid multicollinearity. (There are lots of ways to do that.). It is so that we have coefficients with the interpretation that is nature for our model. (In your example from the Titanic I would use dummy coding because I'm more interested in the differences between cabin class rather than the overall survival rate for each cabin class.

  • Viggo TW   -   December, 11, 2020

    Correct me if I am wrong, but this is only an issue if you want the model to not contain a bias term. If you do, models like linear regression would "absorb" the colinear effect in the bias.

  • Vasanth Sadasivan   -   May, 5, 2021

    There is a type on 8.1 should be 3.1

  • Aaron H   -   February, 1, 2023

    Is it easier to screen your variables based on variance inflation factor (VIF) scores in a real-world setting? Using one column instead of two assumes prior knowledge of the data and how it\'s related to itself. That assumption could be wrong. That way, you address multicollinearity.

Back to Articles


© 2008-2022 Greg Cope