Skip to main content icon/video/no-internet

Dummy coding is used when categorical variables (e.g., sex, geographic location, ethnicity) are of interest in prediction. It provides one way of using categorical predictor variables in various kinds of estimation models, such as linear regression. Dummy coding uses only 1s and 0s to convey all the necessary information on group membership. With this kind of coding, the researcher enters a 1 to indicate that a person is a member of a category, and a 0 otherwise.

Dummy codes are a series of numbers assigned to indicate group membership in any mutually exclusive and exhaustive category. Category membership is indicated in one or more columns of 0s and 1s. For example, a researcher could code sex as 1 = female, 0 = male or 1 = male, 0 = female. In this case the researcher would have a column variable indicating status as male or female. In general, with k groups there will be k-1 coded variables. Each of the dummy-coded variables uses 1 degree of freedom, so k groups have k-1 degrees of freedom, just as in analysis of variance (ANOVA). Consider the following example, in which there are four observations within each of the four groups:

Group

G1

G2

G3

G4

1

2

5

10

3

3

6

10

2

4

4

9

2

3

5

11

Mean

2

3

5

10

For this example we need to create three dummy-coded variables. We will call them d1, d2, and d3. For d1, every observation in Group 1 will be coded as 1 and observations in all other groups will be coded as 0. We will code d2 with 1 if the observation is in Group 2 and zero otherwise. For d3, observations in Group 3 will be coded 1 and zero for the other groups. There is no d4; it is not needed because d1 through d3 have all the information needed to determine which observation is in which group.

Here is how the data look after dummy coding:

Values

Group

d1

d2

d3

1

1

1

0

0

3

1

1

0

0

2

1

1

0

0

2

1

1

0

0

2

2

0

1

0

3

2

0

1

0

4

2

0

1

0

3

2

0

1

0

5

3

0

0

1

6

3

0

0

1

4

3

0

0

1

5

3

0

0

1

10

4

0

0

0

10

4

0

0

0

9

4

0

0

0

11

4

0

0

0

Note that every observation in Group 1 has the dummy-coded value of 1 for d1 and 0 for the others. Those in Group 2 have 1 for d2 and 0 otherwise, and for Group 3, d3 equals 1 with 0 for the others. Observations in Group 4 have all 0s on d1, d2, and d3. These three dummy variables contain all the information needed to determine which observations are included in which group. If you are in Group 2, then d2 is equal to 1 while d1 and d3 are 0. The group with all 0s is known as the reference group, which in this example is Group 4.

Dummy Coding in ANOVA

The use of nominal data in prediction requires the use of dummy codes; this is because data need to be represented quantitatively for predictive purposes, and nominal data lack this quality. Once the data are coded properly, the analysis can be interpreted in a manner similar to traditional ANOVA.

Suppose we have three groups of people, single, married, and divorced, and we want to estimate their life satisfaction. In the following table, the first column identifies the single group (observations of single status are dummy coded as 1 and 0 otherwise), and the second column identifies the married group (observations of married status are dummy coded as 1 and 0 otherwise). The divorced group is left over, meaning this group is the reference group. However, the overall results will be the same no matter which groups we select.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading