Skip to main content icon/video/no-internet

R-squared (R2) is a statistic that explains the amount of variance accounted for in the relationship between two (or more) variables. Sometime R2 is called the coefficient of determination, and it is given as the square of a correlation coefficient.

Given paired variables (Xi, Yi), a linear model that explains the relationship between the variables is given by

None

where e is a mean zero error. The parameters of the linear model can be estimated using the least squares method and denoted by

None
0 and
None
1, respectively. The parameters are estimated by minimizing the sum of squared residuals between variable Yi and the model β0 + β1Xi, that is,
None

It can be shown that the least squares estimations are

None

where the sample cross-covariance Sxy is defined as

None

Statistical packages such as SAS, SPLUS, and R provide a routine for obtaining the least squares estimation. The estimated model is denoted as

None

With the above notations, the sum of squared errors (SSE), or the sum of squared residuals, is given by

None

SSE measures the amount of variability in Y that is not explained by the model. Then how does one measure the amount of variability in Y that is explained by the model? To answer this question, one needs to know the total variability present in the data. The total sum of squares (SST) is the measure of total variation in the Y variable and is defined as

None

where Y is the sample mean of Y variables, that is,

None

Since SSE is the minimum of the sum of squared residuals of any linear model, SSE is always smaller than SST Then the amount of variability explained by the model is SST −SSE, which is denoted as the regression sum of squares (SSR), that is,

None

The ratio SSR/SST = (SST −SSE)/SST measures the proportion of variability explained by the model. The coefficient of determination (R) is defined as the ratio

None

The coefficient of determination is given as the ratio of variations explained by the model to the total variations present in Y Note that the coefficient of determination ranges between 0 and 1. R value is interpreted as the proportion of variation in Y that is explained by the model. R = 1 indicates that the model exactly explains the variability in Y and hence the model must pass through every measurement (Xi, Yi). On the other hand, R2 = 0 indicates that the model does not explain any variability in Y R value larger than .5 is usually considered a significant relationship.

Case Study and Data

Consider the following paired measurements from Moore and McCabe (1989), based on occupational mortality records from 1970 to 1972 in England and Wales. The figures represent smoking rates and deaths from lung cancer for a number of occupational groups.

Smoking indexLung cancer mortality index
7784
137116
117123
94128
116155
102101
111118
93113
88104
10288
91104
104129
10786
11296
113144
110139
125113
133146
115128
105115
8779
9185
100120
7660
6651

For a set of occupational groups, the first variable is the smoking index (average 100), and the second variable is the lung cancer mortality index (average 100). Suppose we are interested in determining how much the lung cancer mortality index (Y variable) is influenced by the smoking index (X variable). Figure 1 shows the scatterplot of the smoking index versus the lung cancer mortality index. The straight line is the estimated linear model, and it is given

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading