4. The Quality of the Regression Equation (1/3)

In terms of analysing the linear regression, one may first ask the question, "How good a fit is the derived regression equation?" Such a question can be addressed by the use of the Coefficient of Determination, or the R2 value.

You will recall that the least squares fitting is designed to minimise the Sums of Squares of the residuals from the fitting, y ˆ = b 0 + b 1 ·x , that is minimising ( Δy ) 2 where Δy=y y ˆ .

This value is called the Sums of Squares due to Error or SSE.

SSE= ( Δy ) 2 = ( y y ˆ ) 2 = ( y( b 0 + b 1 ·x ) ) 2

Now there are two other types of residuals that we can consider:

  • residuals from the mean of the observations and
  • residuals between the fitted function and the mean value
where these are related as follows:

        Residual from the mean =
                Residual between the function and the mean
                        + Residual due to the error

y Y ˆ =( y ˆ Y ˆ )+( y y ˆ )

This equation applies to all of the residuals and so we can also use the summed and squared version of it as long as there is no correlation between the observations.

If there was correlation between the observations, then the product terms in squaring each of these would yield values that would destroy this relationship; but with no correlation these product terms are all zero.

( y Y ˆ )= ( y ˆ Y ˆ )+ ( y y ˆ )

In this equation, it is SSE= ( y y ˆ ) . The term on the left hand side is called SST= ( y Y ˆ ) , the Total Sums of Squares. The remaining term is called the Sums of Squares due to the Regression, SSR= ( y ˆ Y ˆ ) . So that we have:

SST=SSR+SSE

The SST and SSE are relatively easy to calculate and so we usually use these to compute the Coefficient of Determination, R2.

R 2 = SSR SST = (SSTSSE) SST

You will recall that the least squares method minimises SSE. If SSE = 0 as will occur if the data are a perfect fit to the regression line, then R2 = 1.0. When the regression provides no assistance at all, then SSE = SST and R2 = 0.0. You can thus see that 0.0 < R2 < 1.0, and that the closer the Coefficient of Determination value is to 1.0, the better the regression is at explaining or accounting for the variation in the y data values relative to the x values.