Skip to main content icon/video/no-internet

Ordinary Least Squares Regression

The treatment of errors has a long tradition with attempts to combine repeated measurements in astronomy and geodesy in the early 18th century. In 1805, Adrien-Marie Legendre introduced the method of least squares as a tool for using models with specification errors to fit data collected to determine the shape and circumference of the earth. Specifying the earth's shape to be a sphere, he had to estimate three parameters using five observations from the 1795 survey of the French meridian arc. With three unknowns and five equations, any estimate of the unknown parameters led to errors, when fitted to the five observations. He then proposed to choose those estimates that make “the sum of squares of the errors a minimum” (Legendre, 1805, pp. 72–73).

A formal statistical theory of errors was developed by Gauss in 1809 and Laplace in 1810. The method of least squares was shown to possess many desirable statistical properties. For more than 200 years, a method invented to deal with experimental errors in the physical sciences has become universal and is used, with practically little or no modification, in the biological and social sciences.

A scientific method in the biological sciences often involves statement of a causal relationship between observable variables and a statistical model to estimate the relation and test some hypotheses.

Three common medical decision problems involving statistical methods are screening, diagnosis, and treatment. Data used in statistical analysis include medical history, clinical symptoms, and laboratory tests. For many medical conditions, there are no perfect tests such as an X-ray to detect the fracture of a bone. Decisions have to be made using one or more associated, observable factors.

Two problems arise with this approach: (1) How does one formulate a decision rule using the associated factors? and (2) Since no decision rule will be perfect, how is one to compare the decision rules, that is, the errors associated with these rules?

The ordinary least squares regression (OLS) method provides a solution. Suppose the medical condition is type 2 diabetes, and the gold standard is the oral glucose tolerance test. For a screening rule, we want to use readily available data for risk factors such as age, gender, body mass index, race, and so on, to predict the blood glucose and identify individuals with high risk for follow-up tests. Any function of the risk factors will provide an estimate of the blood sugar and hence be useful in diagnosing diabetes. Errors associated with the estimates are calculated using the observed blood sugar. The OLS method can be used to select a set of weights to combine the risk factors and estimate the blood sugar as follows: For every set of weights, there will be corresponding predicted values of blood sugar. Prediction errors can be calculated using the observed blood sugars. One can then calculate the sum of squares of the errors and choose the set of weights with the least sum.

Why square the errors and sum? Why not simply sum the errors? A simple sum of errors will be 0 if the positive errors add up exactly to the sum of the negative errors and hence will be misleading. On the other hand, the sum of squares of errors will be 0 if and only if all the errors are 0, that is, only if there are no errors.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading