Classification and Regression Tree Models

Sarah Boslaugh

doi:10.4135/9781412953948

Entry
Reader's guide
Entries A-Z
Subject index

Return to Entries

Classification and Regression Tree Models

Edited by:
Sarah Boslaugh
In:Encyclopedia of Epidemiology
Chapter DOI:https://doi.org/10.4135/9781412953948.n74
Subject:Epidemiology & Biostatistics, Public Health (general), Public Health Research Methods
Keywords:classification and regression tree models

Request Permissions

Show page numbers Hide page numbers

Classification and regression tree models, also known as recursive partitioning or CARTTM, are a class of nonparametric regression models becoming increasingly popular in epidemiology and biomedical data analysis, as well as in the computer science and datamining fields. These models became popular when this methodology was formalized by Leo Breiman and colleagues in their book Classification and Regression Trees. Subsequent availability of commercial software (e.g., Salford Systems, Inc.) and academic freeware (R ‘tree’ and ‘rpart’ functions) for fitting these models helped make this approach practical to data analysis. One of the most common uses of classification and regression tree models in epidemiology is to develop predictive rules for diagnosis; other uses include developing screening guidelines and creating prognostic models.

The goal of regression and classification is to fit a mathematical model that takes categorical or continuous input (independent or predictor variables) and returns a predicted value for an output (dependent or outcome variable). To take a simple example, the analyst may want to predict a person's weight as a function of their height using a simple linear regression model where height is the input and weight is the output. In collecting data, the analyst will have measured height and weight on numerous people and are likely to have several individuals with nearly or exactly the same height. For this subset, the weights will follow a distribution with some people being heavier or lighter than others. A simple linear regression of weight on height gives a formula that for a particular height, the model returns the ‘expected’ or ‘mean’ weight for individuals at that height. The concept of the ‘expected’ or ‘mean’ weight for individuals at a particular height is essential for understanding regression. In statistics, a simple linear regression gives the conditional mean of Height for a given value of Weight = w, which we write as E(Weight|Height = h) = b0 + b1 × Height, where E(Weight|Height = h) is the expected value of weight for a person with height h, and b0 and b1 are the parameters of the equation used to make this prediction.

A similar concept applies to the classification problem where the input variables are used to predict which group an individual or some other object belongs to. The proper statistical approach for this type of analysis includes discriminant analysis, though often logistic regression is used by many applied data analysts. In classification, the idea of conditional expectation, that is, E(Weight|Height = h), is replaced with a statement of probability of belonging to one of the classes. The simplest problem considers two classes, say, patients who either responded to treatment or did not. In a treatment study comparing response rates for patients receiving drugs versus patients receiving a placebo, the analyst would fit models to obtain estimates of conditional probabilities of response, for example, Prob(Patient responded|Patient received drug) versus the Prob(Patient responded|Patient received placebo).

Nonparametric recursive partitioning has the same goals as regression and classification, as above, but does not assume a particular parametric model. Using a nonparametric approach allows more flexibility in the model fitting adjusting for fluctuations in the data, however, at giving up computational simplicity and formal tests of hypothesis (e.g., testing for significance of a coefficient). In the case above, where weight was regressed on height, a linear relationship might be [Page 190]reasonable for a homogeneous population, for instance, within a particular age range and gender group. It might become less reasonable if all ages from infants to the elderly were included, or if outliers such as weight lifters and marathon runners were included. (While a more complex parametric model with more terms and interactions might fit the data well, confirming that the parametric model is correct cannot always be done easily, especially as the number of input variables increase.)

...

Sign in to access this content

Get a 30 day FREE TRIAL

Watch videos from a variety of sources bringing classroom topics to life
Read modern, diverse business cases
Explore hundreds of books and reference titles

No internet connection.

All search filters on the page have been cleared.

Your search has been saved.

Entry

Reader's guide

Entries A-Z

Subject index

Classification and Regression Tree Models

Sign in to access this content

Get a 30 day FREE TRIAL

Read next

More like this

Sage Recommends