Skip to main content icon/video/no-internet

Biserial correlation coefficients are measures of association that apply when one of the observed variables takes on two numerical values (a binary variable) and the other variable is a measurement or a score. There are several biserial coefficients, with the appropriate choice depending on the underlying statistical model for the data. The point biserial correlation and Pearson's biserial correlation are arguably the most well known and most commonly used coefficients in practice. We will focus on these two coefficients but will discuss other approaches.

Karl Pearson developed the sample biserial correlation coefficient in the early 1900s to estimate the correlation ρYZ between two measurements Z and Y when Z is not directly observed. Instead of Z, data are collected on a binary variable X with X = 0 if Z falls below a threshold level and X = 1 otherwise. The numerical values assigned to X do not matter provided the smaller value identifies when Z is below the threshold. In many settings, the latent variable Z is a conceptual construct and not measurable. The sample point biserial correlation estimates the correlation ρYX between Y and a binary variable X without reference to an underlying latent variable Z.

We will use S. Karelitz and colleagues' data on 38 infants to illustrate these ideas. A listing of the data is given in Table 1. The response Y is a child's IQ score at age 3, whereas X = 1 if the child's speech developmental level at age 3 is high, and X = 0 otherwise. The (population) biserial correlation ρYZ is a reasonable measure of association when X is a surrogate for a latent continuum Z of speech levels. The (population) point biserial correlation ρYX is more relevant when the relationship between IQ and the underlying Z scale is not of interest, or the latent scale could not be justified.

The Point Biserial Correlation

Assume that a random sample (y1, x1), (y2, x2),…, (yn, xn) of n observations is selected from the (Y, X) population, where Y is continuous and X is binary. Let sYX be the sample covariance between all yi and all xi, and let s2y and s2x be the sample variances of all yi and all xi, respectively. The population correlation ρYX between Y and X is estimated by the sample point biserial correlation coefficient, which is just the product-moment correlation between the Y and X samples:

Table 1 Data for a Sample of 38 Children
X = 0Y:8790949497103103104106108109
109109112119132
X = 1Y:100103103106112113114114118119120
120124133135135136141155157159162
Note: X = speech developmental level (0 = low; 1 = high), and Y = IQ score.
None

The sample point biserial estimator rYX can also be expressed as

None

where

1 and Y¯0 are the average y values from sampled pairs with xi = 1 and xi = 0, respectively, and

p∘ is the proportion of observations that have xi = 1.

The equivalence between the two expressions for rYX requires that the sample variances and covariances be computed using a divisor of n and not the usual divisor of n − 1.

The first careful analysis of the properties of rYX was provided by Robert Tate in the middle 1950s. He derived the large-sample distribution of rYX assuming that the conditional distributions of Y given X = 1 and given X = 0 are normal with potentially different means but the same variances. Tate showed that T = (n − 2)½rYX /(1 r2YX)½ is equal to the usual two-sample Student t statistic for comparing Y¯1 to Y¯0 and that the hypothesis ρYX = 0 can be tested using the p value from the two-sample t test. For ρYX ≠ 0, large-sample hypothesis tests and confidence intervals can be based on a normal approximation to rYX, with estimated

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading