Skip to main content icon/video/no-internet

Scatterplot

A scatterplot is a graphic representation of the relationship between two or three variables. Each data point is represented by a point in n space, where n is the number of variables. The most common type of scatterplot involves two variables, with data indicated by its bivariate coordinates, usually denoted by X and Y. Trivariate plots are also common. Representing data in more than three dimensions requires multiple bivariate and/or trivariate plots. So basically, all scatterplots fall into two types: bivariate and trivariate.

Manipulated data points are used to present a bivariate scatterplot in Figure 1 and a trivariate plot in Figure 2.

Scatterplots in Data Analysis

Scatterplots can be used to represent variables that have linear, nonlinear, or no relationship. Often scatterplots can be used to provide researchers with the information necessary to decide whether they should fit a linear model to their data. This is particularly important if they plan to use a statistical technique that assumes linearity, such as an ordinary least squares regression. Looking at the scatterplot in Figure 3 would lead a researcher to understand that fitting a linear model would be misleading.

Correlation and Regression

Correlation is a good way to examine the linear relationships between two variables, for example, a person’s weight and height, or a student’s high school grade point average, and his SAT/ACT score. The strength of the relationship between two variables is usually described in terms of the correlation coefficient (also known as the Pearson correlation coefficient), which ranges from −1 to 1. A scatterplot is often used to provide researchers the graphic view of what tends to happen to one score when another score increases/decreases.

Figure 1 Bivariate Scatterplot

Figure 2 Trivariate Scatterplot

When a set of variables is at hand, it is fairly easy to draw the scatterplot by plotting one score on the vertical axis and the other on the horizontal axis. Figures 4 through 7 provide examples of variable pairs with correlations of −0.8, −0.3, 0.3, and 0.8, respectively.

A positive correlation describes the situation in which an increase in variable X is associated with an increase in variable Y, whereas a negative correlation implies that an increase in variable X is associated with a decrease in variable Y. But a correlation of 0.8 is not stronger than a correlation of −0.8. It is the magnitude that matters. They simply work in opposite directions.

In certain extreme situations, all of the dots fall on a straight line. This is called a perfect correlation. Figures 8 and 9 show perfect positive and perfect negative correlations, respectively.

The line interpolating all the dots in Figures 8 and 9 is considered as the line of best fit. In a real-world data analysis, the line of best fit does not necessarily go through all the dots in a scatterplot. Essentially, the line of best fit means that a line is closest to most of the dots or a line is as close to most of the dots as possible. The vertical distances between the line and those dots are called residuals. In statistics, the least squares method is used to get a regression line, which minimizes the sum of the squared residuals. Generally, the line of best fit is also referred to as the regression line.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading