Linear Regression and Correlation
NOTE: If the applet below does not appear, you may need to download the latest version of Java or enable Java applets in your browser (in IE: Tools/Internet Options/Advanced/Java)
Linear Regression Theory
Consider the problem of fitting a straight line through a set of data points:
, ,...,
. The straight line can be parameterized as:

where
and
are the parameters and
is the error or residual.
Note: this problem is called linear regression because the equation above is linear in the coefficients
, and not because the equation happens to represent a straight line. Fitting the quadratic
would also be a linear regression problem.
To find the straight line that best fits the data set, we use a least-squares formulation:
Minimize
where
For a straight line curve fit, the sum of squares of the residuals,
, reaches a minimum where:
and 
To express how good of a fit this best-fit line is, one can consider the coefficient of determination:

where
is the total sum of squares of the residuals between the data points and the mean:

The correlation is the square root of
and can also be expressed as:

The correlation is an indication of how well the data points fit the straight line. A perfect fit with a positive slope corresponds to a correlation of 1 — the line explains 100% of the variability in the data. A correlation of 0 indicates that the line explains 0% of the variability; that is, the explanation is no better than characterizing the data set by its mean.
For additional details about Linear Regression, review "Chapter 17: Least Squares Regression" in Numerical Methods for Engineers by Chapra and Canale. A succinct on-line overview can be found at Wikipedia
You can also develop a better understanding of these concepts by exploring them interactively with the applet below.
Start Exploring!
In the interactive window below you can perform the following operations:
- Move a data point: click-and-drag a data point to a different position
- Add a data point: ctrl-click in the location where you want to add it
- Remove a data point: ctrl-click on the data point to be removed
- Toggle regression curves: click on the blue or red button for y=f(x) and x=f(y), respectively
- Retrieve a pre-programmed data set: click on one of the buttons S0 through S3
- Print this entire page: click the printer button at the top right of this web-page
Learn by Exploring
- Create a data set of at least 10 points with a correlation of 0.9. How would you characterize the "shape" of your data set?
- Repeat the previous task with correlation values of -0.5 and 0.0.
- How does the shape change with different values of the correlation?
- What is the value of correlation for a data set consisting of only two points? Explain.
- How is the slope of the regression line related to the correlation value?
- Now, click on the red button to turn on the display of the linear regression for x = f(y). Is the regression x=f(y) [in red] the same as the regression y=f(x) [in blue]? Explain.
- Create two data sets, one for which the difference between y=f(x) and x=f(y) is as small as possible, and one for which the difference is as large as possible. Describe your solutions.
Contributors : Ivan Lee and Chris Paredis
(c) Ivan Lee and Chris Paredis 2006
Last modified 08/13/2006 02:48 PM