Linear Regression

Least Squares Regression

Introduction

Suppose that we have a set of data on two variables (x,y)(x,y). For example, we could have data about both the age and weight of a group of individuals, data on height and average hours of sleep etc. We may hypothesize that the variable yy depends on the variable xx. In other words, we may guess that there is a linear relationship between our two variables. In math, we say y=mx+by = mx + b. Our goal is to find the set of parameters, mm and bb that best fit our data. It turns out that there is a unique set of parameters mm and bb that best fit our data. All that means is there is only one mm and only one bb that will give us the best formula y=mx+by = mx + b.

Of course, sometimes our best-fit line will be good and other times it will not be so good.

In this section, we will learn how to find the least-squares regression line. In the section on correlation you have already learned how to quantify how good the fit is!

Finding the Slope and the Intercept

To find the slope:

m=xyxyx2x2m = \frac{\langle xy \rangle -\langle x \rangle \langle y \rangle}{ \langle x^2 \rangle - \langle x \rangle^2}

This may look intimidating at first, but remember the notation x\langle x \rangle just means find the expectation value of x.

To find the y-intercept:

b=ymxb = \langle y \rangle - m \langle x \rangle

An Example

How it Works

The least-squares method works by minimizing the square of the error. Error is defined as the distance from the actual data point to the point on the least-squares line.

Note that we want the total sum of the squares of the errors minimized, not just individual errors. We square the errors to avoid cancellations from positive and negative errors. One line may minimize the error with respect to a certain data point as on the right, but we are interested in minimizing the total error.

Weighted Least Squares Regression

Last updated

Was this helpful?