Opportunity Through Data Textbook
  • Opportunity Through Data Textbook
  • Introduction
    • What is Data Science?
    • Introduction to Data Science: Exploratory Musical Analysis
  • Module 1
    • Introduction to Programming
      • The Command Line
      • Installing Programs
      • Python and the Command Line
      • Jupyter Notebook
    • Introduction to Python
      • Building Blocks of Python - Data Types and Variables
      • Functions
      • Formatting and Syntax
    • Math Review
      • Variables and Functions
      • Intro to Graphs
  • Module 2
    • Data Structures
      • Lists
      • Dictionaries
      • Tables
    • Programming Logic
      • Loops
      • Logical Operators
      • Conditionality
  • Module 3
    • Introduction to Probability
      • Probability and Sampling
    • Introduction to Statistics
      • Mean & Variance
      • Causality & Randomness
  • Module 4
    • Packages
    • Intro to NumPy
      • NumPy (continued)
  • Module 5
    • Introduction to Pandas
      • Introduction to Dataframes
      • Groupby and Join
    • Working with Data
    • Data Visualization
      • Matplotlib
      • Introduction to Data Visualization
  • Appendix
    • Table Utilities
    • Area of More Complicated Shapes
    • Introduction to Counting
    • Slope and Distance
    • Short Circuiting
    • Linear Regression
    • Glossary
  • Extension: Classification
    • Classification
    • Test Sets and Training Sets
    • Nearest Neighbors
  • Extension: Introduction to SQL
    • Introduction to SQL
    • Table Operations
      • Tables and Queries
      • Joins
  • Extension: Central Limit Theorem
    • Overview
    • Probability Distributions
      • Bernoulli Distribution
      • Uniform Distribution (Discrete)
      • Random Variables, Expectation, Variance
      • Discrete and Continuous Distributions
      • Uniform Distribution (Continuous)
      • Normal Distribution
    • Central Limit Theorem in Action
    • Confidence Intervals
  • Extension: Object-Oriented Programming
    • Object-Oriented Programming
      • Classes
      • Instantiation
      • Dot Notation
      • Mutability
  • Extension: Introduction to Excel
    • Introduction to Excel
      • Terminology and Interface
      • Getting Started with Analysis and Charts
      • Basics of Manipulating Data
    • Additional Features in Excel
      • Macros
      • The Data Tab
      • Pivot Tables
Powered by GitBook
On this page
  • Least Squares Regression
  • Introduction
  • Finding the Slope and the Intercept
  • An Example
  • How it Works
  • Weighted Least Squares Regression

Was this helpful?

  1. Appendix

Linear Regression

PreviousShort CircuitingNextGlossary

Last updated 4 years ago

Was this helpful?

Least Squares Regression

Introduction

Suppose that we have a set of data on two variables (x,y)(x,y)(x,y). For example, we could have data about both the age and weight of a group of individuals, data on height and average hours of sleep etc. We may hypothesize that the variable yyy depends on the variable xxx. In other words, we may guess that there is a linear relationship between our two variables. In math, we say y=mx+by = mx + by=mx+b. Our goal is to find the set of parameters, mmm and bbb that best fit our data. It turns out that there is a unique set of parameters mmm and bbb that best fit our data. All that means is there is only one mmm and only one bbb that will give us the best formula y=mx+by = mx + by=mx+b.

Of course, sometimes our best-fit line will be good and other times it will not be so good.

In this section, we will learn how to find the least-squares regression line. In the section on correlation you have already learned how to quantify how good the fit is!

Finding the Slope and the Intercept

To find the slope:

To find the y-intercept:

An Example

How it Works

The least-squares method works by minimizing the square of the error. Error is defined as the distance from the actual data point to the point on the least-squares line.

Note that we want the total sum of the squares of the errors minimized, not just individual errors. We square the errors to avoid cancellations from positive and negative errors. One line may minimize the error with respect to a certain data point as on the right, but we are interested in minimizing the total error.

Weighted Least Squares Regression

m=⟨xy⟩−⟨x⟩⟨y⟩⟨x2⟩−⟨x⟩2m = \frac{\langle xy \rangle -\langle x \rangle \langle y \rangle}{ \langle x^2 \rangle - \langle x \rangle^2} m=⟨x2⟩−⟨x⟩2⟨xy⟩−⟨x⟩⟨y⟩​

This may look intimidating at first, but remember the notation ⟨x⟩\langle x \rangle⟨x⟩ just means find the expectation value of x.

b=⟨y⟩−m⟨x⟩b = \langle y \rangle - m \langle x \rangleb=⟨y⟩−m⟨x⟩