Opportunity Through Data Textbook
  • Opportunity Through Data Textbook
  • Introduction
    • What is Data Science?
    • Introduction to Data Science: Exploratory Musical Analysis
  • Module 1
    • Introduction to Programming
      • The Command Line
      • Installing Programs
      • Python and the Command Line
      • Jupyter Notebook
    • Introduction to Python
      • Building Blocks of Python - Data Types and Variables
      • Functions
      • Formatting and Syntax
    • Math Review
      • Variables and Functions
      • Intro to Graphs
  • Module 2
    • Data Structures
      • Lists
      • Dictionaries
      • Tables
    • Programming Logic
      • Loops
      • Logical Operators
      • Conditionality
  • Module 3
    • Introduction to Probability
      • Probability and Sampling
    • Introduction to Statistics
      • Mean & Variance
      • Causality & Randomness
  • Module 4
    • Packages
    • Intro to NumPy
      • NumPy (continued)
  • Module 5
    • Introduction to Pandas
      • Introduction to Dataframes
      • Groupby and Join
    • Working with Data
    • Data Visualization
      • Matplotlib
      • Introduction to Data Visualization
  • Appendix
    • Table Utilities
    • Area of More Complicated Shapes
    • Introduction to Counting
    • Slope and Distance
    • Short Circuiting
    • Linear Regression
    • Glossary
  • Extension: Classification
    • Classification
    • Test Sets and Training Sets
    • Nearest Neighbors
  • Extension: Introduction to SQL
    • Introduction to SQL
    • Table Operations
      • Tables and Queries
      • Joins
  • Extension: Central Limit Theorem
    • Overview
    • Probability Distributions
      • Bernoulli Distribution
      • Uniform Distribution (Discrete)
      • Random Variables, Expectation, Variance
      • Discrete and Continuous Distributions
      • Uniform Distribution (Continuous)
      • Normal Distribution
    • Central Limit Theorem in Action
    • Confidence Intervals
  • Extension: Object-Oriented Programming
    • Object-Oriented Programming
      • Classes
      • Instantiation
      • Dot Notation
      • Mutability
  • Extension: Introduction to Excel
    • Introduction to Excel
      • Terminology and Interface
      • Getting Started with Analysis and Charts
      • Basics of Manipulating Data
    • Additional Features in Excel
      • Macros
      • The Data Tab
      • Pivot Tables
Powered by GitBook
On this page
  • KEY TERMS
  • What is a graph?
  • Line graphs
  • Scatter plot
  • Different types of data
  • Bar charts
  • Histograms
  • Histogram calculations and defining properties
  • Bar charts vs histograms

Was this helpful?

  1. Module 1
  2. Math Review

Intro to Graphs

How do we visualize relationships between values?

PreviousVariables and FunctionsNextData Structures

Last updated 4 years ago

Was this helpful?

KEY TERMS

  • Qualitative Data: data that is not made up of numbers. For example - collections of interviews, observations, etc.

  • Quantitative Data: data that is made up of numbers. For example - distance, height, time

  • Categorical Data: data that is separated into categories (groups). For example - eye color

  • Continuous Data: data that is measured on a range . For example - weight

What is a graph?

A graph lets us show the relationship between sets of data or variables visually. Each type of graph is useful for representing different types of data, all of which we will go into as we work through this chapter.

Line graphs

So, what is a line graph? It's as it sounds: a straight line! These are defined by a linear function, usually in the form of y = mx + b, where y is the dependent variable (what we're observing changes in), x is the independent variable (what we're changing), m is the slope (the steepness of the line), and b is the intercept (where the line connects to the y axis at x = 0).

The independent variable is displayed on the x-axis, and the dependent variable is on the y-axis.

Here is an example of a line graph.

Before moving on, make sure you know what each part of the equation in the image above means!

Line graphs can also be quadratic (i.e. of the form y=ax2+bx+cy = ax^2 + bx + cy=ax2+bx+c), which means they look like a 'U' and have a curved shape.

They can also be to the third, fourth, fifth... etc. degree of xxx! The higher the degree, the more complex the graph, so we will mostly be focusing on linear graphs in this course. They are very useful for studying chronological trends and patterns. Chronological trends are lines whose x-axis shows time and the y-axis is the variable that is changing over time, so that as a whole they show the change in our dependent variable over time.

CHECKPOINT 1:

What is the value of y=−1/2x+12y = -1/2x + 12y=−1/2x+12 when x=20x = 20x=20?

y=−1/2(20)+12y = -1/2(20) + 12 y=−1/2(20)+12

y=−10+12y = -10 + 12y=−10+12

y=2y = 2y=2

What is the value of xxx that causes the equation in question 1 to take on the value -4? Try to plug your answer back in to make sure it's correct!

−4=−1/2x+12-4 = -1/2x + 12−4=−1/2x+12

−4−12=−1/2x-4 - 12 = -1/2x−4−12=−1/2x

−16=−1/2x-16 = -1/2x−16=−1/2x

32=x32 = x32=x

So the value of x that causes y to equal -4 is 32.

Scatter plot

Like line graphs, scatter plots represent data points in (x, y) coordinate pairs. Unlike line graphs, scatter plots do not connect the different data points. Whether you choose a line graph or a scatter plot depends on how much data you have, how close the points are to each other, and how many variables you are comparing.

Later on in this course we will learn about regression, which will allow us to draw a straight line through a scatter plot so we can better understand the trends of the data. For now, you don't have to think about combining them, but see if you can try to guess how the two will go together!

Think about what correlations can be shown by scatter plots. These can be positive, negative, or nonexistent (i.e. no correlation)

Below are three examples of scatterplots. Assume that the graph depicts lemonade sales (on the y-axis) in relation to another variable (on the x-axis).

The first example shows the relationship between temperature and lemonade sales. As temperature rises, lemonade sales also rise. Therefore, there is a positive correlation.

The second example shows the relationship between distance from an elementary school and lemonade sales. As distance increases, lemonade sales decrease. Therefore, there is a negative correlation.

The third example shows the relationship between the number of butterflies in the air and lemonade sales. Since butterflies do not affect lemonade buying behavior, there is no correlation.

CHECKPOINT 2:

How would you categorize the correlation shown in the scatter plot below?

There is no correlation between head circumference and height, according to the data in the plot.

Different types of data

Due to how coordinates systems are set up, both your variables have to be numerical. However, data comes in all shapes and forms: it can be qualitative (descriptive, usually through words) or quantitative (represented by numbers, or numerical). Try to think of examples of data that are qualitative, and others that are quantitative.

Data can also be part of a categorical distribution, meaning that it can placed in different categories. A category is like a box with a label that you can put certain objects in if they belong. If you are collecting data about people's eye color, you might have these categories: brown, green, and blue. When you collect data from a person, you put them in one of these categories.

Continuous data, on the other hand, are data that are on a rolling scale, such as weight and height: people take on a wide range of values within the extremes of these values. You aren't either five feet tall or six feet tall -- you can be anywhere in between, and outside of, that range!

If we were to conduct a survey and find the number and type of pet that each person in a classroom has, we would not be able to represent the data on a linear or scatterplot graph. Both axes must be quantitative for line graphs and scatter plots. We need a new type of graph which we will introduce in the next section.

CHECKPOINT 3:

Is eye color a categorical or a continuous variable? What about biological sex? Age? Distance?

Eye color: Categorical. There are well-defined categories individuals can fall under.

Sex: Categorical. People are almost entirely either biologically male or biologically female.

Age: Continuous. Everyone is a slightly different age, and so age can be defined in ranges.

Distance: Continuous. Similar idea to why height is continuous and not categorical.

Bar charts

Bar charts are visualizations of categorical distributions. Each bar represents one category, with its height being proportional to the frequency of that category. The bars are evenly spaced and are of equal thickness. It is up to you how you'd like to order the categories: you can choose to have them go from the category with the highest frequency to the category with the lowest, or you can choose to order them based on alphabetical order.

The bar chart below shows the profits for a certain store (x-axis) along with the state that the store is in (y-axis). Virginia has the most profits, and North Carolina has negative profits.

CHECKPOINT 4:

How do you feel about the bar chart below? Does it convey the data without confusion?

While we can understand the data shown in the graph, it is not a good bar chart. The bars are different thicknesses, and the spacing between the bars is also variable. There are also no labels on the axis to help us understand the variables being shown.

Histograms

What if your data isn't categorical, but you would like to display it like a bar chart? For example, if you want to graph the distribution of height among a group of people, you need to create height categories. We can do this using histograms. Histograms show how often a value(s) appear in our data.

While histograms and bar charts look very similar, they have different functionalities.

The Horizontal Axis

The horizontal axis of histograms is divided into bins that are contiguous (share a border). Bins include data at the left endpoint, but not at the right endpoint. This is called an endpoint convention. The bins can be of different widths depending on the data you have. In the example above all the bins are the same width, approximately 1/2 a unit. Later on in this course you will learn how to adjust the bins and the scale to make the graph more readable depending on the data you have.

The Vertical Axis

The vertical axis of histograms is frequently the density scale. This is true of the histogram on the right of Figure 7. The height of each bar is the percent of elements that fall into the bin, relative to the width of the bin. Let's see how to calculate the height of the bin directly to the right of 0 in Figure 7.

Calculating the height

We know that N=1000 (it's written on the axis), and we can see that the bin we are trying to calculate the density of is at approximately 200. That means that this bin contains 200/1000=0.2200/1000 = 0.2200/1000=0.2 of the total data points in the data set. The width of the bin is 1/2, so the height is 0.2/0.5=0.40.2/0.5 = 0.40.2/0.5=0.4 of the entire data set. This can be seen in the graph on the right.

Don't forget that the height of the bin depends on the width!

Density versus count

So why would we ever display data using density instead of using counts, if the histogram looks the same in both cases? The answer is that it doesn't! Depending on the width of the bins, the height might be deceiving, as more data points are within that range. For example, consider the population below:

Age

Count

0-5

12

5-10

20

10-25

30

25-50

110

50-100

10

If we were to combine the last two bins and create a graph that shows count instead of density, we'd see that the height of people between 25 and 100 is 120 -- significantly higher than anything else! We wouldn't be able to really understand the distribution between ages 25-50 and 50-100.

Histogram calculations and defining properties

Histograms have two main properties:

  • The bins are scaled and contiguous

  • The area is proportional to the number of entries in each bin

The area of a rectangle is shown by area = heightâ‹…width, so:

However, this is not the only way we can use the formula. It can be rewritten as the following if we divide both sides by the width of the bar.

CHECKPOINT 5:

What is the area of the bar when the number in set is 3?

The height of the bar is 4. The width of the bar is 1.

And, the area = height * width,

so:

area = 4 * 1 = 4%

Find the area of the rectangles with the following dimensions:

a. width = 6 cm, height = 7 cm

b. width = 7.5 in., height = 2 in.

a. Area = width * height = 6 * 7 = 42 cm.^2

b. Area = width * height = 7.5 * 2 = 15 in.^2

Bar charts vs histograms

Some key differences:

  • Bar charts display one quantity per category. They are often used to display the distributions of categorical variables. Histograms display the distributions of quantitative variables.

  • All the bars in a bar chart have the same width, and there is an equal amount of space between consecutive bars. The bars of a histogram can have different widths, and they are contiguous.

  • The lengths (or heights, if the bars are drawn vertically) of the bars in a bar chart are proportional to the value for each category. The heights of bars in a histogram measure densities; the areas of bars in a histogram are proportional to the numbers of entries in the bins.

Source: Lugezi.com
Figure 7