Intro to Graphs

How do we visualize relationships between values?

KEY TERMS

  • Qualitative Data: data that is not made up of numbers. For example - collections of interviews, observations, etc.

  • Quantitative Data: data that is made up of numbers. For example - distance, height, time

  • Categorical Data: data that is separated into categories (groups). For example - eye color

  • Continuous Data: data that is measured on a range . For example - weight

What is a graph?

A graph lets us show the relationship between sets of data or variables visually. Each type of graph is useful for representing different types of data, all of which we will go into as we work through this chapter.

Line graphs

So, what is a line graph? It's as it sounds: a straight line! These are defined by a linear function, usually in the form of y = mx + b, where y is the dependent variable (what we're observing changes in), x is the independent variable (what we're changing), m is the slope (the steepness of the line), and b is the intercept (where the line connects to the y axis at x = 0).

The independent variable is displayed on the x-axis, and the dependent variable is on the y-axis.

Here is an example of a line graph.

Before moving on, make sure you know what each part of the equation in the image above means!

Line graphs can also be quadratic (i.e. of the form y=ax2+bx+cy = ax^2 + bx + c), which means they look like a 'U' and have a curved shape.

They can also be to the third, fourth, fifth... etc. degree of xx! The higher the degree, the more complex the graph, so we will mostly be focusing on linear graphs in this course. They are very useful for studying chronological trends and patterns. Chronological trends are lines whose x-axis shows time and the y-axis is the variable that is changing over time, so that as a whole they show the change in our dependent variable over time.

CHECKPOINT 1:

What is the value of y=1/2x+12y = -1/2x + 12 when x=20x = 20?

Scatter plot

Like line graphs, scatter plots represent data points in (x, y) coordinate pairs. Unlike line graphs, scatter plots do not connect the different data points. Whether you choose a line graph or a scatter plot depends on how much data you have, how close the points are to each other, and how many variables you are comparing.

Later on in this course we will learn about regression, which will allow us to draw a straight line through a scatter plot so we can better understand the trends of the data. For now, you don't have to think about combining them, but see if you can try to guess how the two will go together!

Think about what correlations can be shown by scatter plots. These can be positive, negative, or nonexistent (i.e. no correlation)

Below are three examples of scatterplots. Assume that the graph depicts lemonade sales (on the y-axis) in relation to another variable (on the x-axis).

The first example shows the relationship between temperature and lemonade sales. As temperature rises, lemonade sales also rise. Therefore, there is a positive correlation.

The second example shows the relationship between distance from an elementary school and lemonade sales. As distance increases, lemonade sales decrease. Therefore, there is a negative correlation.

The third example shows the relationship between the number of butterflies in the air and lemonade sales. Since butterflies do not affect lemonade buying behavior, there is no correlation.

CHECKPOINT 2:

How would you categorize the correlation shown in the scatter plot below?

Different types of data

Due to how coordinates systems are set up, both your variables have to be numerical. However, data comes in all shapes and forms: it can be qualitative (descriptive, usually through words) or quantitative (represented by numbers, or numerical). Try to think of examples of data that are qualitative, and others that are quantitative.

Data can also be part of a categorical distribution, meaning that it can placed in different categories. A category is like a box with a label that you can put certain objects in if they belong. If you are collecting data about people's eye color, you might have these categories: brown, green, and blue. When you collect data from a person, you put them in one of these categories.

Continuous data, on the other hand, are data that are on a rolling scale, such as weight and height: people take on a wide range of values within the extremes of these values. You aren't either five feet tall or six feet tall -- you can be anywhere in between, and outside of, that range!

If we were to conduct a survey and find the number and type of pet that each person in a classroom has, we would not be able to represent the data on a linear or scatterplot graph. Both axes must be quantitative for line graphs and scatter plots. We need a new type of graph which we will introduce in the next section.

CHECKPOINT 3:

Is eye color a categorical or a continuous variable? What about biological sex? Age? Distance?

Bar charts

Bar charts are visualizations of categorical distributions. Each bar represents one category, with its height being proportional to the frequency of that category. The bars are evenly spaced and are of equal thickness. It is up to you how you'd like to order the categories: you can choose to have them go from the category with the highest frequency to the category with the lowest, or you can choose to order them based on alphabetical order.

The bar chart below shows the profits for a certain store (x-axis) along with the state that the store is in (y-axis). Virginia has the most profits, and North Carolina has negative profits.

CHECKPOINT 4:

How do you feel about the bar chart below? Does it convey the data without confusion?

Histograms

What if your data isn't categorical, but you would like to display it like a bar chart? For example, if you want to graph the distribution of height among a group of people, you need to create height categories. We can do this using histograms. Histograms show how often a value(s) appear in our data.

While histograms and bar charts look very similar, they have different functionalities.

The Horizontal Axis

The horizontal axis of histograms is divided into bins that are contiguous (share a border). Bins include data at the left endpoint, but not at the right endpoint. This is called an endpoint convention. The bins can be of different widths depending on the data you have. In the example above all the bins are the same width, approximately 1/2 a unit. Later on in this course you will learn how to adjust the bins and the scale to make the graph more readable depending on the data you have.

The Vertical Axis

The vertical axis of histograms is frequently the density scale. This is true of the histogram on the right of Figure 7. The height of each bar is the percent of elements that fall into the bin, relative to the width of the bin. Let's see how to calculate the height of the bin directly to the right of 0 in Figure 7.

Calculating the height

We know that N=1000 (it's written on the axis), and we can see that the bin we are trying to calculate the density of is at approximately 200. That means that this bin contains 200/1000=0.2200/1000 = 0.2 of the total data points in the data set. The width of the bin is 1/2, so the height is 0.2/0.5=0.40.2/0.5 = 0.4 of the entire data set. This can be seen in the graph on the right.

Don't forget that the height of the bin depends on the width!

Density versus count

So why would we ever display data using density instead of using counts, if the histogram looks the same in both cases? The answer is that it doesn't! Depending on the width of the bins, the height might be deceiving, as more data points are within that range. For example, consider the population below:

Age

Count

0-5

12

5-10

20

10-25

30

25-50

110

50-100

10

If we were to combine the last two bins and create a graph that shows count instead of density, we'd see that the height of people between 25 and 100 is 120 -- significantly higher than anything else! We wouldn't be able to really understand the distribution between ages 25-50 and 50-100.

Histogram calculations and defining properties

Histograms have two main properties:

  • The bins are scaled and contiguous

  • The area is proportional to the number of entries in each bin

The area of a rectangle is shown by area = height⋅width, so:

However, this is not the only way we can use the formula. It can be rewritten as the following if we divide both sides by the width of the bar.

CHECKPOINT 5:

What is the area of the bar when the number in set is 3?

Bar charts vs histograms

Some key differences:

  • Bar charts display one quantity per category. They are often used to display the distributions of categorical variables. Histograms display the distributions of quantitative variables.

  • All the bars in a bar chart have the same width, and there is an equal amount of space between consecutive bars. The bars of a histogram can have different widths, and they are contiguous.

  • The lengths (or heights, if the bars are drawn vertically) of the bars in a bar chart are proportional to the value for each category. The heights of bars in a histogram measure densities; the areas of bars in a histogram are proportional to the numbers of entries in the bins.

Last updated