Matplotlib

How can we effectively visualize data?‌

KEY TERMS:

  • Matplotlib : a Python plotting package used to visualize data.

  • Visualizations : graphs, maps, charts or other representations of data that gives us another way to think about it.

Introduction to Matplotlib‌

In this section, we will learn more about the package called Matplotlib. Matplotlib is a package within Python that is used to make graphs. To import the package, we typically use the following code:

import matplotlib%matplotlib inline

The first line imports the package as we have seen before with NumPy. The second line is used especially for the Jupyter Notebooks. The line ensures that the graphs are displayed directly underneath the code you write within a cell of the Notebook.‌

Together this code helps use create visualizations with the Jupyter Notebooks. Visualizations are graphs, maps, charts or other representations of data that gives us another way to think about it.‌

Within the Matplotlib package, there is a group of functions under the group name pyplot. This group of functions helps us create the visualizations that you have made in the past such as histograms, line graphs, bar charts, and scatterplots. The code used to access these functions is the following:

import matplotlib.pyplot as plt

Writing matplotlib.pyplot specifies which part of the package we want, and the second part as plt tells Python to abbreviate the package as plt. We will use this package as we make graphs in the future.‌

Guide to Plotting

A key part to any graph is labels. Every graph you make should have the appropriate labels and titles. Let's walk through some different plots to understand how to label our graphs.‌

For the following few examples, we will use a data set about an ice cream shop!

Scatterplot

If we want to analyze the relationship between two of the variables, we can make a scatterplot. To create a scatterplot, we will use the function plt.scatter() within the plt package. Let's look at the relationship between the number of scoops sold per month and the average temperature of that month.

scoops_per_month = [49, 86, 115, 342, 942, 1113, 1407, 1812, 1002, 400, 102, 51]
average_temp = [40, 52, 60, 62, 65, 70, 73, 75, 68, 57, 53, 41]

plt.scatter(scoops_per_month, average_temp)
plt.show()

Now we have a scatterplot! However, there is something missing on this plot... this plot needs labels! In order to add labels, we can use the following functions:

  • plt.xlabel() -- to label x axis

  • plt.ylabel() -- to label y axis

  • plt.title() -- to give your plot a title

plt.xlabel("Number of Scoops Sold per Month")
plt.ylabel("Average Monthly Temperature")
plt.title("Scoops Sold & Temperature")
plt.show()

Line Graphs

Another way to visualize data is a line graph! Line graphs are typically used to study chronological trends and patterns. To create a line plot, we can use the plt.plot()within the plt package. Let's use a line graph to visualize the number of scoops sold over time.

scoops_per_month = [49, 86, 115, 342, 942, 1113, 1407, 1812, 1002, 400, 102, 51]
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sept", "Oct", "Nov", "Dec"]

plt.plot(months, scoops_per_month)
plt.xlabel("Month")
plt.ylabel("Number of Scoops Sold")
plt.title("Number of Scoops Sold Over Time")
plt.show()

Using our line graph, can you find an association between certain months and the number of scoops sold? With this line graph, we can see that the number of scoops sold increases from January to August and decreases from August to December.

Bar Chart

Bar charts are useful for categorical distributions, or data that can be broken into categories (rather than being numerical data). A bar chart is useful for displaying a bar for each category.

In our data set, we have information about the number of scoops sold for each flavor of ice cream. In this example, the flavor of ice cream can be considered the category, and thus the bar chart below will display one bar per flavor of ice cream. The height of the bar refers to the frequency of the category, which in this case refers to the number of scoops sold for each flavor.

flavors = ["chocolate", "vanilla", "strawberry", "oreo"]
count_per_flavor = [43, 51, 19, 24]

plt.bar(flavors, count_per_flavor)
plt.xlabel("Flavor")
plt.ylabel("Scoops Sold")
plt.title("Scoops Sold Per Flavor")
plt.show()

Using the bar graph above, we can easily compare the frequency of each of the flavors!

Histogram

A histogram looks similar to a bar chart, however there are very important distinctions between the two that we will examine. To do so, let's first make a histogram using data on the number of scoops of ice cream each customer ordered.

scoops_per_customer = [1, 5, 4, 3, 1, 1, 1, 2, 3, 3, 4, 9, 1, 4, 4, 10, 11, 3, 5, 5, 5,7, 1, 2, 3, 4, 8, 6, 6, 4, 2, 2, 1, 3, 8, 2, 3, 1, 1, 3]

plt.hist(scoops_per_customer)
plt.xlabel("Number of Scoops per Customer")
plt.ylabel("Count")
plt.show()

The histogram above helps us to visualize where on the number line the data are most concentrated. In the histogram above, we did not specify the bins and 10 were automatically created. The width of the bins refers to the range of values contained in the bins. In this case, the width of the bins is 1. So, the bin [1, 2), contains all values x where 1 <= x < 2. The heights of the bins is the frequency of the values in each bin.

Now let's see what happens when we specify the number of bins. Furthermore, we will set density=True.

plt.hist(scoops_per_customer, bins=7, density=True)
plt.xlabel("Number of Scoops per Customer")
plt.ylabel("Percent of Total")
plt.show()

As you can see from the graph above, there are now only 7 bins. Furthermore, what happened when we set density = True? The y-axis no longer refers to the frequency of the values in each bin (or the count), but rather the proportion of the values in each bin to the total. For example, the left-most bin has a y-value of around 0.24, which means that 24% of the scoops per customer are contained in the bin from 1 to ~2.3!

Last updated