Introduction to Data Visualization

What constitutes an effective visualization?

Earlier, we talked about how data science helps us find the answers to questions using sets of information gathered from the world. As a data scientist, a part of our job is to communicate these answers effectively. This is where data visualizations come in handy - they can help us present a lot of information quickly, effectively and in an engaging manner.

Here is a simple example of the usefulness of data visualizations. The following table shows ice cream preferences in a class of 10 people.

Ice Cream Flavor

Number of People

Chocolate

4

Strawberry

3

Vanilla

3

Below is a bar graph of the above data.

Do you prefer the table or the graph?

Let's take a look at another example. Here is a table of drivers that were stopped by the police who were subsequently searched.

As we can see, all this information is hard to digest at once. We use a bar graph to represent information from the parts of the table we want to convey.

This bar graph reveals that Black and Hispanic drivers are more likely to be searched than White drivers, demonstrating bias.

Your choice of data visualization depends on what you want to represent and the information you want to convey. For example if you want to compare values within a dataset, you might want to use a bar graph (as above) or a line graph. However, if you want to show the composition of something, you might want to use a pie chart. Below is a pie chart showing how many men and women are in Congress. As you can see, we have quite a way to go before women are properly represented!

If we want to look at trends or possible correlations we can use scatter plots. The one shown below illustrates a correlation between the years of education and income.

You can also compare the trends between classes by overlaying graphs. The example below compares the total number of cases pf COVID-19 overtime among different age groups.

Overlaying Graphs

If we want to compare quantities in two or more cases, you can use an overlaid bar graph. The following graph shows the comparison of median weekly earnings between men and women by race.

The bar graph is better suited to our analysis than a scatter plot in this case because of two reasons:

  • We are more interested in comparing quantities than exploring or determining trends or correlation

  • Since the bar graph compares on a month to month basis, our x-axis has labels rather than numerical fields which is a requirement for a scatter plot.

As you can see different types on data require a variety of data visualizations depending on the aim of the exploration, types of data available and what we hope for the visualization to communicate.

Area Principle

The area principle states that the area of the graph must equal the amount of data it's representing.

Although this graph looks cool in 3D, it actually violates the area principle because the area of the bars does not reflect the data it's representing. The 3D aspect throws it off.

Style Guide

In terms of style, it is recommended to adhere to the following guidelines:

  • Use consistent colors on the graphs, especially when trying to illustrate changes over time.

  • Use horizontal labels on the X-axis to improve readability.

  • Start the Y-axis at 0 and ensure a uniform scale to prevent your graph from being misleading.

Last updated