Groupby and Join

How can we aggregate data?

Key Terms:

Aggregation functions: This term refers to all functions that group rows together in some way to return a single value.

Now that we know how to use a Pandas DataFrame to capture our data, suppose we want to be able to look only at rows that satisfy some condition we want to impose. Or, perhaps, we want to combine the data from two DataFrames into one so that we can analyze them together. Thankfully, Pandas provides several handy functions that allow us to do both of these with ease!

Groupby

There may be situations in which we want to split our data into groups and then perform some tasks with those groups. Pandas provides an in-built function to do exactly that!

df.groupby()

Groupby operations are usually a combination of the following steps:

  1. Splitting the data: we will split the data into certain groups based on criteria that we specify.

  2. Applying a function: Each group independently has a function applied to it. Often, this function is count, which will count the number of members in each group, or sum, which will some attribute of each member in the group and give us the total. These are called aggregation functions. We might also choose to filter some of the groups in this step - for example, groups with less than 3 members might not be useful to us in some situation, so we can use this step to simply discard them.

  3. Combining the data: lastly, we will combine the data, usually back into a DataFrame so that we can view the results and perform further operations on it.

The groupby function has some parameters that allow us to customize how it operates on our data. We'll go over some of the more important parameters here, but you can find the complete list of parameters and what they do in the documentation we have provided for you.

  • by: This parameter defines which column (or even multiple columns!) or function we'd like to group by.

  • axis: This parameter determines whether the groupby function operates over columns (which is what you'll generally use) or over rows.

  • sort: This parameter allows you to change the order in which pandas sorts the groups. Note that it preserves the order of rows within each group.

Let us look at an example. Suppose some reviewers have left ratings on two movies, and we want to know how many people rated each movie without having to manually count. How can we accomplish this using groupby?

Take a minute to think of an approach and then look at the example below!

>>> df1 = pd.DataFrame({
        "Name": ["Alice", "Bob", "Courtney", "Delilah", "Elias", "Frank"],
        "Movie": ["Joker", "The Farewell", "Joker", "Joker", "The Farewell", "Joker"]
    })
>>> df1.groupby("Movie").count(
                    Name
Movie
Joker               4
The Farewell        2

As we can see, the table indicates the name of the movie and the number of people who rated it in the "Name" column, which got aggregated.

Join

What if we want to combine two, or even more, large dataframes into a single dataframe? It would be inefficient to manually add rows to a dataframe especially if we have a large number of dataframes to combine. Joins provide a powerful way to automate this process.

Joins are always called in the following manner:

firstDataFrame.join(secondDataFrame, other_parameters)

The relevant columns of secondDataFrame will be added to the existing columns of firstDataFrame. There are other parameters we can change to manipulate how this join occurs.

  • on: This parameter defines the column or list of columns that we want to compare between the two dataframes.

  • how: There are a number of different types of joins that we can execute: 'left', 'right', 'outer', and 'inner'. This parameter allows us to specify which type we would like to perform.

favorite_movies = pd.DataFrame({
"Name": ["Alice", "Bob", "Courtney", "Delilah", "Elias", "Frank"],
"Movie": ["Joker", "The Farewell", "Joker", "Joker", "The Farewell", "Joker"]
})

movie_genres = pd.DataFrame({
"Genre": ["Crime/Drama", "Comedy", "Crime/Drama", "Crime/Drama", "Comedy", "Crime/Drama"]
})
favorite_movies.join(movie_genres)

Last updated