Introduction to Dataframes

What is a DataFrame, why do we need them, and how do we use them?

Key Terms

DataFrame: a table-like data structure within the Pandas framework Series: the data structure that Pandas uses to represent each row and column within a DataFrame index: a label for each row that uniquely identifies that row, two rows cannot have the same index label: the column names of a DataFrame

What is a DataFrame?

Pandas is a popular data science package because it provides easy ways to explore and manipulate data. To do this, Pandas represents data in a data structure called a DataFrame. DataFrames are 2-dimensional data structures made of labeled rows and columns (of type Series, which are a special kind of data structure similar to lists!). Each row in a DataFrame represents an object and each column represents a characteristic of that object.

DataFrames look very similar to the dictionary tables that we learned about in Module 2 (take a look back at Module 2 if you’ve forgotten what dictionary tables are). In fact, we can easily convert our dictionary tables into DataFrames! However, the advantage of using DataFrames is that Pandas includes many helpful functions that you can call on these DataFrames in order to quickly access and analyze data.

Let’s say you have a bunch of information about everyone in your class. This information can be organized into rows and columns, where each row represents one of your classmates and each column tells you something about them (i.e. name, height, age, hometown, number of siblings, etc.). We could load all of this data into a DataFrame and use a few of the many functions Pandas has to sort and find information about our class. For example, we could find out which student has the most siblings, which two people live the farthest away from each other, or what the average student height is. Using a Pandas DataFrame, these calculations could all be found without writing a lot of code!

Importing Pandas

In order to use Pandas, we first have to import the Pandas package. A common way to import Pandas is to import the whole package and to give it the abbreviation pd, like so:

>>> import pandas as pd

Remember that we add as pd just for the sake of simplicity! It's much faster to type pd.<some function> each time we want to use Pandas instead of pandas.<some function>.

Note: If you’re having trouble with importing and using Python packages, please review Module 5!

How do we make a DataFrame?

One way we can create a DataFrame is by converting data from a dictionary table into a DataFrame. Let's say we have the fruit_table from our Tables chapter.

>>> fruit_table
{'Fruit': ['apple', 'orange', 'peach', 'banana'], 
'Price($)': [1.49, 1.49, 2.49, 1.29]}

This table organizes the names and prices of four types of fruits into four rows and two columns. We can load this data into a DataFrame by passing the name of the dictionary as an argument to the pd.DataFrame() function.

>>> df = pd.DataFrame(fruit_table)

We now have a DataFrame named df which has the same data as fruit_table. The DataFrame has four rows and two columns (just like fruit_table) and the columns are named after the keys in our dictionary table, ‘Fruit’ and ‘Price($)’. Now, we have all of the Pandas DataFrame functions at our disposal! If we want to learn something about our data, we can call many of these functions on our DataFrame df using dot notation: df.<function name>.

How do we import a DataFrame?

Data scientists also often work with pre-existing datasets contained in Comma Separated Value files, or CSV files. In order to work with this data, we first need to import the data from the CSV file into a DataFrame. Luckily, Pandas provides an easy function for us to do this!

Let’s say that our fruit data was contained in a CSV file called “fruit_table.csv”. We can call pd.read_csv(<CSV file path>) and pass in the name of the CSV file as the argument, like so:

>>> df_from_csv = pd.read_csv(“fruit_table.csv”)

Note that if you are running this command in a Jupyter Notebook, you will either have to make sure that the CSV file is in the same directory as your Notebook or specify the full path to the CSV if it is in another directory.

Now, df_from_csv is a DataFrame containing the same information as the CSV file we were given!

Note: The pd.read_csv() function has many more parameters that we can specify to tell Pandas exactly how we want it to read and format the data. You can take a look at the Pandas documentation to learn more!

How can we interact with data in a DataFrame?

You can access any column of a DataFrame by referring to the string that names the column (also called the column label) in brackets. For instance, df['Fruit'] would return the column of fruit names, while df['Price($)'] would return the column of prices.

Using this same pattern, we can also add columns to our DataFrame. Let's add an index to each row so that every fruit has its own ID number to access all the information about it. We can name this column "ID" and number the fruits so that the first row is index 0, the next row is index 1, and so on, up to our last row. Take a second to think about how you would implement it before looking at the next section.

df['ID'] = range(len(df))
#remember that the range function will return a list 
#from 0 to len(df) - 1

Another way to do this uses a DataFrame function, but the result is the same: df['ID'] = df.index

Because we know that we have one row per fruit, we can set our Fruit column to be our index. We would do this using a Pandas function: df.set_index('Fruit', inplace=True). The second argument changes our index directly to the DataFrame, and has the same effect as running df = df.set_index('Fruit').

Now, we can access the rows and columns of our DataFrame by using the index. Let's access the rows of our fruits DataFrame, where our rows are indexed by numbers, so df looks like this:

   Fruit   Price($) 
0   apple   1.49    
1   orange  1.49    
2   peach   2.49    
3   banana  1.29    

There are three ways to access rows and columns by indexing. 1. the loc function: df.loc[<arguments>] is really flexible and can return any part of the DataFrame. Arguments can be one or multiple headings, or indices.

df.loc[0] 
>> Fruit    apple
   Price($) 1.49
#returns the info from the first row

df.loc[[0, 2]]
>>   Fruit   Price($) 
 0   apple   1.49    
 2   peach   2.49 
# returns the info from the rows indexed at 0 and 2

df.loc[3, 'Price($)']
>> 1.29
# returns the value of row 3 under the `Price($)` heading

2. the iloc function: df.iloc[<arguments>] is similar to above, but only takes in an index or list of indices for its arguments. In fact, the first two examples above would be the same if you were to use iloc instead of loc.

3. using [ ]: df[<arguments>] takes in only column names, and gives you access to columns.

df['Fruit'] #returns the list of Fruit
>> 0     apple
1    orange
2     peach
3    banana
Name: Fruit, dtype: object

df[['Fruit', 'Price($)']] #returns our entire DataFrame (without the indices)
>>  Fruit   Price($) 
0   apple   1.49    
1   orange  1.49    
2   peach   2.49    
3   banana  1.29       

We can also use these 3 functions to change the values of our DataFrame. For example, if the price of peaches increases by $1, we would update that. One point to keep in mind is that there are multiple ways to access data, so one way to update the price would be: df.loc[2, 'Price($)'] = df.loc[2, 'Price($)'] + 1 This first gets the current price ($2.49) and then adds 1 to it.

How can we select certain data from a DataFrame?

We can also use Pandas to select certain portions of data! You can select some number of rows from a DataFrame by giving it a list of 1s (or Trues) and 0s (or Falses).

For example, if we only wanted to select the first and third row of our DataFrame, we could do the following:

selection = df.loc[[0, 2]]
selection
   Fruit   Price($) 
0   apple   1.49    
2   peach   2.49    

This generally isn't that useful, but it allows us to do something useful! We can use this if we want to select values from a table based on whether or not a certain condition is true.

How might this work? If we select a column from our table, we can use a boolean operator to turn it into a series of Trues and Falses!

The general structure for filtering a table by a condition based on the values in a given column is dataframe_name[dataframe_name[column_name] ? value], where ? is a boolean operator [e.g. <, >, = ].

In the example below, we filter our fruit DataFrame by all fruits that cost less than $2.

>>> prices = df['Price($)']
>>> prices
0    1.49
1    1.49
2    2.49
3    1.29
Name: Price($), dtype: float64

>>> prices < 2
0     True
1     True
2    False
3     True
Name: Price($), dtype: bool

>>> df[prices < 2]
    Fruit  Price($)
0   apple      1.49
1  orange      1.49
3  banana      1.29

>>> df[df['Price($)'] < 2]
    Fruit  Price($)
0   apple      1.49
1  orange      1.49
3  banana      1.29

Summary

  • A DataFrame is a table-like data structure that we use to be able to use Pandas functions to learn more about our data.

  • Use a DataFrame by calling import pandas as pdat the start of your code.

  • Create a DataFrame by converting a dictionary table to a DataFrame: dataframe_name = pd.DataFrame(dictionary_name)

  • Import a DataFrame using pd.read_csv(<CSV file path>)

  • Access and change values of your DataFrame by indexing directly with [ ] which takes in labels to access columns, iloc which takes in indices to access rows, or loc which can take combinations of indices and labels to access any part of your DataFrame.

Last updated