Working with Data

There are a variety of tools that can be helpful when working with real-world data; in practice, data is often missing values, and you might also want to work with string data.

Null Data

In the real world, data isn't always perfect. Sometimes, values are missing! For example, if you run a survey and people don't fill out a field, you won't have their data. Instead, you'll have null values in that column of your table. These values are represented by None in Python.

Let's take a dataset of restaurant reviews! Some people may leave reviews that have a star rating, but no description; others may leave reviews with descriptions, but no star ratings. What can we do in this case?

There are a few methods that might prove helpful to us here!

fillna

DataFrame.fillna and Series.fillna take in values to fill holes in a dataset.

DataFrame.fillna takes in values for the whole table, while Series.fillna takes in values for a specific series, and can be easier when trying to fill values in a specific column.

Let's try filling any review where none is present with the string "No Review", and let's try filling any leftover null values with 0.

>>> dataset = pd.read_csv('data.csv')
>>> dataset.head()
# TODO(shayna) fill in with an example
>>> dataset['reviews'] = dataset['reviews'].fillna('No Review')
>>> dataset.head()
# TODO(shayna) fill in with an example
>>> dataset.fillna(0)
>>> dataset.head()
# TODO(shayna) fill in with an example

dropna

Maybe we only want reviews with star ratings -- we could also just not want rows with any missing values! To get rid of rows with missing values, we can use DataFrame.dropna , which drops all values.

>>> dataset.head()
# TODO(shayna) fill in with an example
>>> dataset.dropna()
>>> dataset.head()
# TODO(shayna) fill in with an example

Data Analysis Tools

Some other analysis tools may also come in handy while you're analyzing data:

nlargest, nsmallest

If we only want the descriptions from the 10 most positive or 10 most negative reviews, we can get those rows of a dataset using nlargest and nsmallest.

>>> dataset.head()
# TODO(shayna) fill in with an example
>>> positive = dataset.nlargest(10, 'rating')
>>> negative = dataset.nsmallest(10, 'rating')
>>> positive.head()
# TODO(shayna) fill in with an example
>>> negative.head()
# TODO(shayna) fill in with an example

max, min, mean

We can get basic descriptive statistics from Series very easily!

>>> reviews = dataset['reviews']
>>> reviews.max()
5.0
>>> reviews.min()
1.0
>>> reviews.mean()
4.328

Analyzing Strings

A lot of the time, we have descriptive data -- if we have lots of reviews, how do we pull information from them?

There are lots of useful methods that Pandas provides to work with strings! All of these are called on Series, so we use them on individual columns.

One especially useful one is str.contains:

str.contains

What if we want a measure of how good the restaurant's noodles are? We don't know what each person ordered -- but we can approximate by only looking at reviews that mention noodles! To do this, we can use str.contains, which tells you if each element in a Series contains a specific word or phrase.

>>> dataset['reviews'].mean()
4.328
>>> noodles = dataset['reviews'].str.contains('noodles')
>>> with_noodles = dataset[noodles]
>>> with_noodles.mean()
3.780

From this, we can tell that the noodles probably aren't as good as the other dishes at the restaurant!

Conclusion

With these tools, you can analyze datasets in depth and get more out of them! There are also lots more that might be helpful in a given scenario; to see all of them, take a look at the Pandas documentation.

Last updated