Working with Data
There are a variety of tools that can be helpful when working with real-world data; in practice, data is often missing values, and you might also want to work with string data.
Null Data
In the real world, data isn't always perfect. Sometimes, values are missing! For example, if you run a survey and people don't fill out a field, you won't have their data. Instead, you'll have null values in that column of your table. These values are represented by None
in Python.
Let's take a dataset of restaurant reviews! Some people may leave reviews that have a star rating, but no description; others may leave reviews with descriptions, but no star ratings. What can we do in this case?
There are a few methods that might prove helpful to us here!
fillna
DataFrame.fillna
and Series.fillna
take in values to fill holes in a dataset.
DataFrame.fillna
takes in values for the whole table, while Series.fillna
takes in values for a specific series, and can be easier when trying to fill values in a specific column.
Let's try filling any review where none is present with the string "No Review", and let's try filling any leftover null values with 0.
dropna
Maybe we only want reviews with star ratings -- we could also just not want rows with any missing values! To get rid of rows with missing values, we can use DataFrame.dropna
, which drops all values.
Data Analysis Tools
Some other analysis tools may also come in handy while you're analyzing data:
nlargest, nsmallest
If we only want the descriptions from the 10 most positive or 10 most negative reviews, we can get those rows of a dataset using nlargest
and nsmallest
.
max, min, mean
We can get basic descriptive statistics from Series very easily!
Analyzing Strings
A lot of the time, we have descriptive data -- if we have lots of reviews, how do we pull information from them?
There are lots of useful methods that Pandas provides to work with strings! All of these are called on Series, so we use them on individual columns.
One especially useful one is str.contains
:
str.contains
What if we want a measure of how good the restaurant's noodles are? We don't know what each person ordered -- but we can approximate by only looking at reviews that mention noodles! To do this, we can use str.contains
, which tells you if each element in a Series contains a specific word or phrase.
From this, we can tell that the noodles probably aren't as good as the other dishes at the restaurant!
Conclusion
With these tools, you can analyze datasets in depth and get more out of them! There are also lots more that might be helpful in a given scenario; to see all of them, take a look at the Pandas documentation.
Last updated
Was this helpful?