Opportunity Through Data Textbook
  • Opportunity Through Data Textbook
  • Introduction
    • What is Data Science?
    • Introduction to Data Science: Exploratory Musical Analysis
  • Module 1
    • Introduction to Programming
      • The Command Line
      • Installing Programs
      • Python and the Command Line
      • Jupyter Notebook
    • Introduction to Python
      • Building Blocks of Python - Data Types and Variables
      • Functions
      • Formatting and Syntax
    • Math Review
      • Variables and Functions
      • Intro to Graphs
  • Module 2
    • Data Structures
      • Lists
      • Dictionaries
      • Tables
    • Programming Logic
      • Loops
      • Logical Operators
      • Conditionality
  • Module 3
    • Introduction to Probability
      • Probability and Sampling
    • Introduction to Statistics
      • Mean & Variance
      • Causality & Randomness
  • Module 4
    • Packages
    • Intro to NumPy
      • NumPy (continued)
  • Module 5
    • Introduction to Pandas
      • Introduction to Dataframes
      • Groupby and Join
    • Working with Data
    • Data Visualization
      • Matplotlib
      • Introduction to Data Visualization
  • Appendix
    • Table Utilities
    • Area of More Complicated Shapes
    • Introduction to Counting
    • Slope and Distance
    • Short Circuiting
    • Linear Regression
    • Glossary
  • Extension: Classification
    • Classification
    • Test Sets and Training Sets
    • Nearest Neighbors
  • Extension: Introduction to SQL
    • Introduction to SQL
    • Table Operations
      • Tables and Queries
      • Joins
  • Extension: Central Limit Theorem
    • Overview
    • Probability Distributions
      • Bernoulli Distribution
      • Uniform Distribution (Discrete)
      • Random Variables, Expectation, Variance
      • Discrete and Continuous Distributions
      • Uniform Distribution (Continuous)
      • Normal Distribution
    • Central Limit Theorem in Action
    • Confidence Intervals
  • Extension: Object-Oriented Programming
    • Object-Oriented Programming
      • Classes
      • Instantiation
      • Dot Notation
      • Mutability
  • Extension: Introduction to Excel
    • Introduction to Excel
      • Terminology and Interface
      • Getting Started with Analysis and Charts
      • Basics of Manipulating Data
    • Additional Features in Excel
      • Macros
      • The Data Tab
      • Pivot Tables
Powered by GitBook
On this page
  • Null Data
  • Data Analysis Tools
  • Analyzing Strings
  • Conclusion

Was this helpful?

  1. Module 5

Working with Data

There are a variety of tools that can be helpful when working with real-world data; in practice, data is often missing values, and you might also want to work with string data.

Null Data

In the real world, data isn't always perfect. Sometimes, values are missing! For example, if you run a survey and people don't fill out a field, you won't have their data. Instead, you'll have null values in that column of your table. These values are represented by None in Python.

Let's take a dataset of restaurant reviews! Some people may leave reviews that have a star rating, but no description; others may leave reviews with descriptions, but no star ratings. What can we do in this case?

There are a few methods that might prove helpful to us here!

fillna

DataFrame.fillna and Series.fillna take in values to fill holes in a dataset.

DataFrame.fillna takes in values for the whole table, while Series.fillna takes in values for a specific series, and can be easier when trying to fill values in a specific column.

Let's try filling any review where none is present with the string "No Review", and let's try filling any leftover null values with 0.

>>> dataset = pd.read_csv('data.csv')
>>> dataset.head()
# TODO(shayna) fill in with an example
>>> dataset['reviews'] = dataset['reviews'].fillna('No Review')
>>> dataset.head()
# TODO(shayna) fill in with an example
>>> dataset.fillna(0)
>>> dataset.head()
# TODO(shayna) fill in with an example

dropna

Maybe we only want reviews with star ratings -- we could also just not want rows with any missing values! To get rid of rows with missing values, we can use DataFrame.dropna , which drops all values.

>>> dataset.head()
# TODO(shayna) fill in with an example
>>> dataset.dropna()
>>> dataset.head()
# TODO(shayna) fill in with an example

Data Analysis Tools

Some other analysis tools may also come in handy while you're analyzing data:

nlargest, nsmallest

If we only want the descriptions from the 10 most positive or 10 most negative reviews, we can get those rows of a dataset using nlargest and nsmallest.

>>> dataset.head()
# TODO(shayna) fill in with an example
>>> positive = dataset.nlargest(10, 'rating')
>>> negative = dataset.nsmallest(10, 'rating')
>>> positive.head()
# TODO(shayna) fill in with an example
>>> negative.head()
# TODO(shayna) fill in with an example

max, min, mean

We can get basic descriptive statistics from Series very easily!

>>> reviews = dataset['reviews']
>>> reviews.max()
5.0
>>> reviews.min()
1.0
>>> reviews.mean()
4.328

Analyzing Strings

A lot of the time, we have descriptive data -- if we have lots of reviews, how do we pull information from them?

There are lots of useful methods that Pandas provides to work with strings! All of these are called on Series, so we use them on individual columns.

One especially useful one is str.contains:

str.contains

What if we want a measure of how good the restaurant's noodles are? We don't know what each person ordered -- but we can approximate by only looking at reviews that mention noodles! To do this, we can use str.contains, which tells you if each element in a Series contains a specific word or phrase.

>>> dataset['reviews'].mean()
4.328
>>> noodles = dataset['reviews'].str.contains('noodles')
>>> with_noodles = dataset[noodles]
>>> with_noodles.mean()
3.780

From this, we can tell that the noodles probably aren't as good as the other dishes at the restaurant!

Conclusion

With these tools, you can analyze datasets in depth and get more out of them! There are also lots more that might be helpful in a given scenario; to see all of them, take a look at the Pandas documentation.

PreviousGroupby and JoinNextData Visualization

Last updated 5 years ago

Was this helpful?