Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
assets
code
slides
readme.md

readme.md

title duration creator
Intro to Data Cleaning
1:5
name city
Lucy Williams
DC

Intro to Data Cleaning

Week 2 | Lesson 2.3

LEARNING OBJECTIVES

After this lesson, you will be able to:

  • Inspect data types
  • Clean up a column using df.apply()
  • Know what situations to use .value_counts() in your code

INSTRUCTOR PREP

Before this lesson, instructors will need to:

  • Read in / Review any dataset(s) & starter/solution code
  • Generate a brief slide deck

LESSON GUIDE

TIMING TYPE TOPIC
5 min Introduction Inpsect data types, df.apply(), .value_counts()
20 min Demo /Guided Practice Inpsect data types
20 min Demo /Guided Practice df.apply()
20 min Demo /Guided Practice .value_counts()
20 min Independent Practice
5 min Conclusion

Introduction: Topic (5 mins)

Since we're starting to get pretty comfortable with using pandas to do EDA, let's add a couple more tools to our toolbox.

The main data types stored in pandas objects are float, int, bool, datetime64, datetime64, timedelta, category, and object.

df.apply() will apply a function along any axis of the DataFrame. We'll see it in action below.

pandas.Series.value_counts returns Series containing counts of unique values. The resulting Series will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values.

dtypes value_counts

Demo /Guided Practice: Inspect data types (20 mins)

Let's create a small dictionary with different data types in it.

demo code can be found in the code folder and contains all the code in this lesson in a Jupyter notebook. May be easier to use...

in iPython notebook type:

import pandas as pd
import numpy as np
dft = pd.DataFrame(dict(A = np.random.rand(3),
                        B = 1,
                        C = 'foo',
                        D = pd.Timestamp('20010102'),
                        E = pd.Series([1.0]*3).astype('float32'),
                                F = False,
                                G = pd.Series([1]*3,dtype='int8')))
dft

There is a really easy way to see what kind of dtypes are in each column.

dft.dtypes

If a pandas object contains data multiple dtypes IN A SINGLE COLUMN, the dtype of the column will be chosen to accommodate all of the data types (object is the most general).

# these ints are coerced to floats
pd.Series([1, 2, 3, 4, 5, 6.])
# string data forces an ``object`` dtype
pd.Series([1, 2, 3, 6., 'foo'])

The method get_dtype_counts() will return the number of columns of each type in a DataFrame:

dft.get_dtype_counts()

You can do a lot more with dtypes that you can check out here

Check: Why do you think it might be important to know what kind of dtypes youre' working with?

Demo /Guided Practice: df.apply() (20 mins)

Let's create a small data frame.

df = pd.DataFrame(np.random.randn(5, 4), columns=['a', 'b', 'c', 'd'])
df

Use df.apply to find the square root of all the values.

df.apply(numpy.sqrt)

Find the mean of all of the columns.

df.apply(np.mean, axis=0)

Find the mean of all of the rows.

df.apply(np.mean, axis=1)

df.apply df.apply

Check: How would find the std of the columns and rows?

Demo /Guided Practice: .value_counts() (20 mins)

Let's create a random array with 50 numbers, ranging from 0 to 7.

data = np.random.randint(0, 7, size = 50)

Convert the array into a series.

s = pd.Series(data)

How many of each number is there in the series? Enter value_counts()

pd.value_counts(s)

Independent Practice: Topic (20 minutes)

  • Use the sales.csv data set, we've seen this a few times in previous lessons
  • Inspect the data types
  • You've found out that all your values in column 1 are off by 1. Use df.apply to add 1 to column 1 of the dataset
  • Use .value_counts to count the values of 1 column of the dataset

Bonus

  • Add 3 to column 2
  • Use .value_counts for each column of the dataset

Conclusion (5 mins)

So far we've used pandas to look at the head and tail of a data set. We've also taken a look at summary stats and different types of data types. We've selected and sliced data too. Today we added inspecting data types, df.apply, .value_counts to our pandas arsenal. Nice!