Agenda¶

Define the problem and the approach
Data basics: loading data, looking at your data, basic commands
Handling missing values
Intro to scikit-learn
Grouping and aggregating data
Feature selection
Fitting and evaluating a model
Deploying your work

In this notebook you will¶

Learn how to load data into Python
Learn the basics of working with data in pandas
Clean and manage your data
Wrangle missing data

Reading from a file¶

In [1]:

import pandas as pd
import pylab as pl
import numpy as np
import re

In [2]:

np.sum

Out[2]:

<function numpy.core.fromnumeric.sum>

We're going to use the read_csv function in pandas

In [3]:

?pd.read_csv

In [4]:

! head -n 2 ./data/credit-training.csv

SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
1,0.76612660900000007,45,2,0.8029821290000001,9120.0,13,0,6,0,2.0

In [5]:

df = pd.read_csv("./data/credit-training.csv")

What is `df`?¶

Our data is represented by a DataFrame. You can think of data frames as a giant spreadsheet which you can program. It's a collection of series (or columns) with a common set of commands that make managing data in Python super easy.

In [6]:

#barf!!!
df

Out[6]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150000 entries, 0 to 149999
Data columns (total 11 columns):
SeriousDlqin2yrs                        150000  non-null values
RevolvingUtilizationOfUnsecuredLines    150000  non-null values
age                                     150000  non-null values
NumberOfTime30-59DaysPastDueNotWorse    150000  non-null values
DebtRatio                               150000  non-null values
MonthlyIncome                           120269  non-null values
NumberOfOpenCreditLinesAndLoans         150000  non-null values
NumberOfTimes90DaysLate                 150000  non-null values
NumberRealEstateLoansOrLines            150000  non-null values
NumberOfTime60-89DaysPastDueNotWorse    150000  non-null values
NumberOfDependents                      146076  non-null values
dtypes: float64(4), int64(7)

In [7]:

#you'll learn more about head() later
df.head()

Out[7]:

	SeriousDlqin2yrs	RevolvingUtilizationOfUnsecuredLines	age	NumberOfTime30-59DaysPastDueNotWorse	DebtRatio	MonthlyIncome	NumberOfOpenCreditLinesAndLoans	NumberOfTimes90DaysLate	NumberRealEstateLoansOrLines	NumberOfDependents
0	1	0.766127	45	2	0.802982	9120	13	0	6	2
1	0	0.957151	40	0	0.121876	2600	4	0	0	1
2	0	0.658180	38	1	0.085113	3042	2	1	0	0
3	0	0.233810	30	0	0.036050	3300	5	0	0	0
4	0	0.907239	49	1	0.024926	63588	7	0	1	0

In [8]:

type(df)

Out[8]:

pandas.core.frame.DataFrame

You can think of a Series is a column in your data. A series will have a particular datatype associated with it. Datatypes can be integers, strings, floating point numbers, etc.

In [9]:

print df.SeriousDlqin2yrs.head()
type(df.SeriousDlqin2yrs)

0    1
1    0
2    0
3    0
4    0
Name: SeriousDlqin2yrs, dtype: int64

Out[9]:

pandas.core.series.Series

In [10]:

df.dtypes

Out[10]:

SeriousDlqin2yrs                          int64
RevolvingUtilizationOfUnsecuredLines    float64
age                                       int64
NumberOfTime30-59DaysPastDueNotWorse      int64
DebtRatio                               float64
MonthlyIncome                           float64
NumberOfOpenCreditLinesAndLoans           int64
NumberOfTimes90DaysLate                   int64
NumberRealEstateLoansOrLines              int64
NumberOfTime60-89DaysPastDueNotWorse      int64
NumberOfDependents                      float64
dtype: object

Basic Commands¶

Many of these commands will be familiar for those of you coming from a UNIX/Linux background. If not, don't sweat it, these commands are super easy, but extremely helpful for "taking a look around" at your data.

Selecting a column¶

In [11]:

df['DebtRatio']
df.DebtRatio

Out[11]:

0        0.802982
1        0.121876
2        0.085113
3        0.036050
4        0.024926
5        0.375607
6     5710.000000
7        0.209940
8       46.000000
9        0.606291
10       0.309476
11       0.531529
12       0.298354
13       0.382965
14     477.000000
...
149985       0.324962
149986       0.080384
149987       0.055692
149988       0.347924
149989       0.001408
149990       0.609779
149991       0.477658
149992    4132.000000
149993       0.000000
149994       0.404293
149995       0.225131
149996       0.716562
149997    3870.000000
149998       0.000000
149999       0.249908
Name: DebtRatio, Length: 150000, dtype: float64

head(n=5)¶

In [12]:

df.head()

Out[12]:

	SeriousDlqin2yrs	RevolvingUtilizationOfUnsecuredLines	age	NumberOfTime30-59DaysPastDueNotWorse	DebtRatio	MonthlyIncome	NumberOfOpenCreditLinesAndLoans	NumberOfTimes90DaysLate	NumberRealEstateLoansOrLines	NumberOfDependents
0	1	0.766127	45	2	0.802982	9120	13	0	6	2
1	0	0.957151	40	0	0.121876	2600	4	0	0	1
2	0	0.658180	38	1	0.085113	3042	2	1	0	0
3	0	0.233810	30	0	0.036050	3300	5	0	0	0
4	0	0.907239	49	1	0.024926	63588	7	0	1	0

In [13]:

df.head(1)

Out[13]:

	SeriousDlqin2yrs	RevolvingUtilizationOfUnsecuredLines	age	NumberOfTime30-59DaysPastDueNotWorse	DebtRatio	MonthlyIncome	NumberOfOpenCreditLinesAndLoans	NumberOfTimes90DaysLate	NumberRealEstateLoansOrLines	NumberOfTime60-89DaysPastDueNotWorse	NumberOfDependents
0	1	0.766127	45	2	0.802982	9120	13	0	6	0	2

In [14]:

df.SeriousDlqin2yrs.head()

Out[14]:

0    1
1    0
2    0
3    0
4    0
Name: SeriousDlqin2yrs, dtype: int64

tail(n=5)¶

In [15]:

df.tail()

Out[15]:

	RevolvingUtilizationOfUnsecuredLines	age	DebtRatio	MonthlyIncome	NumberOfOpenCreditLinesAndLoans	NumberRealEstateLoansOrLines	NumberOfDependents
149995	0.040674	74	0.225131	2100	4	1	0
149996	0.299745	44	0.716562	5584	4	1	2
149997	0.246044	58	3870.000000	NaN	18	1	0
149998	0.000000	30	0.000000	5716	4	0	0
149999	0.850283	64	0.249908	8158	8	2	0

In [16]:

df.RevolvingUtilizationOfUnsecuredLines.tail()

Out[16]:

149995    0.040674
149996    0.299745
149997    0.246044
149998    0.000000
149999    0.850283
Name: RevolvingUtilizationOfUnsecuredLines, dtype: float64

describe(percentile_width=50)¶

In [17]:

df.describe()

Out[17]:

	SeriousDlqin2yrs	RevolvingUtilizationOfUnsecuredLines	age	NumberOfTime30-59DaysPastDueNotWorse	DebtRatio	MonthlyIncome	NumberOfOpenCreditLinesAndLoans	NumberOfTimes90DaysLate	NumberRealEstateLoansOrLines	NumberOfTime60-89DaysPastDueNotWorse	NumberOfDependents
count	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	120269.000000	150000.000000	150000.000000	150000.000000	150000.000000	146076.000000
mean	0.066840	6.048438	52.295207	0.421033	353.005076	6670.221237	8.452760	0.265973	1.018240	0.240387	0.757222
std	0.249746	249.755371	14.771866	4.192781	2037.818523	14384.674215	5.145951	4.169304	1.129771	4.155179	1.115086
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	0.029867	41.000000	0.000000	0.175074	3400.000000	5.000000	0.000000	0.000000	0.000000	0.000000
50%	0.000000	0.154181	52.000000	0.000000	0.366508	5400.000000	8.000000	0.000000	1.000000	0.000000	0.000000
75%	0.000000	0.559046	63.000000	0.000000	0.868254	8249.000000	11.000000	0.000000	2.000000	0.000000	1.000000
max	1.000000	50708.000000	109.000000	98.000000	329664.000000	3008750.000000	58.000000	98.000000	54.000000	98.000000	20.000000

In [18]:

df.age.describe(percentile_width=25)

Out[18]:

count    150000.000000
mean         52.295207
std          14.771866
min           0.000000
37.5%        47.000000
50%          52.000000
62.5%        57.000000
max         109.000000
dtype: float64

unqiue() and nunique()¶

In [19]:

df.NumberOfDependents.unique()

Out[19]:

array([  2.,   1.,   0.,  nan,   3.,   4.,   5.,   6.,   8.,   7.,  20.,
        10.,   9.,  13.])

In [20]:

df.NumberOfDependents.nunique()

Out[20]:

pd.value_counts(values_to_count)¶

In [21]:

pd.value_counts(df.NumberOfDependents)
df.NumberOfDependents.value_counts()

Out[21]:

0     86902
1     26316
2     19522
3      9483
4      2862
5       746
6       158
7        51
8        24
10        5
9         5
20        1
13        1
dtype: int64

In [22]:

pd.value_counts(df.NumberOfDependents, ascending=True)

Out[22]:

13        1
20        1
9         5
10        5
8        24
7        51
6       158
5       746
4      2862
3      9483
2     19522
1     26316
0     86902
dtype: int64

In [23]:

pd.value_counts(df.NumberOfDependents, sort=False)

Out[23]:

0     86902
1     26316
2     19522
3      9483
4      2862
5       746
6       158
7        51
8        24
9         5
10        5
13        1
20        1
dtype: int64

In [24]:

#chain value_counts together with head() to give you the top 3
pd.value_counts(df.NumberOfDependents).head(3)

Out[24]:

0    86902
1    26316
2    19522
dtype: int64

In [25]:

pd.value_counts(df.NumberOfDependents).plot(kind='bar')

Out[25]:

<matplotlib.axes.AxesSubplot at 0x106e8e250>

pd.crosstab(rows, cols)¶

In [26]:

pd.crosstab(df.NumberOfTimes90DaysLate, df.SeriousDlqin2yrs)

Out[26]:

SeriousDlqin2yrs	0	1
NumberOfTimes90DaysLate
0	135108	6554
1	3478	1765
2	779	776
3	282	385
4	96	195
5	48	83
6	32	48
7	7	31
8	6	15
9	5	14
10	3	5
11	2	3
12	1	1
13	2	2
14	1	1
15	2	0
17	0	1
96	1	4
98	121	143

Use `pd.crosstab` to make a table that contains customer's ages in the lefthand column and the number of dependents they have in the right¶

In [30]:

pd.crosstab(df.age, df.NumberOfDependents)

Out[30]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 84 entries, 0 to 107
Data columns (total 13 columns):
0.0     84  non-null values
1.0     84  non-null values
2.0     84  non-null values
3.0     84  non-null values
4.0     84  non-null values
5.0     84  non-null values
6.0     84  non-null values
7.0     84  non-null values
8.0     84  non-null values
9.0     84  non-null values
10.0    84  non-null values
13.0    84  non-null values
20.0    84  non-null values
dtypes: int64(13)

Basic Cleanliness¶

Let's fix for formatting of the column names. I personally like snake_case (and so does Python). I found this handy function on stackoverflow for converting camelCase to snake_case.

Now we can apply the camel_to_snake function on each column name.

In [14]:

%load https://gist.github.com/glamp/6529725/raw/e38ffd2fc4cb840be21098486ffe5df991946736/camel_to_snake.py

In [31]:

def camel_to_snake(column_name):
    """
    converts a string that is camelCase into snake_case
    Example:
        print camel_to_snake("javaLovesCamelCase")
        > java_loves_camel_case
    See Also:
        http://stackoverflow.com/questions/1175208/elegant-python-function-to-convert-camelcase-to-camel-case
    """
    s1 = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', column_name)
    return re.sub('([a-z0-9])([A-Z])', r'\1_\2', s1).lower()

In [32]:

camel_to_snake("javaLovesCamelCase")

Out[32]:

'java_loves_camel_case'

In [33]:

df.columns = [camel_to_snake(col) for col in df.columns]
df.columns.tolist()

Out[33]:

['serious_dlqin2yrs',
 'revolving_utilization_of_unsecured_lines',
 'age',
 'number_of_time30-59_days_past_due_not_worse',
 'debt_ratio',
 'monthly_income',
 'number_of_open_credit_lines_and_loans',
 'number_of_times90_days_late',
 'number_real_estate_loans_or_lines',
 'number_of_time60-89_days_past_due_not_worse',
 'number_of_dependents']

Slicing and Indexing Data¶

pandas (like R) uses a system of boolean indexing. What this means is that when selecting particular rows or columns in your dataset...

Grabbing columns¶

In [34]:

df['monthly_income'].head()
df.monthly_income.head()

Out[34]:

0     9120
1     2600
2     3042
3     3300
4    63588
Name: monthly_income, dtype: float64

In [35]:

df[['monthly_income', 'serious_dlqin2yrs']].head()

Out[35]:

	monthly_income	serious_dlqin2yrs
0	9120	1
1	2600	0
2	3042	0
3	3300	0
4	63588	0

In [36]:

columns_i_want = ['monthly_income', 'serious_dlqin2yrs']
df[columns_i_want].head()

Out[36]:

	monthly_income	serious_dlqin2yrs
0	9120	1
1	2600	0
2	3042	0
3	3300	0
4	63588	0

Adding Columns¶

In [37]:

df.newcolumn = 1
# this will throw an error
df['newcolumn']

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-37-320c417e7351> in <module>()
      1 df.newcolumn = 1
      2 # this will throw an error
----> 3 df['newcolumn']

/usr/local/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
   2001             # get column
   2002             if self.columns.is_unique:
-> 2003                 return self._get_item_cache(key)
   2004 
   2005             # duplicate columns

/usr/local/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
    665             return cache[item]
    666         except Exception:
--> 667             values = self._data.get(item)
    668             res = self._box_item_values(item, values)
    669             cache[item] = res

/usr/local/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item)
   1653     def get(self, item):
   1654         if self.items.is_unique:
-> 1655             _, block = self._find_block(item)
   1656             return block.get(item)
   1657         else:

/usr/local/lib/python2.7/site-packages/pandas/core/internals.pyc in _find_block(self, item)
   1933 
   1934     def _find_block(self, item):
-> 1935         self._check_have(item)
   1936         for i, block in enumerate(self.blocks):
   1937             if item in block:

/usr/local/lib/python2.7/site-packages/pandas/core/internals.pyc in _check_have(self, item)
   1940     def _check_have(self, item):
   1941         if item not in self.items:
-> 1942             raise KeyError('no item named %s' % com.pprint_thing(item))
   1943 
   1944     def reindex_axis(self, new_axis, method=None, axis=0, copy=True):

KeyError: u'no item named newcolumn'

In [38]:

df['one'] = 1
df.one.head()

Out[38]:

0    1
1    1
2    1
3    1
4    1
Name: one, dtype: int64

Removing a column¶

In [39]:

del df['one']

Boolean Indexing¶

In [40]:

df.monthly_income > 5000

Out[40]:

0      True
1     False
2     False
3     False
4      True
5     False
6     False
7     False
8     False
9      True
10    False
11     True
12     True
13     True
14    False
...
149985    False
149986    False
149987    False
149988     True
149989     True
149990    False
149991     True
149992    False
149993    False
149994    False
149995    False
149996     True
149997    False
149998     True
149999     True
Name: monthly_income, Length: 150000, dtype: bool

In [41]:

gt_5k = df[df.monthly_income > 5000]
print len(gt_5k),"people with monthly_income > 5000"
df[df.monthly_income > 5000].head()

64410 people with monthly_income > 5000

Out[41]:

	serious_dlqin2yrs	revolving_utilization_of_unsecured_lines	age	number_of_time30-59_days_past_due_not_worse	debt_ratio	monthly_income	number_of_open_credit_lines_and_loans	number_real_estate_loans_or_lines	number_of_dependents
0	1	0.766127	45	2	0.802982	9120	13	6	2
4	0	0.907239	49	1	0.024926	63588	7	1	0
9	0	0.189169	57	0	0.606291	23684	9	4	2
11	0	0.018798	51	0	0.531529	6501	7	2	2
12	0	0.010352	46	0	0.298354	12454	13	2	2

In [42]:

df.ix[40:45, 0:2]

Out[42]:

	serious_dlqin2yrs	revolving_utilization_of_unsecured_lines
40	0	0.719985
41	0	0.072898
42	0	0.469057
43	0	0.718705
44	0	0.368876
45	0	0.000603

In [43]:

mask = (df.monthly_income > 5000) & (df.serious_dlqin2yrs==1)
df[mask].head()

Out[43]:

	serious_dlqin2yrs	revolving_utilization_of_unsecured_lines	age	number_of_time30-59_days_past_due_not_worse	debt_ratio	monthly_income	number_of_open_credit_lines_and_loans	number_of_times90_days_late	number_real_estate_loans_or_lines	number_of_time60-89_days_past_due_not_worse	number_of_dependents
0	1	0.766127	45	2	0.802982	9120	13	0	6	0	2
13	1	0.964673	40	3	0.382965	13700	9	3	1	1	2
74	1	0.133063	49	1	0.182882	10257	9	0	2	0	3
162	1	1.046279	47	1	1.104301	5416	6	0	2	0	1
184	1	0.085338	51	0	0.477504	6200	11	0	1	0	0

Write some code that filters to the following:¶

people 35 or older
who have not been delinquent in the past 2 years
who have less than 10 open credit lines/loans

In [46]:

mask = (df.age >= 35) & (df.serious_dlqin2yrs==0) & (df.number_of_open_credit_lines_and_loans < 10)
len(df[mask])==76151

Out[46]:

True

Write something that filters:¶

people who have been delinquent in the past 2 years
have are in the 90th percentile for monthly_income

HINT: use quantile

In [51]:

mask = (df.monthly_income == df.monthly_income.quantile(0.90)) & (df.serious_dlqin2yrs==0)
len(df[mask])

Out[51]:

Handling Missing Values¶

One of the most frustrating parts of data science can be handling null or missing data. pandas has a lot of built in features for making is super easy to handle missing data. The first thing we need to do is determine which fields have missing data. To do that we're going to use pd.melt.

Long vs. Wide Data ¶

Depending on the problem you're solving, you may need to rotate between having your data in wide/long format.

Wide data is probably what you think of when the work "spreadsheet" comes to mind. We're talking about data in which each row represents 1 datapoint and each value is in a particular column. This is well suited for things like modeling and producing summary statistics.

I often find that having data in long format is often best for doing the same task against multiple variables. Things like plotting distributions of each variable, making frequency tables, or, in our case, determining what portion of a dataframe's variables are null.

pd.melt()¶

For converting data from wide to long format.

>>> df
A B C
a 1 2
b 3 4
c 5 6

>>> pd.melt(df, id_vars=['A'], value_vars=['B'])
A variable value
a B        1
b B        3
c B        5

In [64]:

?pd.melt

In [45]:

# By not specifying id_vars, we're going to melt EVERYTHING
df_lng = pd.melt(df)
# now our data is a series of (key, value) rows. 
#think of when you've done this in Excel so that you can
#create a pivot table 
df_lng.head()

Out[45]:

	variable	value
0	serious_dlqin2yrs	1
1	serious_dlqin2yrs	0
2	serious_dlqin2yrs	0
3	serious_dlqin2yrs	0
4	serious_dlqin2yrs	0

In [46]:

null_variables = df_lng.value.isnull()
null_variables.sum()

Out[46]:

In [47]:

# crosstab creates a frequency table between 2 variables
# it's going to automatically enumerate the possibilities between
# the two Series and show you a count of occurrences 
#in each possible bucket
pd.crosstab(df_lng.variable, null_variables)

Out[47]:

value	False	True
variable
age	150000	0
debt_ratio	150000	0
monthly_income	120269	29731
number_of_dependents	150000	0
number_of_open_credit_lines_and_loans	150000	0
number_of_time30-59_days_past_due_not_worse	150000	0
number_of_time60-89_days_past_due_not_worse	150000	0
number_of_times90_days_late	150000	0
number_real_estate_loans_or_lines	150000	0
revolving_utilization_of_unsecured_lines	150000	0
serious_dlqin2yrs	150000	0

In [48]:

# let's abstract that code into a function so we can easily 
# recalculate it
def print_null_freq(df):
    """
    for a given DataFrame, calculates how many values for 
    each variable is null and prints the resulting table to stdout
    """
    df_lng = pd.melt(df)
    null_variables = df_lng.value.isnull()
    return pd.crosstab(df_lng.variable, null_variables)
print_null_freq(df)

Out[48]:

value	False	True
variable
age	150000	0
debt_ratio	150000	0
monthly_income	120269	29731
number_of_dependents	150000	0
number_of_open_credit_lines_and_loans	150000	0
number_of_time30-59_days_past_due_not_worse	150000	0
number_of_time60-89_days_past_due_not_worse	150000	0
number_of_times90_days_late	150000	0
number_real_estate_loans_or_lines	150000	0
revolving_utilization_of_unsecured_lines	150000	0
serious_dlqin2yrs	150000	0

Use pd.melt to create a data frame in the following format:¶

     serious_dlqin2yrs variable	  value
0	                1	 age	    45
1	                0	 age	    40
2	                0	 age	    38
3	                0	 age	    30
4	                0	 age	    49
...	                ...	 ...	    ...
299999              1	 debt_ratio 0.423
300000              0	 debt_ratio 0.8923

Only include values for age and debt_ratio

In [50]:

melted = pd.melt(..., id_vars=[...], value_vars=[...])

print len(melted)==300000
print melted.variable.unique()==np.array(['age', 'debt_ratio'])

True
[ True  True]

Filling NA's¶

In [49]:

s = pd.Series([1, 2, None, 4])
s

Out[49]:

In [50]:

s.fillna(3)

Out[50]:

In [51]:

s.ffill()

Out[51]:

In [52]:

s.bfill()

Out[52]:

In [53]:

s.fillna(s.mean())

Out[53]:

0    1.000000
1    2.000000
2    2.333333
3    4.000000

If you look at df you can see that there are 2 columns which don't have a full 150,000 values: monthly_income and number_of_dependents. In order to incorporate these variables into our analysis, we need to specify how to treat these missing values.

For number_of_dependents let's keep things simple and intuitive. if someone didn't specify how many dependents they had then let's assume it's becasue they don't have any to begin with.

In [54]:

df.number_of_dependents = df.number_of_dependents.fillna(0)
# proof that the number_of_dependents no longer contains nulls
print_null_freq(df)

Out[54]:

value	False	True
variable
age	150000	0
debt_ratio	150000	0
monthly_income	120269	29731
number_of_dependents	150000	0
number_of_open_credit_lines_and_loans	150000	0
number_of_time30-59_days_past_due_not_worse	150000	0
number_of_time60-89_days_past_due_not_worse	150000	0
number_of_times90_days_late	150000	0
number_real_estate_loans_or_lines	150000	0
revolving_utilization_of_unsecured_lines	150000	0
serious_dlqin2yrs	150000	0

Taking a look at monthly_income we see that it's a bit more complicated than number_of_dependents. We have a few options for replacing missing data. We could do something like set it to the mean or median or the dataset but this might skew our distribution. We could also set it to 0 but this might not be right either. Instead we're going to use a technique called imputation. We'll go into this more after we take a look at scikit-learn.

In [55]:

df.monthly_income.describe()

Out[55]:

count     120269.000000
mean        6670.221237
std        14384.674215
min            0.000000
25%         3400.000000
50%         5400.000000
75%         8249.000000
max      3008750.000000

In [56]:

df.to_csv("./data/credit-data-post-import.csv", index=False)

We just did the following¶

Used read_csv to load data Python
Used head, tail, unique, and other functions to inspect our data
Sliced data and selected data frames
Cleaned our data and learned the basics of handling missing values

In [ ]: