Data basics: loading data, looking at your data, basic commands
pandas
import pandas as pd
import pylab as pl
import numpy as np
import re
np.sum
<function numpy.core.fromnumeric.sum>
We're going to use the read_csv
function in pandas
?pd.read_csv
! head -n 2 ./data/credit-training.csv
SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents 1,0.76612660900000007,45,2,0.8029821290000001,9120.0,13,0,6,0,2.0
df = pd.read_csv("./data/credit-training.csv")
df
?¶Our data is represented by a DataFrame. You can think of data frames as a giant spreadsheet which you can program. It's a collection of series (or columns) with a common set of commands that make managing data in Python super easy.
#barf!!!
df
<class 'pandas.core.frame.DataFrame'> Int64Index: 150000 entries, 0 to 149999 Data columns (total 11 columns): SeriousDlqin2yrs 150000 non-null values RevolvingUtilizationOfUnsecuredLines 150000 non-null values age 150000 non-null values NumberOfTime30-59DaysPastDueNotWorse 150000 non-null values DebtRatio 150000 non-null values MonthlyIncome 120269 non-null values NumberOfOpenCreditLinesAndLoans 150000 non-null values NumberOfTimes90DaysLate 150000 non-null values NumberRealEstateLoansOrLines 150000 non-null values NumberOfTime60-89DaysPastDueNotWorse 150000 non-null values NumberOfDependents 146076 non-null values dtypes: float64(4), int64(7)
#you'll learn more about head() later
df.head()
SeriousDlqin2yrs | RevolvingUtilizationOfUnsecuredLines | age | NumberOfTime30-59DaysPastDueNotWorse | DebtRatio | MonthlyIncome | NumberOfOpenCreditLinesAndLoans | NumberOfTimes90DaysLate | NumberRealEstateLoansOrLines | NumberOfTime60-89DaysPastDueNotWorse | NumberOfDependents | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.766127 | 45 | 2 | 0.802982 | 9120 | 13 | 0 | 6 | 0 | 2 |
1 | 0 | 0.957151 | 40 | 0 | 0.121876 | 2600 | 4 | 0 | 0 | 0 | 1 |
2 | 0 | 0.658180 | 38 | 1 | 0.085113 | 3042 | 2 | 1 | 0 | 0 | 0 |
3 | 0 | 0.233810 | 30 | 0 | 0.036050 | 3300 | 5 | 0 | 0 | 0 | 0 |
4 | 0 | 0.907239 | 49 | 1 | 0.024926 | 63588 | 7 | 0 | 1 | 0 | 0 |
type(df)
pandas.core.frame.DataFrame
You can think of a Series is a column in your data. A series will have a particular datatype associated with it. Datatypes can be integers, strings, floating point numbers, etc.
print df.SeriousDlqin2yrs.head()
type(df.SeriousDlqin2yrs)
0 1 1 0 2 0 3 0 4 0 Name: SeriousDlqin2yrs, dtype: int64
pandas.core.series.Series
df.dtypes
SeriousDlqin2yrs int64 RevolvingUtilizationOfUnsecuredLines float64 age int64 NumberOfTime30-59DaysPastDueNotWorse int64 DebtRatio float64 MonthlyIncome float64 NumberOfOpenCreditLinesAndLoans int64 NumberOfTimes90DaysLate int64 NumberRealEstateLoansOrLines int64 NumberOfTime60-89DaysPastDueNotWorse int64 NumberOfDependents float64 dtype: object
Many of these commands will be familiar for those of you coming from a UNIX/Linux background. If not, don't sweat it, these commands are super easy, but extremely helpful for "taking a look around" at your data.
df['DebtRatio']
df.DebtRatio
0 0.802982 1 0.121876 2 0.085113 3 0.036050 4 0.024926 5 0.375607 6 5710.000000 7 0.209940 8 46.000000 9 0.606291 10 0.309476 11 0.531529 12 0.298354 13 0.382965 14 477.000000 ... 149985 0.324962 149986 0.080384 149987 0.055692 149988 0.347924 149989 0.001408 149990 0.609779 149991 0.477658 149992 4132.000000 149993 0.000000 149994 0.404293 149995 0.225131 149996 0.716562 149997 3870.000000 149998 0.000000 149999 0.249908 Name: DebtRatio, Length: 150000, dtype: float64
df.head()
SeriousDlqin2yrs | RevolvingUtilizationOfUnsecuredLines | age | NumberOfTime30-59DaysPastDueNotWorse | DebtRatio | MonthlyIncome | NumberOfOpenCreditLinesAndLoans | NumberOfTimes90DaysLate | NumberRealEstateLoansOrLines | NumberOfTime60-89DaysPastDueNotWorse | NumberOfDependents | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.766127 | 45 | 2 | 0.802982 | 9120 | 13 | 0 | 6 | 0 | 2 |
1 | 0 | 0.957151 | 40 | 0 | 0.121876 | 2600 | 4 | 0 | 0 | 0 | 1 |
2 | 0 | 0.658180 | 38 | 1 | 0.085113 | 3042 | 2 | 1 | 0 | 0 | 0 |
3 | 0 | 0.233810 | 30 | 0 | 0.036050 | 3300 | 5 | 0 | 0 | 0 | 0 |
4 | 0 | 0.907239 | 49 | 1 | 0.024926 | 63588 | 7 | 0 | 1 | 0 | 0 |
df.head(1)
SeriousDlqin2yrs | RevolvingUtilizationOfUnsecuredLines | age | NumberOfTime30-59DaysPastDueNotWorse | DebtRatio | MonthlyIncome | NumberOfOpenCreditLinesAndLoans | NumberOfTimes90DaysLate | NumberRealEstateLoansOrLines | NumberOfTime60-89DaysPastDueNotWorse | NumberOfDependents | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.766127 | 45 | 2 | 0.802982 | 9120 | 13 | 0 | 6 | 0 | 2 |
df.SeriousDlqin2yrs.head()
0 1 1 0 2 0 3 0 4 0 Name: SeriousDlqin2yrs, dtype: int64
df.tail()
SeriousDlqin2yrs | RevolvingUtilizationOfUnsecuredLines | age | NumberOfTime30-59DaysPastDueNotWorse | DebtRatio | MonthlyIncome | NumberOfOpenCreditLinesAndLoans | NumberOfTimes90DaysLate | NumberRealEstateLoansOrLines | NumberOfTime60-89DaysPastDueNotWorse | NumberOfDependents | |
---|---|---|---|---|---|---|---|---|---|---|---|
149995 | 0 | 0.040674 | 74 | 0 | 0.225131 | 2100 | 4 | 0 | 1 | 0 | 0 |
149996 | 0 | 0.299745 | 44 | 0 | 0.716562 | 5584 | 4 | 0 | 1 | 0 | 2 |
149997 | 0 | 0.246044 | 58 | 0 | 3870.000000 | NaN | 18 | 0 | 1 | 0 | 0 |
149998 | 0 | 0.000000 | 30 | 0 | 0.000000 | 5716 | 4 | 0 | 0 | 0 | 0 |
149999 | 0 | 0.850283 | 64 | 0 | 0.249908 | 8158 | 8 | 0 | 2 | 0 | 0 |
df.RevolvingUtilizationOfUnsecuredLines.tail()
149995 0.040674 149996 0.299745 149997 0.246044 149998 0.000000 149999 0.850283 Name: RevolvingUtilizationOfUnsecuredLines, dtype: float64
df.describe()
SeriousDlqin2yrs | RevolvingUtilizationOfUnsecuredLines | age | NumberOfTime30-59DaysPastDueNotWorse | DebtRatio | MonthlyIncome | NumberOfOpenCreditLinesAndLoans | NumberOfTimes90DaysLate | NumberRealEstateLoansOrLines | NumberOfTime60-89DaysPastDueNotWorse | NumberOfDependents | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 120269.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 146076.000000 |
mean | 0.066840 | 6.048438 | 52.295207 | 0.421033 | 353.005076 | 6670.221237 | 8.452760 | 0.265973 | 1.018240 | 0.240387 | 0.757222 |
std | 0.249746 | 249.755371 | 14.771866 | 4.192781 | 2037.818523 | 14384.674215 | 5.145951 | 4.169304 | 1.129771 | 4.155179 | 1.115086 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 0.029867 | 41.000000 | 0.000000 | 0.175074 | 3400.000000 | 5.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 0.000000 | 0.154181 | 52.000000 | 0.000000 | 0.366508 | 5400.000000 | 8.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 |
75% | 0.000000 | 0.559046 | 63.000000 | 0.000000 | 0.868254 | 8249.000000 | 11.000000 | 0.000000 | 2.000000 | 0.000000 | 1.000000 |
max | 1.000000 | 50708.000000 | 109.000000 | 98.000000 | 329664.000000 | 3008750.000000 | 58.000000 | 98.000000 | 54.000000 | 98.000000 | 20.000000 |
df.age.describe(percentile_width=25)
count 150000.000000 mean 52.295207 std 14.771866 min 0.000000 37.5% 47.000000 50% 52.000000 62.5% 57.000000 max 109.000000 dtype: float64
df.NumberOfDependents.unique()
array([ 2., 1., 0., nan, 3., 4., 5., 6., 8., 7., 20., 10., 9., 13.])
df.NumberOfDependents.nunique()
13
pd.value_counts(df.NumberOfDependents)
df.NumberOfDependents.value_counts()
0 86902 1 26316 2 19522 3 9483 4 2862 5 746 6 158 7 51 8 24 10 5 9 5 20 1 13 1 dtype: int64
pd.value_counts(df.NumberOfDependents, ascending=True)
13 1 20 1 9 5 10 5 8 24 7 51 6 158 5 746 4 2862 3 9483 2 19522 1 26316 0 86902 dtype: int64
pd.value_counts(df.NumberOfDependents, sort=False)
0 86902 1 26316 2 19522 3 9483 4 2862 5 746 6 158 7 51 8 24 9 5 10 5 13 1 20 1 dtype: int64
#chain value_counts together with head() to give you the top 3
pd.value_counts(df.NumberOfDependents).head(3)
0 86902 1 26316 2 19522 dtype: int64
pd.value_counts(df.NumberOfDependents).plot(kind='bar')
<matplotlib.axes.AxesSubplot at 0x106e8e250>
pd.crosstab(df.NumberOfTimes90DaysLate, df.SeriousDlqin2yrs)
SeriousDlqin2yrs | 0 | 1 |
---|---|---|
NumberOfTimes90DaysLate | ||
0 | 135108 | 6554 |
1 | 3478 | 1765 |
2 | 779 | 776 |
3 | 282 | 385 |
4 | 96 | 195 |
5 | 48 | 83 |
6 | 32 | 48 |
7 | 7 | 31 |
8 | 6 | 15 |
9 | 5 | 14 |
10 | 3 | 5 |
11 | 2 | 3 |
12 | 1 | 1 |
13 | 2 | 2 |
14 | 1 | 1 |
15 | 2 | 0 |
17 | 0 | 1 |
96 | 1 | 4 |
98 | 121 | 143 |
pd.crosstab
to make a table that contains customer's ages in the lefthand column and the number of dependents they have in the right¶pd.crosstab(df.age, df.NumberOfDependents)
<class 'pandas.core.frame.DataFrame'> Int64Index: 84 entries, 0 to 107 Data columns (total 13 columns): 0.0 84 non-null values 1.0 84 non-null values 2.0 84 non-null values 3.0 84 non-null values 4.0 84 non-null values 5.0 84 non-null values 6.0 84 non-null values 7.0 84 non-null values 8.0 84 non-null values 9.0 84 non-null values 10.0 84 non-null values 13.0 84 non-null values 20.0 84 non-null values dtypes: int64(13)
Let's fix for formatting of the column names. I personally like snake_case (and so does Python). I found this handy function on stackoverflow for converting camelCase to snake_case.
Now we can apply the camel_to_snake function on each column name.
%load https://gist.github.com/glamp/6529725/raw/e38ffd2fc4cb840be21098486ffe5df991946736/camel_to_snake.py
def camel_to_snake(column_name):
"""
converts a string that is camelCase into snake_case
Example:
print camel_to_snake("javaLovesCamelCase")
> java_loves_camel_case
See Also:
http://stackoverflow.com/questions/1175208/elegant-python-function-to-convert-camelcase-to-camel-case
"""
s1 = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', column_name)
return re.sub('([a-z0-9])([A-Z])', r'\1_\2', s1).lower()
camel_to_snake("javaLovesCamelCase")
'java_loves_camel_case'
df.columns = [camel_to_snake(col) for col in df.columns]
df.columns.tolist()
['serious_dlqin2yrs', 'revolving_utilization_of_unsecured_lines', 'age', 'number_of_time30-59_days_past_due_not_worse', 'debt_ratio', 'monthly_income', 'number_of_open_credit_lines_and_loans', 'number_of_times90_days_late', 'number_real_estate_loans_or_lines', 'number_of_time60-89_days_past_due_not_worse', 'number_of_dependents']
pandas (like R) uses a system of boolean indexing. What this means is that when selecting particular rows or columns in your dataset...
df['monthly_income'].head()
df.monthly_income.head()
0 9120 1 2600 2 3042 3 3300 4 63588 Name: monthly_income, dtype: float64
df[['monthly_income', 'serious_dlqin2yrs']].head()
monthly_income | serious_dlqin2yrs | |
---|---|---|
0 | 9120 | 1 |
1 | 2600 | 0 |
2 | 3042 | 0 |
3 | 3300 | 0 |
4 | 63588 | 0 |
columns_i_want = ['monthly_income', 'serious_dlqin2yrs']
df[columns_i_want].head()
monthly_income | serious_dlqin2yrs | |
---|---|---|
0 | 9120 | 1 |
1 | 2600 | 0 |
2 | 3042 | 0 |
3 | 3300 | 0 |
4 | 63588 | 0 |
df.newcolumn = 1
# this will throw an error
df['newcolumn']
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-37-320c417e7351> in <module>() 1 df.newcolumn = 1 2 # this will throw an error ----> 3 df['newcolumn'] /usr/local/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key) 2001 # get column 2002 if self.columns.is_unique: -> 2003 return self._get_item_cache(key) 2004 2005 # duplicate columns /usr/local/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item) 665 return cache[item] 666 except Exception: --> 667 values = self._data.get(item) 668 res = self._box_item_values(item, values) 669 cache[item] = res /usr/local/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item) 1653 def get(self, item): 1654 if self.items.is_unique: -> 1655 _, block = self._find_block(item) 1656 return block.get(item) 1657 else: /usr/local/lib/python2.7/site-packages/pandas/core/internals.pyc in _find_block(self, item) 1933 1934 def _find_block(self, item): -> 1935 self._check_have(item) 1936 for i, block in enumerate(self.blocks): 1937 if item in block: /usr/local/lib/python2.7/site-packages/pandas/core/internals.pyc in _check_have(self, item) 1940 def _check_have(self, item): 1941 if item not in self.items: -> 1942 raise KeyError('no item named %s' % com.pprint_thing(item)) 1943 1944 def reindex_axis(self, new_axis, method=None, axis=0, copy=True): KeyError: u'no item named newcolumn'
df['one'] = 1
df.one.head()
0 1 1 1 2 1 3 1 4 1 Name: one, dtype: int64
del df['one']
df.monthly_income > 5000
0 True 1 False 2 False 3 False 4 True 5 False 6 False 7 False 8 False 9 True 10 False 11 True 12 True 13 True 14 False ... 149985 False 149986 False 149987 False 149988 True 149989 True 149990 False 149991 True 149992 False 149993 False 149994 False 149995 False 149996 True 149997 False 149998 True 149999 True Name: monthly_income, Length: 150000, dtype: bool
gt_5k = df[df.monthly_income > 5000]
print len(gt_5k),"people with monthly_income > 5000"
df[df.monthly_income > 5000].head()
64410 people with monthly_income > 5000
serious_dlqin2yrs | revolving_utilization_of_unsecured_lines | age | number_of_time30-59_days_past_due_not_worse | debt_ratio | monthly_income | number_of_open_credit_lines_and_loans | number_of_times90_days_late | number_real_estate_loans_or_lines | number_of_time60-89_days_past_due_not_worse | number_of_dependents | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.766127 | 45 | 2 | 0.802982 | 9120 | 13 | 0 | 6 | 0 | 2 |
4 | 0 | 0.907239 | 49 | 1 | 0.024926 | 63588 | 7 | 0 | 1 | 0 | 0 |
9 | 0 | 0.189169 | 57 | 0 | 0.606291 | 23684 | 9 | 0 | 4 | 0 | 2 |
11 | 0 | 0.018798 | 51 | 0 | 0.531529 | 6501 | 7 | 0 | 2 | 0 | 2 |
12 | 0 | 0.010352 | 46 | 0 | 0.298354 | 12454 | 13 | 0 | 2 | 0 | 2 |
df.ix[40:45, 0:2]
serious_dlqin2yrs | revolving_utilization_of_unsecured_lines | |
---|---|---|
40 | 0 | 0.719985 |
41 | 0 | 0.072898 |
42 | 0 | 0.469057 |
43 | 0 | 0.718705 |
44 | 0 | 0.368876 |
45 | 0 | 0.000603 |
mask = (df.monthly_income > 5000) & (df.serious_dlqin2yrs==1)
df[mask].head()
serious_dlqin2yrs | revolving_utilization_of_unsecured_lines | age | number_of_time30-59_days_past_due_not_worse | debt_ratio | monthly_income | number_of_open_credit_lines_and_loans | number_of_times90_days_late | number_real_estate_loans_or_lines | number_of_time60-89_days_past_due_not_worse | number_of_dependents | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.766127 | 45 | 2 | 0.802982 | 9120 | 13 | 0 | 6 | 0 | 2 |
13 | 1 | 0.964673 | 40 | 3 | 0.382965 | 13700 | 9 | 3 | 1 | 1 | 2 |
74 | 1 | 0.133063 | 49 | 1 | 0.182882 | 10257 | 9 | 0 | 2 | 0 | 3 |
162 | 1 | 1.046279 | 47 | 1 | 1.104301 | 5416 | 6 | 0 | 2 | 0 | 1 |
184 | 1 | 0.085338 | 51 | 0 | 0.477504 | 6200 | 11 | 0 | 1 | 0 | 0 |
mask = (df.age >= 35) & (df.serious_dlqin2yrs==0) & (df.number_of_open_credit_lines_and_loans < 10)
len(df[mask])==76151
True
HINT: use quantile
mask = (df.monthly_income == df.monthly_income.quantile(0.90)) & (df.serious_dlqin2yrs==0)
len(df[mask])
147
One of the most frustrating parts of data science can be handling null or missing data. pandas has a lot of built in features for making is super easy to handle missing data. The first thing we need to do is determine which fields have missing data. To do that we're going to use pd.melt
.
Depending on the problem you're solving, you may need to rotate between having your data in wide/long format.
Wide data is probably what you think of when the work "spreadsheet" comes to mind. We're talking about data in which each row represents 1 datapoint and each value is in a particular column. This is well suited for things like modeling and producing summary statistics.
I often find that having data in long
format is often best for doing the same task against multiple variables. Things like plotting distributions of each variable, making frequency tables, or, in our case, determining what portion of a dataframe's variables are null.
For converting data from wide
to long
format.
>>> df
A B C
a 1 2
b 3 4
c 5 6
>>> pd.melt(df, id_vars=['A'], value_vars=['B'])
A variable value
a B 1
b B 3
c B 5
?pd.melt
# By not specifying id_vars, we're going to melt EVERYTHING
df_lng = pd.melt(df)
# now our data is a series of (key, value) rows.
#think of when you've done this in Excel so that you can
#create a pivot table
df_lng.head()
variable | value | |
---|---|---|
0 | serious_dlqin2yrs | 1 |
1 | serious_dlqin2yrs | 0 |
2 | serious_dlqin2yrs | 0 |
3 | serious_dlqin2yrs | 0 |
4 | serious_dlqin2yrs | 0 |
null_variables = df_lng.value.isnull()
null_variables.sum()
29731
# crosstab creates a frequency table between 2 variables
# it's going to automatically enumerate the possibilities between
# the two Series and show you a count of occurrences
#in each possible bucket
pd.crosstab(df_lng.variable, null_variables)
value | False | True |
---|---|---|
variable | ||
age | 150000 | 0 |
debt_ratio | 150000 | 0 |
monthly_income | 120269 | 29731 |
number_of_dependents | 150000 | 0 |
number_of_open_credit_lines_and_loans | 150000 | 0 |
number_of_time30-59_days_past_due_not_worse | 150000 | 0 |
number_of_time60-89_days_past_due_not_worse | 150000 | 0 |
number_of_times90_days_late | 150000 | 0 |
number_real_estate_loans_or_lines | 150000 | 0 |
revolving_utilization_of_unsecured_lines | 150000 | 0 |
serious_dlqin2yrs | 150000 | 0 |
# let's abstract that code into a function so we can easily
# recalculate it
def print_null_freq(df):
"""
for a given DataFrame, calculates how many values for
each variable is null and prints the resulting table to stdout
"""
df_lng = pd.melt(df)
null_variables = df_lng.value.isnull()
return pd.crosstab(df_lng.variable, null_variables)
print_null_freq(df)
value | False | True |
---|---|---|
variable | ||
age | 150000 | 0 |
debt_ratio | 150000 | 0 |
monthly_income | 120269 | 29731 |
number_of_dependents | 150000 | 0 |
number_of_open_credit_lines_and_loans | 150000 | 0 |
number_of_time30-59_days_past_due_not_worse | 150000 | 0 |
number_of_time60-89_days_past_due_not_worse | 150000 | 0 |
number_of_times90_days_late | 150000 | 0 |
number_real_estate_loans_or_lines | 150000 | 0 |
revolving_utilization_of_unsecured_lines | 150000 | 0 |
serious_dlqin2yrs | 150000 | 0 |
serious_dlqin2yrs variable value
0 1 age 45
1 0 age 40
2 0 age 38
3 0 age 30
4 0 age 49
... ... ... ...
299999 1 debt_ratio 0.423
300000 0 debt_ratio 0.8923
Only include values for age
and debt_ratio
melted = pd.melt(..., id_vars=[...], value_vars=[...])
print len(melted)==300000
print melted.variable.unique()==np.array(['age', 'debt_ratio'])
True [ True True]
s = pd.Series([1, 2, None, 4])
s
0 1 1 2 2 NaN 3 4
s.fillna(3)
0 1 1 2 2 3 3 4
s.ffill()
0 1 1 2 2 2 3 4
s.bfill()
0 1 1 2 2 4 3 4
s.fillna(s.mean())
0 1.000000 1 2.000000 2 2.333333 3 4.000000
If you look at df
you can see that there are 2 columns which don't have a full 150,000 values: monthly_income
and number_of_dependents
. In order to incorporate these variables into our analysis, we need to specify how to treat these missing values.
For number_of_dependents let's keep things simple and intuitive. if someone didn't specify how many dependents they had then let's assume it's becasue they don't have any to begin with.
df.number_of_dependents = df.number_of_dependents.fillna(0)
# proof that the number_of_dependents no longer contains nulls
print_null_freq(df)
value | False | True |
---|---|---|
variable | ||
age | 150000 | 0 |
debt_ratio | 150000 | 0 |
monthly_income | 120269 | 29731 |
number_of_dependents | 150000 | 0 |
number_of_open_credit_lines_and_loans | 150000 | 0 |
number_of_time30-59_days_past_due_not_worse | 150000 | 0 |
number_of_time60-89_days_past_due_not_worse | 150000 | 0 |
number_of_times90_days_late | 150000 | 0 |
number_real_estate_loans_or_lines | 150000 | 0 |
revolving_utilization_of_unsecured_lines | 150000 | 0 |
serious_dlqin2yrs | 150000 | 0 |
Taking a look at monthly_income
we see that it's a bit more complicated than number_of_dependents
. We have a few options for replacing missing data. We could do something like set it to the mean or median or the dataset but this might skew our distribution. We could also set it to 0 but this might not be right either. Instead we're going to use a technique called imputation. We'll go into this more after we take a look at scikit-learn
.
df.monthly_income.describe()
count 120269.000000 mean 6670.221237 std 14384.674215 min 0.000000 25% 3400.000000 50% 5400.000000 75% 8249.000000 max 3008750.000000
df.to_csv("./data/credit-data-post-import.csv", index=False)
read_csv
to load data Pythonhead
, tail
, unique
, and other functions to inspect our data