Example of cleaning up data in Pandas¶

This example is actually the exercises provided by the following link: A2. Linear Regression - Data Exploration - Lending Club, which is part of the site for Learn Data Science.

First import ipython settings¶

In [43]:

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Import the data to be cleaned and inspect them¶

In [44]:

df = pd.read_csv('https://cdn.rawgit.com/benedict-chan/pytest/master/datasets/loansData.csv')

In [45]:

df.head()

Out[45]:

	Amount.Requested	Amount.Funded.By.Investors	Interest.Rate	Loan.Length	Loan.Purpose	Debt.To.Income.Ratio	State	Home.Ownership	Monthly.Income	FICO.Range	Open.CREDIT.Lines	Revolving.CREDIT.Balance	Inquiries.in.the.Last.6.Months	Employment.Length
81174	20000	20000	8.90%	36 months	debt_consolidation	14.90%	SC	MORTGAGE	6541.67	735-739	14	14272	2	< 1 year
99592	19200	19200	12.12%	36 months	debt_consolidation	28.36%	TX	MORTGAGE	4583.33	715-719	12	11140	1	2 years
80059	35000	35000	21.98%	60 months	debt_consolidation	23.81%	CA	MORTGAGE	11500.00	690-694	14	21977	1	2 years
15825	10000	9975	9.99%	36 months	debt_consolidation	14.30%	KS	MORTGAGE	3833.33	695-699	10	9346	0	5 years
33182	12000	12000	11.71%	36 months	credit_card	18.78%	NJ	RENT	3195.00	695-699	11	14469	0	9 years

In [46]:

df.dtypes

Out[46]:

Amount.Requested                    int64
Amount.Funded.By.Investors        float64
Interest.Rate                      object
Loan.Length                        object
Loan.Purpose                       object
Debt.To.Income.Ratio               object
State                              object
Home.Ownership                     object
Monthly.Income                    float64
FICO.Range                         object
Open.CREDIT.Lines                 float64
Revolving.CREDIT.Balance          float64
Inquiries.in.the.Last.6.Months    float64
Employment.Length                  object
dtype: object

We will find out that several columns are not regonized as numeric.

Fix the Interest.Rate and Debt.To.Income.Ratio¶

First check whether any of the rate doesn't contains %

In [47]:

len(df[ ( df['Interest.Rate'].str.contains('%') == False) ] ) , len(df[ ( df['Debt.To.Income.Ratio'].str.contains('%') == False) ] )

Out[47]:

(0, 0)

In [48]:

remove_percent_converter = lambda x: float(x.replace('%', ''))

Fix Loan.Length¶

First check whether any of the loan doesn't contains "months".

In [49]:

len(df[df['Loan.Length'].str.contains('months') == False])

Out[49]:

In [50]:

remove_month_converter = lambda x: float(x.replace('months', ''))

Apply the converters¶

In [51]:

clean_df = pd.read_csv('https://cdn.rawgit.com/benedict-chan/pytest/master/datasets/loansData.csv'
	, converters={'Interest.Rate': remove_percent_converter
	,'Debt.To.Income.Ratio': remove_percent_converter
	, 'Loan.Length': remove_month_converter})

Fix FICO.Score¶

For FICO, we are trying to get the lower part of the Range as our preferred value. We will do it by:

Creating a function to get the lower part of the Range
Create a new column to store this value

In [52]:

fico_function = lambda x: float(x.split('-')[0])
clean_df['FICO.Score'] = clean_df['FICO.Range'].map(fico_function)

Finalize the result¶

Rename column¶

In [53]:

clean_df = clean_df.rename(columns={'Amount.Requested':'Loan.Amount'})

Create a new index column¶

In [54]:

clean_df['NewIndex'] = pd.Series(range(1, clean_df['Amount.Funded.By.Investors'].count()+1), index=clean_df.index)
clean_df = clean_df.set_index('NewIndex')

Filter only the columns we need¶

In [55]:

final_columns = ["Interest.Rate","FICO.Score","Loan.Length","Monthly.Income","Loan.Amount"]
final_df = clean_df[final_columns]

Remove extremes (strange monthly income)¶

In [56]:

final_df = final_df[final_df['Monthly.Income'] < 100000]

Comparing to the expected results¶

Getting the expected results and format it

In [57]:

checking_df = pd.read_csv('https://cdn.rawgit.com/benedict-chan/pytest/master/datasets/loanf.csv')
checking_df = checking_df.sort_index()
checking_df.index.name = 'NewIndex'

Some magic¶

To see the difference of our cleaned data to the expected one.

In [62]:

ne_stacked = pd.concat([final_df, checking_df])
ne_stacked = ne_stacked.reset_index(drop=True)
df_gpby = ne_stacked.groupby(list(ne_stacked.columns))

idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]

ne_stacked.reindex(idx)

Out[62]:

	Interest.Rate	FICO.Score	Loan.Length	Monthly.Income	Loan.Amount
4349	7.62	805	36	102750	33000
2864	7.43	800	36	NaN	5000