This example is actually the exercises provided by the following link: A2. Linear Regression - Data Exploration - Lending Club, which is part of the site for Learn Data Science.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv('https://cdn.rawgit.com/benedict-chan/pytest/master/datasets/loansData.csv')
df.head()
Amount.Requested | Amount.Funded.By.Investors | Interest.Rate | Loan.Length | Loan.Purpose | Debt.To.Income.Ratio | State | Home.Ownership | Monthly.Income | FICO.Range | Open.CREDIT.Lines | Revolving.CREDIT.Balance | Inquiries.in.the.Last.6.Months | Employment.Length | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
81174 | 20000 | 20000 | 8.90% | 36 months | debt_consolidation | 14.90% | SC | MORTGAGE | 6541.67 | 735-739 | 14 | 14272 | 2 | < 1 year |
99592 | 19200 | 19200 | 12.12% | 36 months | debt_consolidation | 28.36% | TX | MORTGAGE | 4583.33 | 715-719 | 12 | 11140 | 1 | 2 years |
80059 | 35000 | 35000 | 21.98% | 60 months | debt_consolidation | 23.81% | CA | MORTGAGE | 11500.00 | 690-694 | 14 | 21977 | 1 | 2 years |
15825 | 10000 | 9975 | 9.99% | 36 months | debt_consolidation | 14.30% | KS | MORTGAGE | 3833.33 | 695-699 | 10 | 9346 | 0 | 5 years |
33182 | 12000 | 12000 | 11.71% | 36 months | credit_card | 18.78% | NJ | RENT | 3195.00 | 695-699 | 11 | 14469 | 0 | 9 years |
df.dtypes
Amount.Requested int64 Amount.Funded.By.Investors float64 Interest.Rate object Loan.Length object Loan.Purpose object Debt.To.Income.Ratio object State object Home.Ownership object Monthly.Income float64 FICO.Range object Open.CREDIT.Lines float64 Revolving.CREDIT.Balance float64 Inquiries.in.the.Last.6.Months float64 Employment.Length object dtype: object
We will find out that several columns are not regonized as numeric.
First check whether any of the rate doesn't contains %
len(df[ ( df['Interest.Rate'].str.contains('%') == False) ] ) , len(df[ ( df['Debt.To.Income.Ratio'].str.contains('%') == False) ] )
(0, 0)
remove_percent_converter = lambda x: float(x.replace('%', ''))
First check whether any of the loan doesn't contains "months".
len(df[df['Loan.Length'].str.contains('months') == False])
0
remove_month_converter = lambda x: float(x.replace('months', ''))
clean_df = pd.read_csv('https://cdn.rawgit.com/benedict-chan/pytest/master/datasets/loansData.csv'
, converters={'Interest.Rate': remove_percent_converter
,'Debt.To.Income.Ratio': remove_percent_converter
, 'Loan.Length': remove_month_converter})
For FICO, we are trying to get the lower part of the Range as our preferred value. We will do it by:
fico_function = lambda x: float(x.split('-')[0])
clean_df['FICO.Score'] = clean_df['FICO.Range'].map(fico_function)
clean_df = clean_df.rename(columns={'Amount.Requested':'Loan.Amount'})
clean_df['NewIndex'] = pd.Series(range(1, clean_df['Amount.Funded.By.Investors'].count()+1), index=clean_df.index)
clean_df = clean_df.set_index('NewIndex')
final_columns = ["Interest.Rate","FICO.Score","Loan.Length","Monthly.Income","Loan.Amount"]
final_df = clean_df[final_columns]
final_df = final_df[final_df['Monthly.Income'] < 100000]
Getting the expected results and format it
checking_df = pd.read_csv('https://cdn.rawgit.com/benedict-chan/pytest/master/datasets/loanf.csv')
checking_df = checking_df.sort_index()
checking_df.index.name = 'NewIndex'
To see the difference of our cleaned data to the expected one.
ne_stacked = pd.concat([final_df, checking_df])
ne_stacked = ne_stacked.reset_index(drop=True)
df_gpby = ne_stacked.groupby(list(ne_stacked.columns))
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
ne_stacked.reindex(idx)
Interest.Rate | FICO.Score | Loan.Length | Monthly.Income | Loan.Amount | |
---|---|---|---|---|---|
4349 | 7.62 | 805 | 36 | 102750 | 33000 |
2864 | 7.43 | 800 | 36 | NaN | 5000 |