The information in a regular bank statement is presented in a very unhelpful way - simply sorted by date, with no analysis of the items to assist you to understand them.
This notebook is a quick and dirty way to process a bank statement downloaded as a .csv file and extract useful information from it so that you can:
It does this by:
tom.paton@gmail.com / https://tompaton.com/
An iPython notebook is a good way to perform this task as all of the code is processing the data and there isn't any need to building a UI or add configuration options.
To install iPython:
$ sudo apt-get install ipython-notebook python-matplotlib python-scipy python-pandas python-sympy python-nose
Run using:
$ ipython notebook --pylab inline
First, the python environment needs to be configured:
import re
import matplotlib.pyplot as plt
import matplotlib.ticker as pltticker
import numpy as np
import pandas as pd
from datetime import date, timedelta
def compose(*funcs):
return lambda x: reduce(lambda acc, f: f(acc), funcs, x)
pd.options.display.max_colwidth = 0
pd.options.display.show_dimensions = False
def line_break(s, n=40): # line break every n chars
return re.sub('(.{%d})' % n, '\\1\n', s)
This function will be used to plot the histograms of varying items. I'm not familiar with matplotlib so there are probably better ways to do this.
def expense_item_hist(row):
amounts = pd.Series(row['amounts']).map(abs)
# histogram of amounts
counts, bins, patches = plt.hist(amounts, color=['lightgray'])
# colour current amount bar red
for patch, rightside, leftside in zip(patches, bins[1:], bins[:-1]):
if leftside <= abs(row['Amount']) <= rightside:
patch.set_facecolor('red')
# add probability density estimate
amounts.plot(kind="kde", secondary_y=True, style=':')
plt.axis('off')
# desc on top row
# date - amount - catg on second row
plt.title('%s\n%s' % (line_break(row['desc']),
row['Entered Date'].date()),
loc='left')
plt.title('$%.2f' % abs(row['Amount']),
loc='center')
plt.title(row['catg'], loc='right')
def expense_histograms(table, ncols=3, width=16, height=3, hspace=1):
nrows = ceil(len(table.index) / float(ncols))
plt.figure(figsize=(width, nrows*height))
plt.subplots_adjust(hspace=hspace)
# TODO: axis scale in $10, $100, $1000, $10000 increments
#loc = pltticker.MaxNLocator()
for i, (idx, row) in enumerate(table.iterrows()):
ax = plt.subplot(nrows, ncols, i+1)
#ax.xaxis.set_major_locator(loc)
expense_item_hist(row)
You'll need to supply a .csv file downloaded from your bank and edit the column names referenced in the code to match.
TODO: you'll probably get a new .csv file each time you check your statement, all of which will need to be loaded in and any duplicates where the files overlap will need to be removed.
statement_csv = "data/eg-statement.csv"
!head {statement_csv}
statement = pd.read_csv(statement_csv, parse_dates=[0, 1], dayfirst=True)
statement['Entered Date'].describe()
count 257 unique 171 first 2013-04-01 00:00:00 last 2014-05-08 00:00:00 top 2013-09-03 00:00:00 freq 5 Name: Entered Date, dtype: object
The processing basically strips reference and receipt numbers from the transaction descriptions so that items to the same payee can be matched together.
At the bottom of the notebook, any left over descriptions will be listed so you can see if additional clean up substitutions are required.
#statement.head()
#statement['Transaction Description'].value_counts()
def re_sub(regex, repl): return lambda desc: re.sub(regex, repl, desc)
clean_up = [lambda desc: desc.lower(),
re_sub('\s+', ' '),
re_sub('atm#\d+', '<atm>'),
re_sub('atm owner fee \d+', '<atm owner fee>'),
re_sub('pos( including cash out)?\s*#\d+', '<pos>'),
re_sub('receipt no \d+', '<receipt no>'),
re_sub('crn#\d+', '<crn>'),
re_sub('\(?ref\.?\d+\)?', '<ref>'),
re_sub('(au|gb|us)#\d+', '<card>')]
statement['desc'] = statement['Transaction Description'].map(compose(*clean_up))
#statement.head()
It also helps if similar items are grouped together, so you'll want to add your own bills etc to this list:
def categorise(desc):
if '<atm>' in desc: return 'atm'
if '<pos>' in desc: return 'pos'
if 'paypal' in desc: return 'paypal'
if 'bpay' in desc: return 'bpay'
if '<atm owner fee>' in desc: return 'fee'
if 'transaction fee' in desc: return 'fee'
if 'visa access card fee' in desc: return 'fee'
if 'member net transfer' in desc: return 'transfer'
if 'visa-internode' in desc: return 'bill'
if 'visa-linode.com' in desc: return 'bill'
if 'visa-citylink' in desc: return 'bill'
if 'visa-virgin mobile' in desc: return 'bill'
return ''
statement['catg'] = statement['desc'].map(categorise)
This code finds recurring items by grouping matching descriptions and then splits the statement into debits, credits etc.
stats = statement.groupby('desc')['Amount'].agg({#'total' : np.sum, 'average' : np.mean,
#'min': np.min, 'max': np.max,
'count': np.count_nonzero,
'amounts': lambda amounts: list(amounts)})
last_date = statement['Entered Date'].describe()['last']
start_date = last_date - timedelta(days=15)
table = statement.join(stats, on='desc')[statement['Entered Date'].gt(start_date)]
cols = ['Entered Date', 'catg', 'desc', 'Amount']
table.sort(['catg'], inplace=True)
new = table['count'].eq(1) # only one instance of item description seen
not_new = new.eq(False)
debit = table['Amount'].lt(0)
credit = table['Amount'].gt(0)
# only consider it recurring if we've seen it more than twice,
# otherwise it could be a doubled-up transaction we want to catch
table['recurring'] = [amounts.count(amount) > 1 and len(amounts) > 2
for amount, amounts in zip(table['Amount'], table['amounts'])]
recurring = table['recurring'].eq(True)
not_recurring = table['recurring'].eq(False) & not_new
TODO: It would be nice to be able to add notes on items to avoid wasting time investigating the weird ones more than once. This could be a simple dict of dict of Entered Date+Transaction Description --> string
This is the useful bit, and you have to ask yourself: why can't the bank do something like this?
The first thing to check for are debits that haven't appeared before - separating them out makes it easy.
table[cols][debit & new]
Entered Date | catg | desc | Amount | |
---|---|---|---|---|
1 | 2014-05-06 | visa-masters 7722 hawthorn east<card><ref> | -70.78 | |
4 | 2014-05-04 | visa-act*trailsplus macedon surry hills <card><ref> | -50.00 | |
8 | 2014-05-02 | visa-target melbourne city vic <card><ref> | -29.00 | |
14 | 2014-04-24 | visa-australia post 328874 carnegie <card><ref> | -15.60 | |
9 | 2014-05-01 | bpay | internet bpay to yarra valley water - biller code 123456 - <receipt no> <crn> | -114.68 |
13 | 2014-04-24 | paypal | visa-paypal *hl5pts 123456789 <card><ref> | -199.30 |
11 | 2014-04-30 | pos | <pos>-jaycar electronics melbourne au | -69.95 |
These payees have been seen before but the specific amount hasn't. The charts let you see at a glance which are truly problematic and which are normal variations.
expense_histograms(table[cols+['amounts']][debit & not_recurring])
These items have all been seen before, mostly here so everything is included, but it may be worth checking that recurring payments haven't been missed accidentally.
An item must be seen 3 or more times to show here, so that accidentally doubled up payments aren't ignored.
table[cols][debit & recurring]
Entered Date | catg | desc | Amount | |
---|---|---|---|---|
7 | 2014-05-02 | atm | <atm>-melbourne - 330 collins street | -200.00 |
2 | 2014-05-06 | bill | visa-virgin mobile h/o macquarie prk<card><ref> | -19.19 |
10 | 2014-04-30 | fee | visa access card fee <va -123456 > | -1.65 |
The same breakdown but this time for credits.
table[cols][credit & new]
expense_histograms(table[cols+['amounts']][credit & not_recurring])
table[cols][credit & recurring]
TODO: might be nice to show details of each (non-unique) item desc - average, totals, charts, frequency...
The following are the unique item descriptions, check through these to see if any additional steps need to be added to clean_up.
left_overs = sorted(statement['desc'].unique())
len(left_overs), left_overs
(95, ['<atm owner fee> cba\\melbourne au', '<atm owner fee> westpacbox hill 2 o/s box hill 3 au', '<atm>-cba\\melbourne au', '<atm>-chadstone - 1341 dandenong rd', ...snip... 'visa-ww petrol 3093 camberwell <card><ref>'])