import pandas as pd
s = pd.read_csv('sample.csv')
Here's some rather messy data:
s
Category | Wholenum | Decimal | Year | |
---|---|---|---|---|
0 | A | 2 | 0.6 | 20014 |
1 | A | <1 | 5 | 2013 |
2 | B | 3 | 4 | 2012 |
3 | B | four | -2 | 2011 |
4 | C | -5 | >0 | 210 |
5 | c | NaN | NaN | 2009 |
6 rows × 4 columns
To install the pandawash extension, run:
%install_ext https://github.com/takluyver/pandawash/raw/master/pandawash.py
%load_ext pandawash
The first column is supposed to contain a limited set of categories, but the names are inconsistent. %groupcol
creates a cell like the one below, which you edit to make some values equivalent. Here, we've replaced lowercase c
with capital C
.
%groupcol s.Category
# Modify the mapping below to group equivalent values
s.Category = s.Category.map({
'A': 'A',
'B': 'B',
'C': 'C',
'c': 'C',
})
Some of the values in our numeric columns aren't proper numbers. %numbercol
finds which values can't be converted to the specified type (int
or float
), and creates code to replace them. You fill in the replacement values.
%numbercol s.Wholenum int
# Fill in the replacements below
s.Wholenum = s.Wholenum\
.fillna(value=0)\
.replace('four', '4')\
.replace('<1', '1')\
.astype(int)
%numbercol s.Decimal float
# Fill in the replacements below
s.Decimal = s.Decimal\
.replace('>0', '0.1')\
.astype(float)
Some of the years are clearly unreasonable. Using %boundscheckcol
, we can clean up anything outside a defined range.
%boundscheckcol s.Year 1900-2100
# Fill in the replacements below
s.Year = s.Year\
.replace(20014, 2014)\
.replace(210, 2010)
Here's the cleaned up DataFrame. The numeric columns now have appropriate datatypes, so you can easily use them in calculations.
s
Category | Wholenum | Decimal | Year | |
---|---|---|---|---|
0 | A | 2 | 0.6 | 2014 |
1 | A | 1 | 5.0 | 2013 |
2 | B | 3 | 4.0 | 2012 |
3 | B | 4 | -2.0 | 2011 |
4 | C | -5 | 0.1 | 2010 |
5 | C | 0 | NaN | 2009 |
6 rows × 4 columns