In [1]:

import pandas as pd
s = pd.read_csv('sample.csv')

Here's some rather messy data:

In [2]:

Out[2]:

	Category	Wholenum	Decimal	Year
0	A	2	0.6	20014
1	A	<1	5	2013
2	B	3	4	2012
3	B	four	-2	2011
4	C	-5	>0	210
5	c	NaN	NaN	2009

6 rows × 4 columns

To install the pandawash extension, run:

%install_ext https://github.com/takluyver/pandawash/raw/master/pandawash.py

In [3]:

%load_ext pandawash

The first column is supposed to contain a limited set of categories, but the names are inconsistent. %groupcol creates a cell like the one below, which you edit to make some values equivalent. Here, we've replaced lowercase c with capital C.

In [4]:

%groupcol s.Category

In [5]:

# Modify the mapping below to group equivalent values
s.Category = s.Category.map({
  'A': 'A',
  'B': 'B',
  'C': 'C',
  'c': 'C',
})

Some of the values in our numeric columns aren't proper numbers. %numbercol finds which values can't be converted to the specified type (int or float), and creates code to replace them. You fill in the replacement values.

In [6]:

%numbercol s.Wholenum int

In [7]:

# Fill in the replacements below
s.Wholenum = s.Wholenum\
    .fillna(value=0)\
    .replace('four', '4')\
    .replace('<1', '1')\
  .astype(int)

In [8]:

%numbercol s.Decimal float

In [9]:

# Fill in the replacements below
s.Decimal = s.Decimal\
    .replace('>0', '0.1')\
  .astype(float)

Some of the years are clearly unreasonable. Using %boundscheckcol, we can clean up anything outside a defined range.

In [10]:

%boundscheckcol s.Year 1900-2100

In [11]:

# Fill in the replacements below
s.Year = s.Year\
  .replace(20014, 2014)\
  .replace(210, 2010)

Here's the cleaned up DataFrame. The numeric columns now have appropriate datatypes, so you can easily use them in calculations.

In [12]:

Out[12]:

	Category	Wholenum	Decimal	Year
0	A	2	0.6	2014
1	A	1	5.0	2013
2	B	3	4.0	2012
3	B	4	-2.0	2011
4	C	-5	0.1	2010
5	C	0	NaN	2009

6 rows × 4 columns