%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
try:
import seaborn
except ImportError:
pass
# redefining the example objects
# series
population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3,
'United Kingdom': 64.9, 'Netherlands': 16.9})
# dataframe
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
'population': [11.3, 64.3, 81.3, 16.9, 64.9],
'area': [30510, 671308, 357050, 41526, 244820],
'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries
Setting the index to the country names:
countries = countries.set_index('country')
countries
One of pandas' basic features is the labeling of rows and columns, but this makes indexing also a bit more complex compared to numpy. We now have to distuinguish between:
data[]
provides some convenience shortcuts¶For a DataFrame, basic indexing selects the columns.
Selecting a single column:
countries['area']
or multiple columns:
countries[['area', 'population']]
But, slicing accesses the rows:
countries['France':'Netherlands']
So as a summary, []
provides the following convenience shortcuts:
s[label]
df['col']
or df[['col1', 'col2']]
df['row_label1':'row_label2']
or df[mask]
loc
and iloc
¶When using []
like above, you can only select from one axis at once (rows or columns, not both). For more advanced indexing, you have some extra attributes:
loc
: selection by labeliloc
: selection by positionThese methods index the different dimensions of the frame:
df.loc[row_indexer, column_indexer]
df.iloc[row_indexer, column_indexer]
Selecting a single element:
countries.loc['Germany', 'area']
But the row or column indexer can also be a list, slice, boolean array, ..
countries.loc['France':'Germany', ['area', 'population']]
Selecting by position with iloc
works similar as indexing numpy arrays:
countries.iloc[0:2,1:3]
The different indexing methods can also be used to assign data:
countries2 = countries.copy()
countries2.loc['Belgium':'Germany', 'population'] = 10
countries2
Often, you want to select rows based on a certain condition. This can be done with 'boolean indexing' (like a where clause in SQL).
The indexer (or boolean mask) should be 1-dimensional and the same length as the thing being indexed.
countries['area'] > 100000
countries[countries['area'] > 100000]
isin
and string methods¶The isin
method of Series is very useful to select rows that may contain certain values:
s = countries['capital']
s.isin?
s.isin(['Berlin', 'London'])
This can then be used to filter the dataframe with boolean indexing:
countries[countries['capital'].isin(['Berlin', 'London'])]
Let's say we want to select all data for which the capital starts with a 'B'. In Python, when having a string, we could use the startswith
method:
'Berlin'.startswith('B')
In pandas, these are available on a Series through the str
namespace:
countries['capital'].str.startswith('B')
For an overview of all string methods, see: http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling
countries.loc['Belgium', 'capital'] = 'Ghent'
countries
countries['capital']['Belgium'] = 'Antwerp'
countries
countries[countries['capital'] == 'Antwerp']['capital'] = 'Brussels'
countries
How to avoid this?
loc
instead of chained indexing if possible!copy
explicitly if you don't want to change the original data.