Overview of indexing semantics when using []
(__getitem__
).
This does not yet handle with all extra special cases (like duplicate labels, non-monotonic, contiguous or not, etc cases).
import pandas as pd
import numpy as np
print "pandas: ", pd.__version__
pandas: 0.15.2-254-g85703a7
s_int = pd.Series(range(5), index=[0,1,2,3,4])
s_int2 = pd.Series(range(5), index=[0,1,4,5,6])
s_float = pd.Series(range(5), index=[0.0,0.1,0.2,0.3,0.4])
s_date = pd.Series(range(5), index=pd.date_range('2012-01-01', periods=5))
s_string = pd.Series(range(5), index=list('abcde'))
Slicing is integer location based for an integer axis:
print s_int[0:3]
print s_int.ix[0:3]
print s_int.loc[0:3]
print s_int.iloc[0:3]
0 0 1 1 2 2 dtype: int64 0 0 1 1 2 2 3 3 dtype: int64 0 0 1 1 2 2 3 3 dtype: int64 0 0 1 1 2 2 dtype: int64
s_int2[0:3]
0 0 1 1 4 2 dtype: int64
But for an axis with a float type ... only label based:
print s_float[0:3]
print s_float[0:0.3]
0.0 0 0.1 1 0.2 2 0.3 3 0.4 4 dtype: int64 0.0 0 0.1 1 0.2 2 0.3 3 dtype: int64
print s_float.ix[0:3]
0.0 0 0.1 1 0.2 2 0.3 3 0.4 4 dtype: int64
For other types, logically it is integer location based when having integer slice labels:
s_date[0:3]
2012-01-01 0 2012-01-02 1 2012-01-03 2 Freq: D, dtype: int64
s_string[0:3]
a 0 b 1 c 2 dtype: int64
and label based when having slice labels of the correct type:
s_date["2012-01-01":"2012-01-03"]
2012-01-01 0 2012-01-02 1 2012-01-03 2 Freq: D, dtype: int64
s_string["a":"c"]
a 0 b 1 c 2 dtype: int64
Summary for slicing:
So, you can say that the behaviour is equivalent to .ix
, except that the behaviour for integer labels is different for integer indexers (swapped). (For .ix
, when having an integer axis, it is always label based and no fallback to integer location based).
print s_int[4]
print s_int2[4]
4 2
print s_int2[3]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-16-9a3eae08e13d> in <module>() ----> 1 print s_int2[3] c:\users\vdbosscj\scipy\pandas-joris\pandas\core\series.pyc in __getitem__(self, key) 511 def __getitem__(self, key): 512 try: --> 513 result = self.index.get_value(self, key) 514 515 if not np.isscalar(result): c:\users\vdbosscj\scipy\pandas-joris\pandas\core\index.pyc in get_value(self, series, key) 1458 1459 try: -> 1460 return self._engine.get_value(s, k) 1461 except KeyError as e1: 1462 if len(self) > 0 and self.inferred_type in ['integer','boolean']: index.pyx in pandas.index.IndexEngine.get_value (pandas\index.c:3035)() index.pyx in pandas.index.IndexEngine.get_value (pandas\index.c:2805)() index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3586)() hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:6713)() hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:6654)() KeyError: 3L
print s_float[2]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-17-47cabc265c95> in <module>() ----> 1 print s_float[2] c:\users\vdbosscj\scipy\pandas-joris\pandas\core\series.pyc in __getitem__(self, key) 511 def __getitem__(self, key): 512 try: --> 513 result = self.index.get_value(self, key) 514 515 if not np.isscalar(result): c:\users\vdbosscj\scipy\pandas-joris\pandas\core\index.pyc in get_value(self, series, key) 2761 2762 k = _values_from_object(key) -> 2763 loc = self.get_loc(k) 2764 new_values = _values_from_object(series)[loc] 2765 c:\users\vdbosscj\scipy\pandas-joris\pandas\core\index.pyc in get_loc(self, key, method) 2818 except (TypeError, NotImplementedError): 2819 pass -> 2820 return super(Float64Index, self).get_loc(key, method=method) 2821 2822 @property c:\users\vdbosscj\scipy\pandas-joris\pandas\core\index.pyc in get_loc(self, key, method) 1435 """ 1436 if method is None: -> 1437 return self._engine.get_loc(_values_from_object(key)) 1438 1439 indexer = self.get_indexer([key], method=method) index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3706)() index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3586)() hashtable.pyx in pandas.hashtable.Float64HashTable.get_item (pandas\hashtable.c:9226)() hashtable.pyx in pandas.hashtable.Float64HashTable.get_item (pandas\hashtable.c:9167)() KeyError: 2.0
print s_float[0.2]
2
s_date["2012-01-03"]
2
s_date[2]
2
s_string["c"]
2
s_string[2]
2
Summary for single label:
s_int[[3,4]]
3 3 4 4 dtype: int64
print s_int2[[3,4]]
print s_int2.loc[[3,4]]
3 NaN 4 2 dtype: float64 3 NaN 4 2 dtype: float64
print s_int2[[2, 3]]
print s_int2.loc[[2,3]]
2 NaN 3 NaN dtype: float64
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-25-369b4454ff12> in <module>() 1 print s_int2[[2, 3]] ----> 2 print s_int2.loc[[2,3]] c:\users\vdbosscj\scipy\pandas-joris\pandas\core\indexing.pyc in __getitem__(self, key) 1197 return self._getitem_tuple(key) 1198 else: -> 1199 return self._getitem_axis(key, axis=0) 1200 1201 def _getitem_axis(self, key, axis=0): c:\users\vdbosscj\scipy\pandas-joris\pandas\core\indexing.pyc in _getitem_axis(self, key, axis) 1311 raise ValueError('Cannot index with multidimensional key') 1312 -> 1313 return self._getitem_iterable(key, axis=axis) 1314 1315 # nested tuple slicing c:\users\vdbosscj\scipy\pandas-joris\pandas\core\indexing.pyc in _getitem_iterable(self, key, axis) 926 def _getitem_iterable(self, key, axis=0): 927 if self._should_validate_iterable(axis): --> 928 self._has_valid_type(key, axis) 929 930 labels = self.obj._get_axis(axis) c:\users\vdbosscj\scipy\pandas-joris\pandas\core\indexing.pyc in _has_valid_type(self, key, axis) 1259 1260 raise KeyError("None of [%s] are in the [%s]" % -> 1261 (key, self.obj._get_axis_name(axis))) 1262 1263 return True KeyError: 'None of [[2, 3]] are in the [index]'
So with []
using a list is a pure reindex, also if no label of the list is found, you just get an all NaN series (which contrasts with loc
, where at least one label should be found)
print s_int2[[8,9]]
8 NaN 9 NaN dtype: float64
s_float[[2,3]]
2 NaN 3 NaN dtype: float64
s_float.ix[[2,3]]
2 NaN 3 NaN dtype: float64
s_float[[0.2,0.3]]
0.2 2 0.3 3 dtype: int64
So also for a float indexer, it is purely reindex, label based
For a datetime index, it has also integer location fallback:
s_date[[2,3]]
2012-01-03 2 2012-01-04 3 dtype: int64
But now, the index values cannot be out of bound (which follows iloc
):
s_date[[3,9]]
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-30-a6e08ef49279> in <module>() ----> 1 s_date[[3,9]] c:\users\vdbosscj\scipy\pandas-joris\pandas\core\series.pyc in __getitem__(self, key) 551 key = check_bool_indexer(self.index, key) 552 --> 553 return self._get_with(key) 554 555 def _get_with(self, key): c:\users\vdbosscj\scipy\pandas-joris\pandas\core\series.pyc in _get_with(self, key) 585 return self.reindex(key) 586 else: --> 587 return self._get_values(key) 588 elif key_type == 'boolean': 589 return self._get_values(key) c:\users\vdbosscj\scipy\pandas-joris\pandas\core\series.pyc in _get_values(self, indexer) 621 fastpath=True).__finalize__(self) 622 except Exception: --> 623 return self.values[indexer] 624 625 def __setitem__(self, key, value): IndexError: index 9 is out of bounds for axis 1 with size 5
And apparantly indexing with a string does not work, when using lists:
s_date
2012-01-01 0 2012-01-02 1 2012-01-03 2 2012-01-04 3 2012-01-05 4 Freq: D, dtype: int64
s_date[['2012-01-03']]
2012-01-03 NaN dtype: float64
s_date['2012-01-03']
2
s_date[['2012-01-03', '2012-01-04']]
2012-01-03 NaN 2012-01-04 NaN dtype: float64
_.index
Index([u'2012-01-03', u'2012-01-04'], dtype='object')
s_string[[2,3]]
c 2 d 3 dtype: int64
s_string[["c", "f"]]
c 2 f NaN dtype: float64
Summary for indexing with list of labels:
This mainly follows ix
, apart from points 2 and 3
s_int[[True, False, True, False, True]]
0 0 2 2 4 4 dtype: int64
It does not need to be of the correct length (as is the same with ix/loc/iloc
):
s_int[[True, False, True, False, True, False]]
0 0 2 2 4 4 dtype: int64
s_float[[True, False, True, False, True]]
0.0 0 0.2 2 0.4 4 dtype: int64
s_date[[True, False, True, False, True]]
2012-01-01 0 2012-01-03 2 2012-01-05 4 dtype: int64
Summary for boolean indexing:
df = pd.DataFrame(np.arange(25).reshape(5,5))
df2 = pd.DataFrame(np.arange(25).reshape(5,5), columns=list('abcde'))
df3 = pd.DataFrame(np.arange(25).reshape(5,5), columns=[0.0,0.1,0.2,0.3,0.4])
df3b = pd.DataFrame(np.arange(25).reshape(5,5), columns=[0.0,0.1,0.2,0.3,0.4], index=[0.0,0.1,0.2,0.3,0.4])
df
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 0 | 1 | 2 | 3 | 4 |
1 | 5 | 6 | 7 | 8 | 9 |
2 | 10 | 11 | 12 | 13 | 14 |
3 | 15 | 16 | 17 | 18 | 19 |
4 | 20 | 21 | 22 | 23 | 24 |
Single label: 'information' axis (axis=1):
df[0]
0 0 1 5 2 10 3 15 4 20 Name: 0, dtype: int32
df[5]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-46-0c7cf4ee5b30> in <module>() ----> 1 df[5] c:\users\vdbosscj\scipy\pandas-joris\pandas\core\frame.pyc in __getitem__(self, key) 1782 return self._getitem_multilevel(key) 1783 else: -> 1784 return self._getitem_column(key) 1785 1786 def _getitem_column(self, key): c:\users\vdbosscj\scipy\pandas-joris\pandas\core\frame.pyc in _getitem_column(self, key) 1789 # get column 1790 if self.columns.is_unique: -> 1791 return self._get_item_cache(key) 1792 1793 # duplicate columns & possible reduce dimensionaility c:\users\vdbosscj\scipy\pandas-joris\pandas\core\generic.pyc in _get_item_cache(self, item) 1075 res = cache.get(item) 1076 if res is None: -> 1077 values = self._data.get(item) 1078 res = self._box_item_values(item, values) 1079 cache[item] = res c:\users\vdbosscj\scipy\pandas-joris\pandas\core\internals.pyc in get(self, item, fastpath) 2829 2830 if not isnull(item): -> 2831 loc = self.items.get_loc(item) 2832 else: 2833 indexer = np.arange(len(self.items))[isnull(self.items)] c:\users\vdbosscj\scipy\pandas-joris\pandas\core\index.pyc in get_loc(self, key, method) 1435 """ 1436 if method is None: -> 1437 return self._engine.get_loc(_values_from_object(key)) 1438 1439 indexer = self.get_indexer([key], method=method) index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3706)() index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3586)() hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:6713)() hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:6654)() KeyError: 5L
But no fallback to integer location based when having a non-numeric index:
df2[2]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-47-67ed97008229> in <module>() ----> 1 df2[2] c:\users\vdbosscj\scipy\pandas-joris\pandas\core\frame.pyc in __getitem__(self, key) 1782 return self._getitem_multilevel(key) 1783 else: -> 1784 return self._getitem_column(key) 1785 1786 def _getitem_column(self, key): c:\users\vdbosscj\scipy\pandas-joris\pandas\core\frame.pyc in _getitem_column(self, key) 1789 # get column 1790 if self.columns.is_unique: -> 1791 return self._get_item_cache(key) 1792 1793 # duplicate columns & possible reduce dimensionaility c:\users\vdbosscj\scipy\pandas-joris\pandas\core\generic.pyc in _get_item_cache(self, item) 1075 res = cache.get(item) 1076 if res is None: -> 1077 values = self._data.get(item) 1078 res = self._box_item_values(item, values) 1079 cache[item] = res c:\users\vdbosscj\scipy\pandas-joris\pandas\core\internals.pyc in get(self, item, fastpath) 2829 2830 if not isnull(item): -> 2831 loc = self.items.get_loc(item) 2832 else: 2833 indexer = np.arange(len(self.items))[isnull(self.items)] c:\users\vdbosscj\scipy\pandas-joris\pandas\core\index.pyc in get_loc(self, key, method) 1435 """ 1436 if method is None: -> 1437 return self._engine.get_loc(_values_from_object(key)) 1438 1439 indexer = self.get_indexer([key], method=method) index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3706)() index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3586)() hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:11377)() hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:11330)() KeyError: 2
df2.ix[:,2]
0 2 1 7 2 12 3 17 4 22 Name: c, dtype: int32
Slicing: rows (axis=0):
df[0:2]
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 0 | 1 | 2 | 3 | 4 |
1 | 5 | 6 | 7 | 8 | 9 |
df3[0:2]
0.0 | 0.1 | 0.2 | 0.3 | 0.4 | |
---|---|---|---|---|---|
0 | 0 | 1 | 2 | 3 | 4 |
1 | 5 | 6 | 7 | 8 | 9 |
df3b[0:2]
0.0 | 0.1 | 0.2 | 0.3 | 0.4 | |
---|---|---|---|---|---|
0.0 | 0 | 1 | 2 | 3 | 4 |
0.1 | 5 | 6 | 7 | 8 | 9 |
0.2 | 10 | 11 | 12 | 13 | 14 |
0.3 | 15 | 16 | 17 | 18 | 19 |
0.4 | 20 | 21 | 22 | 23 | 24 |
And this seems to follow the same peculiarities as series[]
List of indexers is again axis=1:
df[[1,2]]
1 | 2 | |
---|---|---|
0 | 1 | 2 |
1 | 6 | 7 |
2 | 11 | 12 |
3 | 16 | 17 |
4 | 21 | 22 |
But now all labels must be present (no pure reindex as with series):
df[[1,6]]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-52-4be65b88f086> in <module>() ----> 1 df[[1,6]] c:\users\vdbosscj\scipy\pandas-joris\pandas\core\frame.pyc in __getitem__(self, key) 1776 if isinstance(key, (Series, np.ndarray, Index, list)): 1777 # either boolean or fancy integer index -> 1778 return self._getitem_array(key) 1779 elif isinstance(key, DataFrame): 1780 return self._getitem_frame(key) c:\users\vdbosscj\scipy\pandas-joris\pandas\core\frame.pyc in _getitem_array(self, key) 1820 return self.take(indexer, axis=0, convert=False) 1821 else: -> 1822 indexer = self.ix._convert_to_indexer(key, axis=1) 1823 return self.take(indexer, axis=1, convert=True) 1824 c:\users\vdbosscj\scipy\pandas-joris\pandas\core\indexing.pyc in _convert_to_indexer(self, obj, axis, is_setter) 1138 mask = check == -1 1139 if mask.any(): -> 1140 raise KeyError('%s not in index' % objarr[mask]) 1141 1142 return _values_from_object(indexer) KeyError: '[6] not in index'
df.loc[:,[1,6]]
1 | 6 | |
---|---|---|
0 | 1 | NaN |
1 | 6 | NaN |
2 | 11 | NaN |
3 | 16 | NaN |
4 | 21 | NaN |
And also fallback to integer location:
df2[[1,2]]
b | c | |
---|---|---|
0 | 1 | 2 |
1 | 6 | 7 |
2 | 11 | 12 |
3 | 16 | 17 |
4 | 21 | 22 |
Boolean indexing is again row (axis = 0) oriented:
df[[True, False, False, True, False]]
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 0 | 1 | 2 | 3 | 4 |
3 | 15 | 16 | 17 | 18 | 19 |
df2[[True, False, False, True, False]]
a | b | c | d | e | |
---|---|---|---|---|---|
0 | 0 | 1 | 2 | 3 | 4 |
3 | 15 | 16 | 17 | 18 | 19 |
Summary for DataFrames:
This is as documented (only the boolean case is not explicitely documented I think).
For the rest (on the choses axis), it follows the same semantics as []
on a series, but: