Overview of indexing semantics when using [] (__getitem__).

This does not yet handle with all extra special cases (like duplicate labels, non-monotonic, contiguous or not, etc cases).

In [1]:

import pandas as pd
import numpy as np

In [2]:

print "pandas: ", pd.__version__

pandas:  0.15.2-254-g85703a7

In [3]:

s_int = pd.Series(range(5), index=[0,1,2,3,4])
s_int2 = pd.Series(range(5), index=[0,1,4,5,6])

In [4]:

s_float = pd.Series(range(5), index=[0.0,0.1,0.2,0.3,0.4])

In [5]:

s_date = pd.Series(range(5), index=pd.date_range('2012-01-01', periods=5))

In [6]:

s_string = pd.Series(range(5), index=list('abcde'))

Slicing¶

Slicing is integer location based for an integer axis:

In [7]:

print s_int[0:3]
print s_int.ix[0:3]
print s_int.loc[0:3]
print s_int.iloc[0:3]

0    0
1    1
2    2
dtype: int64
0    0
1    1
2    2
3    3
dtype: int64
0    0
1    1
2    2
3    3
dtype: int64
0    0
1    1
2    2
dtype: int64

In [8]:

s_int2[0:3]

Out[8]:

0    0
1    1
4    2
dtype: int64

But for an axis with a float type ... only label based:

In [9]:

print s_float[0:3]
print s_float[0:0.3]

0.0    0
0.1    1
0.2    2
0.3    3
0.4    4
dtype: int64
0.0    0
0.1    1
0.2    2
0.3    3
dtype: int64

In [10]:

print s_float.ix[0:3]

0.0    0
0.1    1
0.2    2
0.3    3
0.4    4
dtype: int64

For other types, logically it is integer location based when having integer slice labels:

In [11]:

s_date[0:3]

Out[11]:

2012-01-01    0
2012-01-02    1
2012-01-03    2
Freq: D, dtype: int64

In [12]:

s_string[0:3]

Out[12]:

a    0
b    1
c    2
dtype: int64

and label based when having slice labels of the correct type:

In [13]:

s_date["2012-01-01":"2012-01-03"]

Out[13]:

2012-01-01    0
2012-01-02    1
2012-01-03    2
Freq: D, dtype: int64

In [14]:

s_string["a":"c"]

Out[14]:

a    0
b    1
c    2
dtype: int64

Summary for slicing:

Slicing with integer labels is:
- always integer location based
- except for a float indexer where it is label based
Slicing with other types of labels is always label based if it is of appropriate type for the indexer.

So, you can say that the behaviour is equivalent to .ix, except that the behaviour for integer labels is different for integer indexers (swapped). (For .ix, when having an integer axis, it is always label based and no fallback to integer location based).

Single label¶

In [15]:

print s_int[4]
print s_int2[4]

4
2

In [16]:

print s_int2[3]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-16-9a3eae08e13d> in <module>()
----> 1 print s_int2[3]

c:\users\vdbosscj\scipy\pandas-joris\pandas\core\series.pyc in __getitem__(self, key)
    511     def __getitem__(self, key):
    512         try:
--> 513             result = self.index.get_value(self, key)
    514 
    515             if not np.isscalar(result):

c:\users\vdbosscj\scipy\pandas-joris\pandas\core\index.pyc in get_value(self, series, key)
   1458 
   1459         try:
-> 1460             return self._engine.get_value(s, k)
   1461         except KeyError as e1:
   1462             if len(self) > 0 and self.inferred_type in ['integer','boolean']:

index.pyx in pandas.index.IndexEngine.get_value (pandas\index.c:3035)()

index.pyx in pandas.index.IndexEngine.get_value (pandas\index.c:2805)()

index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3586)()

hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:6713)()

hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:6654)()

KeyError: 3L

In [17]:

print s_float[2]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-17-47cabc265c95> in <module>()
----> 1 print s_float[2]

c:\users\vdbosscj\scipy\pandas-joris\pandas\core\series.pyc in __getitem__(self, key)
    511     def __getitem__(self, key):
    512         try:
--> 513             result = self.index.get_value(self, key)
    514 
    515             if not np.isscalar(result):

c:\users\vdbosscj\scipy\pandas-joris\pandas\core\index.pyc in get_value(self, series, key)
   2761 
   2762         k = _values_from_object(key)
-> 2763         loc = self.get_loc(k)
   2764         new_values = _values_from_object(series)[loc]
   2765 

c:\users\vdbosscj\scipy\pandas-joris\pandas\core\index.pyc in get_loc(self, key, method)
   2818         except (TypeError, NotImplementedError):
   2819             pass
-> 2820         return super(Float64Index, self).get_loc(key, method=method)
   2821 
   2822     @property

c:\users\vdbosscj\scipy\pandas-joris\pandas\core\index.pyc in get_loc(self, key, method)
   1435         """
   1436         if method is None:
-> 1437             return self._engine.get_loc(_values_from_object(key))
   1438 
   1439         indexer = self.get_indexer([key], method=method)

index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3706)()

index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3586)()

hashtable.pyx in pandas.hashtable.Float64HashTable.get_item (pandas\hashtable.c:9226)()

hashtable.pyx in pandas.hashtable.Float64HashTable.get_item (pandas\hashtable.c:9167)()

KeyError: 2.0

In [18]:

print s_float[0.2]

In [19]:

s_date["2012-01-03"]

Out[19]:

In [20]:

s_date[2]

Out[20]:

In [21]:

s_string["c"]

Out[21]:

In [22]:

s_string[2]

Out[22]:

Summary for single label:

Indexing with a single label is always label based
But, there is fallback to integer location based, except for integer and float indexers

List of labels¶

In [23]:

s_int[[3,4]]

Out[23]:

3    3
4    4
dtype: int64

In [24]:

print s_int2[[3,4]]
print s_int2.loc[[3,4]]

3   NaN
4     2
dtype: float64
3   NaN
4     2
dtype: float64

In [25]:

print s_int2[[2, 3]]
print s_int2.loc[[2,3]]

2   NaN
3   NaN
dtype: float64

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-25-369b4454ff12> in <module>()
      1 print s_int2[[2, 3]]
----> 2 print s_int2.loc[[2,3]]

c:\users\vdbosscj\scipy\pandas-joris\pandas\core\indexing.pyc in __getitem__(self, key)
   1197             return self._getitem_tuple(key)
   1198         else:
-> 1199             return self._getitem_axis(key, axis=0)
   1200 
   1201     def _getitem_axis(self, key, axis=0):

c:\users\vdbosscj\scipy\pandas-joris\pandas\core\indexing.pyc in _getitem_axis(self, key, axis)
   1311                     raise ValueError('Cannot index with multidimensional key')
   1312 
-> 1313                 return self._getitem_iterable(key, axis=axis)
   1314 
   1315             # nested tuple slicing

c:\users\vdbosscj\scipy\pandas-joris\pandas\core\indexing.pyc in _getitem_iterable(self, key, axis)
    926     def _getitem_iterable(self, key, axis=0):
    927         if self._should_validate_iterable(axis):
--> 928             self._has_valid_type(key, axis)
    929 
    930         labels = self.obj._get_axis(axis)

c:\users\vdbosscj\scipy\pandas-joris\pandas\core\indexing.pyc in _has_valid_type(self, key, axis)
   1259 
   1260                 raise KeyError("None of [%s] are in the [%s]" %
-> 1261                                (key, self.obj._get_axis_name(axis)))
   1262 
   1263             return True

KeyError: 'None of [[2, 3]] are in the [index]'

So with [] using a list is a pure reindex, also if no label of the list is found, you just get an all NaN series (which contrasts with loc, where at least one label should be found)

In [26]:

print s_int2[[8,9]]

8   NaN
9   NaN
dtype: float64

In [27]:

s_float[[2,3]]

Out[27]:

2   NaN
3   NaN
dtype: float64

In [38]:

s_float.ix[[2,3]]

Out[38]:

2   NaN
3   NaN
dtype: float64

In [28]:

s_float[[0.2,0.3]]

Out[28]:

0.2    2
0.3    3
dtype: int64

So also for a float indexer, it is purely reindex, label based

For a datetime index, it has also integer location fallback:

In [29]:

s_date[[2,3]]

Out[29]:

2012-01-03    2
2012-01-04    3
dtype: int64

But now, the index values cannot be out of bound (which follows iloc):

In [30]:

s_date[[3,9]]

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-30-a6e08ef49279> in <module>()
----> 1 s_date[[3,9]]

c:\users\vdbosscj\scipy\pandas-joris\pandas\core\series.pyc in __getitem__(self, key)
    551             key = check_bool_indexer(self.index, key)
    552 
--> 553         return self._get_with(key)
    554 
    555     def _get_with(self, key):

c:\users\vdbosscj\scipy\pandas-joris\pandas\core\series.pyc in _get_with(self, key)
    585                     return self.reindex(key)
    586                 else:
--> 587                     return self._get_values(key)
    588             elif key_type == 'boolean':
    589                 return self._get_values(key)

c:\users\vdbosscj\scipy\pandas-joris\pandas\core\series.pyc in _get_values(self, indexer)
    621                                      fastpath=True).__finalize__(self)
    622         except Exception:
--> 623             return self.values[indexer]
    624 
    625     def __setitem__(self, key, value):

IndexError: index 9 is out of bounds for axis 1 with size 5

And apparantly indexing with a string does not work, when using lists:

In [31]:

s_date

Out[31]:

2012-01-01    0
2012-01-02    1
2012-01-03    2
2012-01-04    3
2012-01-05    4
Freq: D, dtype: int64

In [32]:

s_date[['2012-01-03']]

Out[32]:

2012-01-03   NaN
dtype: float64

In [33]:

s_date['2012-01-03']

Out[33]:

In [34]:

s_date[['2012-01-03', '2012-01-04']]

Out[34]:

2012-01-03   NaN
2012-01-04   NaN
dtype: float64

In [35]:

_.index

Out[35]:

Index([u'2012-01-03', u'2012-01-04'], dtype='object')

In [36]:

s_string[[2,3]]

Out[36]:

c    2
d    3
dtype: int64

In [37]:

s_string[["c", "f"]]

Out[37]:

c     2
f   NaN
dtype: float64

Summary for indexing with list of labels:

It is primarily label based, but:
- There is fallback to integer location based apart from int/float integer axis
- It is a pure reindex, also if no label of the list is found, you just get an all NaN series (which contrasts with loc, where at least one label should be found)
- String parsing for a datetime index does not seem to work

This mainly follows ix, apart from points 2 and 3

Boolean indexing¶

In [39]:

s_int[[True, False, True, False, True]]

Out[39]:

0    0
2    2
4    4
dtype: int64

It does not need to be of the correct length (as is the same with ix/loc/iloc):

In [40]:

s_int[[True, False, True, False, True, False]]

Out[40]:

0    0
2    2
4    4
dtype: int64

In [41]:

s_float[[True, False, True, False, True]]

Out[41]:

0.0    0
0.2    2
0.4    4
dtype: int64

In [42]:

s_date[[True, False, True, False, True]]

Out[42]:

2012-01-01    0
2012-01-03    2
2012-01-05    4
dtype: int64

Summary for boolean indexing:

This is simple, it just works as expected

Specialties for DataFrames¶

In [43]:

df = pd.DataFrame(np.arange(25).reshape(5,5))
df2 = pd.DataFrame(np.arange(25).reshape(5,5), columns=list('abcde'))
df3 = pd.DataFrame(np.arange(25).reshape(5,5), columns=[0.0,0.1,0.2,0.3,0.4])
df3b = pd.DataFrame(np.arange(25).reshape(5,5), columns=[0.0,0.1,0.2,0.3,0.4], index=[0.0,0.1,0.2,0.3,0.4])

In [44]:

df

Out[44]:

	0	1	2	3	4
0	0	1	2	3	4
1	5	6	7	8	9
2	10	11	12	13	14
3	15	16	17	18	19
4	20	21	22	23	24

Single label: 'information' axis (axis=1):

In [45]:

df[0]

Out[45]:

0     0
1     5
2    10
3    15
4    20
Name: 0, dtype: int32

In [46]:

df[5]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-46-0c7cf4ee5b30> in <module>()
----> 1 df[5]

c:\users\vdbosscj\scipy\pandas-joris\pandas\core\frame.pyc in __getitem__(self, key)
   1782             return self._getitem_multilevel(key)
   1783         else:
-> 1784             return self._getitem_column(key)
   1785 
   1786     def _getitem_column(self, key):

c:\users\vdbosscj\scipy\pandas-joris\pandas\core\frame.pyc in _getitem_column(self, key)
   1789         # get column
   1790         if self.columns.is_unique:
-> 1791             return self._get_item_cache(key)
   1792 
   1793         # duplicate columns & possible reduce dimensionaility

c:\users\vdbosscj\scipy\pandas-joris\pandas\core\generic.pyc in _get_item_cache(self, item)
   1075         res = cache.get(item)
   1076         if res is None:
-> 1077             values = self._data.get(item)
   1078             res = self._box_item_values(item, values)
   1079             cache[item] = res

c:\users\vdbosscj\scipy\pandas-joris\pandas\core\internals.pyc in get(self, item, fastpath)
   2829 
   2830             if not isnull(item):
-> 2831                 loc = self.items.get_loc(item)
   2832             else:
   2833                 indexer = np.arange(len(self.items))[isnull(self.items)]

c:\users\vdbosscj\scipy\pandas-joris\pandas\core\index.pyc in get_loc(self, key, method)
   1435         """
   1436         if method is None:
-> 1437             return self._engine.get_loc(_values_from_object(key))
   1438 
   1439         indexer = self.get_indexer([key], method=method)

index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3706)()

index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3586)()

hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:6713)()

hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:6654)()

KeyError: 5L

But no fallback to integer location based when having a non-numeric index:

In [47]:

df2[2]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-47-67ed97008229> in <module>()
----> 1 df2[2]

c:\users\vdbosscj\scipy\pandas-joris\pandas\core\frame.pyc in __getitem__(self, key)
   1782             return self._getitem_multilevel(key)
   1783         else:
-> 1784             return self._getitem_column(key)
   1785 
   1786     def _getitem_column(self, key):

c:\users\vdbosscj\scipy\pandas-joris\pandas\core\frame.pyc in _getitem_column(self, key)
   1789         # get column
   1790         if self.columns.is_unique:
-> 1791             return self._get_item_cache(key)
   1792 
   1793         # duplicate columns & possible reduce dimensionaility

c:\users\vdbosscj\scipy\pandas-joris\pandas\core\generic.pyc in _get_item_cache(self, item)
   1075         res = cache.get(item)
   1076         if res is None:
-> 1077             values = self._data.get(item)
   1078             res = self._box_item_values(item, values)
   1079             cache[item] = res

c:\users\vdbosscj\scipy\pandas-joris\pandas\core\internals.pyc in get(self, item, fastpath)
   2829 
   2830             if not isnull(item):
-> 2831                 loc = self.items.get_loc(item)
   2832             else:
   2833                 indexer = np.arange(len(self.items))[isnull(self.items)]

c:\users\vdbosscj\scipy\pandas-joris\pandas\core\index.pyc in get_loc(self, key, method)
   1435         """
   1436         if method is None:
-> 1437             return self._engine.get_loc(_values_from_object(key))
   1438 
   1439         indexer = self.get_indexer([key], method=method)

index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3706)()

index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3586)()

hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:11377)()

hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:11330)()

KeyError: 2

In [56]:

df2.ix[:,2]

Out[56]:

0     2
1     7
2    12
3    17
4    22
Name: c, dtype: int32

Slicing: rows (axis=0):

In [48]:

df[0:2]

Out[48]:

	0	1	2	3	4
0	0	1	2	3	4
1	5	6	7	8	9

In [49]:

df3[0:2]

Out[49]:

	0.0	0.1	0.2	0.3	0.4
0	0	1	2	3	4
1	5	6	7	8	9

In [50]:

df3b[0:2]

Out[50]:

	0.0	0.1	0.2	0.3	0.4
0.0	0	1	2	3	4
0.1	5	6	7	8	9
0.2	10	11	12	13	14
0.3	15	16	17	18	19
0.4	20	21	22	23	24

And this seems to follow the same peculiarities as series[]

List of indexers is again axis=1:

In [51]:

df[[1,2]]

Out[51]:

	1	2
0	1	2
1	6	7
2	11	12
3	16	17
4	21	22

But now all labels must be present (no pure reindex as with series):

In [52]:

df[[1,6]]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-52-4be65b88f086> in <module>()
----> 1 df[[1,6]]

c:\users\vdbosscj\scipy\pandas-joris\pandas\core\frame.pyc in __getitem__(self, key)
   1776         if isinstance(key, (Series, np.ndarray, Index, list)):
   1777             # either boolean or fancy integer index
-> 1778             return self._getitem_array(key)
   1779         elif isinstance(key, DataFrame):
   1780             return self._getitem_frame(key)

c:\users\vdbosscj\scipy\pandas-joris\pandas\core\frame.pyc in _getitem_array(self, key)
   1820             return self.take(indexer, axis=0, convert=False)
   1821         else:
-> 1822             indexer = self.ix._convert_to_indexer(key, axis=1)
   1823             return self.take(indexer, axis=1, convert=True)
   1824 

c:\users\vdbosscj\scipy\pandas-joris\pandas\core\indexing.pyc in _convert_to_indexer(self, obj, axis, is_setter)
   1138                 mask = check == -1
   1139                 if mask.any():
-> 1140                     raise KeyError('%s not in index' % objarr[mask])
   1141 
   1142                 return _values_from_object(indexer)

KeyError: '[6] not in index'

In [53]:

df.loc[:,[1,6]]

Out[53]:

	1	6
0	1	NaN
1	6	NaN
2	11	NaN
3	16	NaN
4	21	NaN

And also fallback to integer location:

In [54]:

df2[[1,2]]

Out[54]:

	b	c
0	1	2
1	6	7
2	11	12
3	16	17
4	21	22

Boolean indexing is again row (axis = 0) oriented:

In [55]:

df[[True, False, False, True, False]]

Out[55]:

	0	1	2	3	4
0	0	1	2	3	4
3	15	16	17	18	19

In [57]:

df2[[True, False, False, True, False]]

Out[57]:

	a	b	c	d	e
0	0	1	2	3	4
3	15	16	17	18	19

Summary for DataFrames:

It uses the 'information' axis (axis 1) for:
- single labels
- list of labels
It uses the rows (axis 0) for:
- slicing
- boolean indexing

This is as documented (only the boolean case is not explicitely documented I think).

For the rest (on the choses axis), it follows the same semantics as [] on a series, but:

for a list of labels, now all labels must be present (no pure reindex as with series)
for single labels: no fallback to integer location based for non-numeric index (but this does fallback for a list of labels ...)