In [1]:

%pylab inline

Populating the interactive namespace from numpy and matplotlib

Let's say hello to Pandas. The convention is use pd in the import.

In [2]:

import pandas as pd

Pandas Series Object¶

Array/list, one dimensional object. The easiest way is to initialize it with a list.

In [3]:

cities = ['London', 'New York', 'Berlin', 'Toronto']

In [4]:

s =pd.Series(cities)
s

Out[4]:

0      London
1    New York
2      Berlin
3     Toronto
dtype: object

We see that we have the values in the list, but also and index, in this case going from 0 to 3. That is the default index.

In [5]:

s.index

Out[5]:

Int64Index([0, 1, 2, 3], dtype=int64)

In [6]:

s.values

Out[6]:

array(['London', 'New York', 'Berlin', 'Toronto'], dtype=object)

We can reference the values by using the appropiate index. In this case there is not much difference with a numpy array.

In [7]:

s[1]

Out[7]:

'New York'

You can assign you own index. Which brings the power to the structure. Later it will be clear why.

In [8]:

s = pd.Series(cities, index = ['A', 'B', 'C', 'D'])
s

Out[8]:

A      London
B    New York
C      Berlin
D     Toronto
dtype: object

In this case we have used labels A to D. And we can now reference the data in the Series by using the label.

In [9]:

s['D']

Out[9]:

'Toronto'

Bogotá###¶

[Bogotá Image by Jorge Díaz](http://en.wikipedia.org/wiki/File:Bogot%C3%A1_de_noche.jpg)

Let us try to look at a more meaningful series. The population of Bogotá (the capital of Colombia) as evolving in time. Data taken from wikipedia. In this case it makes sense that the label is the year, and the value is the corresponding population. Then we can easily retrieve by year. Previously we used a list, but we can also use a dictionary to initialize a Series Object. In this case the keys will be the index, and the value will be the population in that year:

In [41]:

population_bogota = {1800:21964, 
                     1912:121257, 
                     1951:715250, 
                     1964:1697311, 
                     1973:2855065, 
                     1985:4236490, 
                     1999:6276428, 
                     2012:7571345}

In [42]:

series_bogota = pd.Series(population_bogota)
series_bogota

Out[42]:

1800      21964
1912     121257
1951     715250
1964    1697311
1973    2855065
1985    4236490
1999    6276428
2012    7571345
dtype: int64

When you are working with different series it may be useful to include some meta-data on the series. Like the name of the series itself, and of the index.

In [43]:

series_bogota.name = 'Bogota population'
series_bogota.index.name = 'year'

In [44]:

series_bogota

Out[44]:

year
1800      21964
1912     121257
1951     715250
1964    1697311
1973    2855065
1985    4236490
1999    6276428
2012    7571345
Name: Bogota population, dtype: int64

In [45]:

series_bogota.index

Out[45]:

Int64Index([1800, 1912, 1951, 1964, 1973, 1985, 1999, 2012], dtype=int64)

In [46]:

series_bogota.values

Out[46]:

array([  21964,  121257,  715250, 1697311, 2855065, 4236490, 6276428,
       7571345])

We can obviously reference specific values by using the index

In [47]:

series_bogota[1800]

Out[47]:

Note that the values in the series are numpy arrays. This is part of what makes pandas fast and can be very useful when working with other libraries.

In [48]:

type(series_bogota.values)

Out[48]:

numpy.ndarray

The index, on the other hand is a pandas Object. There is a hierarchy of Index objects that includes specific types for time indexes, hierichical index and other types of indexes.

In [49]:

type(series_bogota.index)

Out[49]:

pandas.core.index.Int64Index

We can query som information based on values. When did the population of bogota went above 1.000.000

In [50]:

series_bogota[series_bogota > 1000000]

Out[50]:

year
1964    1697311
1973    2855065
1985    4236490
1999    6276428
2012    7571345
Name: Bogota population, dtype: int64

I need the population from the 60's on

In [51]:

series_bogota[series_bogota.index > 1960]

Out[51]:

year
1964    1697311
1973    2855065
1985    4236490
1999    6276428
2012    7571345
Name: Bogota population, dtype: int64

What is actually going on with this way of querying information? Let us see what the bit inside the brackets yields.

In [52]:

series_bogota.index > 1965

Out[52]:

array([False, False, False, False,  True,  True,  True,  True], dtype=bool)

This means that we can also query using an array of booleans, perhaps handy if we have complex programatic conditions.

In [53]:

series_bogota[[True, False, True]]

Out[53]:

year
1800     21964
1951    715250
Name: Bogota population, dtype: int64

You can also query based on the index value itself.

In [54]:

series_bogota[[1973, 2012, 2011]]

Out[54]:

1973    2855065
2012    7571345
2011        NaN
Name: Bogota population, dtype: float64

Note that we get NaN, for 2011, because this label is not in the original index. This is part of the automatic handling of missing values, and will turn out to be extremely valuable when working with real data.

Lets add a ficticious value for 2011.

In [55]:

series_bogota = series_bogota.set_value(2011, 6500000)
series_bogota

Out[55]:

year
1800      21964
1912     121257
1951     715250
1964    1697311
1973    2855065
1985    4236490
1999    6276428
2012    7571345
2011    6500000
Name: Bogota population, dtype: int64

You can apply functions to each element of the series. Remember, this is close to numpy.

In [56]:

millions = lambda x: x/1000000.0
series_bogota.apply(millions)

Out[56]:

year
1800    0.021964
1912    0.121257
1951    0.715250
1964    1.697311
1973    2.855065
1985    4.236490
1999    6.276428
2012    7.571345
2011    6.500000
Name: Bogota population, dtype: float64

Finally, sometimes we need to have a quick idea of what the data looks like. For this we can use the function describe.

In [57]:

series_bogota.describe()

Out[57]:

count          9.000000
mean     3332790.000000
std      2926351.983191
min        21964.000000
25%       715250.000000
50%      2855065.000000
75%      6276428.000000
max      7571345.000000
dtype: float64

Or even cooler, plots

In [58]:

pd.Series.plot(series_bogota, kind='bar')

Out[58]:

<matplotlib.axes.AxesSubplot at 0x10993e150>

In [59]:

series_bogota.sort()
pd.Series.plot(series_bogota/1000000.0, kind='bar')

Out[59]:

<matplotlib.axes.AxesSubplot at 0x109a4b350>

What about population change?

In [60]:

series_bogota.pct_change()

Out[60]:

year
1800         NaN
1912    4.520716
1951    4.898629
1964    1.373032
1973    0.682111
1985    0.483851
1999    0.481516
2011    0.035621
2012    0.164822
Name: Bogota population, dtype: float64

In [61]:

series_bogota.pct_change().plot(kind='bar')
plt.ylabel('percentage change')

Out[61]:

<matplotlib.text.Text at 0x109d0ccd0>

Note that there is a lot of missing data. Is there a way to solve this quickly?

Reindex + Interpolation

In [62]:

series_bogota = series_bogota.reindex(range(1800, 2014))

In [63]:

series_bogota

Out[63]:

1800    21964
1801      NaN
1802      NaN
1803      NaN
1804      NaN
1805      NaN
1806      NaN
1807      NaN
1808      NaN
1809      NaN
1810      NaN
1811      NaN
1812      NaN
1813      NaN
1814      NaN
...
1999    6276428
2000        NaN
2001        NaN
2002        NaN
2003        NaN
2004        NaN
2005        NaN
2006        NaN
2007        NaN
2008        NaN
2009        NaN
2010        NaN
2011    6500000
2012    7571345
2013        NaN
Name: Bogota population, Length: 214, dtype: float64

In [64]:

series_bogota = series_bogota.interpolate('values')
series_bogota

Out[64]:

1800    21964.000000
1801    22850.544643
1802    23737.089286
1803    24623.633929
1804    25510.178571
1805    26396.723214
1806    27283.267857
1807    28169.812500
1808    29056.357143
1809    29942.901786
1810    30829.446429
1811    31715.991071
1812    32602.535714
1813    33489.080357
1814    34375.625000
...
1999    6276428
2000    6295059
2001    6313690
2002    6332321
2003    6350952
2004    6369583
2005    6388214
2006    6406845
2007    6425476
2008    6444107
2009    6462738
2010    6481369
2011    6500000
2012    7571345
2013    7571345
Name: Bogota population, Length: 214, dtype: float64

In [65]:

(series_bogota/1000000.0).plot()

Out[65]:

<matplotlib.axes.AxesSubplot at 0x109dc0750>

In [66]:

(series_bogota).pct_change().plot()

Out[66]:

<matplotlib.axes.AxesSubplot at 0x109dc6e90>

Finally, let us look at adding series.

Cali###¶

[Cali Image by Carlos Dussán Gómez](http://en.wikipedia.org/wiki/File:Gatonegro_Cali_Noche_desde_Cristo_Rey.jpg)

In [67]:

population_cali = {1809: 7546, 1938:101883, 1973:991549, 1985:1429026, 2013:2319684}

In [68]:

series_cali = pd.Series(population_cali)
series_cali.name = 'Cali (Colombia) Population'
series_cali.index.name = 'year'
series_cali

Out[68]:

year
1809       7546
1938     101883
1973     991549
1985    1429026
2013    2319684
Name: Cali (Colombia) Population, dtype: int64

In [69]:

(series_cali + series_bogota)

Out[69]:

1800             NaN
1801             NaN
1802             NaN
1803             NaN
1804             NaN
1805             NaN
1806             NaN
1807             NaN
1808             NaN
1809    37488.901786
1810             NaN
1811             NaN
1812             NaN
1813             NaN
1814             NaN
...
1999        NaN
2000        NaN
2001        NaN
2002        NaN
2003        NaN
2004        NaN
2005        NaN
2006        NaN
2007        NaN
2008        NaN
2009        NaN
2010        NaN
2011        NaN
2012        NaN
2013    9891029
Length: 214, dtype: float64

Probably the most meaningful way to add the series is if they share an index. So I can reindex cali with the index of Bogotá

In [70]:

series_cali

Out[70]:

year
1809       7546
1938     101883
1973     991549
1985    1429026
2013    2319684
Name: Cali (Colombia) Population, dtype: int64

In [71]:

series_cali = series_cali.reindex(series_bogota.index)
series_cali

Out[71]:

1800     NaN
1801     NaN
1802     NaN
1803     NaN
1804     NaN
1805     NaN
1806     NaN
1807     NaN
1808     NaN
1809    7546
1810     NaN
1811     NaN
1812     NaN
1813     NaN
1814     NaN
...
1999        NaN
2000        NaN
2001        NaN
2002        NaN
2003        NaN
2004        NaN
2005        NaN
2006        NaN
2007        NaN
2008        NaN
2009        NaN
2010        NaN
2011        NaN
2012        NaN
2013    2319684
Name: Cali (Colombia) Population, Length: 214, dtype: float64

In [72]:

len(series_bogota) == len(series_cali)

Out[72]:

True

In [73]:

np.alltrue(series_bogota.index == series_cali.index) 

Out[73]:

True

In [74]:

series_cali = series_cali.interpolate('values')
series_cali

Out[74]:

1800             NaN
1801             NaN
1802             NaN
1803             NaN
1804             NaN
1805             NaN
1806             NaN
1807             NaN
1808             NaN
1809     7546.000000
1810     8277.294574
1811     9008.589147
1812     9739.883721
1813    10471.178295
1814    11202.472868
...
1999    1874355.000000
2000    1906164.214286
2001    1937973.428571
2002    1969782.642857
2003    2001591.857143
2004    2033401.071429
2005    2065210.285714
2006    2097019.500000
2007    2128828.714286
2008    2160637.928571
2009    2192447.142857
2010    2224256.357143
2011    2256065.571429
2012    2287874.785714
2013    2319684.000000
Name: Cali (Colombia) Population, Length: 214, dtype: float64

In [75]:

series_cali = series_cali.fillna(0.0)
series_cali

Out[75]:

1800        0.000000
1801        0.000000
1802        0.000000
1803        0.000000
1804        0.000000
1805        0.000000
1806        0.000000
1807        0.000000
1808        0.000000
1809     7546.000000
1810     8277.294574
1811     9008.589147
1812     9739.883721
1813    10471.178295
1814    11202.472868
...
1999    1874355.000000
2000    1906164.214286
2001    1937973.428571
2002    1969782.642857
2003    2001591.857143
2004    2033401.071429
2005    2065210.285714
2006    2097019.500000
2007    2128828.714286
2008    2160637.928571
2009    2192447.142857
2010    2224256.357143
2011    2256065.571429
2012    2287874.785714
2013    2319684.000000
Name: Cali (Colombia) Population, Length: 214, dtype: float64

Now finally add the two series

In [76]:

series_bogota + series_cali

Out[76]:

1800    21964.000000
1801    22850.544643
1802    23737.089286
1803    24623.633929
1804    25510.178571
1805    26396.723214
1806    27283.267857
1807    28169.812500
1808    29056.357143
1809    37488.901786
1810    39106.741002
1811    40724.580219
1812    42342.419435
1813    43960.258652
1814    45578.097868
...
1999    8150783.000000
2000    8201223.214286
2001    8251663.428571
2002    8302103.642857
2003    8352543.857143
2004    8402984.071429
2005    8453424.285714
2006    8503864.500000
2007    8554304.714286
2008    8604744.928571
2009    8655185.142857
2010    8705625.357143
2011    8756065.571429
2012    9859219.785714
2013    9891029.000000
Length: 214, dtype: float64

In [77]:

series_bogota.plot(label='Bogota population')
series_cali.plot(label='Cali population')
plt.legend(loc='best')

Out[77]:

<matplotlib.legend.Legend at 0x109fd1950>

I do not vouch for the statistics here, it is just an example of the tool.

Pandas Data Frame¶

In [78]:

df = pd.DataFrame({'bogotá':series_bogota, 'cali':series_cali})

In [79]:

df

Out[79]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 214 entries, 1800 to 2013
Data columns (total 2 columns):
bogotá    214  non-null values
cali      214  non-null values
dtypes: float64(2)

In [80]:

df.head()

Out[80]:

	bogotá	cali
1800	21964.000000	0
1801	22850.544643	0
1802	23737.089286	0
1803	24623.633929	0
1804	25510.178571	0

In [81]:

pd.options.display.float_format = '{:20,.2f}'.format
df.index.name = 'year'

In [82]:

df.tail()

Out[82]:

	bogotá	cali
year
2009	6,462,738.00	2,192,447.14
2010	6,481,369.00	2,224,256.36
2011	6,500,000.00	2,256,065.57
2012	7,571,345.00	2,287,874.79
2013	7,571,345.00	2,319,684.00

Elements of DataFrame

In [83]:

df.index

Out[83]:

Int64Index([1800, 1801, 1802, 1803, 1804, 1805, 1806, 1807, 1808, 1809, 1810, 1811, 1812, 1813, 1814, 1815, 1816, 1817, 1818, 1819, 1820, 1821, 1822, 1823, 1824, 1825, 1826, 1827, 1828, 1829, 1830, 1831, 1832, 1833, 1834, 1835, 1836, 1837, 1838, 1839, 1840, 1841, 1842, 1843, 1844, 1845, 1846, 1847, 1848, 1849, 1850, 1851, 1852, 1853, 1854, 1855, 1856, 1857, 1858, 1859, 1860, 1861, 1862, 1863, 1864, 1865, 1866, 1867, 1868, 1869, 1870, 1871, 1872, 1873, 1874, 1875, 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894, 1895, 1896, 1897, 1898, 1899, 1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013], dtype=int64)

In [84]:

df.columns

Out[84]:

Index([u'bogotá', u'cali'], dtype=object)

In [85]:

np.shape(df.values)

Out[85]:

(214, 2)

In [86]:

df['population difference'] = df['bogotá'] - df['cali']

In [88]:

df.tail()

Out[88]:

	bogotá	cali	population difference
year
2009	6,462,738.00	2,192,447.14	4,270,290.86
2010	6,481,369.00	2,224,256.36	4,257,112.64
2011	6,500,000.00	2,256,065.57	4,243,934.43
2012	7,571,345.00	2,287,874.79	5,283,470.21
2013	7,571,345.00	2,319,684.00	5,251,661.00

In [87]:

df.describe()

Out[87]:

	bogotá	cali	population difference
count	214.00	214.00	214.00
mean	1,269,633.63	440,945.03	828,688.61
std	2,057,142.86	658,404.11	1,404,249.56
min	21,964.00	0.00	21,964.00
25%	69,172.50	39,905.78	29,266.72
50%	116,381.00	78,847.22	37,533.78
75%	1,376,252.60	654,746.87	721,505.72
max	7,571,345.00	2,319,684.00	5,283,470.21

In [89]:

df['population difference'].tail()

Out[89]:

year
2009           4,270,290.86
2010           4,257,112.64
2011           4,243,934.43
2012           5,283,470.21
2013           5,251,661.00
Name: population difference, dtype: float64

In [90]:

df[df.index >1990]

Out[90]:

	bogotá	cali	population difference
year
1991	5,110,749.14	1,619,881.29	3,490,867.86
1992	5,256,459.00	1,651,690.50	3,604,768.50
1993	5,402,168.86	1,683,499.71	3,718,669.14
1994	5,547,878.71	1,715,308.93	3,832,569.79
1995	5,693,588.57	1,747,118.14	3,946,470.43
1996	5,839,298.43	1,778,927.36	4,060,371.07
1997	5,985,008.29	1,810,736.57	4,174,271.71
1998	6,130,718.14	1,842,545.79	4,288,172.36
1999	6,276,428.00	1,874,355.00	4,402,073.00
2000	6,295,059.00	1,906,164.21	4,388,894.79
2001	6,313,690.00	1,937,973.43	4,375,716.57
2002	6,332,321.00	1,969,782.64	4,362,538.36
2003	6,350,952.00	2,001,591.86	4,349,360.14
2004	6,369,583.00	2,033,401.07	4,336,181.93
2005	6,388,214.00	2,065,210.29	4,323,003.71
2006	6,406,845.00	2,097,019.50	4,309,825.50
2007	6,425,476.00	2,128,828.71	4,296,647.29
2008	6,444,107.00	2,160,637.93	4,283,469.07
2009	6,462,738.00	2,192,447.14	4,270,290.86
2010	6,481,369.00	2,224,256.36	4,257,112.64
2011	6,500,000.00	2,256,065.57	4,243,934.43
2012	7,571,345.00	2,287,874.79	5,283,470.21
2013	7,571,345.00	2,319,684.00	5,251,661.00

In [91]:

df.plot()

Out[91]:

<matplotlib.axes.AxesSubplot at 0x10a2c04d0>

In [92]:

!ls *.csv

cali_and_bogota.csv metropolitan.csv    patients.csv

In [93]:

df.to_csv('cali_and_bogota.csv')

In [94]:

!head cali_and_bogota.csv

year,bogotá,cali,population difference
1800,21964.0,0.0,21964.0
1801,22850.54464285714,0.0,22850.54464285714
1802,23737.089285714286,0.0,23737.089285714286
1803,24623.633928571428,0.0,24623.633928571428
1804,25510.178571428572,0.0,25510.178571428572
1805,26396.723214285714,0.0,26396.723214285714
1806,27283.267857142855,0.0,27283.267857142855
1807,28169.8125,0.0,28169.8125
1808,29056.357142857145,0.0,29056.357142857145

There are many ways to initialize DataFrames

Let's be more cosmopolitan (if we still have 7 more minutes)¶

In [95]:

!cat metropolitan.csv

Metropolitan area,Country,Population,Area
Tokyo,Japan,32450000,8014
Seoul,South Korea,20550000,5076
Mexico City,Mexico,20450000,7346
New York City,United States,19750000,17884
Mumbai-Bombay,India,19200000,2350
Jakarta,Indonesia,18900000,5100
São Paulo,Brazil,18850000,8479
New Delhi,India,18600000,3182
Osaka-Kobe-Kyoto,Japan,17375000,6930
Shanghai,China,16650000,5177
Manila,Philippines,16300000,2521
Hong Kong,Hong Kong/China,15800000,3051
Los Angeles,United States,15250000,10780
Kolkata,India,15100000,1785
Moscow,Russia,15000000,14925
Cairo,Egypt,14450000,1600
Buenos Aires,Argentina,13170000,10888
London,United Kingdom,12875000,11391
Beijing,China,12500000,6562
Karachi,Pakistan,11800000,1100

In [96]:

df = pd.read_csv('metropolitan.csv')
df

Out[96]:

	Metropolitan area	Country	Population	Area
0	Tokyo	Japan	32450000	8014
1	Seoul	South Korea	20550000	5076
2	Mexico City	Mexico	20450000	7346
3	New York City	United States	19750000	17884
4	Mumbai-Bombay	India	19200000	2350
5	Jakarta	Indonesia	18900000	5100
6	São Paulo	Brazil	18850000	8479
7	New Delhi	India	18600000	3182
8	Osaka-Kobe-Kyoto	Japan	17375000	6930
9	Shanghai	China	16650000	5177
10	Manila	Philippines	16300000	2521
11	Hong Kong	Hong Kong/China	15800000	3051
12	Los Angeles	United States	15250000	10780
13	Kolkata	India	15100000	1785
14	Moscow	Russia	15000000	14925
15	Cairo	Egypt	14450000	1600
16	Buenos Aires	Argentina	13170000	10888
17	London	United Kingdom	12875000	11391
18	Beijing	China	12500000	6562
19	Karachi	Pakistan	11800000	1100

In [97]:

df = pd.read_csv('metropolitan.csv', index_col='Metropolitan area')
df

Out[97]:

	Country	Population	Area
Metropolitan area
Tokyo	Japan	32450000	8014
Seoul	South Korea	20550000	5076
Mexico City	Mexico	20450000	7346
New York City	United States	19750000	17884
Mumbai-Bombay	India	19200000	2350
Jakarta	Indonesia	18900000	5100
São Paulo	Brazil	18850000	8479
New Delhi	India	18600000	3182
Osaka-Kobe-Kyoto	Japan	17375000	6930
Shanghai	China	16650000	5177
Manila	Philippines	16300000	2521
Hong Kong	Hong Kong/China	15800000	3051
Los Angeles	United States	15250000	10780
Kolkata	India	15100000	1785
Moscow	Russia	15000000	14925
Cairo	Egypt	14450000	1600
Buenos Aires	Argentina	13170000	10888
London	United Kingdom	12875000	11391
Beijing	China	12500000	6562
Karachi	Pakistan	11800000	1100

In [98]:

df.describe()

Out[98]:

	Population	Area
count	20.00	20.00
mean	17,251,000.00	6,707.05
std	4,485,562.40	4,619.33
min	11,800,000.00	1,100.00
25%	14,862,500.00	2,918.50
50%	16,475,000.00	5,869.50
75%	18,975,000.00	9,054.25
max	32,450,000.00	17,884.00

In [99]:

df.dtypes

Out[99]:

Country       object
Population     int64
Area           int64
dtype: object

In [100]:

df['Population']= df['Population'].apply(float)
df['Area']= df['Area'].apply(float)

In [101]:

df.dtypes

Out[101]:

Country        object
Population    float64
Area          float64
dtype: object

Three biggest metro areas¶

In [102]:

df.sort('Population')
df.head(3)

Out[102]:

	Country	Population	Area
Metropolitan area
Tokyo	Japan	32,450,000.00	8,014.00
Seoul	South Korea	20,550,000.00	5,076.00
Mexico City	Mexico	20,450,000.00	7,346.00

Area largest than 10.0000 km2

In [103]:

df[df['Area']> 10000]

Out[103]:

	Country	Population	Area
Metropolitan area
New York City	United States	19,750,000.00	17,884.00
Los Angeles	United States	15,250,000.00	10,780.00
Moscow	Russia	15,000,000.00	14,925.00
Buenos Aires	Argentina	13,170,000.00	10,888.00
London	United Kingdom	12,875,000.00	11,391.00

In [104]:

df[(df['Area']> 10000) & (df['Population'] > 20000)]

Out[104]:

	Country	Population	Area
Metropolitan area
New York City	United States	19,750,000.00	17,884.00
Los Angeles	United States	15,250,000.00	10,780.00
Moscow	Russia	15,000,000.00	14,925.00
Buenos Aires	Argentina	13,170,000.00	10,888.00
London	United Kingdom	12,875,000.00	11,391.00

In [105]:

df['Density'] = df['Population']/df['Area']
df.sort('Density', ascending=False)['Density'].plot(kind='bar')

Out[105]:

<matplotlib.axes.AxesSubplot at 0x109faac10>

In [106]:

df.groupby('Country').head()

Out[106]:

		Country	Population	Area	Density
Country	Metropolitan area
Argentina	Buenos Aires	Argentina	13,170,000.00	10,888.00	1,209.59
Brazil	São Paulo	Brazil	18,850,000.00	8,479.00	2,223.14
China	Shanghai	China	16,650,000.00	5,177.00	3,216.15
China	Beijing	China	12,500,000.00	6,562.00	1,904.91
Egypt	Cairo	Egypt	14,450,000.00	1,600.00	9,031.25
Hong Kong/China	Hong Kong	Hong Kong/China	15,800,000.00	3,051.00	5,178.63
India	Mumbai-Bombay	India	19,200,000.00	2,350.00	8,170.21
	New Delhi	India	18,600,000.00	3,182.00	5,845.38
	Kolkata	India	15,100,000.00	1,785.00	8,459.38
Indonesia	Jakarta	Indonesia	18,900,000.00	5,100.00	3,705.88
Japan	Tokyo	Japan	32,450,000.00	8,014.00	4,049.16
Japan	Osaka-Kobe-Kyoto	Japan	17,375,000.00	6,930.00	2,507.22
Mexico	Mexico City	Mexico	20,450,000.00	7,346.00	2,783.83
Pakistan	Karachi	Pakistan	11,800,000.00	1,100.00	10,727.27
Philippines	Manila	Philippines	16,300,000.00	2,521.00	6,465.69
Russia	Moscow	Russia	15,000,000.00	14,925.00	1,005.03
South Korea	Seoul	South Korea	20,550,000.00	5,076.00	4,048.46
United Kingdom	London	United Kingdom	12,875,000.00	11,391.00	1,130.28
United States	New York City	United States	19,750,000.00	17,884.00	1,104.34
United States	Los Angeles	United States	15,250,000.00	10,780.00	1,414.66

In [107]:

type(df.groupby('Country'))

Out[107]:

pandas.core.groupby.DataFrameGroupBy

In [108]:

print df.groupby('Country').sum().sort('Density', ascending=False).head()

                      Population                 Area              Density
Country                                                                   
India              52,900,000.00             7,317.00            22,474.98
Pakistan           11,800,000.00             1,100.00            10,727.27
Egypt              14,450,000.00             1,600.00             9,031.25
Japan              49,825,000.00            14,944.00             6,556.38
Philippines        16,300,000.00             2,521.00             6,465.69

All the data in this notebook was taken from wikipedia.