%pylab inline
Populating the interactive namespace from numpy and matplotlib
Let's say hello to Pandas. The convention is use pd in the import.
import pandas as pd
Array/list, one dimensional object. The easiest way is to initialize it with a list.
cities = ['London', 'New York', 'Berlin', 'Toronto']
s =pd.Series(cities)
s
0 London 1 New York 2 Berlin 3 Toronto dtype: object
We see that we have the values in the list, but also and index, in this case going from 0 to 3. That is the default index.
s.index
Int64Index([0, 1, 2, 3], dtype=int64)
s.values
array(['London', 'New York', 'Berlin', 'Toronto'], dtype=object)
We can reference the values by using the appropiate index. In this case there is not much difference with a numpy array.
s[1]
'New York'
You can assign you own index. Which brings the power to the structure. Later it will be clear why.
s = pd.Series(cities, index = ['A', 'B', 'C', 'D'])
s
A London B New York C Berlin D Toronto dtype: object
In this case we have used labels A to D. And we can now reference the data in the Series by using the label.
s['D']
'Toronto'
Let us try to look at a more meaningful series. The population of Bogotá (the capital of Colombia) as evolving in time. Data taken from wikipedia. In this case it makes sense that the label is the year, and the value is the corresponding population. Then we can easily retrieve by year. Previously we used a list, but we can also use a dictionary to initialize a Series Object. In this case the keys will be the index, and the value will be the population in that year:
population_bogota = {1800:21964,
1912:121257,
1951:715250,
1964:1697311,
1973:2855065,
1985:4236490,
1999:6276428,
2012:7571345}
series_bogota = pd.Series(population_bogota)
series_bogota
1800 21964 1912 121257 1951 715250 1964 1697311 1973 2855065 1985 4236490 1999 6276428 2012 7571345 dtype: int64
When you are working with different series it may be useful to include some meta-data on the series. Like the name of the series itself, and of the index.
series_bogota.name = 'Bogota population'
series_bogota.index.name = 'year'
series_bogota
year 1800 21964 1912 121257 1951 715250 1964 1697311 1973 2855065 1985 4236490 1999 6276428 2012 7571345 Name: Bogota population, dtype: int64
series_bogota.index
Int64Index([1800, 1912, 1951, 1964, 1973, 1985, 1999, 2012], dtype=int64)
series_bogota.values
array([ 21964, 121257, 715250, 1697311, 2855065, 4236490, 6276428, 7571345])
We can obviously reference specific values by using the index
series_bogota[1800]
21964
Note that the values in the series are numpy arrays. This is part of what makes pandas fast and can be very useful when working with other libraries.
type(series_bogota.values)
numpy.ndarray
The index, on the other hand is a pandas Object. There is a hierarchy of Index objects that includes specific types for time indexes, hierichical index and other types of indexes.
type(series_bogota.index)
pandas.core.index.Int64Index
We can query som information based on values. When did the population of bogota went above 1.000.000
series_bogota[series_bogota > 1000000]
year 1964 1697311 1973 2855065 1985 4236490 1999 6276428 2012 7571345 Name: Bogota population, dtype: int64
I need the population from the 60's on
series_bogota[series_bogota.index > 1960]
year 1964 1697311 1973 2855065 1985 4236490 1999 6276428 2012 7571345 Name: Bogota population, dtype: int64
What is actually going on with this way of querying information? Let us see what the bit inside the brackets yields.
series_bogota.index > 1965
array([False, False, False, False, True, True, True, True], dtype=bool)
This means that we can also query using an array of booleans, perhaps handy if we have complex programatic conditions.
series_bogota[[True, False, True]]
year 1800 21964 1951 715250 Name: Bogota population, dtype: int64
You can also query based on the index value itself.
series_bogota[[1973, 2012, 2011]]
1973 2855065 2012 7571345 2011 NaN Name: Bogota population, dtype: float64
Note that we get NaN, for 2011, because this label is not in the original index. This is part of the automatic handling of missing values, and will turn out to be extremely valuable when working with real data.
Lets add a ficticious value for 2011.
series_bogota = series_bogota.set_value(2011, 6500000)
series_bogota
year 1800 21964 1912 121257 1951 715250 1964 1697311 1973 2855065 1985 4236490 1999 6276428 2012 7571345 2011 6500000 Name: Bogota population, dtype: int64
You can apply functions to each element of the series. Remember, this is close to numpy.
millions = lambda x: x/1000000.0
series_bogota.apply(millions)
year 1800 0.021964 1912 0.121257 1951 0.715250 1964 1.697311 1973 2.855065 1985 4.236490 1999 6.276428 2012 7.571345 2011 6.500000 Name: Bogota population, dtype: float64
Finally, sometimes we need to have a quick idea of what the data looks like. For this we can use the function describe.
series_bogota.describe()
count 9.000000 mean 3332790.000000 std 2926351.983191 min 21964.000000 25% 715250.000000 50% 2855065.000000 75% 6276428.000000 max 7571345.000000 dtype: float64
Or even cooler, plots
pd.Series.plot(series_bogota, kind='bar')
<matplotlib.axes.AxesSubplot at 0x10993e150>
series_bogota.sort()
pd.Series.plot(series_bogota/1000000.0, kind='bar')
<matplotlib.axes.AxesSubplot at 0x109a4b350>
What about population change?
series_bogota.pct_change()
year 1800 NaN 1912 4.520716 1951 4.898629 1964 1.373032 1973 0.682111 1985 0.483851 1999 0.481516 2011 0.035621 2012 0.164822 Name: Bogota population, dtype: float64
series_bogota.pct_change().plot(kind='bar')
plt.ylabel('percentage change')
<matplotlib.text.Text at 0x109d0ccd0>
Note that there is a lot of missing data. Is there a way to solve this quickly?
Reindex + Interpolation
series_bogota = series_bogota.reindex(range(1800, 2014))
series_bogota
1800 21964 1801 NaN 1802 NaN 1803 NaN 1804 NaN 1805 NaN 1806 NaN 1807 NaN 1808 NaN 1809 NaN 1810 NaN 1811 NaN 1812 NaN 1813 NaN 1814 NaN ... 1999 6276428 2000 NaN 2001 NaN 2002 NaN 2003 NaN 2004 NaN 2005 NaN 2006 NaN 2007 NaN 2008 NaN 2009 NaN 2010 NaN 2011 6500000 2012 7571345 2013 NaN Name: Bogota population, Length: 214, dtype: float64
series_bogota = series_bogota.interpolate('values')
series_bogota
1800 21964.000000 1801 22850.544643 1802 23737.089286 1803 24623.633929 1804 25510.178571 1805 26396.723214 1806 27283.267857 1807 28169.812500 1808 29056.357143 1809 29942.901786 1810 30829.446429 1811 31715.991071 1812 32602.535714 1813 33489.080357 1814 34375.625000 ... 1999 6276428 2000 6295059 2001 6313690 2002 6332321 2003 6350952 2004 6369583 2005 6388214 2006 6406845 2007 6425476 2008 6444107 2009 6462738 2010 6481369 2011 6500000 2012 7571345 2013 7571345 Name: Bogota population, Length: 214, dtype: float64
(series_bogota/1000000.0).plot()
<matplotlib.axes.AxesSubplot at 0x109dc0750>
(series_bogota).pct_change().plot()
<matplotlib.axes.AxesSubplot at 0x109dc6e90>
Finally, let us look at adding series.
population_cali = {1809: 7546, 1938:101883, 1973:991549, 1985:1429026, 2013:2319684}
series_cali = pd.Series(population_cali)
series_cali.name = 'Cali (Colombia) Population'
series_cali.index.name = 'year'
series_cali
year 1809 7546 1938 101883 1973 991549 1985 1429026 2013 2319684 Name: Cali (Colombia) Population, dtype: int64
(series_cali + series_bogota)
1800 NaN 1801 NaN 1802 NaN 1803 NaN 1804 NaN 1805 NaN 1806 NaN 1807 NaN 1808 NaN 1809 37488.901786 1810 NaN 1811 NaN 1812 NaN 1813 NaN 1814 NaN ... 1999 NaN 2000 NaN 2001 NaN 2002 NaN 2003 NaN 2004 NaN 2005 NaN 2006 NaN 2007 NaN 2008 NaN 2009 NaN 2010 NaN 2011 NaN 2012 NaN 2013 9891029 Length: 214, dtype: float64
Probably the most meaningful way to add the series is if they share an index. So I can reindex cali with the index of Bogotá
series_cali
year 1809 7546 1938 101883 1973 991549 1985 1429026 2013 2319684 Name: Cali (Colombia) Population, dtype: int64
series_cali = series_cali.reindex(series_bogota.index)
series_cali
1800 NaN 1801 NaN 1802 NaN 1803 NaN 1804 NaN 1805 NaN 1806 NaN 1807 NaN 1808 NaN 1809 7546 1810 NaN 1811 NaN 1812 NaN 1813 NaN 1814 NaN ... 1999 NaN 2000 NaN 2001 NaN 2002 NaN 2003 NaN 2004 NaN 2005 NaN 2006 NaN 2007 NaN 2008 NaN 2009 NaN 2010 NaN 2011 NaN 2012 NaN 2013 2319684 Name: Cali (Colombia) Population, Length: 214, dtype: float64
len(series_bogota) == len(series_cali)
True
np.alltrue(series_bogota.index == series_cali.index)
True
series_cali = series_cali.interpolate('values')
series_cali
1800 NaN 1801 NaN 1802 NaN 1803 NaN 1804 NaN 1805 NaN 1806 NaN 1807 NaN 1808 NaN 1809 7546.000000 1810 8277.294574 1811 9008.589147 1812 9739.883721 1813 10471.178295 1814 11202.472868 ... 1999 1874355.000000 2000 1906164.214286 2001 1937973.428571 2002 1969782.642857 2003 2001591.857143 2004 2033401.071429 2005 2065210.285714 2006 2097019.500000 2007 2128828.714286 2008 2160637.928571 2009 2192447.142857 2010 2224256.357143 2011 2256065.571429 2012 2287874.785714 2013 2319684.000000 Name: Cali (Colombia) Population, Length: 214, dtype: float64
series_cali = series_cali.fillna(0.0)
series_cali
1800 0.000000 1801 0.000000 1802 0.000000 1803 0.000000 1804 0.000000 1805 0.000000 1806 0.000000 1807 0.000000 1808 0.000000 1809 7546.000000 1810 8277.294574 1811 9008.589147 1812 9739.883721 1813 10471.178295 1814 11202.472868 ... 1999 1874355.000000 2000 1906164.214286 2001 1937973.428571 2002 1969782.642857 2003 2001591.857143 2004 2033401.071429 2005 2065210.285714 2006 2097019.500000 2007 2128828.714286 2008 2160637.928571 2009 2192447.142857 2010 2224256.357143 2011 2256065.571429 2012 2287874.785714 2013 2319684.000000 Name: Cali (Colombia) Population, Length: 214, dtype: float64
Now finally add the two series
series_bogota + series_cali
1800 21964.000000 1801 22850.544643 1802 23737.089286 1803 24623.633929 1804 25510.178571 1805 26396.723214 1806 27283.267857 1807 28169.812500 1808 29056.357143 1809 37488.901786 1810 39106.741002 1811 40724.580219 1812 42342.419435 1813 43960.258652 1814 45578.097868 ... 1999 8150783.000000 2000 8201223.214286 2001 8251663.428571 2002 8302103.642857 2003 8352543.857143 2004 8402984.071429 2005 8453424.285714 2006 8503864.500000 2007 8554304.714286 2008 8604744.928571 2009 8655185.142857 2010 8705625.357143 2011 8756065.571429 2012 9859219.785714 2013 9891029.000000 Length: 214, dtype: float64
series_bogota.plot(label='Bogota population')
series_cali.plot(label='Cali population')
plt.legend(loc='best')
<matplotlib.legend.Legend at 0x109fd1950>
I do not vouch for the statistics here, it is just an example of the tool.
df = pd.DataFrame({'bogotá':series_bogota, 'cali':series_cali})
df
<class 'pandas.core.frame.DataFrame'> Int64Index: 214 entries, 1800 to 2013 Data columns (total 2 columns): bogotá 214 non-null values cali 214 non-null values dtypes: float64(2)
df.head()
bogotá | cali | |
---|---|---|
1800 | 21964.000000 | 0 |
1801 | 22850.544643 | 0 |
1802 | 23737.089286 | 0 |
1803 | 24623.633929 | 0 |
1804 | 25510.178571 | 0 |
pd.options.display.float_format = '{:20,.2f}'.format
df.index.name = 'year'
df.tail()
bogotá | cali | |
---|---|---|
year | ||
2009 | 6,462,738.00 | 2,192,447.14 |
2010 | 6,481,369.00 | 2,224,256.36 |
2011 | 6,500,000.00 | 2,256,065.57 |
2012 | 7,571,345.00 | 2,287,874.79 |
2013 | 7,571,345.00 | 2,319,684.00 |
Elements of DataFrame
df.index
Int64Index([1800, 1801, 1802, 1803, 1804, 1805, 1806, 1807, 1808, 1809, 1810, 1811, 1812, 1813, 1814, 1815, 1816, 1817, 1818, 1819, 1820, 1821, 1822, 1823, 1824, 1825, 1826, 1827, 1828, 1829, 1830, 1831, 1832, 1833, 1834, 1835, 1836, 1837, 1838, 1839, 1840, 1841, 1842, 1843, 1844, 1845, 1846, 1847, 1848, 1849, 1850, 1851, 1852, 1853, 1854, 1855, 1856, 1857, 1858, 1859, 1860, 1861, 1862, 1863, 1864, 1865, 1866, 1867, 1868, 1869, 1870, 1871, 1872, 1873, 1874, 1875, 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894, 1895, 1896, 1897, 1898, 1899, 1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013], dtype=int64)
df.columns
Index([u'bogotá', u'cali'], dtype=object)
np.shape(df.values)
(214, 2)
df['population difference'] = df['bogotá'] - df['cali']
df.tail()
bogotá | cali | population difference | |
---|---|---|---|
year | |||
2009 | 6,462,738.00 | 2,192,447.14 | 4,270,290.86 |
2010 | 6,481,369.00 | 2,224,256.36 | 4,257,112.64 |
2011 | 6,500,000.00 | 2,256,065.57 | 4,243,934.43 |
2012 | 7,571,345.00 | 2,287,874.79 | 5,283,470.21 |
2013 | 7,571,345.00 | 2,319,684.00 | 5,251,661.00 |
df.describe()
bogotá | cali | population difference | |
---|---|---|---|
count | 214.00 | 214.00 | 214.00 |
mean | 1,269,633.63 | 440,945.03 | 828,688.61 |
std | 2,057,142.86 | 658,404.11 | 1,404,249.56 |
min | 21,964.00 | 0.00 | 21,964.00 |
25% | 69,172.50 | 39,905.78 | 29,266.72 |
50% | 116,381.00 | 78,847.22 | 37,533.78 |
75% | 1,376,252.60 | 654,746.87 | 721,505.72 |
max | 7,571,345.00 | 2,319,684.00 | 5,283,470.21 |
df['population difference'].tail()
year 2009 4,270,290.86 2010 4,257,112.64 2011 4,243,934.43 2012 5,283,470.21 2013 5,251,661.00 Name: population difference, dtype: float64
df[df.index >1990]
bogotá | cali | population difference | |
---|---|---|---|
year | |||
1991 | 5,110,749.14 | 1,619,881.29 | 3,490,867.86 |
1992 | 5,256,459.00 | 1,651,690.50 | 3,604,768.50 |
1993 | 5,402,168.86 | 1,683,499.71 | 3,718,669.14 |
1994 | 5,547,878.71 | 1,715,308.93 | 3,832,569.79 |
1995 | 5,693,588.57 | 1,747,118.14 | 3,946,470.43 |
1996 | 5,839,298.43 | 1,778,927.36 | 4,060,371.07 |
1997 | 5,985,008.29 | 1,810,736.57 | 4,174,271.71 |
1998 | 6,130,718.14 | 1,842,545.79 | 4,288,172.36 |
1999 | 6,276,428.00 | 1,874,355.00 | 4,402,073.00 |
2000 | 6,295,059.00 | 1,906,164.21 | 4,388,894.79 |
2001 | 6,313,690.00 | 1,937,973.43 | 4,375,716.57 |
2002 | 6,332,321.00 | 1,969,782.64 | 4,362,538.36 |
2003 | 6,350,952.00 | 2,001,591.86 | 4,349,360.14 |
2004 | 6,369,583.00 | 2,033,401.07 | 4,336,181.93 |
2005 | 6,388,214.00 | 2,065,210.29 | 4,323,003.71 |
2006 | 6,406,845.00 | 2,097,019.50 | 4,309,825.50 |
2007 | 6,425,476.00 | 2,128,828.71 | 4,296,647.29 |
2008 | 6,444,107.00 | 2,160,637.93 | 4,283,469.07 |
2009 | 6,462,738.00 | 2,192,447.14 | 4,270,290.86 |
2010 | 6,481,369.00 | 2,224,256.36 | 4,257,112.64 |
2011 | 6,500,000.00 | 2,256,065.57 | 4,243,934.43 |
2012 | 7,571,345.00 | 2,287,874.79 | 5,283,470.21 |
2013 | 7,571,345.00 | 2,319,684.00 | 5,251,661.00 |
df.plot()
<matplotlib.axes.AxesSubplot at 0x10a2c04d0>
!ls *.csv
cali_and_bogota.csv metropolitan.csv patients.csv
df.to_csv('cali_and_bogota.csv')
!head cali_and_bogota.csv
year,bogotá,cali,population difference 1800,21964.0,0.0,21964.0 1801,22850.54464285714,0.0,22850.54464285714 1802,23737.089285714286,0.0,23737.089285714286 1803,24623.633928571428,0.0,24623.633928571428 1804,25510.178571428572,0.0,25510.178571428572 1805,26396.723214285714,0.0,26396.723214285714 1806,27283.267857142855,0.0,27283.267857142855 1807,28169.8125,0.0,28169.8125 1808,29056.357142857145,0.0,29056.357142857145
There are many ways to initialize DataFrames
!cat metropolitan.csv
Metropolitan area,Country,Population,Area Tokyo,Japan,32450000,8014 Seoul,South Korea,20550000,5076 Mexico City,Mexico,20450000,7346 New York City,United States,19750000,17884 Mumbai-Bombay,India,19200000,2350 Jakarta,Indonesia,18900000,5100 São Paulo,Brazil,18850000,8479 New Delhi,India,18600000,3182 Osaka-Kobe-Kyoto,Japan,17375000,6930 Shanghai,China,16650000,5177 Manila,Philippines,16300000,2521 Hong Kong,Hong Kong/China,15800000,3051 Los Angeles,United States,15250000,10780 Kolkata,India,15100000,1785 Moscow,Russia,15000000,14925 Cairo,Egypt,14450000,1600 Buenos Aires,Argentina,13170000,10888 London,United Kingdom,12875000,11391 Beijing,China,12500000,6562 Karachi,Pakistan,11800000,1100
df = pd.read_csv('metropolitan.csv')
df
Metropolitan area | Country | Population | Area | |
---|---|---|---|---|
0 | Tokyo | Japan | 32450000 | 8014 |
1 | Seoul | South Korea | 20550000 | 5076 |
2 | Mexico City | Mexico | 20450000 | 7346 |
3 | New York City | United States | 19750000 | 17884 |
4 | Mumbai-Bombay | India | 19200000 | 2350 |
5 | Jakarta | Indonesia | 18900000 | 5100 |
6 | São Paulo | Brazil | 18850000 | 8479 |
7 | New Delhi | India | 18600000 | 3182 |
8 | Osaka-Kobe-Kyoto | Japan | 17375000 | 6930 |
9 | Shanghai | China | 16650000 | 5177 |
10 | Manila | Philippines | 16300000 | 2521 |
11 | Hong Kong | Hong Kong/China | 15800000 | 3051 |
12 | Los Angeles | United States | 15250000 | 10780 |
13 | Kolkata | India | 15100000 | 1785 |
14 | Moscow | Russia | 15000000 | 14925 |
15 | Cairo | Egypt | 14450000 | 1600 |
16 | Buenos Aires | Argentina | 13170000 | 10888 |
17 | London | United Kingdom | 12875000 | 11391 |
18 | Beijing | China | 12500000 | 6562 |
19 | Karachi | Pakistan | 11800000 | 1100 |
df = pd.read_csv('metropolitan.csv', index_col='Metropolitan area')
df
Country | Population | Area | |
---|---|---|---|
Metropolitan area | |||
Tokyo | Japan | 32450000 | 8014 |
Seoul | South Korea | 20550000 | 5076 |
Mexico City | Mexico | 20450000 | 7346 |
New York City | United States | 19750000 | 17884 |
Mumbai-Bombay | India | 19200000 | 2350 |
Jakarta | Indonesia | 18900000 | 5100 |
São Paulo | Brazil | 18850000 | 8479 |
New Delhi | India | 18600000 | 3182 |
Osaka-Kobe-Kyoto | Japan | 17375000 | 6930 |
Shanghai | China | 16650000 | 5177 |
Manila | Philippines | 16300000 | 2521 |
Hong Kong | Hong Kong/China | 15800000 | 3051 |
Los Angeles | United States | 15250000 | 10780 |
Kolkata | India | 15100000 | 1785 |
Moscow | Russia | 15000000 | 14925 |
Cairo | Egypt | 14450000 | 1600 |
Buenos Aires | Argentina | 13170000 | 10888 |
London | United Kingdom | 12875000 | 11391 |
Beijing | China | 12500000 | 6562 |
Karachi | Pakistan | 11800000 | 1100 |
df.describe()
Population | Area | |
---|---|---|
count | 20.00 | 20.00 |
mean | 17,251,000.00 | 6,707.05 |
std | 4,485,562.40 | 4,619.33 |
min | 11,800,000.00 | 1,100.00 |
25% | 14,862,500.00 | 2,918.50 |
50% | 16,475,000.00 | 5,869.50 |
75% | 18,975,000.00 | 9,054.25 |
max | 32,450,000.00 | 17,884.00 |
df.dtypes
Country object Population int64 Area int64 dtype: object
df['Population']= df['Population'].apply(float)
df['Area']= df['Area'].apply(float)
df.dtypes
Country object Population float64 Area float64 dtype: object
df.sort('Population')
df.head(3)
Country | Population | Area | |
---|---|---|---|
Metropolitan area | |||
Tokyo | Japan | 32,450,000.00 | 8,014.00 |
Seoul | South Korea | 20,550,000.00 | 5,076.00 |
Mexico City | Mexico | 20,450,000.00 | 7,346.00 |
Area largest than 10.0000 km2
df[df['Area']> 10000]
Country | Population | Area | |
---|---|---|---|
Metropolitan area | |||
New York City | United States | 19,750,000.00 | 17,884.00 |
Los Angeles | United States | 15,250,000.00 | 10,780.00 |
Moscow | Russia | 15,000,000.00 | 14,925.00 |
Buenos Aires | Argentina | 13,170,000.00 | 10,888.00 |
London | United Kingdom | 12,875,000.00 | 11,391.00 |
df[(df['Area']> 10000) & (df['Population'] > 20000)]
Country | Population | Area | |
---|---|---|---|
Metropolitan area | |||
New York City | United States | 19,750,000.00 | 17,884.00 |
Los Angeles | United States | 15,250,000.00 | 10,780.00 |
Moscow | Russia | 15,000,000.00 | 14,925.00 |
Buenos Aires | Argentina | 13,170,000.00 | 10,888.00 |
London | United Kingdom | 12,875,000.00 | 11,391.00 |
df['Density'] = df['Population']/df['Area']
df.sort('Density', ascending=False)['Density'].plot(kind='bar')
<matplotlib.axes.AxesSubplot at 0x109faac10>
df.groupby('Country').head()
Country | Population | Area | Density | ||
---|---|---|---|---|---|
Country | Metropolitan area | ||||
Argentina | Buenos Aires | Argentina | 13,170,000.00 | 10,888.00 | 1,209.59 |
Brazil | São Paulo | Brazil | 18,850,000.00 | 8,479.00 | 2,223.14 |
China | Shanghai | China | 16,650,000.00 | 5,177.00 | 3,216.15 |
Beijing | China | 12,500,000.00 | 6,562.00 | 1,904.91 | |
Egypt | Cairo | Egypt | 14,450,000.00 | 1,600.00 | 9,031.25 |
Hong Kong/China | Hong Kong | Hong Kong/China | 15,800,000.00 | 3,051.00 | 5,178.63 |
India | Mumbai-Bombay | India | 19,200,000.00 | 2,350.00 | 8,170.21 |
New Delhi | India | 18,600,000.00 | 3,182.00 | 5,845.38 | |
Kolkata | India | 15,100,000.00 | 1,785.00 | 8,459.38 | |
Indonesia | Jakarta | Indonesia | 18,900,000.00 | 5,100.00 | 3,705.88 |
Japan | Tokyo | Japan | 32,450,000.00 | 8,014.00 | 4,049.16 |
Osaka-Kobe-Kyoto | Japan | 17,375,000.00 | 6,930.00 | 2,507.22 | |
Mexico | Mexico City | Mexico | 20,450,000.00 | 7,346.00 | 2,783.83 |
Pakistan | Karachi | Pakistan | 11,800,000.00 | 1,100.00 | 10,727.27 |
Philippines | Manila | Philippines | 16,300,000.00 | 2,521.00 | 6,465.69 |
Russia | Moscow | Russia | 15,000,000.00 | 14,925.00 | 1,005.03 |
South Korea | Seoul | South Korea | 20,550,000.00 | 5,076.00 | 4,048.46 |
United Kingdom | London | United Kingdom | 12,875,000.00 | 11,391.00 | 1,130.28 |
United States | New York City | United States | 19,750,000.00 | 17,884.00 | 1,104.34 |
Los Angeles | United States | 15,250,000.00 | 10,780.00 | 1,414.66 |
type(df.groupby('Country'))
pandas.core.groupby.DataFrameGroupBy
print df.groupby('Country').sum().sort('Density', ascending=False).head()
Population Area Density Country India 52,900,000.00 7,317.00 22,474.98 Pakistan 11,800,000.00 1,100.00 10,727.27 Egypt 14,450,000.00 1,600.00 9,031.25 Japan 49,825,000.00 14,944.00 6,556.38 Philippines 16,300,000.00 2,521.00 6,465.69
All the data in this notebook was taken from wikipedia.