Chapter 3, example 3¶

In this example, we will download and analyze some data about a large number of cities around the world and their population. This data has been created by MaxMind and is available for free at http://www.maxmind.com.

We first download the Zip file and uncompress it in a folder. The Zip file is about 40MB so that downloading it may take a while.

In [1]:

import urllib2, zipfile

In [2]:

url = 'http://ipython.rossant.net/'

In [3]:

filename = 'cities.zip'

In [4]:

downloaded = urllib2.urlopen(url + filename)

In [5]:

folder = 'data'

In [6]:

mkdir data

In [7]:

with open(filename, 'wb') as f:
    f.write(downloaded.read())

In [8]:

with zipfile.ZipFile(filename) as zip:
    zip.extractall(folder)

Now, we're going to load the CSV file that has been extracted with Pandas. The read_csv function of Pandas can open any CSV file.

In [9]:

import pandas as pd

In [10]:

filename = 'data/worldcitiespop.txt'

In [11]:

data = pd.read_csv(filename)

Now, let's explore the newly created data object.

In [12]:

type(data)

Out[12]:

pandas.core.frame.DataFrame

The data object is a DataFrame, a Pandas type consisting of a two-dimensional labeled data structure with columns of potentially different types (like a Excel spreadsheet). Like a NumPy array, the shape attribute returns the shape of the table. But unlike NumPy, the DataFrame object has a richer structure, and in particular the keys methods returns the names of the different columns.

In [13]:

data.shape, data.keys()

Out[13]:

((3173958, 7),
 Index([Country, City, AccentCity, Region, Population, Latitude, Longitude], dtype=object))

We can see that data has more than 3 million lines, and seven columns including the country, city, population and GPS coordinates of each city. The head and tail methods allow to take a quick look to the beginning and the end of the table, respectively.

In [14]:

data.tail()

Out[14]:

	Country	City	AccentCity	Region	Population	Latitude	Longitude
3173953	zw	zimre park	Zimre Park	4	NaN	-17.866111	31.213611
3173954	zw	ziyakamanas	Ziyakamanas	0	NaN	-18.216667	27.950000
3173955	zw	zizalisari	Zizalisari	4	NaN	-17.758889	31.010556
3173956	zw	zuzumba	Zuzumba	6	NaN	-20.033333	27.933333
3173957	zw	zvishavane	Zvishavane	7	79876	-20.333333	30.033333

We can see that these cities have NaN values as populations. The reason is that the population is not available for all cities in the data set, and Pandas handles those missing values transparently.

We'll see in the next sections what we can actually do with these data.

Each column of the DataFrame object can be accessed with its name. In IPython, tab completion proposes notably the different columns as attributes of the object. Here we get the series with the names of all cities (AccentCity is the full name of the city, with uppercase characters and accents).

In [15]:

data.AccentCity

Out[15]:

0                  Aix�s
1             Aixirivali
2             Aixirivall
3              Aixirvall
4               Aixovall
5                Andorra
6       Andorra la Vella
7        Andorra-Vieille
8                Andorre
9     Andorre-la-Vieille
10       Andorre-Vieille
11             Ansalonga
12                 Any�s
13                 Arans
14               Arinsal
...
3173943                Zandi
3173944              Zanyika
3173945           Zemalapala
3173946            Zemandana
3173947              Zemanda
3173948           Zibalonkwe
3173949          Zibunkululu
3173950                 Ziga
3173951    Zikamanas Village
3173952             Zimbabwe
3173953           Zimre Park
3173954          Ziyakamanas
3173955           Zizalisari
3173956              Zuzumba
3173957           Zvishavane
Name: AccentCity, Length: 3173958

This column is an instance of the Series class. We can access to certain rows using indexing. In the following example, we get the name 30000th city (knowing that indexing is 0-based):

In [16]:

data.AccentCity[30000]

Out[16]:

'Howasiyan'

So we can access to an element knowing its index. But how can we obtain a city from its name? For example, we'd like to obtain the population and GPS coordinates of New York. A possibility might be to loop through all cities and check their names, but it would be extremely slow because Python loops on millions on elements are not optimized at all. Pandas and NumPy offer a much more elegant and efficient way called boolean indexing. There are two steps that typically occur on the same line of code. First, we create an array with boolean values indicating, for each element, whether it satisfies a condition or not (if, whether the city name is New York). Then, we pass this array of booleans as an index to our original array: the result is then a subpart of the full array with only the elements corresponding to True. For example:

In [17]:

data[data.AccentCity=='New York'],

Out[17]:

(        Country      City AccentCity Region  Population   Latitude   Longitude
998166       gb  new york   New York     H7         NaN  53.083333   -0.150000
1087431      hn  new york   New York     16         NaN  14.800000  -88.366667
1525856      jm  new york   New York      9         NaN  18.250000  -77.183333
1525857      jm  new york   New York     10         NaN  18.116667  -77.133333
1893972      mx  new york   New York      5         NaN  16.266667  -93.233333
2929399      us  new york   New York     FL         NaN  30.838333  -87.200833
2946036      us  new york   New York     IA         NaN  40.851667  -93.259722
2951120      us  new york   New York     KY         NaN  36.988889  -88.952500
2977571      us  new york   New York     MO         NaN  39.685278  -93.926667
2986561      us  new york   New York     NM         NaN  35.058611 -107.526667
2990572      us  new york   New York     NY     8107916  40.714167  -74.006389
3029084      us  new york   New York     TX         NaN  32.167778  -95.668889,)

The same syntax works in NumPy and Pandas. Here, we find a dozen of cities named New York, but only one happens to be in the New York state. To access a single element with Pandas, we can use the .ix attribute (for index):

In [18]:

ny = 2990572
data.ix[ny]

Out[18]:

Country             us
City          new york
AccentCity    New York
Region              NY
Population     8107916
Latitude      40.71417
Longitude    -74.00639
Name: 2990572

Now, let's turn this Series object into a pure NumPy array. We go from the Pandas world to NumPy (keeping in mind that Pandas is built on top of NumPy). We'll mostly work with the population count of all cities.

In [19]:

population = array(data.Population)

In [20]:

population.shape

Out[20]:

(3173958,)

The population array is a one-dimensional vector with the populations of all cities (or NaN if the population is not available). The population of New York can be accessed in NumPy with basic indexing:

In [21]:

population[ny]

Out[21]:

8107916.0

Let's find out how many cities do have an actual population count. To do this, we'll select all elements in the population array that have a value different to NaN. We can use the NumPy function isnan:

In [22]:

isnan(population)

Out[22]:

array([ True,  True,  True, ...,  True,  True, False], dtype=bool)

In [23]:

x = population[~_]
len(x), len(x) / float(len(population))

Out[23]:

(47980, 0.015116772181610469)

There are about 1.5% of all cities in this data set that have a population count.

Let's explore now some statistics on the cities population.

In [24]:

x.mean()

Out[24]:

47719.57063359733

In [25]:

x.sum() / 1e9

Out[25]:

2.2895849990000001

In [26]:

len(x)/float(len(population))

Out[26]:

0.015116772181610469

The total population of those cities is about 2.3 billion people, about a third of the current world population. Hence, according to this data set, roughly 30% of the population lives in less than 1.5% of the cities in the world!

In [27]:

data.Population.describe()

Out[27]:

count       47980.000000
mean        47719.570634
std        302888.715626
min             7.000000
25%          3732.000000
50%         10779.000000
75%         27990.500000
max      31480498.000000

Now, let's locate some geographical coordinates.

In [28]:

locations = data[['Latitude','Longitude']].as_matrix()

In [29]:

def locate(x, y):
    d = locations - array([x, y])
    distances = d[:,0] ** 2 + d[:,1] ** 2
    closest = distances.argmin()
    return data.AccentCity[closest]

In [30]:

print(locate(48.861, 2.3358))

Paris