In this example, we will download and analyze some data about a large number of cities around the world and their population. This data has been created by MaxMind and is available for free at http://www.maxmind.com.
We first download the Zip file and uncompress it in a folder. The Zip file is about 40MB so that downloading it may take a while.
import urllib2, zipfile
url = 'http://ipython.rossant.net/'
filename = 'cities.zip'
downloaded = urllib2.urlopen(url + filename)
folder = 'data'
mkdir data
with open(filename, 'wb') as f:
f.write(downloaded.read())
with zipfile.ZipFile(filename) as zip:
zip.extractall(folder)
Now, we're going to load the CSV file that has been extracted with Pandas. The read_csv
function of Pandas can open any CSV file.
import pandas as pd
filename = 'data/worldcitiespop.txt'
data = pd.read_csv(filename)
Now, let's explore the newly created data object.
type(data)
pandas.core.frame.DataFrame
The data object is a DataFrame, a Pandas type consisting of a two-dimensional labeled data structure with columns of potentially different types (like a Excel spreadsheet). Like a NumPy array, the shape attribute returns the shape of the table. But unlike NumPy, the DataFrame object has a richer structure, and in particular the keys methods returns the names of the different columns.
data.shape, data.keys()
((3173958, 7), Index([Country, City, AccentCity, Region, Population, Latitude, Longitude], dtype=object))
We can see that data has more than 3 million lines, and seven columns including the country, city, population and GPS coordinates of each city. The head and tail methods allow to take a quick look to the beginning and the end of the table, respectively.
data.tail()
Country | City | AccentCity | Region | Population | Latitude | Longitude | |
---|---|---|---|---|---|---|---|
3173953 | zw | zimre park | Zimre Park | 4 | NaN | -17.866111 | 31.213611 |
3173954 | zw | ziyakamanas | Ziyakamanas | 0 | NaN | -18.216667 | 27.950000 |
3173955 | zw | zizalisari | Zizalisari | 4 | NaN | -17.758889 | 31.010556 |
3173956 | zw | zuzumba | Zuzumba | 6 | NaN | -20.033333 | 27.933333 |
3173957 | zw | zvishavane | Zvishavane | 7 | 79876 | -20.333333 | 30.033333 |
We can see that these cities have NaN values as populations. The reason is that the population is not available for all cities in the data set, and Pandas handles those missing values transparently.
We'll see in the next sections what we can actually do with these data.
Each column of the DataFrame object can be accessed with its name. In IPython, tab completion proposes notably the different columns as attributes of the object. Here we get the series with the names of all cities (AccentCity is the full name of the city, with uppercase characters and accents).
data.AccentCity
0 Aix�s 1 Aixirivali 2 Aixirivall 3 Aixirvall 4 Aixovall 5 Andorra 6 Andorra la Vella 7 Andorra-Vieille 8 Andorre 9 Andorre-la-Vieille 10 Andorre-Vieille 11 Ansalonga 12 Any�s 13 Arans 14 Arinsal ... 3173943 Zandi 3173944 Zanyika 3173945 Zemalapala 3173946 Zemandana 3173947 Zemanda 3173948 Zibalonkwe 3173949 Zibunkululu 3173950 Ziga 3173951 Zikamanas Village 3173952 Zimbabwe 3173953 Zimre Park 3173954 Ziyakamanas 3173955 Zizalisari 3173956 Zuzumba 3173957 Zvishavane Name: AccentCity, Length: 3173958
This column is an instance of the Series class. We can access to certain rows using indexing. In the following example, we get the name 30000th city (knowing that indexing is 0-based):
data.AccentCity[30000]
'Howasiyan'
So we can access to an element knowing its index. But how can we obtain a city from its name? For example, we'd like to obtain the population and GPS coordinates of New York. A possibility might be to loop through all cities and check their names, but it would be extremely slow because Python loops on millions on elements are not optimized at all. Pandas and NumPy offer a much more elegant and efficient way called boolean indexing. There are two steps that typically occur on the same line of code. First, we create an array with boolean values indicating, for each element, whether it satisfies a condition or not (if, whether the city name is New York). Then, we pass this array of booleans as an index to our original array: the result is then a subpart of the full array with only the elements corresponding to True. For example:
data[data.AccentCity=='New York'],
( Country City AccentCity Region Population Latitude Longitude 998166 gb new york New York H7 NaN 53.083333 -0.150000 1087431 hn new york New York 16 NaN 14.800000 -88.366667 1525856 jm new york New York 9 NaN 18.250000 -77.183333 1525857 jm new york New York 10 NaN 18.116667 -77.133333 1893972 mx new york New York 5 NaN 16.266667 -93.233333 2929399 us new york New York FL NaN 30.838333 -87.200833 2946036 us new york New York IA NaN 40.851667 -93.259722 2951120 us new york New York KY NaN 36.988889 -88.952500 2977571 us new york New York MO NaN 39.685278 -93.926667 2986561 us new york New York NM NaN 35.058611 -107.526667 2990572 us new york New York NY 8107916 40.714167 -74.006389 3029084 us new york New York TX NaN 32.167778 -95.668889,)
The same syntax works in NumPy and Pandas. Here, we find a dozen of cities named New York, but only one happens to be in the New York state. To access a single element with Pandas, we can use the .ix attribute (for index):
ny = 2990572
data.ix[ny]
Country us City new york AccentCity New York Region NY Population 8107916 Latitude 40.71417 Longitude -74.00639 Name: 2990572
Now, let's turn this Series object into a pure NumPy array. We go from the Pandas world to NumPy (keeping in mind that Pandas is built on top of NumPy). We'll mostly work with the population count of all cities.
population = array(data.Population)
population.shape
(3173958,)
The population array is a one-dimensional vector with the populations of all cities (or NaN if the population is not available). The population of New York can be accessed in NumPy with basic indexing:
population[ny]
8107916.0
Let's find out how many cities do have an actual population count. To do this, we'll select all elements in the population array that have a value different to NaN. We can use the NumPy function isnan:
isnan(population)
array([ True, True, True, ..., True, True, False], dtype=bool)
x = population[~_]
len(x), len(x) / float(len(population))
(47980, 0.015116772181610469)
There are about 1.5% of all cities in this data set that have a population count.
Let's explore now some statistics on the cities population.
x.mean()
47719.57063359733
x.sum() / 1e9
2.2895849990000001
len(x)/float(len(population))
0.015116772181610469
The total population of those cities is about 2.3 billion people, about a third of the current world population. Hence, according to this data set, roughly 30% of the population lives in less than 1.5% of the cities in the world!
data.Population.describe()
count 47980.000000 mean 47719.570634 std 302888.715626 min 7.000000 25% 3732.000000 50% 10779.000000 75% 27990.500000 max 31480498.000000
Now, let's locate some geographical coordinates.
locations = data[['Latitude','Longitude']].as_matrix()
def locate(x, y):
d = locations - array([x, y])
distances = d[:,0] ** 2 + d[:,1] ** 2
closest = distances.argmin()
return data.AccentCity[closest]
print(locate(48.861, 2.3358))
Paris