Contributors to this notebook

In [47]:
import numpy as np  # 1.7 or higher
import pandas as pd # 0.10 or higher
from matplotlib import pyplot as plt

print "numpy version: ", np.__version__
print "pandas version: ", pd.__version__
numpy version:  1.6.2
pandas version:  0.10.0

Brief introduction to the Python stack for scientific computing

In this notebook, we will do a very brief introduction of the main elements of the Python stack for scientific computing, with an eye to data analysis. Keep in mind this very introductory and only allows to to scratch the surface of what is out there. Hopefully, it is enough to get you interested and dig deeper.

The main five libraries presented here are:

There is a lot of different resources to learn about tools in Python for science, so any attempt to list will be unavoidably incomplete. However, one reference can be made to Wes McKiney's new book: Python for data analysis, published by O'Reilly Media. It is a good first step into the world of scientific computing in Python.

In [4]:
from IPython.display import Image
Image(url="http://akamaicovers.oreilly.com/images/0636920023784/cat.gif")
Out[4]:

IPython

IPython is an enhanced interactive shell for scientific computing in Python. It includes several improvements and extensions with respect to the vanilla Python interpreter and it is particularly well suited for interactive data analysis. The project website (http://ipython.org) contains many resources so make sure to take a look for help.

IPython Notebook

In particular, this document is written in the IPython Notebook, a web app that extends many of the capabilities and allows to create self-contained files (called notebooks) that combine Python code, Python output and text in the markdown markup language.

The concept of the notebook is structured around individual cells that may contain either code or text; if they contain code, they can be run as if that chunk of code was pasted in the interpreter.

Here are the main tricks to get you started with the notebook:

  • Make sure all dependencies and the Notebook itself are installed. Then fire up a terminal and run ipython notebook. This will open up your default browser and start the web app at the directory where you had your terminal pointed. From the home screen, you can create a new notebook, or open any .ipynb files in the folder.
  • Double click on a cell to edit it. Once you are ready to run it (code) or render it (text), press shift+enter.
  • A code cell can contain any chunk of Python code that you could execute on the interpreter. For instance:
In [5]:
print "Hello world"
Hello world

When you press shift+enter, the code is evaluated dynamically.

  • A text cell is to be written in the Markdown markup language. This includes, among others:

    • Links to websites.
    • Italics and bold.
    • And even the fantastic $\LaTeX$ language for mathematical typesetting.
  • Tab completion and inline help
In [6]:
pd?
  • The Notebook includes some keyboard shortcuts for agile development. All of them are triggered by ctr+m. To get a full list of them, you can call the Help by pressing ctr+m and then h. This is the full list:

Shift-Enter : run cell Ctrl-Enter : run cell in-place Ctrl-m x : cut cell Ctrl-m c : copy cell Ctrl-m v : paste cell Ctrl-m d : delete cell Ctrl-m a : insert cell above Ctrl-m b : insert cell below Ctrl-m o : toggle output Ctrl-m O : toggle output scroll Ctrl-m l : toggle line numbers Ctrl-m s : save notebook Ctrl-m j : move cell down Ctrl-m k : move cell up Ctrl-m y : code cell Ctrl-m m : markdown cell Ctrl-m t : raw cell Ctrl-m 1-6 : heading 1-6 cell Ctrl-m p : select previous Ctrl-m n : select next Ctrl-m i : interrupt kernel Ctrl-m . : restart kernel Ctrl-m h : show keyboard shortcuts

  • Once you are done with the session, save the notebook and exit the app. A .ipynb file will have been saved in the folder where you started the session. You can come back at any time and pick up where you left, re-run all or some cells and save it again. Because a .ipynb file is basically a text file containing a json object, this is multi-platform and easily inter-exchangeable. It is also great to share it with friends and collaborators.
  • Finally, notebooks are great also to share them on the internert to diseminate analysis or collaborate online. You can host the file somewhere on a server and point to it from one of the renderers for notebooks. For example, the nbviewer.

The functionality...

Here are a couple of resources to show you how cool the notebook can get:

Numpy and Scipy

Numpy and Scipy are the foundational libraries for any kind of numeric computing in Python. Numpy offers the efficient matrix structure denominated array or ndarray (Numpy data array) as well as some basic statistical functions that may be applied to arrays.

To whet your appetite, let's first create a simple array. You can do this from a pre-existing Python list, for example:

In [7]:
l = [1, 2, 3, 4]
a = np.array(l)
a
Out[7]:
array([1, 2, 3, 4])

At first sight, a is not very different from l; however, under the hood, it provides much more efficient structures for data manipulation (including C-optimized functions and other performance enhancements). Arrays contain only one data type and may have several dimensions, opening up the door for very fancy matrix manipulation.

In [8]:
print type(a[0])
l += 'a'
a = np.array(l)
print type(a[0])
<type 'numpy.int64'>
<type 'numpy.string_'>

In [9]:
l = [[1, 2], [3, 4], [5, 6]]
a = np.array(l)
print 'Array a has a dimension of: ', a.shape
print a
Array a has a dimension of:  (3, 2)
[[1 2]
 [3 4]
 [5 6]]

Arrays have several basic utilities as methods:

In [10]:
a.mean()
Out[10]:
3.5
In [11]:
a.min(axis=0)
Out[11]:
array([1, 2])
In [12]:
a.max(axis=1)
Out[12]:
array([2, 4, 6])

It also includes random number generation:

In [13]:
r = np.random.random(4)
r
Out[13]:
array([ 0.85141035,  0.48934467,  0.82694686,  0.18799519])
In [14]:
rr = np.random.random((3, 2))
rr
Out[14]:
array([[ 0.31710468,  0.93124485],
       [ 0.39381576,  0.43668221],
       [ 0.11553762,  0.51282671]])

Numpy even supports operations betwee arrays, such as sumation, difference, multiplication and division:

In [15]:
a = np.random.random((3, 2))
b = np.random.random((2, 3))
a, b
Out[15]:
(array([[ 0.1053234 ,  0.27330955],
       [ 0.18669486,  0.95238838],
       [ 0.3806249 ,  0.78371884]]),
 array([[ 0.88481153,  0.99623253,  0.88403853],
       [ 0.17156613,  0.56839029,  0.62555925]]))
In [16]:
# Sum (note the transpose for dimensionality alignment)
a + b.T
Out[16]:
array([[ 0.99013492,  0.44487568],
       [ 1.18292739,  1.52077867],
       [ 1.26466343,  1.40927809]])
In [17]:
# Difference (note the transpose for dimensionality alignment)
a - b.T
Out[17]:
array([[-0.77948813,  0.10174342],
       [-0.80953768,  0.38399809],
       [-0.50341363,  0.15815959]])
In [18]:
# Matrix product
c = np.dot(a, b)
c
Out[18]:
array([[ 0.14008201,  0.26027309,  0.26408126],
       [ 0.32858735,  0.7273198 ,  0.76082081],
       [ 0.4712409 ,  0.82464909,  0.82674964]])
In [19]:
# Matrix division by a scalar
c = a / 2.
c
Out[19]:
array([[ 0.0526617 ,  0.13665477],
       [ 0.09334743,  0.47619419],
       [ 0.19031245,  0.39185942]])

SciPy is the sister library of Numpy and offers a wide arrange of statistical functions to operate on Numpy arrays. This provides much of the functionality that you find in the core packages of other statistical languages like R (in the r-base package) or Matlab.

Examples to show:

  • Draw from a normal distribution
  • Check a couple of other examples

Besides the core of scipy, the project also includes additional packages called scikits that expand the main functionality in some particular way. Check out the scikits website to get a sense of what is covered.

Pandas

Pandas is a fairly new library but is undergoing exciting intense development. It relies on Numpy to provide efficient data structures that ease dealing with (potentially messy) data. Its core objects are the Series and the DataFrame, which are very similar to the array except for the following main differences:

  • They allow different data types into one object.
  • They are labeled, opening up the floor for all kinds of efficient manipulation.
  • They support missing data.

Essentially, you can think of pandas as a Numpy "on steroids" with a focus on real-world data. It encapsulates and wraps around much of the low-level functionality of numpy, scipy and matplotlib, exposing it to the end-user in a much friendlier way. For that reason, throughout the tutorial, we will call most of the functions from pandas.

Matplotlib

Matplotlib is the main tool for static graphical display in Python. It provides 2D and 3D functionality to plot data in a static way. The library may not appear as very intuitive at first, but if you get over the learning curve, it is extremely flexible and it allows to tweak every aspect of a figure. Because of its focus on flexibility, the defaults may not be the prettiest, but with some work on them, Matplotlib can create beautiful figures that rival in quality with any other library for static plotting (such as R's ggplot2).

Hands-on example

In this section, we are going to use real world data to get a sense of the basic capabilities of the libraries just presented. We will be using a sample of Foursquare checkins collected from Twitter in Amsterdam during 2010; the data is derived from the database originally published by Chen et al (2011), see this link for the original source and information regarding the open license. The data are stored in a comma-delimitted (.csv) text file.

Read in and checks

Let's start by reading the data into memory; note that, although we have the file stored on the cloud, pandas can read it without problem:

In [20]:
data_link = 'http://ubuntuone.com/4oIpVJDCpdREhzNdSRwMx8'
db = pd.read_csv(data_link)
db
Out[20]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 77427 entries, 0 to 77426
Data columns:
user_id     77427  non-null values
tweet_id    77427  non-null values
lat         77427  non-null values
lon         77427  non-null values
time        77427  non-null values
text        77427  non-null values
place_id    75273  non-null values
BU_CODE     77427  non-null values
dtypes: float64(2), int64(2), object(4)

This gives us a basic summary of the properties of the dataset we just loaded. We can also take a look at the real data:

In [21]:
db.head()
Out[21]:
user_id tweet_id lat lon time text place_id BU_CODE
0 82837637 13416482426 52.360800 4.867410 2010-05-05 09:21:15 I'm at Marqt Overtoom (Overtoom, Amsterdam). h... 99cdab25eddd6bce BU03630322
1 82837637 13419413928 52.366300 4.886320 2010-05-05 10:52:24 I'm at Walem (Keizersgracht 449, Amsterdam). h... 99cdab25eddd6bce BU03630003
2 82837637 14847348078 52.333500 4.889770 2010-05-27 17:06:19 Mooi feest hier. Wie is er eigenlijk niet vand... 99cdab25eddd6bce BU03631491
3 82837637 17401139685 52.340512 4.873362 2010-06-30 08:54:35 @nienkehofkamp ik ben er! (@ World Trade Cente... 99cdab25eddd6bce BU03631459
4 82837637 17554964312 52.349976 4.918581 2010-07-02 07:26:45 Klein leger van 200 groen gele Braziliaantjes ... 99cdab25eddd6bce BU03631255
In [22]:
db.tail(7)
Out[22]:
user_id tweet_id lat lon time text place_id BU_CODE
77420 188743276 6108935840862208 52.307327 4.944605 2010-11-20 22:17:38 Nu aansluiten in de file vanaf parkeerplaats. ... 536dd7e62e0b35c3 BU03631192
77421 25689874 20442608845193216 52.330856 4.879264 2010-12-30 11:34:32 Lunch! :) @ Werelds http://fst.je/fgZklm 355790f421ee5262 BU03631490
77422 23592731 14773261764 52.372175 4.884715 2010-05-26 16:15:24 Net met Een (electrische) Tesla Roadster door ... 99cdab25eddd6bce BU03630002
77423 48758572 4256444949995521 52.359700 4.881170 2010-11-15 19:36:30 I'm at Bilderberg Hotel Jan Luyken (Jan Luijke... cbe7d5bad97ca45c BU03631347
77424 48758572 4506146060632064 52.347838 4.856291 2010-11-16 12:08:43 I'm at Mindshare Amsterdam (Karperstraat 8, Am... 1736a08fa21720bb BU03631348
77425 48758572 4506221956567040 52.347838 4.856291 2010-11-16 12:09:01 I'm at GroupM Netherlands (Karperstraat 8, Ams... 1736a08fa21720bb BU03631348
77426 85143057 9291006105096193 52.361092 4.890365 2010-11-29 17:02:03 I'm at Dierenkliniek de Wetering. http://4sq.c... 6ec13a20504b1599 BU03630007
In [23]:
db.describe()
Out[23]:
user_id tweet_id lat lon
count 7.742700e+04 7.742700e+04 77427.000000 77427.000000
mean 4.180966e+07 5.787601e+13 52.360964 4.893621
std 6.408738e+06 1.069835e+07 0.021956 0.036132
min 1.013000e+03 9.666602e+09 52.280120 4.733114
25% 1.421860e+07 1.818519e+10 52.351255 4.875392
50% 2.130657e+07 2.523498e+10 52.364679 4.892890
75% 5.049944e+07 1.359153e+15 52.375917 4.911386
max 2.277718e+08 2.805568e+16 52.429774 5.025750

Indexing and slicing

DataFrame objects support various ways of indexing and slicing. By default, rows are indexed with a range of integers:

In [24]:
db.index
Out[24]:
Int64Index([0, 1, 2, ..., 77424, 77425, 77426], dtype=int64)
In [25]:
db.ix[5: 10, ['time', 'text', 'BU_CODE']]
Out[25]:
time text BU_CODE
5 2010-07-31 13:30:58 I'm at Café Bloemers (Hemonystraat 70HS, Amste... BU03631325
6 2010-08-02 11:56:43 Altijd fijn lunchen hier! (@ Hotel Okura Amste... BU03631452
7 2010-08-02 14:37:30 Ik ga voor het mayorship! :-) (@ Café Bloemers... BU03631325
8 2010-08-16 10:25:56 @lisaportengen. Racen! :-) Ik zie je wel komen... BU03631324
9 2010-08-21 10:43:53 VicMB1, 1 x winst, 1 x gelijk. So far so good.... BU03630984
10 2010-08-27 07:01:18 Aan de slag! (@ Hotel Casa 400) http://4sq.com... BU03631255
In [26]:
# Clip checkins on a bounding box for (52.30, 52.40, 4.89, 4.9)
clipped = db[(db['lat'] > 52.30) & (db.lat < 52.40) & (db.lon > 4.89) & (db['lon'] < 4.9)]
clipped
Out[26]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 15113 entries, 6 to 77426
Data columns:
user_id     15113  non-null values
tweet_id    15113  non-null values
lat         15113  non-null values
lon         15113  non-null values
time        15113  non-null values
text        15113  non-null values
place_id    14708  non-null values
BU_CODE     15113  non-null values
dtypes: float64(2), int64(2), object(4)

Grouping operations

There is also very efficient functionality to group the data based on certain characteristics. Once grouped, you can apply different functions to the grouped data and collect the output in a new DataFrame. This is akin to some of the "map-reduce" operations that have become very popular in distributed systems, with the difference that everything here happens in memory (which makes it very fast). As an example, let's group the checkins by neighborhood ('BU_CODE') and count them:

In [27]:
neigh = db.groupby('BU_CODE').size()
neigh
Out[27]:
BU_CODE
BU03630000    2750
BU03630001    6620
BU03630002    2521
BU03630003    3468
BU03630004    2076
BU03630005    1434
BU03630006    2385
BU03630007    4209
BU03630008    1211
BU03630009     988
BU03630110     554
BU03630111    2019
BU03630212     241
BU03630213    1228
BU03630214    1330
BU03630215     424
BU03630216     222
BU03630317     315
BU03630318     134
BU03630319     672
BU03630320     289
BU03630321     498
BU03630322     539
BU03630431     420
BU03630432     218
BU03630433    2157
BU03630434     144
BU03630435     739
BU03630451     208
BU03630536      55
BU03630537     660
BU03630538     201
BU03630539     455
BU03630640     293
BU03630641     100
BU03630642     299
BU03630643     148
BU03630760     108
BU03630761     163
BU03630762      10
BU03630763       5
BU03630764      28
BU03630765      44
BU03630766      62
BU03630767      62
BU03630768      36
BU03630769     163
BU03630770      64
BU03630771    1199
BU03630772     136
BU03630773      41
BU03630875      54
BU03630876      76
BU03630877     151
BU03630878     125
BU03630879      23
BU03630980      38
BU03630981     501
BU03630982     369
BU03630983      72
BU03630984     355
BU03631085    1149
BU03631086     369
BU03631087     665
BU03631088     872
BU03631192    4008
BU03631193    1231
BU03631194     237
BU03631195     203
BU03631196     914
BU03631197      46
BU03631198       7
BU03631227    1191
BU03631228     535
BU03631229     895
BU03631230     380
BU03631255    1874
BU03631256    1477
BU03631257      88
BU03631258     971
BU03631324    1690
BU03631325     914
BU03631326     262
BU03631344     400
BU03631345     287
BU03631346     509
BU03631347    2811
BU03631348     897
BU03631349     428
BU03631350     190
BU03631452    1367
BU03631453     431
BU03631454     483
BU03631459     543
BU03631490    2179
BU03631491     815
Length: 96

With this

The output is a Series because we applied a function that outputs a single number per group of rows, but you can apply functions that operate by column and the output will then be a DataFrame (in this example, mean is applied only to those columns where it can be applied, but not to those with text):

In [28]:
neigh_several = db.groupby('BU_CODE').mean()
neigh_several
Out[28]:
user_id tweet_id lat lon
BU_CODE
BU03630000 4.089537e+07 2.896880e+15 52.373986 4.897222
BU03630001 4.001633e+07 2.494666e+15 52.375894 4.896031
BU03630002 2.975765e+07 2.635145e+15 52.374089 4.886963
BU03630003 3.360024e+07 3.197510e+15 52.366105 4.894889
BU03630004 3.696746e+07 2.597457e+15 52.371965 4.903914
BU03630005 3.213309e+07 3.016933e+15 52.384250 4.889986
BU03630006 3.500533e+07 3.837788e+15 52.375541 4.881686
BU03630007 3.076569e+07 2.026354e+15 52.362500 4.890146
BU03630008 3.885122e+07 1.834704e+15 52.364856 4.909952
BU03630009 3.275136e+07 1.761756e+15 52.372387 4.922353
BU03630110 4.583103e+07 3.024124e+15 52.403604 4.840568
BU03630111 4.553697e+07 4.043264e+15 52.390713 4.837820
BU03630212 4.653763e+07 5.552885e+15 52.393468 4.885883
BU03630213 4.207794e+07 1.911825e+15 52.388810 4.875225
BU03630214 2.604093e+07 1.475080e+15 52.383136 4.872976
BU03630215 2.596567e+07 1.881479e+15 52.384311 4.867759
BU03630216 3.409945e+07 2.481599e+15 52.377359 4.874483
BU03630317 3.083976e+07 9.677685e+14 52.371689 4.872083
BU03630318 3.270001e+07 4.452065e+15 52.369788 4.866933
BU03630319 1.256780e+07 9.603981e+15 52.366653 4.871118
BU03630320 3.580618e+07 3.725604e+15 52.364255 4.872827
BU03630321 4.327673e+07 2.162391e+15 52.360449 4.860336
BU03630322 3.306616e+07 2.397359e+15 52.362916 4.876829
BU03630431 3.621605e+07 3.166838e+15 52.364937 4.935627
BU03630432 3.705649e+07 3.989395e+15 52.365043 4.946886
BU03630433 4.343960e+07 1.383628e+15 52.375096 4.932047
BU03630434 5.536995e+07 3.429937e+15 52.366242 4.968782
BU03630435 3.080973e+07 2.029308e+15 52.357460 4.991925
BU03630451 5.232057e+07 2.948715e+15 52.351359 5.004841
BU03630536 3.945977e+07 3.452410e+15 52.388011 4.850064
BU03630537 3.519838e+07 2.477468e+15 52.382839 4.852951
BU03630538 3.493221e+07 2.787626e+15 52.378635 4.850743
BU03630539 4.352352e+07 1.309941e+15 52.378174 4.842480
BU03630640 3.914126e+07 3.803748e+15 52.371784 4.860462
BU03630641 4.282359e+07 2.833880e+15 52.372286 4.848435
BU03630642 4.807170e+07 2.098702e+15 52.369191 4.852857
BU03630643 2.546463e+07 1.109082e+15 52.362064 4.854477
BU03630760 3.908344e+07 1.564955e+15 52.392179 4.912117
BU03630761 3.431164e+07 2.454694e+15 52.386317 4.919903
BU03630762 1.010210e+08 8.829694e+15 52.391132 4.947516
BU03630763 3.912022e+07 8.352349e+15 52.394585 4.928672
BU03630764 1.165250e+08 9.981837e+13 52.389851 4.941517
BU03630765 2.503894e+07 2.076387e+15 52.407464 4.895148
BU03630766 6.563517e+07 5.165142e+15 52.423225 4.879855
BU03630767 4.283430e+07 9.493364e+14 52.415386 4.908213
BU03630768 5.704161e+07 2.083465e+15 52.395175 4.953750
BU03630769 4.664555e+07 3.435446e+15 52.401304 4.937816
BU03630770 7.281672e+07 1.315143e+15 52.407017 4.917439
BU03630771 3.789073e+07 1.218216e+15 52.396542 4.898526
BU03630772 5.448457e+07 1.724291e+15 52.384922 4.926660
BU03630773 5.508481e+07 3.657312e+15 52.394324 4.973050
BU03630875 3.594713e+07 7.794782e+15 52.387976 4.808582
BU03630876 3.349138e+07 2.399725e+15 52.381993 4.829340
BU03630877 3.273494e+07 1.659150e+15 52.375512 4.821954
BU03630878 1.904827e+07 2.327387e+15 52.378257 4.801976
BU03630879 2.440559e+07 5.435435e+15 52.381651 4.775801
BU03630980 3.414051e+07 4.780221e+15 52.367047 4.785599
BU03630981 3.648536e+07 2.713174e+15 52.360207 4.806123
BU03630982 3.127312e+07 4.149097e+15 52.353184 4.796367
BU03630983 4.603978e+07 1.870076e+15 52.357826 4.787147
BU03630984 3.940835e+07 4.014687e+14 52.351995 4.781948
BU03631085 4.430177e+07 2.849526e+15 52.359106 4.830747
BU03631086 4.779580e+07 2.986380e+15 52.366085 4.840374
BU03631087 4.184698e+07 1.974442e+15 52.350586 4.839453
BU03631088 3.869939e+07 3.006081e+15 52.342867 4.820212
BU03631192 5.403510e+07 4.860566e+15 52.307826 4.946894
BU03631193 7.972719e+07 2.138244e+15 52.317417 4.948888
BU03631194 3.217954e+07 5.126992e+15 52.320579 4.977246
BU03631195 5.940351e+07 5.772487e+15 52.310161 4.988906
BU03631196 1.181518e+08 1.638116e+16 52.301250 4.965400
BU03631197 1.696124e+07 7.058894e+14 52.294902 4.988288
BU03631198 3.380559e+07 1.352041e+15 52.311467 5.015063
BU03631227 4.789822e+07 2.182101e+15 52.356135 4.910376
BU03631228 4.173428e+07 2.744241e+15 52.360448 4.918493
BU03631229 2.320473e+07 1.259262e+15 52.363710 4.926495
BU03631230 1.030016e+08 1.930944e+15 52.354213 4.919484
BU03631255 5.286842e+07 6.314643e+15 52.348853 4.919705
BU03631256 4.094719e+07 2.910426e+15 52.350648 4.945512
BU03631257 2.836188e+07 1.531634e+15 52.343198 4.939110
BU03631258 3.463629e+07 2.752124e+15 52.336681 4.916669
BU03631324 3.998340e+07 1.891869e+15 52.356410 4.893144
BU03631325 3.055078e+07 2.104374e+15 52.353144 4.895829
BU03631326 3.681281e+07 2.426239e+15 52.352225 4.902491
BU03631344 3.373319e+07 1.908824e+15 52.351826 4.850016
BU03631345 3.490560e+07 3.600734e+15 52.352346 4.855679
BU03631346 3.347443e+07 2.801139e+15 52.354236 4.863389
BU03631347 3.838088e+07 2.181407e+15 52.358917 4.879794
BU03631348 3.336986e+07 2.238330e+15 52.345672 4.857665
BU03631349 3.858499e+07 4.708061e+15 52.349222 4.875158
BU03631350 3.337788e+07 2.267384e+15 52.354644 4.885674
BU03631452 4.030466e+07 3.022871e+15 52.343024 4.892552
BU03631453 1.084910e+08 2.076940e+16 52.348890 4.908481
BU03631454 3.481568e+07 2.950243e+15 52.341972 4.907108
BU03631459 4.564034e+07 2.574770e+15 52.340828 4.872763
BU03631490 5.062685e+07 3.270327e+15 52.335458 4.866732
BU03631491 1.002436e+08 7.937146e+15 52.330808 4.886211

This is also useful if you want to obtain simple descriptive statistics by group. For example, you might want to know a bit more of checkins per neighborhodd:

In [29]:
# Proportion by neighborhood
pct = neigh * 100. / neigh.sum()

# Sort them
rk = pct.order(ascending=False)
print '--- Top 5 most popular neighborhoods ---'
print np.round(rk.head(), decimals=2)
print '--- Top 5 least popular neighborhoods ---'
print np.round(rk.tail()[::-1], decimals=2)
--- Top 5 most popular neighborhoods ---
BU_CODE
BU03630001    8.55
BU03630007    5.44
BU03631192    5.18
BU03630003    4.48
BU03631347    3.63
--- Top 5 least popular neighborhoods ---
BU_CODE
BU03630763    0.01
BU03631198    0.01
BU03630762    0.01
BU03630879    0.03
BU03630764    0.04

Date functionality

pandas inludes very nice support for dates and operations based on date. To exemplify it, we will use the column time. As it is read, it is taken as simple text:

In [30]:
db['time'][0]
Out[30]:
'2010-05-05 09:21:15'
In [31]:
type(db['time'][0])
Out[31]:
str

However, you can convert it to a Timestamp, the class in pandas to deal with dates:

In [32]:
# This might take some time
db['t'] = db['time'].apply(pd.Timestamp)
In [33]:
ts = db['t'][0]
ts
Out[33]:
<Timestamp: 2010-05-05 09:21:15>

This gives you handy many attributes:

In [34]:
ts.hour
Out[34]:
9
In [35]:
ts.date()
Out[35]:
datetime.date(2010, 5, 5)

See all of them:

In [36]:
[i for i in dir(ts) if '__' not in i]
Out[36]:
['_get_field',
 '_repr_base',
 'asm8',
 'astimezone',
 'combine',
 'ctime',
 'date',
 'day',
 'dayofweek',
 'dayofyear',
 'dst',
 'freq',
 'freqstr',
 'fromordinal',
 'fromtimestamp',
 'hour',
 'isocalendar',
 'isoformat',
 'isoweekday',
 'max',
 'microsecond',
 'min',
 'minute',
 'month',
 'nanosecond',
 'now',
 'offset',
 'quarter',
 'replace',
 'resolution',
 'second',
 'strftime',
 'strptime',
 'time',
 'timetuple',
 'timetz',
 'to_datetime',
 'to_period',
 'to_pydatetime',
 'today',
 'toordinal',
 'tz',
 'tz_convert',
 'tz_localize',
 'tzinfo',
 'tzname',
 'utcfromtimestamp',
 'utcnow',
 'utcoffset',
 'utctimetuple',
 'value',
 'week',
 'weekday',
 'weekofyear',
 'year']

With this, we can assign t as the index of the DataFrame, which will be useful later:

In [37]:
db = db.set_index('t')
db.head()
Out[37]:
user_id tweet_id lat lon time text place_id BU_CODE
t
2010-05-05 09:21:15 82837637 13416482426 52.360800 4.867410 2010-05-05 09:21:15 I'm at Marqt Overtoom (Overtoom, Amsterdam). h... 99cdab25eddd6bce BU03630322
2010-05-05 10:52:24 82837637 13419413928 52.366300 4.886320 2010-05-05 10:52:24 I'm at Walem (Keizersgracht 449, Amsterdam). h... 99cdab25eddd6bce BU03630003
2010-05-27 17:06:19 82837637 14847348078 52.333500 4.889770 2010-05-27 17:06:19 Mooi feest hier. Wie is er eigenlijk niet vand... 99cdab25eddd6bce BU03631491
2010-06-30 08:54:35 82837637 17401139685 52.340512 4.873362 2010-06-30 08:54:35 @nienkehofkamp ik ben er! (@ World Trade Cente... 99cdab25eddd6bce BU03631459
2010-07-02 07:26:45 82837637 17554964312 52.349976 4.918581 2010-07-02 07:26:45 Klein leger van 200 groen gele Braziliaantjes ... 99cdab25eddd6bce BU03631255

Once we have it, it is easier to group the rows by day and get the average location, for example:

In [38]:
days = db.groupby(lambda i: i.date())[['lat', 'lon']].mean()
days.head()
Out[38]:
lat lon
2010-02-26 52.362463 4.882973
2010-02-27 52.367031 4.879173
2010-02-28 52.360148 4.857665
2010-03-01 52.365379 4.877875
2010-03-02 52.364598 4.891863

Another one-liner is to get the volume of checkins by hour of the day:

In [39]:
by_hour = db.groupby(lambda i: i.hour).size()
by_hour
Out[39]:
0      778
1      463
2      225
3      159
4      375
5     1168
6     3049
7     4601
8     4463
9     3921
10    4164
11    4683
12    4284
13    4247
14    4426
15    4748
16    5349
17    5686
18    5613
19    4583
20    3743
21    3107
22    2215
23    1377

And we can do similar by hour of the day by neighborhood:

In [40]:
by_neigh_by_hour = db.groupby(['BU_CODE', lambda i: i.hour]).size()
by_neigh_by_hour
Out[40]:
BU_CODE       
BU03630000  0      30
            1      12
            2      13
            3       9
            4       3
            5      11
            6      48
            7      90
            8     117
            9     111
            10    136
            11    155
            12    163
            13    149
            14    195
...
BU03631491  9     39
            10    43
            11    39
            12    41
            13    25
            14    46
            15    45
            16    65
            17    45
            18    37
            19    27
            20    37
            21    48
            22    58
            23    24
Length: 1986

This creates a Series object but with a hierarchical index that contains two levels: one for the neighborhood and one for the hour of the day. We can quickly reshape this into a DataFrame:

In [41]:
bnbh_df = by_neigh_by_hour.unstack()
# Add 'h' to every column to make it clear that is the hour of the day
bnbh_df = bnbh_df.rename(columns=lambda x: 'h'+str(x))
bnbh_df
Out[41]:
<class 'pandas.core.frame.DataFrame'>
Index: 96 entries, BU03630000 to BU03631491
Data columns:
h0     72  non-null values
h1     59  non-null values
h2     53  non-null values
h3     43  non-null values
h4     59  non-null values
h5     80  non-null values
h6     90  non-null values
h7     93  non-null values
h8     93  non-null values
h9     91  non-null values
h10    93  non-null values
h11    94  non-null values
h12    93  non-null values
h13    92  non-null values
h14    94  non-null values
h15    90  non-null values
h16    91  non-null values
h17    90  non-null values
h18    89  non-null values
h19    89  non-null values
h20    86  non-null values
h21    85  non-null values
h22    88  non-null values
h23    79  non-null values
dtypes: float64(24)

Plotting

Another nice addition of pandas is that it leverages the power of matplotlib considerably lowering the access and difficulty. It includes a few canned graphics that are one-liners and very helpful when interacting with your data. Let's walk through the main four ones: bar and line plots, histograms and density plots.

To see the bar plots, let's explore the proportions of checkins by neighborhood:

In [42]:
neigh.order(ascending=False).plot(kind='bar')
# Note the notebook allows you to manually resize the figure if it is too small
Out[42]:
<matplotlib.axes.AxesSubplot at 0x51088d0>

Another useful simple plot is using lines instead of bars. In this case, let's examine the evolution of checkins by hour of the day:

In [43]:
by_hour.plot()
title('Checkins by hour of the day')
Out[43]:
<matplotlib.text.Text at 0x6fb6910>

We can also think of what the distribution of checkins is over the whole period. To that, we can pull the values of the each Timestamp and get a histogram:

In [44]:
db['t'] = db.index
deltas = db['t'].apply(lambda x: pd.Timestamp(x).value)
deltas.hist(bins=50)
Out[44]:
<matplotlib.axes.AxesSubplot at 0x5f1ae50>

Or a density kernel, which is a more elegant continious version of the histogram:

In [45]:
(deltas * 1.).plot(kind='kde')
Out[45]:
<matplotlib.axes.AxesSubplot at 0x72b7350>

Finally, a useful graph that is not included in pandas but is very simple from matplotlib is a scatter plot. In this case, let's plot the spatial distribution of the checkins:

In [46]:
scatter(db['lon'].values, db['lat'].values)
Out[46]:
<matplotlib.collections.PathCollection at 0x82be890>