Original blog post at: http://kldavenport.com

In [203]:

import pandas as pd
import numpy as np
import scipy as sp
import statsmodels.api as sm
import statsmodels.graphics.tsaplots as tsaplots
from scipy import stats
import seaborn as sns

from pylab import rcParams

Loading and Transforming the Data¶

Let's inspect the composition of our source data.

In [262]:

%%bash 
head -n 3 data/historic_euro_hrs_worked.csv

,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013
Belgium,1896,1737,1733,1757,1763,1769,1751,1709,1725,1738,1752,1719,1765,1768,1767,1765
Bulgaria,N.a.,N.a.,N.a.,2005,2006,1964,2007,1997,2000,1989,1978,1913,1976,1972,1955,1954

The original document specifies missing or NA values as 'N.a.'. Rather than replacing these values with df.replace() after loading the dataframe, read_csv provides a parameter for replacing specific strings with true NaN values on load.

In [205]:

df = pd.read_csv('data/historic_euro_hrs_worked.csv',  na_values='N.a.', index_col= 0 )

In [206]:

df.head(5)

Out[206]:

	1998	1999	2000	2001	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012	2013
Belgium	1896	1737	1733	1757	1763	1769	1751	1709	1725	1738	1752	1719	1765	1768	1767	1765
Bulgaria	NaN	NaN	NaN	2005	2006	1964	2007	1997	2000	1989	1978	1913	1976	1972	1955	1954
Czech Republic	2124	2157	2128	2029	1984	1971	2029	1959	1961	1954	1967	1926	1936	1912	1898	1881
Denmark	1646	1783	1666	1709	1776	1713	1710	1719	1702	1708	1711	1692	1730	1750	1743	1744
Germany	2026	2028	2017	2023	2026	2003	2003	1953	1974	1959	1945	1879	1904	1886	1863	1847

Since version .15 Pandas has included memory usage in df.info. We can specify verbose=False if we don't need column descriptions.

In [207]:

df.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
Index: 28 entries, Belgium to UK
Columns: 16 entries, 1998 to 2013
dtypes: float64(4), int64(11), object(1)
memory usage: 3.7+ KB

Now let's take a look at some summary statistics with the describe() method.

In [208]:

df.describe()

Out[208]:

	1998	1999	2000	2001	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012	2013
count	24.000000	24.000000	26.000000	27.000000	28.000000	28.000000	28.000000	28.000000	28.000000	28.000000	28.000000	28.000000	28.000000	28.000000	28.000000
mean	1969.583333	1979.375000	1962.461538	1938.370370	1926.178571	1934.285714	1898.857143	1892.321429	1891.357143	1882.035714	1854.857143	1874.107143	1874.892857	1868.107143	1862.571429
std	131.329627	124.765802	136.295042	133.832889	139.376669	136.029370	136.960312	131.068841	127.115641	122.631783	117.888433	109.010478	107.642185	106.880459	105.851198
min	1646.000000	1716.000000	1666.000000	1657.000000	1643.000000	1685.000000	1661.000000	1660.000000	1659.000000	1672.000000	1632.000000	1670.000000	1667.000000	1658.000000	1648.000000
25%	1911.000000	1939.250000	1922.750000	1871.500000	1857.750000	1860.500000	1797.500000	1809.250000	1801.500000	1797.000000	1784.250000	1797.750000	1806.000000	1801.750000	1809.500000
50%	1978.500000	1989.500000	1973.000000	1965.000000	1952.500000	1953.500000	1898.500000	1898.500000	1893.000000	1897.500000	1869.500000	1890.500000	1879.000000	1878.000000	1864.000000
75%	2046.000000	2056.000000	2032.250000	2017.500000	2016.000000	2041.000000	2009.000000	1997.000000	1989.250000	1980.500000	1931.250000	1952.250000	1962.750000	1955.750000	1937.000000
max	2195.000000	2199.000000	2217.000000	2198.000000	2189.000000	2154.000000	2144.000000	2114.000000	2121.000000	2112.000000	2091.000000	2095.000000	2103.000000	2103.000000	2099.000000

At first glance it looks like the average time worked per year is descreasing overtime. Let's examine how the aggregate average mean of the EU has changed over time.

In [209]:

df.describe().ix[1].plot(figsize=(10, 4), linewidth=3, alpha =.8)

Out[209]:

<matplotlib.axes._subplots.AxesSubplot at 0x11546ed90>

Since we're working with time series data we might prefer to transform the data from wide format to long. We'll want the dates to be the index for the given observations so we can simply transpose the dataframe with df.T. Inline with Pandas' excellent syntax this instance method is similar to numpy.ndarray.T which is the same as self.transpose().

In [210]:

df = df.T
df = df.convert_objects(convert_numeric=True)

In [211]:

df.head(5)

Out[211]:

	Belgium	Bulgaria	Czech Republic	Denmark	Germany	Estonia	Ireland	Greece	Spain	France	...	The Netherlands	Austria	Poland	Portugal	Romania	Slovania	Slovak Republic	Finland	Sweden	UK
1998	1896	NaN	2124	1646	2026	2055	1981	2021	1935	1915	...	1875	1952	2177	1976	2104	1899	2059	1772	1721	1965
1999	1737	NaN	2157	1783	2028	2026	1979	2056	2001	1950	...	1863	1986	2152	2012	2099	1954	2056	1716	1842	1960
2000	1733	NaN	2128	1666	2017	1974	1972	2065	1940	1864	...	1846	1980	2033	1954	2143	1918	2053	1675	1839	1940
2001	1757	2005	2029	1709	2023	2005	1965	2043	1934	1870	...	1861	1991	2012	1960	2148	1873	2052	1657	1674	1944
2002	1763	2006	1984	1776	2026	2058	1938	2059	1952	1842	...	1756	1982	2040	1936	2167	1906	2036	1686	1693	1939

5 rows × 28 columns

Examining Relationships with Plotting¶

Let's take a look at Germany, UK, and France vs the PIIGS. Seaborn makes it easy to set palettes from colorbrewer. Note that I append a standard matplotlib method to Panda's plot() helper function to set the legend size below.

In [212]:

sns.set_palette("Set2", 10, 1)
sns.palplot(sns.color_palette("Set2", 10,1))

In [213]:

guf_vs_piigs = ['Germany', 'UK', 'France','Portugal', 'Italy', 'Ireland', 'Greece', 'Spain']
df.ix[:, guf_vs_piigs].plot(figsize=(14, 10),
                            linewidth=3,
                            alpha =.8,
                            title = "Germany, UK, and France vs the PIIGS Annual Average Hours Worked per Employee").legend(loc=3,prop={'size':15})

Out[213]:

<matplotlib.legend.Legend at 0x1156b8050>

Note that the decline in hours worked deaccelerates around 2005 for all countries except Germany. Now what if we want to examine all time series plots at once to quickly examine countries with the most drastic change? Note the assignment to supress in order to supress the matplotlib inline rendering messages (e.g. <matplotlib.axes._subplots.AxesSubplot at 0x10bb579d0>)

In [214]:

supress = df.plot(subplots=True, layout=(8, 4), figsize=(16, 16,), sharex=False, alpha =.7)

It might be interesting to look at a correlation matrix to reveal relationships between the countries.

In [215]:

plt.rcParams['figure.figsize'] = 15, 15

supress = sns.corrplot(df, annot=False, sig_stars=False,
             diag_names=False, cmap="coolwarm")

The correlation matrix indicates that Denmark is going against the trend as it rarely correlates with other countries (lots of blue), just as the line plot indicated above. Another interesting summary statistic might be the sum of total percentage change per country which indicates what country had the most drastic change to date. Luckily Pandas has an instance method for that.

In [216]:

pct_change_sum = df.pct_change().sum()
pct_change_sum.sort() # inplace method default
print 'Absolute Mean Percentage Change: ' + str(pct_change_sum.abs().mean())
print ''
print pct_change_sum

Absolute Mean Percentage Change: 0.0603916775238

France            -0.137442
Latvia            -0.124114
Czech Republic    -0.118457
Poland            -0.111631
Germany           -0.091152
Luxembourg        -0.087165
Ireland           -0.080509
Italy             -0.074149
Portugal          -0.072338
Malta             -0.071814
Austria           -0.070783
Finland           -0.070174
Belgium           -0.066778
Spain             -0.065156
Slovak Republic   -0.061676
Lithuania         -0.060430
Estonia           -0.057099
Hungary           -0.035527
UK                -0.033165
The Netherlands   -0.029818
Slovania          -0.029289
Croatia           -0.025596
Bulgaria          -0.024140
Sweden            -0.012458
Cyprus            -0.007883
Greece            -0.004757
Romania           -0.001922
Denmark            0.065546
dtype: float64

Compared to the mean absolute percentage change of 0.06, France is clearly leading the EU in reducing hours worked from 1998 - 2013. Conversely Denmark stands alone in increasing hours worked.

Statistical Methods¶

Let's explore the relationship between time and unemployment using France's data and linear regression. If you're unfamiliar with concepts behind linear regression check out this link. This is meant to be a short intro so we won't explore how to tune a fit and make better approximations of the underlying distribution of the data.

In [217]:

df.France

Out[217]:

1998    1915
1999    1950
2000    1864
2001    1870
2002    1842
2003    1712
2004    1708
2005    1661
2006    1660
2007    1659
2008    1672
2009    1640
2010    1679
2011    1682
2012    1681
2013    1661
Name: France, dtype: float64

In [219]:

known_x = np.arange(len(df.France)) # Create the index
known_y = df.France # Return a pandas series from a df object

# Find the polynomial coefficients
linear_coef = np.polyfit(known_x, known_y, 1)

# Pass polynomial coefficients to poly1d convenience function (highest power 1st)
linear_poly = np.poly1d(linear_coef)

NumPy's poly1d function allows you to pass a future unkown time index, or whatever type of x value you fit on, to return a y value.

For example we fit on known_x = np.arange(len(df.France)) which contained 16 elements (1998-2013). Below we use our linear_poly to see what the annual average hours worked per employee might look like in 2020:

In [220]:

linear_poly(23)

Out[220]:

1449.4632352941185

We can also construct a quadratic fit with the same methods except we pass 2 to denote the degree.

In [221]:

a,b = linear_poly.coeffs # slope, intercept

quadratic_coef = np.polyfit(known_x, known_y, 2)
quadratic_poly = np.poly1d(quadratic_coef)

Below we plot our simple linear regression and quadratic models:

In [222]:

known_x_labels = np.array(list(df.index)) # For x-axis labeling 
plt.rcParams['figure.figsize'] = 9, 5
plt.scatter(known_x_labels, known_y, label = 'Original Data')
plt.plot(known_x_labels,known_y, alpha = .3, label = 'Original Data')
plt.plot(known_x_labels, linear_poly(known_x), 'r', alpha=.8, label='Linear Regression') # y = a + bx or y = mx + b 
plt.plot(known_x_labels, quadratic_poly(known_x), '-', label='Quadratic Fit') # y = ax^2 + bx + c
plt.title('France Average Annual Hours Worked per Employee')
plt.legend()

print 'Linear Regression'
print 'slope: %s' % a, 'intercept: %s' % b

Linear Regression
slope: -18.8088235294 intercept: 1882.06617647

Imputation (filling missing data)¶

When we have abundant data we might not think twice about discarding incomplete or missing observations for any given index. Depending on the work at hand and the size of the dataset, the practice of discarding NaNs may introduce some unforeseen bias. We might decouple some trend or miss an element of cyclically or seasonality. In essence imputation allows us to preserve all cases by replacing missing data with a probable value based on other available information. The practice of imputation goes much deeper when you start using simulations and probabilistic models.

We've been working with a complete and vetted dataset so far. What should we do when the time-series data we're working with is incomplete? Let's say the governing body of this study was unable to collect data for some set of years. Here is the current complete set:

In [223]:

df.France

Out[223]:

1998    1915
1999    1950
2000    1864
2001    1870
2002    1842
2003    1712
2004    1708
2005    1661
2006    1660
2007    1659
2008    1672
2009    1640
2010    1679
2011    1682
2012    1681
2013    1661
Name: France, dtype: float64

And here is the data with 2004, 2010, 2011 employment data missing (NaN):

In [224]:

France_incomplete  = df.France.copy()

France_incomplete[['2004','2010','2011']] = np.NaN
France_incomplete

Out[224]:

1998    1915
1999    1950
2000    1864
2001    1870
2002    1842
2003    1712
2004     NaN
2005    1661
2006    1660
2007    1659
2008    1672
2009    1640
2010     NaN
2011     NaN
2012    1681
2013    1661
Name: France, dtype: float64

With imputation, we have a large array of methods to employ depending on the requirements of what analysis we intend to do with the data. Are we cleaning up the data for ploting purposes to demonstrate a simple point or are we planning to use the data to describe the directionality of an equity?

Since this is one-dimensional data we could naively fill these values with the mean of the entire column. This might work okay for a larger set of time-series data with no clear trend and intermittent dispersed NaNs.

For exampe:

In [225]:

France_incomplete.fillna(value = France_incomplete.mean())

Out[225]:

1998    1915.000000
1999    1950.000000
2000    1864.000000
2001    1870.000000
2002    1842.000000
2003    1712.000000
2004    1752.846154
2005    1661.000000
2006    1660.000000
2007    1659.000000
2008    1672.000000
2009    1640.000000
2010    1752.846154
2011    1752.846154
2012    1681.000000
2013    1661.000000
Name: France, dtype: float64

This might make for a bad method given that two of the two values are next to each other so we might flat line the momentum that the series had. What if we carry forward the last known value using the forward-fill method in Pandas?

In [226]:

plt.rcParams['figure.figsize'] = 7, 3.5

# Original data
df.France.plot(color = 'black')
# Using the mean
France_incomplete.fillna(value = France_incomplete.mean()).plot(color = 'orange',alpha = .5)
# Filling forward
France_incomplete.fillna(method= 'ffill').plot(color = 'dodgerblue',alpha = .5)

Out[226]:

<matplotlib.axes._subplots.AxesSubplot at 0x10e864c90>

In the simple case we see that forward-filling more closely resembles the original complete data. nafill also provides {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’}. Check out the official documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html

Trend Extrapolation¶

Trend Extrapolation is a forecasting technique which uses methods such as exponential smoothing or moving averages to project the future pattern of a time series data. Forecasting is very hard and at times you have to walk a fine line between overfitting some set of data with a higher order polynomial and producing non-practical (or non-sensical) results with ARIMA.

Let's go over some practical and easy to implement forecasting methods below using the entire EU dataset and extrapolate it out to 2020 with basic linear regression.

In [250]:

colnames = [colname for colname in df.columns]

# 1998 - 2013
years_into_future = 7
known_x = np.arange(df.shape[0]) # Current time interval
new_x = np.arange(len(known_x), len(known_x) + years_into_future, 1) # 7 years into future
concat_x = np.concatenate((known_x, new_x),axis=0)

out = np.empty(shape=(concat_x.shape[0], len(colnames)), dtype=float) # Empty faster than zero

# Return tuple of index and value
for col_index in enumerate(colnames):
    known_y = df.iloc[:,col_index[0]]
    known_y = known_y.fillna(value=known_y.mean())
    
    linear_coef = np.polyfit(known_x, known_y, 1)
    linear_poly = np.poly1d(linear_coef)

    new_y = linear_poly(new_x)   
    #new_y = new_y[::-1] # Reverse projections
    
    concat_y = np.concatenate((known_y,new_y),axis=0)
    out[:, col_index[0]] = concat_y

extended_df = pd.DataFrame(out)
extended_df.columns = colnames


extended_df.index = pd.date_range(start = '1998',
                                  end = '2021',
                                  freq='A')

In [251]:

supress = extended_df.plot(subplots=True, layout=(7, 4), figsize=(16, 16), alpha =.6, sharex=False);

Above we see the simple straight line projection to 2020. Again we can change the polynomial degree number in the polyfit function above or replace the new value generator with something more robust.

What if we have some domain knowledge that won't be reflected in the current data or trend? For example, what if I have insider knowledge of a bizarre policy that will be enacted by the EU to increase annual hours worked by employees X% by 2020? What might that look like?

I can anchor a point out in the future (2020) based on some simple descriptive statistic like the max value of the current data set plus 1 standard deviation then interpolate all the missing values between 2013 and 2020. An economist would use a more sophisticated growth model of course. I can use one of SciPy's many interpolation methods, which are conveniently called from Pandas, to fill the empty range in a smooth and more "natural" method opposed to a straight linear regression line.

In [258]:

colnames = [colname for colname in df.columns]

# 1998 - 2013
years_into_future = 7
known_x = np.arange(df.shape[0]) # Current time interval
new_x = np.arange(len(known_x), len(known_x) + years_into_future, 1) # 7 years into future
concat_x = np.concatenate((known_x, new_x),axis=0)

out = np.empty(shape=(concat_x.shape[0], len(colnames)), dtype=float) # Empty faster than zero

# Return tuple of index and value
for col_index in enumerate(colnames):
    known_y = df.iloc[:,col_index[0]]
    known_y = known_y.fillna(value=known_y.mean())
    
    new_y = np.zeros(shape=(len(new_x),))
    new_y[-1] = np.max(known_y) + (np.std(known_y) * 1)
    concat_y = np.concatenate((known_y,new_y),axis=0)
    
    out[:, col_index[0]] = concat_y

extended_df = pd.DataFrame(out)
extended_df.columns = colnames

extended_df.index = pd.date_range(start = '1998',
                                  end = '2021',
                                  freq='A')

In [259]:

# All the options described here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.interpolate.html
# {‘linear’, ‘time’, ‘index’, ‘values’, ‘nearest’, ‘zero’,
# ‘slinear’, ‘quadratic’, ‘cubic’, ‘barycentric’, ‘krogh’, ‘polynomial’, ‘spline’ ‘piecewise_polynomial’, ‘pchip’}

extended_df.replace(0, np.nan, inplace=True)

# Must specify order for poly
extended_df.interpolate(method='pchip', axis=0, limit=None, inplace=True, downcast=None)

supress = extended_df.plot(subplots=True, layout=(7, 4), figsize=(16, 16), alpha =.6, sharex=False);

Here's another fun view:

In [260]:

extended_df.plot(figsize=(16, 9), alpha =.6)

Out[260]:

<matplotlib.axes._subplots.AxesSubplot at 0x11e4bc150>