Prediction Using Different Machine Learning Methods¶

Support Vector Regression and Cross Validation¶

In this notebook, we'll train a Support Vector Regression(SVR) model for predicting building energy consumption based on historical energy data, several weather variables, hour of the day, day of the week, weekends and holidays.

To do this, we'll fit the model to daily and hourly energy and weather data from 2012-01-01 to 2014-10-31 and compute the average squared residuals from predictions.

During the design time, we've used cross-validation to fine tune the SVR parameters. And since SVR take too much time to compute, in this final notebook we'll set the parameters to their optimal value that were found with cross validation. We'll still show the range of parameters that was provided as input to cross validation.

In [1]:

# special IPython command to prepare the notebook for matplotlib
%matplotlib inline 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.svm import SVR
from sklearn.grid_search import GridSearchCV
from sklearn import cross_validation
from sklearn import grid_search

pd.options.display.mpl_style = 'default'

In [2]:

dailyElectricity = pd.read_excel('Data/dailyElectricityWithFeatures.xlsx')
dailyElectricity = dailyElectricity.drop('startDay', 1).drop('endDay', 1)

dailyChilledWater = pd.read_excel('Data/dailyChilledWaterWithFeatures.xlsx')
dailyChilledWater = dailyChilledWater.drop('startDay', 1).drop('endDay', 1)

dailySteam = pd.read_excel('Data/dailySteamWithFeatures.xlsx')
dailySteam = dailySteam.drop('startDay', 1).drop('endDay', 1)

hourlyElectricity = pd.read_excel('Data/hourlyElectricityWithFeatures.xlsx')
hourlyElectricity = hourlyElectricity.drop('startTime', 1).drop('endTime', 1)

hourlyChilledWater = pd.read_excel('Data/hourlyChilledWaterWithFeatures.xlsx')
hourlyChilledWater = hourlyChilledWater.drop('startTime', 1).drop('endTime', 1)

hourlySteam = pd.read_excel('Data/hourlySteamWithFeatures.xlsx')
hourlySteam = hourlySteam.drop('startTime', 1).drop('endTime', 1)

#display one dataframe
dailyElectricity.head()

Out[2]:

	electricity-kWh	RH-%	T-C	Tdew-C	pressure-mbar	solarRadiation-W/m2	windDirection	windSpeed-m/s	humidityRatio-kg/kg	coolingDegrees	heatingDegrees	occupancy
2012-01-01	2800.244977	76.652174	7.173913	3.073913	1004.956522	95.260870	236.086957	4.118361	0.004796	0.086957	7.826087	0.0
2012-01-02	3168.974047	55.958333	5.833333	-2.937500	994.625000	87.333333	253.750000	5.914357	0.003415	0.000000	9.166667	0.3
2012-01-03	5194.533376	42.500000	-3.208333	-12.975000	1002.125000	95.708333	302.916667	6.250005	0.001327	0.000000	18.208333	0.3
2012-01-04	5354.861935	41.541667	-7.083333	-16.958333	1008.250000	98.750000	286.666667	5.127319	0.000890	0.000000	22.083333	0.3
2012-01-05	5496.223993	46.916667	-0.583333	-9.866667	1002.041667	90.750000	258.333333	5.162041	0.001746	0.000000	15.583333	0.3

Daily Prediction¶

Adding new features to the dataframe: weekday, day of the year and week of the year.

In [37]:

def addDailyTimeFeatures(df):
    df['weekday'] = df.index.weekday
    df['day'] = df.index.dayofyear
    df['week'] = df.index.weekofyear
    return df
    
dailyElectricity = addDailyTimeFeatures(dailyElectricity)
dailyChilledWater = addDailyTimeFeatures(dailyChilledWater)
dailySteam = addDailyTimeFeatures(dailySteam)

Daily Electricity Prediction¶

In [38]:

df_elect = dailyElectricity[['weekday', 'day', 'week', 'occupancy', 'electricity-kWh']]

elect_train = pd.DataFrame(data=df_elect, index=np.arange('2012-01', '2013-07', dtype='datetime64[D]')).dropna()
elect_test = pd.DataFrame(data=df_elect, index=np.arange('2013-07', '2014-11', dtype='datetime64[D]')).dropna()

XX_elect_train = elect_train.drop('electricity-kWh', axis = 1).reset_index().drop('index', axis = 1)
XX_elect_test = elect_test.drop('electricity-kWh', axis = 1).reset_index().drop('index', axis = 1)

YY_elect_train = elect_train['electricity-kWh']
YY_elect_test = elect_test['electricity-kWh']

In [45]:

# Input parameters ranges to cross validation 
gamma_range = [0.01, 0.001, 0.0001]
epsilon_range = [x * 0.1 for x in range(0, 2)]
C_range = range(1, 2500, 500)

# To speed up, we first find the optimal paratemers for C and gamma and then set them directly in the method call
tuned_parameters = [{
    'kernel': ['rbf', 'linear'],
#    'C': C_range,
#    'gamma': gamma_range,
    'epsilon': epsilon_range}]

# search for the best parameters with crossvalidation.
svr_elect = GridSearchCV(SVR(C=2000, gamma=0.01), param_grid = tuned_parameters, verbose = 0)

# Fit regression model
y_elect = svr_elect.fit(XX_elect_train, YY_elect_train).predict(XX_elect_test)

print 'Optimum parameters C=2000 and gamma=0.01 for SVR'
print 'Optimum parameters epsilon and kernel for SVR: ', svr_elect.best_params_

print "The test score R2 for SVR: ", svr_elect.score(XX_elect_test, YY_elect_test)

print("SVR mean squared error: %.2f"
      % np.mean((YY_elect_test - svr_elect.predict(XX_elect_test)) ** 2))

Optimum parameters C=2000 and gamma=0.01 for SVR
Optimum parameters epsilon and kernel for SVR:  {'epsilon': 0.1, 'kernel': 'rbf'}
The test score R2 for RBF:  0.691753544427
RBF mean squared error: 391582.21

In [46]:

#Plot time series of observed and predicted electricity demand over the testing period.
fig = plt.figure(figsize=(15,7))
plt.scatter(XX_elect_test.index, YY_elect_test, c='k', label='Observed')
plt.plot(XX_elect_test.index, y_elect, c='r', label='Predicted')
plt.xlabel('data')
plt.ylabel('target')
plt.title('Support Vector Regression')
plt.legend()
plt.show()

In [47]:

#Plot Observed vs. predicted usage.
fig = plt.figure(figsize=(6,6))
plt.scatter(YY_elect_test, YY_elect_test, c='k')
plt.scatter(YY_elect_test, y_elect, c='r')
plt.xlabel('Observed Elec. Usage (kWh)')
plt.ylabel("Predicted Elec. Usage (kWh): $\hat{Y}_i$")
plt.title("Energy vs Predicted Energy: $Y_i$ vs $\hat{Y}_i$")

Out[47]:

<matplotlib.text.Text at 0x1b5679e8>

Daily Chilled Water Prediction¶

In [48]:

chilledw_train = pd.DataFrame(data=dailyChilledWater, index=np.arange('2012-01', '2013-07', dtype='datetime64[D]')).dropna()
chilledw_test = pd.DataFrame(data=dailyChilledWater, index=np.arange('2013-07', '2014-11', dtype='datetime64[D]')).dropna()

XX_chilledw_train = chilledw_train.drop('chilledWater-TonDays', axis = 1).reset_index().drop('index', axis = 1)
XX_chilledw_test = chilledw_test.drop('chilledWater-TonDays', axis = 1).reset_index().drop('index', axis = 1)

YY_chilledw_train = chilledw_train['chilledWater-TonDays']
YY_chilledw_test = chilledw_test['chilledWater-TonDays']

In [49]:

# Optimal parameters for the SVR regressor
gamma_range = [0.1, 0.01, 0.001, 0.0001]
epsilon_range = [x * 0.1 for x in range(0, 3)]
C_range = range(1, 5, 2)

# To speed up, we first find the optimal paratemers for C and gamma and then set them directly in the method call
tuned_parameters = [{
    'kernel': ['rbf', 'linear'],
#    'C': C_range,
#    'gamma': gamma_range,
    'epsilon': epsilon_range}]

# search for the best parameters with crossvalidation.
svr_chilledw = GridSearchCV(SVR(C=3, gamma=0.0001), param_grid = tuned_parameters, verbose = 0)

# Fit regression model
y_chilledw = svr_chilledw.fit(XX_chilledw_train, YY_chilledw_train).predict(XX_chilledw_test)

print 'Optimum parameters C=3 and gamma=0.1 for SVR'
print 'Optimum epsilon and kernel: ', svr_chilledw.best_params_

print "The test score R2 for SVR: ", svr_chilledw.score(XX_chilledw_test, YY_chilledw_test)

print("SVR mean squared error: %.2f"
      % np.mean((YY_chilledw_test - svr_chilledw.predict(XX_chilledw_test)) ** 2))

Optimum parameters C=3 and gamma=0.1 for SVR
Optimum epsilon and kernel:  {'epsilon': 0.1, 'kernel': 'linear'}
The test score R2 for SVR:  0.764904375769
SVR mean squared error: 443.92

In [50]:

#Plot observed and predicted Chilled Water values with SVR
fig = plt.figure(figsize=(15,10))
plt.scatter(XX_chilledw_test.index, YY_chilledw_test, c='k', label='Observed')
plt.plot(XX_chilledw_test.index, y_chilledw, c='g', label='Predicted')
plt.xlabel('data')
plt.ylabel('target')
plt.title('Support Vector Regression')
plt.legend()
plt.show()

In [51]:

#Plot observed vs predicted energy usage
fig = plt.figure(figsize=(6,6))
plt.scatter(YY_chilledw_test, YY_chilledw_test, c='k')
plt.scatter(YY_chilledw_test, y_chilledw, c='g')
plt.xlabel('Observed Chilled Water Usage (TonDays)')
plt.ylabel("Predicted Chilled Water Usage (TonDays): $\hat{Y}_i$")
plt.title("Observed vs Predicted Energy: $Y_i$ vs $\hat{Y}_i$")

Out[51]:

<matplotlib.text.Text at 0x1b9e7588>

Daily Steam Prediction¶

In [52]:

steam_train = pd.DataFrame(data=dailySteam, index=np.arange('2012-01', '2013-07', dtype='datetime64[D]')).dropna()
steam_test = pd.DataFrame(data=dailySteam, index=np.arange('2013-07', '2014-11', dtype='datetime64[D]')).dropna()

XX_steam_train = steam_train.drop('steam-LBS', axis = 1).reset_index().drop('index', axis = 1)
XX_steam_test = steam_test.drop('steam-LBS', axis = 1).reset_index().drop('index', axis = 1)

YY_steam_train = steam_train['steam-LBS']
YY_steam_test = steam_test['steam-LBS']

In [59]:

# Optimal parameters for the SVR regressor
gamma_range = [0.1, 0.01, 0.001, 0.0001]
epsilon_range = [x * 0.1 for x in range(0, 3)]
C_range = range(1, 500, 50)

tuned_parameters = [{
    'kernel': ['rbf', 'linear'],
#    'C': C_range,
    'gamma': gamma_range,
    'epsilon': epsilon_range}]

# search for the best parameters with crossvalidation.
svr_steam = GridSearchCV(SVR(C=50), param_grid = tuned_parameters, verbose = 0)

# Fit regression model
y_steam = svr_steam.fit(XX_steam_train, YY_steam_train).predict(XX_steam_test)

print 'Optimum parameters C=50 for SVR'
print 'Optimum epsilon, gamma and kernel: ', svr_steam.best_params_

print "The test score R2 for SVR: ", svr_steam.score(XX_steam_test, YY_steam_test)

print("SVR mean squared error: %.2f"
      % np.mean((YY_steam_test - svr_steam.predict(XX_steam_test)) ** 2))

Optimum parameters C=50 for SVR
Optimum epsilon, gamma and kernel:  {'epsilon': 0.2, 'gamma': 0.1, 'kernel': 'linear'}
The test score R2 for SVR:  0.938467924325
SVR mean squared error: 20632451.77

In [60]:

#Plot observed and predicted Steam values 
fig,ax = plt.subplots(1, 1,figsize=(20,10))
plt.scatter(XX_steam_test.index, YY_steam_test, c='k', label='Observed')
plt.plot(XX_steam_test.index, y_steam, c='r', label='Predicted')
plt.xlabel('data')
plt.ylabel('target')
plt.title('Support Vector Regression')
plt.legend()
plt.show()

In [61]:

#Plot Observed vs Predicted 
fig = plt.figure(figsize=(6,6))
plt.scatter(YY_steam_test, YY_steam_test, c='k')
plt.scatter(YY_steam_test, y_steam, c='r')
plt.xlabel('Observed Steam Usage (LBS)')
plt.ylabel("Predicted Steam Usage (LBS): $\hat{Y}_i$")
plt.title("Observed vs Predicted Energy: $Y_i$ vs $\hat{Y}_i$")

Out[61]:

<matplotlib.text.Text at 0x1b37b1d0>

Hourly Prediction¶

Hourly Electricity Prediction¶

Adding new features to the dataframe: hour, weekday, day of the year and week of the year.

In [89]:

def addHourlyTimeFeatures(df):
    df['hour'] = df.index.hour
    df['weekday'] = df.index.weekday
    df['day'] = df.index.dayofyear
    df['week'] = df.index.weekofyear    
    return df

hourlyElectricity = addHourlyTimeFeatures(hourlyElectricity)

In [96]:

df_hourlyelect = hourlyElectricity[['hour', 'weekday', 'day', 'week', 'cosHour', 
                                                      'occupancy', 'electricity-kWh']]

hourlyelect_train = pd.DataFrame(data=df_hourlyelect, index=np.arange('2014-01-01 00:00:00', '2014-10-01 00:00:00', dtype='datetime64[h]')).dropna()
hourlyelect_test = pd.DataFrame(data=df_hourlyelect, index=np.arange('2014-10-01 00:00:00', '2014-11-01 00:00:00', dtype='datetime64[h]')).dropna()

XX_hourlyelect_train = hourlyelect_train.drop('electricity-kWh', axis = 1).reset_index().drop('index', axis = 1)
XX_hourlyelect_test = hourlyelect_test.drop('electricity-kWh', axis = 1).reset_index().drop('index', axis = 1)

YY_hourlyelect_train = hourlyelect_train['electricity-kWh']
YY_hourlyelect_test = hourlyelect_test['electricity-kWh']

In [99]:

# Optimal parameters for the SVR regressor
gamma_range = [0.01, 0.001, 0.0001]
epsilon_range = [x * 0.1 for x in range(0, 2)]
C_range = range(1, 5, 1)

tuned_parameters = [{
    'kernel': ['rbf', 'linear'],
#    'C': C_range,
#    'gamma': gamma_range,
    'epsilon': epsilon_range}]

# search for the best parameters with crossvalidation.
svr_hourlyelect = GridSearchCV(SVR(C=1, gamma=0.01), param_grid = tuned_parameters, verbose = 0)

# Fit regression model
y_hourlyelect = svr_hourlyelect.fit(XX_hourlyelect_train, YY_hourlyelect_train).predict(XX_hourlyelect_test)

print 'Optimum parameters C=1 and gamma=0.01 for SVR'
print 'Optimum epsilon and kernel for SVR: ', svr_hourlyelect.best_params_

print "The test score R2: ", svr_hourlyelect.score(XX_hourlyelect_test, YY_hourlyelect_test)

print("SVR mean squared error: %.2f"
      % np.mean((YY_hourlyelect_test - svr_hourlyelect.predict(XX_hourlyelect_test)) ** 2))

Optimum parameters C=1 and gamma=0.01 for SVR
Optimum epsilon and kernel for SVR:  {'epsilon': 0.1, 'kernel': 'linear'}
The test score R2:  0.747282383852
SVR mean squared error: 1561.24

In [100]:

#Plot observed and predicted electricity value 
fig = plt.figure(figsize=(20,10))
plt.scatter(XX_hourlyelect_test.index, YY_hourlyelect_test, label='Observed', color='k')
plt.plot(XX_hourlyelect_test.index, y_hourlyelect, label='Predicted', color='g')
plt.legend(loc='upper right')

Out[100]:

<matplotlib.legend.Legend at 0x1f4a6ef0>

In [101]:

#Plot Observed vs. Predicted usage.
fig = plt.figure(figsize=(6,6))
plt.plot(YY_hourlyelect_test, YY_hourlyelect_test, c='k')
plt.scatter(YY_hourlyelect_test, y_hourlyelect, c='g')
plt.xlabel('Observed Elec. Usage (kWh)')
plt.ylabel("Predicted Elec. Usage (kWh): $\hat{Y}_i$")
plt.title("Observed vs Predicted Elec.: $Y_i$ vs $\hat{Y}_i$")

Out[101]:

<matplotlib.text.Text at 0x210e82b0>

In [237]: