Data Wrangling¶

For some large scale buildings in Harvard, there are three types energy consumption, electricity, chilled water and steam. Chilled water is for cooling and steam is for heating. The following figure shows the buildings with chilled water and steam supply from Harvard plants.

Fig. Harvard chilled water and steam supply. (Left: chilled water, highlighted in blue. Right: Steam, highlighted in yellow.)

Image source: Harvard University Campus Service

We picked one building and got energy consumption data from 2011/07/01 to 2014/10/31. There are several months data missing due to meter malfunction. The data resolution is hourly. In original data, the hourly data is meter readings. So in order to get hourly consumption, we need to offset the data and do the minus. We have hourly data from January 2012 to October 2014 for both weather and energy (2.75 years). The weather data is from weather stations in Cambridge.

In this section, we are going to finish the following tasks.

Get hourly electricity, chilled water and steam from the original data downloaded from Harvard Energy Witness website manually.
Clean weather data, add more features including cooling degrees, heating degress and humidity ratio.
Estimate daily occupancy based on holidays, school year and weekends.
Create features reletate to hour, which is cos(hourOfDay * 2 * pi / 24).
Merge electricity, chilled water and steam dataframe with weather and time and occupancy features.

In [1]:

%matplotlib inline 

import requests 
from StringIO import StringIO
import numpy as np
import pandas as pd # pandas
import matplotlib.pyplot as plt # module for plotting 
import datetime as dt # module for manipulating dates and times
import numpy.linalg as lin # module for performing linear algebra operations
from __future__ import division
from math import log10,exp

pd.options.display.mpl_style = 'default'

Original Energy Consumption Data¶

Original data are downloaded from Harvard Energy Witness Website <img src="Pics/energyWitness1.png", style="width:80%"> <img src="Pics/energyWitness3.png", style="width:80%">

Then we use Pandas to put them together into one dataframe.

In [2]:

file = 'Data/Org/0701-0930-2011.xls' 
df = pd.read_excel(file, header = 0, skiprows = np.arange(0,6))

files = ['Data/Org/1101-1130-2011.xls', 'Data/Org/1201-2011-0131-2012.xls',
         'Data/Org/0201-0331-2012.xls','Data/Org/0401-0531-2012.xls','Data/Org/0601-0630-2012.xls',
         'Data/Org/0701-0831-2012.xls','Data/Org/0901-1031-2012.xls','Data/Org/1101-1231-2012.xls',
         'Data/Org/0101-0228-2013.xls',
         'Data/Org/0301-0430-2013.xls','Data/Org/0501-0630-2013.xls','Data/Org/0701-0831-2013.xls',
         'Data/Org/0901-1031-2013.xls','Data/Org/1101-1231-2013.xls','Data/Org/0101-0228-2014.xls',
         'Data/Org/0301-0430-2014.xls', 'Data/Org/0501-0630-2014.xls', 'Data/Org/0701-0831-2014.xls',
         'Data/Org/0901-1031-2014.xls']

for file in files:
    data = pd.read_excel(file, header = 0, skiprows = np.arange(0,6))
    df = df.append(data)

df.head()

WARNING *** file size (2481102) not 512 + multiple of sector size (512)
WARNING *** file size (848833) not 512 + multiple of sector size (512)
WARNING *** file size (1694257) not 512 + multiple of sector size (512)
WARNING *** file size (1640459) not 512 + multiple of sector size (512)
WARNING *** file size (1667907) not 512 + multiple of sector size (512)
WARNING *** file size (847258) not 512 + multiple of sector size (512)
WARNING *** file size (1691449) not 512 + multiple of sector size (512)
WARNING *** file size (1666647) not 512 + multiple of sector size (512)
WARNING *** file size (1665736) not 512 + multiple of sector size (512)
WARNING *** file size (1614814) not 512 + multiple of sector size (512)
WARNING *** file size (1665980) not 512 + multiple of sector size (512)
WARNING *** file size (1667276) not 512 + multiple of sector size (512)
WARNING *** file size (1691736) not 512 + multiple of sector size (512)
WARNING *** file size (1666704) not 512 + multiple of sector size (512)
WARNING *** file size (1665920) not 512 + multiple of sector size (512)
WARNING *** file size (1614900) not 512 + multiple of sector size (512)
WARNING *** file size (1666228) not 512 + multiple of sector size (512)
WARNING *** file size (1666191) not 512 + multiple of sector size (512)
WARNING *** file size (1691845) not 512 + multiple of sector size (512)
WARNING *** file size (1663846) not 512 + multiple of sector size (512)

Out[2]:

	Unnamed: 0	Unnamed: 1	Gund Bus-A 15 Min Block Demand - kW	Gund Bus-A CurrentA - Amps	Unnamed: 4	Unnamed: 5	Gund Bus-A CurrentB - Amps	Unnamed: 7	Gund Bus-A CurrentC - Amps	Unnamed: 9	...	Gund Main Demand - Tons	Gund Main Energy - Ton-Days	Gund Main FlowRate - gpm	Gund Main FlowTotal - kgal(1000)	Gund Main SignalAeration - Count	Gund Main SignalStrength - Count	Gund Main SonicVelocity - Ft/Sec	Gund Main TempDelta - Deg F	Gund Main TempReturn - Deg F	Gund Main TempSupply - Deg F
0	2011-07-01 01:00:00	White	48.458733	65.977882	NaN	NaN	52.631417	NaN	55.603840	NaN	...	4.677294	17912.537804	6.916454	48168.083414	0.693405	57.208127	1437.640543	16.238684	59.757447	43.516103
1	2011-07-01 02:00:00	White	40.472697	57.230223	NaN	NaN	42.483092	NaN	50.243230	NaN	...	4.586403	17912.853518	6.739337	48168.645429	0.567355	57.082909	1438.030719	16.263573	59.710199	43.495128
2	2011-07-01 03:00:00	#d2e4b0	39.472809	55.487443	NaN	NaN	41.911784	NaN	48.482163	NaN	...	4.462877	17913.169232	6.725142	48169.207444	0.441304	57.001646	1439.111130	15.797043	59.248158	43.457344
3	2011-07-01 04:00:00	White	39.198879	55.849806	NaN	NaN	41.525529	NaN	48.987457	NaN	...	4.696993	17913.484946	7.041330	48169.769458	0.315254	57.000000	1440.768604	15.947392	59.207097	43.267682
4	2011-07-01 05:00:00	White	39.297522	55.736219	NaN	NaN	41.299381	NaN	48.710408	NaN	...	4.550372	17913.800660	6.863004	48170.331473	0.189204	57.000000	1442.426077	15.903679	59.282707	43.372615

5 rows × 55 columns

Above is the original hourly data.

As you can see, it is quite messy. The first thing to remove meaningless columns.

In [3]:

df.rename(columns={'Unnamed: 0':'Datetime'}, inplace=True)
nonBlankColumns = ['Unnamed' not in s for s in df.columns]
columns = df.columns[nonBlankColumns]
df = df[columns]
df = df.set_index(['Datetime'])
df.index.name = None
df.head()

Out[3]:

	Gund Bus-A 15 Min Block Demand - kW	Gund Bus-A CurrentA - Amps	Gund Bus-A CurrentB - Amps	Gund Bus-A CurrentC - Amps	Gund Bus-A CurrentN - Amps	Gund Bus-A EnergyReal - kWhr	Gund Bus-A Freq - Hertz	Gund Bus-A Max Monthly Demand - kW	Gund Bus-A PowerApp - kVA	Gund Bus-A PowerReac - kVAR	...	Gund Main Demand - Tons	Gund Main Energy - Ton-Days	Gund Main FlowRate - gpm	Gund Main FlowTotal - kgal(1000)	Gund Main SignalAeration - Count	Gund Main SignalStrength - Count	Gund Main SonicVelocity - Ft/Sec	Gund Main TempDelta - Deg F	Gund Main TempReturn - Deg F	Gund Main TempSupply - Deg F
2011-07-01 01:00:00	48.458733	65.977882	52.631417	55.603840	15.982278	1796757.502803	59.837524	96.117915	48.757073	12.344712	...	4.677294	17912.537804	6.916454	48168.083414	0.693405	57.208127	1437.640543	16.238684	59.757447	43.516103
2011-07-01 02:00:00	40.472697	57.230223	42.483092	50.243230	13.423762	1796800.145991	60.005569	96.117915	42.238685	12.967984	...	4.586403	17912.853518	6.739337	48168.645429	0.567355	57.082909	1438.030719	16.263573	59.710199	43.495128
2011-07-01 03:00:00	39.472809	55.487443	41.911784	48.482163	13.478933	1796840.146023	59.833880	96.117915	41.278573	12.732046	...	4.462877	17913.169232	6.725142	48169.207444	0.441304	57.001646	1439.111130	15.797043	59.248158	43.457344
2011-07-01 04:00:00	39.198879	55.849806	41.525529	48.987457	13.603309	1796879.023607	59.673044	96.117915	41.345776	12.687845	...	4.696993	17913.484946	7.041330	48169.769458	0.315254	57.000000	1440.768604	15.947392	59.207097	43.267682
2011-07-01 05:00:00	39.297522	55.736219	41.299381	48.710408	13.797331	1796918.273558	59.986672	96.117915	41.166736	12.437842	...	4.550372	17913.800660	6.863004	48170.331473	0.189204	57.000000	1442.426077	15.903679	59.282707	43.372615

5 rows × 48 columns

Then we print out all the column names. Only a few columns are useful to get the hourly electricity, chilled water and steam.

In [4]:

for item in df.columns:
    print item

Gund Bus-A 15 Min Block Demand - kW
Gund Bus-A CurrentA - Amps
Gund Bus-A CurrentB - Amps
Gund Bus-A CurrentC - Amps
Gund Bus-A CurrentN - Amps
Gund Bus-A EnergyReal - kWhr
Gund Bus-A Freq - Hertz
Gund Bus-A Max Monthly Demand - kW
Gund Bus-A PowerApp - kVA
Gund Bus-A PowerReac - kVAR
Gund Bus-A PowerReal - kW
Gund Bus-A TruePF - PF
Gund Bus-A VoltageAB - Volts
Gund Bus-A VoltageAN - Volts
Gund Bus-A VoltageBC - Volts
Gund Bus-A VoltageBN - Volts
Gund Bus-A VoltageCA - Volts
Gund Bus-A VoltageCN - Volts
Gund Bus-B 15 Min Block Demand - kW
Gund Bus-B CurrentA - Amps
Gund Bus-B CurrentB - Amps
Gund Bus-B CurrentC - Amps
Gund Bus-B CurrentN - Amps
Gund Bus-B EnergyReal - kWhr
Gund Bus-B Freq - Hertz
Gund Bus-B Max Monthly Demand - kW
Gund Bus-B PowerApp - kVA
Gund Bus-B PowerReac - kVAR
Gund Bus-B PowerReal - kW
Gund Bus-B TruePF - PF
Gund Bus-B VoltageAB - Volts
Gund Bus-B VoltageAN - Volts
Gund Bus-B VoltageBC - Volts
Gund Bus-B VoltageBN - Volts
Gund Bus-B VoltageCA - Volts
Gund Bus-B VoltageCN - Volts
Gund Condensate Counter - count
Gund Condensate FlowTotal - LBS
Gund Main Demand - Tons
Gund Main Energy - Ton-Days
Gund Main FlowRate - gpm
Gund Main FlowTotal - kgal(1000)
Gund Main SignalAeration - Count
Gund Main SignalStrength - Count
Gund Main SonicVelocity - Ft/Sec
Gund Main TempDelta - Deg F
Gund Main TempReturn - Deg F
Gund Main TempSupply - Deg F

Electricity¶

Use electricity as an example, there are two meters, "Gund Bus A" and "Gund Bus B". The "EnergyReal - kWhr" records accumulative consumption. We are not sure exactly what is "PowerReal". Just in case, we put it into the electricity dateframe as well.

In [5]:

electricity=df[['Gund Bus-A EnergyReal - kWhr','Gund Bus-B EnergyReal - kWhr',
                'Gund Bus-A PowerReal - kW','Gund Bus-B PowerReal - kW',]]
electricity.head()

Out[5]:

	Gund Bus-A EnergyReal - kWhr	Gund Bus-B EnergyReal - kWhr	Gund Bus-A PowerReal - kW	Gund Bus-B PowerReal - kW
2011-07-01 01:00:00	1796757.502803	3657811.582122	47.184015	63.486186
2011-07-01 02:00:00	1796800.145991	3657873.464938	40.208796	61.270542
2011-07-01 03:00:00	1796840.146023	3657934.837505	39.209866	61.464394
2011-07-01 04:00:00	1796879.023607	3657995.470348	39.378507	59.396581
2011-07-01 05:00:00	1796918.273558	3658054.470285	39.240837	58.911729

Validate our data processing method by checking monthly energy consumption¶

In order to check whether our understanding of the data is correct, we want to calculate monthly electricity consumption from the hourly data, and then compare the results with the monthly data provided by facalities, which is available on Energy Witness as well.

Here is the monthly data provided by facalities, "Bus A & B" are called "CE603B kWh" and "CE604B kWh" in ths monthly form. They are simply two meters. Please note, the meter reading cycle is NOT calendar month.

In [6]:

file = 'Data/monthly electricity.csv' 
monthlyElectricityFromFacility = pd.read_csv(file, header=0)
monthlyElectricityFromFacility
monthlyElectricityFromFacility = monthlyElectricityFromFacility.set_index(['month'])
monthlyElectricityFromFacility.head()

Out[6]:

	startDate	endDate	CE603B kWh	CE604B kWh
month
Jul 11	6/16/11	7/18/11	43968.1	106307.1
Aug 11	7/18/11	8/17/11	41270.1	83121.1
Sep 11	8/17/11	9/16/11	51514.1	107083.1
Oct 11	9/16/11	10/18/11	65338.1	114350.1
Nov 11	10/18/11	11/17/11	65453.1	115318.1

We use the column "EnergyReal - kWhr" for two meters. We found the numbers of the begin and end date of the meter reading cycle, and use the number of the end date minus the number of the begin date, then we get our monthly electiricty consumption.

In [7]:

monthlyElectricityFromFacility['startDate'] = pd.to_datetime(monthlyElectricityFromFacility['startDate'], format="%m/%d/%y")
values = monthlyElectricityFromFacility.index.values

keys = np.array(monthlyElectricityFromFacility['startDate'])

dates = {}
for key, value in zip(keys, values):
    dates[key] = value

sortedDates =  np.sort(dates.keys())
sortedDates = sortedDates[sortedDates > np.datetime64('2011-11-01')]

months = []
monthlyElectricityOrg = np.zeros((len(sortedDates) - 1, 2))
for i in range(len(sortedDates) - 1):
    begin = sortedDates[i]
    end = sortedDates[i+1]
    months.append(dates[sortedDates[i]])
    monthlyElectricityOrg[i, 0] = (np.round(electricity.loc[end,'Gund Bus-A EnergyReal - kWhr'] 
                           -  electricity.loc[begin,'Gund Bus-A EnergyReal - kWhr'], 1))
    monthlyElectricityOrg[i, 1] = (np.round(electricity.loc[end,'Gund Bus-B EnergyReal - kWhr'] 
                           -  electricity.loc[begin,'Gund Bus-B EnergyReal - kWhr'], 1))

monthlyElectricity = pd.DataFrame(data = monthlyElectricityOrg, index = months, columns = ['CE603B kWh', 'CE604B kWh'])


plt.figure()
fig, ax = plt.subplots()
fig = monthlyElectricity.plot(marker = 'o', figsize=(15,6), rot = 40, fontsize = 13, ax = ax, linestyle='')
fig.set_axis_bgcolor('w')
plt.xlabel('Billing month', fontsize = 15)
plt.ylabel('kWh', fontsize = 15)
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 13)
plt.xticks(np.arange(0,len(months)),months)
plt.title('Original monthly consumption from hourly data',fontsize = 17)

text = 'Meter malfunction'
ax.annotate(text, xy = (9, 4500000), 
            xytext = (5, 2), fontsize = 15,
            textcoords = 'offset points', ha = 'center', va = 'top')

ax.annotate(text, xy = (8, -4500000), 
            xytext = (5, 2), fontsize = 15, 
            textcoords = 'offset points', ha = 'center', va = 'bottom')

ax.annotate(text, xy = (14, -2500000), 
            xytext = (5, 2), fontsize = 15, 
            textcoords = 'offset points', ha = 'center', va = 'bottom')

ax.annotate(text, xy = (15, 2500000), 
            xytext = (5, 2), fontsize = 15, 
            textcoords = 'offset points', ha = 'center', va = 'top')

plt.show()

<matplotlib.figure.Figure at 0x1044639d0>

Above is a plot of monthly electricity consumption using our data processing method. Obviously, the two meters malfunctioned for several months. There are two sets of dots "CE603B" and "CE604B", which come from two meters. There are two electricity meters. The sum of them is the total electricity consumption of the building.

In [53]:

monthlyElectricity.loc['Aug 12','CE604B kWh'] = np.nan
monthlyElectricity.loc['Sep 12','CE604B kWh'] = np.nan
monthlyElectricity.loc['Feb 13','CE603B kWh'] = np.nan
monthlyElectricity.loc['Mar 13','CE603B kWh'] = np.nan


fig,ax = plt.subplots(1, 1,figsize=(15,8))
#ax.set_axis_bgcolor('w')
plt.bar(np.arange(0, len(monthlyElectricity))-0.5,monthlyElectricity['CE603B kWh'], label='Our data processing from hourly data')
plt.plot(monthlyElectricityFromFacility.loc[months,'CE603B kWh'],'or', label='Facility data')
plt.xticks(np.arange(0,len(months)),months)
plt.xlabel('Month',fontsize=15)
plt.ylabel('kWh',fontsize=15)
plt.xlim([0, len(monthlyElectricity)])
plt.legend()
ax.set_xticklabels(months, rotation=40, fontsize=13)
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
plt.title('Comparison between our data processing and facilities, Meter CE603B',fontsize=20)

text = 'Meter malfunction \n estimated by facility'
ax.annotate(text, xy = (14, monthlyElectricityFromFacility.loc['Feb 13','CE603B kWh']), 
            xytext = (5, 50), fontsize = 15, 
            arrowprops=dict(facecolor='black', shrink=0.15),
            textcoords = 'offset points', ha = 'center', va = 'bottom')

ax.annotate(text, xy = (15, monthlyElectricityFromFacility.loc['Mar 13','CE603B kWh']), 
            xytext = (5, 50), fontsize = 15, 
            arrowprops=dict(facecolor='black', shrink=0.15),
            textcoords = 'offset points', ha = 'center', va = 'bottom')
plt.show()

fig,ax = plt.subplots(1, 1,figsize=(15,8))
#ax.set_axis_bgcolor('w')
plt.bar(np.arange(0, len(monthlyElectricity))-0.5, monthlyElectricity['CE604B kWh'], label='Our data processing from hourly data')
plt.plot(monthlyElectricityFromFacility.loc[months,'CE604B kWh'],'or', label='Facility data')
plt.xticks(np.arange(0,len(months)),months)
plt.xlabel('Month',fontsize=15)
plt.ylabel('kWh',fontsize=15)
plt.xlim([0, len(monthlyElectricity)])
plt.legend()
ax.set_xticklabels(months, rotation=40, fontsize=13)
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
plt.title('Comparison between our data processing and facilities, Meter CE604B',fontsize=20)


ax.annotate(text, xy = (9, monthlyElectricityFromFacility.loc['Sep 12','CE604B kWh']), 
            xytext = (5, 50), fontsize = 15, 
            arrowprops=dict(facecolor='black', shrink=0.15),
            textcoords = 'offset points', ha = 'center', va = 'bottom')

ax.annotate(text, xy = (8, monthlyElectricityFromFacility.loc['Aug 12','CE604B kWh']), 
            xytext = (5, 50), fontsize = 15, 
            arrowprops=dict(facecolor='black', shrink=0.15),
            textcoords = 'offset points', ha = 'center', va = 'bottom')
plt.show()

The dots in the picture are monthly data provided by facility. They are the numbers on a bill. When meters were malfunctioning, the facility just estimated the energy use for billin purpose. The data points we excluded are not in the plot. They are simply way higher than normal (30 times more than other normal consumption). "Meter malfunction, estimated by facility" means in this month, the meter was malfunctioning and this dot is what facilities estimated.

Meter "CE603B" malfunctioned on February and March 2013. Meter "CE604B" malfunctioned on August and September 2012. We set the nubmers of these months to np.nan and simply exclude them from regression. Plot them againa and compare with the facility monthly data. They match! Of course, excep when the meters were malfunctioning. In thoses months, the facilities estimated the energy consumption for billing purpose. For us, we just excluded hourly and daily data points during thoese months in regression.

The total building electricity consumption is the sum of these two meters. Just in case, we keep the "power" and compare them with "energy"

In [54]:

electricity['energy'] = electricity['Gund Bus-A EnergyReal - kWhr'] + electricity['Gund Bus-B EnergyReal - kWhr']
electricity['power'] = electricity['Gund Bus-A PowerReal - kW'] + electricity['Gund Bus-B PowerReal - kW']
electricity.head()

Out[54]:

	Gund Bus-A EnergyReal - kWhr	Gund Bus-B EnergyReal - kWhr	Gund Bus-A PowerReal - kW	Gund Bus-B PowerReal - kW	energy	power	startTime	endTime
2011-07-01 00:00:00	NaN	NaN	NaN	NaN	NaN	NaN	2011-07-01 00:00:00	2011-07-01 01:00:00
2011-07-01 01:00:00	1796757.502803	3657811.582122	47.184015	63.486186	5454569.084925	110.670201	2011-07-01 01:00:00	2011-07-01 02:00:00
2011-07-01 02:00:00	1796800.145991	3657873.464938	40.208796	61.270542	5454673.610929	101.479338	2011-07-01 02:00:00	2011-07-01 03:00:00
2011-07-01 03:00:00	1796840.146023	3657934.837505	39.209866	61.464394	5454774.983528	100.674260	2011-07-01 03:00:00	2011-07-01 04:00:00
2011-07-01 04:00:00	1796879.023607	3657995.470348	39.378507	59.396581	5454874.493955	98.775088	2011-07-01 04:00:00	2011-07-01 05:00:00

Derive hourly consumption¶

The hourly energy consumption between this hour and next hour is the meter kWh reading ("energy" in the dataframe) of the next hour minus this hour. We assume the meter reading is recorded at the end of hour. In order to avoid confusion, we also mark the startTime and endTime of the meter reading in the dataframe. We compared hourly power of next hour with the derived hourly electricity, most of the time, power is close to hourly electricity. Sometimes, there is quite a difference.

The meters record accumulative energy consumption. Say at the beginning of today, the number is 10. After an hour, the energy use is 1, then the meter number adds one and become 11, etc. So the meter number should keep increasing. However, we found that sometimes, the meter reading suddenly drops to 0, and after quite a while, get a high positive number again. This will create negative hourly consumption, and then an absurdly high postive number. We get hourly/daily energy consumption by "meter reading at time t + 1 - at time t". If the result is negative, or absurdly high, we consider the meter was malfunctioning. We set these data points to np.nan now. We figured out the exact dates when the meters were malfunctioning by looking at the raw data in excel files and from monthly plots.

In addition to that, occasionaly, there are more non-sense data points, which are thousand times higher than normal values. The normal hourly consumption range is between 100 and 400. We created a filter: "index = abs(hourlyEnergy) < 200000", which means only values lower than 200000 are kept.

In [10]:

# In case there are any missing hours, reindex to get the entire time span. Fill in nan data.
hourlyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'H')

# Somehow, reindex does not work well. October 2011 and several other hours are missing. 
# Basically it is just the length of original length.
#electricity.reindex(hourlyTimestamp, inplace = True, fill_value = np.nan)

startTime = hourlyTimestamp
endTime = hourlyTimestamp + np.timedelta64(1,'h')
hourlyTime = pd.DataFrame(data = np.transpose([startTime, endTime]), index = hourlyTimestamp, columns = ['startTime', 'endTime'])

electricity = electricity.join(hourlyTime, how = 'outer')

# Just in case, in order to use diff method, timestamp has to be in asending order.
electricity.sort_index(inplace = True)
hourlyEnergy = electricity.diff(periods=1)['energy']

hourlyElectricity = pd.DataFrame(data = hourlyEnergy.values, index = hourlyEnergy.index, columns = ['electricity-kWh'])
hourlyElectricity = hourlyElectricity.join(hourlyTime, how = 'inner')

print "Data length: ", len(hourlyElectricity)/24, " days"
hourlyElectricity.head()

Data length:  1218.04166667  days

Out[10]:

	electricity-kWh	startTime	endTime
2011-07-01 00:00:00	NaN	2011-07-01 00:00:00	2011-07-01 01:00:00
2011-07-01 01:00:00	NaN	2011-07-01 01:00:00	2011-07-01 02:00:00
2011-07-01 02:00:00	104.526004	2011-07-01 02:00:00	2011-07-01 03:00:00
2011-07-01 03:00:00	101.372599	2011-07-01 03:00:00	2011-07-01 04:00:00
2011-07-01 04:00:00	99.510428	2011-07-01 04:00:00	2011-07-01 05:00:00

Above is the hourly electricity consumption. The data is exported to an excel file. We assume the meter reading is recorded at the end of hour.

In [11]:

# Filter the data, keep the NaN and generate two excels, with and without Nan

hourlyElectricity.loc[abs(hourlyElectricity['electricity-kWh']) > 100000,'electricity-kWh'] = np.nan

time = hourlyElectricity.index
index = ((time > np.datetime64('2012-07-26')) & (time < np.datetime64('2012-08-18'))) \
        | ((time > np.datetime64('2013-01-21')) & (time < np.datetime64('2013-03-08')))

hourlyElectricity.loc[index,'electricity-kWh'] = np.nan
hourlyElectricityWithoutNaN = hourlyElectricity.dropna(axis=0, how='any')

hourlyElectricity.to_excel('Data/hourlyElectricity.xlsx')
hourlyElectricityWithoutNaN.to_excel('Data/hourlyElectricityWithoutNaN.xlsx')

In [12]:

plt.figure()
fig = hourlyElectricity.plot(fontsize = 15, figsize = (15, 6))
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
fig.set_axis_bgcolor('w')
plt.title('All the hourly electricity data', fontsize = 16)
plt.ylabel('kWh')
plt.show()

plt.figure()
fig = hourlyElectricity.iloc[26200:27400,:].plot(marker = 'o',label='hourly electricity', fontsize = 15, figsize = (15, 6))
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
fig.set_axis_bgcolor('w')
plt.title('Hourly electricity data of selected days', fontsize = 16)
plt.ylabel('kWh')
plt.legend()
plt.show()

<matplotlib.figure.Figure at 0x1096bca90>

<matplotlib.figure.Figure at 0x1096a1750>

Above are the hourly data plots. In the first graph, the blank part is due to missing data when the meters were malfunctioning. In the second graph, you can see difference between day and night, weekday and weekends.

Derive daily consumption¶

We manage to get the daily electricity consumption through "reindex".

In [13]:

dailyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'D')
electricityReindexed = electricity.reindex(dailyTimestamp, inplace = False)

# Just in case, in order to use diff method, timestamp has to be in asending order.
electricityReindexed.sort_index(inplace = True)
dailyEnergy = electricityReindexed.diff(periods=1)['energy']

dailyElectricity = pd.DataFrame(data = dailyEnergy.values, index = electricityReindexed.index - np.timedelta64(1,'D'), columns = ['electricity-kWh'])
dailyElectricity['startDay'] = dailyElectricity.index
dailyElectricity['endDay'] = dailyElectricity.index + np.timedelta64(1,'D')

# Filter the data, keep the NaN and generate two excels, with and without Nan

dailyElectricity.loc[abs(dailyElectricity['electricity-kWh']) > 2000000,'electricity-kWh'] = np.nan

time = dailyElectricity.index
index = ((time > np.datetime64('2012-07-26')) & (time < np.datetime64('2012-08-18'))) | ((time > np.datetime64('2013-01-21')) & (time < np.datetime64('2013-03-08')))

dailyElectricity.loc[index,'electricity-kWh'] = np.nan
dailyElectricityWithoutNaN = dailyElectricity.dropna(axis=0, how='any')

dailyElectricity.to_excel('Data/dailyElectricity.xlsx')
dailyElectricityWithoutNaN.to_excel('Data/dailyElectricityWithoutNaN.xlsx')

dailyElectricity.head()

Out[13]:

	electricity-kWh	startDay	endDay
2011-06-30	NaN	2011-06-30	2011-07-01
2011-07-01	NaN	2011-07-01	2011-07-02
2011-07-02	3630.398480	2011-07-02	2011-07-03
2011-07-03	3750.648885	2011-07-03	2011-07-04
2011-07-04	4568.724044	2011-07-04	2011-07-05

Above is the daily electricity consumption.

In [14]:

plt.figure()
fig = dailyElectricity.plot(figsize = (15, 6))
fig.set_axis_bgcolor('w')
plt.title('All the daily electricity data', fontsize = 16)
plt.ylabel('kWh')
plt.show()

plt.figure()
fig = dailyElectricity.iloc[1000:1130,:].plot(marker = 'o', figsize = (15, 6))
fig.set_axis_bgcolor('w')
plt.title('Daily electricity data of selected days', fontsize = 16)
plt.ylabel('kWh')
plt.show()

<matplotlib.figure.Figure at 0x10a556a90>

<matplotlib.figure.Figure at 0x109b48150>

Above is the daily electricity plot.

The super low electricity consumption happens after semesters end, including Christmas vacations. In addition, as you can see in the second figure, the summer energy consumption is lower when school starts.

Chilled Water¶

We clean the chilled water data in the same way as electricity.

In [15]:

chilledWater = df[['Gund Main Energy - Ton-Days']]
chilledWater.head()

Out[15]:

	Gund Main Energy - Ton-Days
2011-07-01 01:00:00	17912.537804
2011-07-01 02:00:00	17912.853518
2011-07-01 03:00:00	17913.169232
2011-07-01 04:00:00	17913.484946
2011-07-01 05:00:00	17913.800660

In [16]:

file = 'Data/monthly chilled water.csv' 
monthlyChilledWaterFromFacility = pd.read_csv(file, header=0)
monthlyChilledWaterFromFacility.set_index(['month'], inplace = True)
monthlyChilledWaterFromFacility.head()

Out[16]:

	startDate	endDate	chilledWater
month
11-Jul	6/12/11	7/12/11	2258
11-Aug	7/12/11	8/12/11	2095
11-Sep	8/12/11	9/12/11	2200
11-Oct	9/12/11	10/12/11	1664
11-Nov	10/12/11	11/12/11	447

In [17]:

monthlyChilledWaterFromFacility['startDate'] = pd.to_datetime(monthlyChilledWaterFromFacility['startDate'], format="%m/%d/%y")
values = monthlyChilledWaterFromFacility.index.values

keys = np.array(monthlyChilledWaterFromFacility['startDate'])

dates = {}
for key, value in zip(keys, values):
    dates[key] = value

sortedDates =  np.sort(dates.keys())
sortedDates = sortedDates[sortedDates > np.datetime64('2011-11-01')]

months = []
monthlyChilledWaterOrg = np.zeros((len(sortedDates) - 1))
for i in range(len(sortedDates) - 1):
    begin = sortedDates[i]
    end = sortedDates[i+1]
    months.append(dates[sortedDates[i]])
    monthlyChilledWaterOrg[i] = (np.round(chilledWater.loc[end,:] -  chilledWater.loc[begin,:], 1))
    

monthlyChilledWater = pd.DataFrame(data = monthlyChilledWaterOrg, index = months, columns = ['chilledWater-TonDays'])

fig,ax = plt.subplots(1, 1,figsize=(15,8))
#ax.set_axis_bgcolor('w')
#plt.plot(monthlyChilledWater, label='Our data processing from hourly data', marker = 'x', markersize = 15, linestyle = '')
plt.bar(np.arange(len(monthlyChilledWater))-0.5, monthlyChilledWater.values, label='Our data processing from hourly data')
plt.plot(monthlyChilledWaterFromFacility[5:-1]['chilledWater'],'or', label='Facility data')
plt.xticks(np.arange(0,len(months)),months)
plt.xlabel('Month',fontsize=15)
plt.ylabel('kWh',fontsize=15)
plt.xlim([0,len(months)])
plt.legend()
ax.set_xticklabels(months, rotation=40, fontsize=13)
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
plt.title('Data Validation: comparison between our data processing and facilities',fontsize=20)

text = 'Match! Our processing method is valid.'
ax.annotate(text, xy = (15, 2000), 
            xytext = (5, 50), fontsize = 15, 
            textcoords = 'offset points', ha = 'center', va = 'bottom')

plt.show()

In [55]:

hourlyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'H')
chilledWater.reindex(hourlyTimestamp, inplace = True)

# Just in case, in order to use diff method, timestamp has to be in asending order.
chilledWater.sort_index(inplace = True)
hourlyEnergy = chilledWater.diff(periods=1)

hourlyChilledWater = pd.DataFrame(data = hourlyEnergy.values, index = hourlyEnergy.index, columns = ['chilledWater-TonDays'])
hourlyChilledWater['startTime'] = hourlyChilledWater.index
hourlyChilledWater['endTime'] = hourlyChilledWater.index + np.timedelta64(1,'h')

hourlyChilledWater.loc[abs(hourlyChilledWater['chilledWater-TonDays']) > 50,'chilledWater-TonDays'] = np.nan

hourlyChilledWaterWithoutNaN = hourlyChilledWater.dropna(axis=0, how='any')

hourlyChilledWater.to_excel('Data/hourlyChilledWater.xlsx')
hourlyChilledWaterWithoutNaN.to_excel('Data/hourlyChilledWaterWithoutNaN.xlsx')

plt.figure()
fig = hourlyChilledWater.plot(fontsize = 15, figsize = (15, 6))
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
fig.set_axis_bgcolor('w')
plt.title('All the hourly chilled water data', fontsize = 16)
plt.ylabel('Ton-Days')
plt.show()

hourlyChilledWater.head()

<matplotlib.figure.Figure at 0x10f0df950>

Out[55]:

	chilledWater-TonDays	startTime	endTime
2011-07-01 01:00:00	NaN	2011-07-01 01:00:00	2011-07-01 02:00:00
2011-07-01 02:00:00	0.315714	2011-07-01 02:00:00	2011-07-01 03:00:00
2011-07-01 03:00:00	0.315714	2011-07-01 03:00:00	2011-07-01 04:00:00
2011-07-01 04:00:00	0.315714	2011-07-01 04:00:00	2011-07-01 05:00:00
2011-07-01 05:00:00	0.315714	2011-07-01 05:00:00	2011-07-01 06:00:00

In [19]:

dailyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'D')
chilledWaterReindexed = chilledWater.reindex(dailyTimestamp, inplace = False)

chilledWaterReindexed.sort_index(inplace = True)
dailyEnergy = chilledWaterReindexed.diff(periods=1)['Gund Main Energy - Ton-Days']

dailyChilledWater = pd.DataFrame(data = dailyEnergy.values, index = chilledWaterReindexed.index - np.timedelta64(1,'D'), columns = ['chilledWater-TonDays'])
dailyChilledWater['startDay'] = dailyChilledWater.index
dailyChilledWater['endDay'] = dailyChilledWater.index + np.timedelta64(1,'D')

dailyChilledWaterWithoutNaN = dailyChilledWater.dropna(axis=0, how='any')

dailyChilledWater.to_excel('Data/dailyChilledWater.xlsx')
dailyChilledWaterWithoutNaN.to_excel('Data/dailyChilledWaterWithoutNaN.xlsx')

plt.figure()
fig = dailyChilledWater.plot(fontsize = 15, figsize = (15, 6))
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
fig.set_axis_bgcolor('w')
plt.title('All the daily chilled water data', fontsize = 16)
plt.ylabel('Ton-Days')
plt.show()


dailyChilledWater.head()

<matplotlib.figure.Figure at 0x10a022050>

Out[19]:

	chilledWater-TonDays	startDay	endDay
2011-06-30	NaN	2011-06-30	2011-07-01
2011-07-01	NaN	2011-07-01	2011-07-02
2011-07-02	54.741028	2011-07-02	2011-07-03
2011-07-03	55.649728	2011-07-03	2011-07-04
2011-07-04	109.049077	2011-07-04	2011-07-05

Steam¶

In [20]:

steam = df[['Gund Condensate FlowTotal - LBS']]
steam.head()

Out[20]:

	Gund Condensate FlowTotal - LBS
2011-07-01 01:00:00	15443350.388455
2011-07-01 02:00:00	15443459.322917
2011-07-01 03:00:00	15443574.687500
2011-07-01 04:00:00	15443701.953125
2011-07-01 05:00:00	15443818.359375

In [21]:

file = 'Data/monthly steam.csv' 
monthlySteamFromFacility = pd.read_csv(file, header=0)
monthlySteamFromFacility.set_index(['month'], inplace = True)
monthlySteamFromFacility.head()

Out[21]:

	startDate	endDate	steam
month
Jul 11	6/17/2011	7/21/2011	0.0
Aug 11	7/21/2011	8/20/2011	0.0
Sep 11	8/20/2011	9/17/2011	0.0
Oct 11	9/17/2011	10/20/2011	246.5
Nov 11	10/20/2011	11/20/2011	786.1

In [22]:

monthlySteamFromFacility['startDate'] = pd.to_datetime(monthlySteamFromFacility['startDate'], format="%m/%d/%Y")
values = monthlySteamFromFacility.index.values

keys = np.array(monthlySteamFromFacility['startDate'])

dates = {}
for key, value in zip(keys, values):
    dates[key] = value

sortedDates =  np.sort(dates.keys())
sortedDates = sortedDates[sortedDates > np.datetime64('2011-11-01')]

months = []
monthlySteamOrg = np.zeros((len(sortedDates) - 1))
for i in range(len(sortedDates) - 1):
    begin = sortedDates[i]
    end = sortedDates[i+1]
    months.append(dates[sortedDates[i]])
    monthlySteamOrg[i] = (np.round(steam.loc[end,:] -  steam.loc[begin,:], 1))
    

monthlySteam = pd.DataFrame(data = monthlySteamOrg, index = months, columns = ['steam-LBS'])

# 867 LBS ~= 1MMBTU steam

fig,ax = plt.subplots(1, 1,figsize=(15,8))
#ax.set_axis_bgcolor('w')
#plt.plot(monthlySteam/867, label='Our data processing from hourly data')
plt.bar(np.arange(len(monthlySteam))-0.5, monthlySteam.values/867, label='Our data processing from hourly data')
plt.plot(monthlySteamFromFacility.loc[months,'steam'],'or', label='Facility data')
plt.xticks(np.arange(0,len(months)),months)
plt.xlabel('Month',fontsize=15)
plt.ylabel('Steam (MMBTU)',fontsize=15)
plt.xlim([0,len(months)])
plt.legend()
ax.set_xticklabels(months, rotation=40, fontsize=13)
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
plt.title('Comparison between our data processing and facilities - Steam',fontsize=20)

text = 'Match! Our processing method is valid.'
ax.annotate(text, xy = (9, 1500), 
            xytext = (5, 50), fontsize = 15, 
            textcoords = 'offset points', ha = 'center', va = 'bottom')

plt.show()

In [23]:

hourlyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'H')
steam.reindex(hourlyTimestamp, inplace = True)

# Just in case, in order to use diff method, timestamp has to be in asending order.
steam.sort_index(inplace = True)
hourlyEnergy = steam.diff(periods=1)

hourlySteam = pd.DataFrame(data = hourlyEnergy.values, index = hourlyEnergy.index, columns = ['steam-LBS'])
hourlySteam['startTime'] = hourlySteam.index
hourlySteam['endTime'] = hourlySteam.index + np.timedelta64(1,'h')

hourlySteam.loc[abs(hourlySteam['steam-LBS']) > 100000,'steam-LBS'] = np.nan

plt.figure()
fig = hourlySteam.plot(fontsize = 15, figsize = (15, 6))
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 17)
fig.set_axis_bgcolor('w')
plt.title('All the hourly steam data', fontsize = 16)
plt.ylabel('LBS')
plt.show()

hourlySteamWithoutNaN = hourlySteam.dropna(axis=0, how='any')

hourlySteam.to_excel('Data/hourlySteam.xlsx')
hourlySteamWithoutNaN.to_excel('Data/hourlySteamWithoutNaN.xlsx')

hourlySteam.head()

<matplotlib.figure.Figure at 0x10a53ecd0>

Out[23]:

	steam-LBS	startTime	endTime
2011-07-01 01:00:00	NaN	2011-07-01 01:00:00	2011-07-01 02:00:00
2011-07-01 02:00:00	108.934462	2011-07-01 02:00:00	2011-07-01 03:00:00
2011-07-01 03:00:00	115.364583	2011-07-01 03:00:00	2011-07-01 04:00:00
2011-07-01 04:00:00	127.265625	2011-07-01 04:00:00	2011-07-01 05:00:00
2011-07-01 05:00:00	116.406250	2011-07-01 05:00:00	2011-07-01 06:00:00

In [24]:

dailyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'D')
steamReindexed = steam.reindex(dailyTimestamp, inplace = False)

steamReindexed.sort_index(inplace = True)
dailyEnergy = steamReindexed.diff(periods=1)['Gund Condensate FlowTotal - LBS']

dailySteam = pd.DataFrame(data = dailyEnergy.values, index = steamReindexed.index - np.timedelta64(1,'D'), columns = ['steam-LBS'])
dailySteam['startDay'] = dailySteam.index
dailySteam['endDay'] = dailySteam.index + np.timedelta64(1,'D')

plt.figure()
fig = dailySteam.plot(fontsize = 15, figsize = (15, 6))
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
fig.set_axis_bgcolor('w')
plt.title('All the daily steam data', fontsize = 16)
plt.ylabel('LBS')
plt.show()

dailySteamWithoutNaN = dailyChilledWater.dropna(axis=0, how='any')

dailySteam.to_excel('Data/dailySteam.xlsx')
dailySteamWithoutNaN.to_excel('Data/dailySteamWithoutNaN.xlsx')

dailySteam.head()

<matplotlib.figure.Figure at 0x107e2d150>

Out[24]:

	steam-LBS	startDay	endDay
2011-06-30	NaN	2011-06-30	2011-07-01
2011-07-01	NaN	2011-07-01	2011-07-02
2011-07-02	3250.651042	2011-07-02	2011-07-03
2011-07-03	3271.276042	2011-07-03	2011-07-04
2011-07-04	3236.718750	2011-07-04	2011-07-05

Weather data¶

Original data¶

There are two sources for weather data.

Year 2014: From local weather on the roof of GSD (Graduate School of Design) building Gund.
Year 2012 & 2013: purchased weather data from weather stations located in Cambridge, MA. Please note that year 2012 and 2013 data were from different weather stations.

Here is the orginal weather data after a little cleaning including unit conversion. AS you can see, the interval is 5-minute.

In [25]:

weather2014 = pd.read_excel('Data/weather-2014.xlsx')
weather2014.head()

Out[25]:

	Datetime	T-C	RH-%	Tdew-C	windDirection	windSpeed-m/s	pressure-mbar	solarRadiation-W/m2
0	2014-01-01 00:00:00	-5.02	53.2	-13.07	256	4.5	1020.8	1
1	2014-01-01 00:05:00	-5.14	54.3	-12.93	257	4.0	1020.7	1
2	2014-01-01 00:10:00	-5.08	53.4	-13.08	258	3.5	1021.1	1
3	2014-01-01 00:15:00	-5.17	52.6	-13.35	257	2.8	1021.0	1
4	2014-01-01 00:20:00	-5.23	52.9	-13.33	248	2.5	1021.2	1

Convert to hourly by resampling method.

Here is the clean hourly weather data of year 2014.

In [26]:

weather2014 = weather2014.set_index('Datetime')
weather2014 = weather2014.resample('H')
weather2014.head()

Out[26]:

	T-C	RH-%	Tdew-C	windDirection	windSpeed-m/s	pressure-mbar	solarRadiation-W/m2
Datetime
2014-01-01 00:00:00	-5.281667	52.858333	-13.393333	253.500000	2.775000	1021.158333	1
2014-01-01 01:00:00	-5.725000	51.650000	-14.090833	253.000000	2.350000	1021.700000	1
2014-01-01 02:00:00	-6.002500	50.766667	-14.555833	239.250000	1.133333	1022.708333	1
2014-01-01 03:00:00	-6.320833	49.675000	-15.114167	234.333333	1.191667	1023.233333	1
2014-01-01 04:00:00	-6.535833	49.708333	-15.305833	227.333333	1.633333	1023.925000	1

Here is the orginal weather data of the year 2013 and 2014 after a little cleaning including unit conversion. As you can see, the Datetime format is not correct.

In [27]:

weather2012and2013 = pd.read_excel('Data/weather-2012-2013.xlsx')
weather2012and2013.head()

Out[27]:

	Datetime	RH-%	windDirection	T-C	Tdew-C	pressure-mbar	windSpeed-m/s
0	2012-01-01-01	87	310	4	1.9	1004	4.166670
1	2012-01-01-02	87	280	4	1.9	1005	4.166670
2	2012-01-01-03	81	270	5	1.9	1006	4.166670
3	2012-01-01-04	76	290	6	1.9	1007	4.722226
4	2012-01-01-05	87	280	4	1.9	1007	3.055558

Correct the timestep.

Here is the hourly data after cleaning.

In [28]:

weather2012and2013['Datetime'] = pd.to_datetime(weather2012and2013['Datetime'], format='%Y-%m-%d-%H')
weather2012and2013 = weather2012and2013.set_index('Datetime')
weather2012and2013.head()

Out[28]:

	RH-%	windDirection	solarRadiation-W/m2	T-C	Tdew-C	pressure-mbar	windSpeed-m/s
Datetime
2012-01-01 01:00:00	87	310	0	4	1.9	1004	4.166670
2012-01-01 02:00:00	87	280	0	4	1.9	1005	4.166670
2012-01-01 03:00:00	81	270	0	5	1.9	1006	4.166670
2012-01-01 04:00:00	76	290	0	6	1.9	1007	4.722226
2012-01-01 05:00:00	87	280	0	4	1.9	1007	3.055558

Hourly weather data¶

Combine two files and add more features including cooling degrees, heating degrees, humidity ratio and dehumidification.

Here is all the hourly weather data.

In [29]:

# Combine two weather files
hourlyWeather = weather2014.append(weather2012and2013)
hourlyWeather.index.name = None
hourlyWeather.sort_index(inplace = True)

# Add more features

# Convert relative humidity to specific humidity

Mw=18.0160 # molecular weight of water
Md=28.9660 # molecular weight of dry air
R =  8.31432E3 # gas constant
Rd = R/Md # specific gas constant for dry air
Rv = R/Mw # specific gas constant for vapour
Lv = 2.5e6 # heat release for condensation of water vapour [J kg-1]
eps = Mw/Md

#saturation pressure
def esat(T):
    ''' get sateration pressure (units [Pa]) for a given air temperature (units [K])'''
    from numpy import log10
    TK = 273.15
    e1 = 101325.0
    logTTK = log10(T/TK)
    esat =  e1*10**(10.79586*(1-TK/T)-5.02808*logTTK+ 1.50474*1e-4*(1.-10**(-8.29692*(T/TK-1)))+ 0.42873*1e-3*(10**(4.76955*(1-TK/T))-1)-2.2195983) 
    return esat

def rh2sh(RH,p,T):
    '''purpose: conversion relative humidity (unitless) to specific humidity (humidity ratio) [kg/kg]'''
    es = esat(T)
    W = Mw/Md*RH*es/(p-RH*es)

    return W/(1.+W)


p = hourlyWeather['pressure-mbar'] * 100
RH = hourlyWeather['RH-%'] / 100
T = hourlyWeather['T-C'] + 273.15
w = rh2sh(RH,p,T)

hourlyWeather['humidityRatio-kg/kg'] = w
hourlyWeather['coolingDegrees'] = hourlyWeather['T-C'] - 12
hourlyWeather.loc[hourlyWeather['coolingDegrees'] < 0, 'coolingDegrees'] = 0

hourlyWeather['heatingDegrees'] = 15 - hourlyWeather['T-C']
hourlyWeather.loc[hourlyWeather['heatingDegrees'] < 0, 'heatingDegrees'] = 0

hourlyWeather['dehumidification'] = hourlyWeather['humidityRatio-kg/kg'] - 0.00886
hourlyWeather.loc[hourlyWeather['dehumidification'] < 0, 'dehumidification'] = 0

#hourlyWeather.to_excel('Data/hourlyWeather.xlsx')
hourlyWeather.head()

Out[29]:

	RH-%	T-C	Tdew-C	pressure-mbar	windDirection	windSpeed-m/s	humidityRatio-kg/kg	heatingDegrees
2012-01-01 01:00:00	87	4	1.9	1004	310	4.166670	0.004396	11
2012-01-01 02:00:00	87	4	1.9	1005	280	4.166670	0.004391	11
2012-01-01 03:00:00	81	5	1.9	1006	270	4.166670	0.004380	10
2012-01-01 04:00:00	76	6	1.9	1007	290	4.722226	0.004401	9
2012-01-01 05:00:00	87	4	1.9	1007	280	3.055558	0.004382	11

In [30]:

plt.figure()
fig = hourlyWeather.plot(y = 'T-C', figsize = (15, 6))
fig.set_axis_bgcolor('w')
plt.title('All hourly temperture', fontsize = 16)
plt.ylabel(r'Temperature ($\circ$C)')
plt.show()

plt.figure()
fig = hourlyWeather.plot(y = 'solarRadiation-W/m2', figsize = (15, 6))
fig.set_axis_bgcolor('w')
plt.title('All hourly solar radiation', fontsize = 16)
plt.ylabel(r'$W/m^2$', fontsize = 13)
plt.show()

plt.figure()
fig = hourlyWeather['2014-10'].plot(y = 'T-C', figsize = (15, 6), marker = 'o')
fig.set_axis_bgcolor('w')
plt.title('Selected hourly temperture',fontsize = 16)
plt.ylabel(r'Temperature ($\circ$C)',fontsize = 13)
plt.show()

plt.figure()
fig = hourlyWeather['2014-10'].plot(y = 'solarRadiation-W/m2', figsize = (15, 6), marker ='o')
fig.set_axis_bgcolor('w')
plt.title('Selected hourly solar radiation', fontsize = 16)
plt.ylabel(r'$W/m^2$', fontsize = 13)
plt.show()

<matplotlib.figure.Figure at 0x10c8d2c10>

<matplotlib.figure.Figure at 0x107d9a690>

<matplotlib.figure.Figure at 0x10cab2f10>

<matplotlib.figure.Figure at 0x1123f1bd0>

Daily weather data¶

In [31]:

dailyWeather = hourlyWeather.resample('D')
#dailyWeather.to_excel('Data/dailyWeather.xlsx')
dailyWeather.head()

Out[31]:

	RH-%	T-C	Tdew-C	pressure-mbar	solarRadiation-W/m2	windDirection	windSpeed-m/s	humidityRatio-kg/kg	heatingDegrees
2012-01-01	76.652174	7.173913	3.073913	1004.956522	95.260870	236.086957	4.118361	0.004796	7.826087
2012-01-02	55.958333	5.833333	-2.937500	994.625000	87.333333	253.750000	5.914357	0.003415	9.166667
2012-01-03	42.500000	-3.208333	-12.975000	1002.125000	95.708333	302.916667	6.250005	0.001327	18.208333
2012-01-04	41.541667	-7.083333	-16.958333	1008.250000	98.750000	286.666667	5.127319	0.000890	22.083333
2012-01-05	46.916667	-0.583333	-9.866667	1002.041667	90.750000	258.333333	5.162041	0.001746	15.583333

In [32]:

plt.figure()
fig = dailyWeather.plot(y = 'T-C', figsize = (15, 6), marker ='o')
fig.set_axis_bgcolor('w')
plt.title('All daily temperture', fontsize = 16)
plt.ylabel(r'Temperature ($\circ$C)', fontsize = 13)
plt.show()

plt.figure()
fig = dailyWeather['2014'].plot(y = 'T-C', figsize = (15, 6), marker ='o')
fig.set_axis_bgcolor('w')
plt.title('Selected daily temperture', fontsize = 16)
plt.ylabel(r'Temperature ($\circ$C)', fontsize = 13)
plt.show()

plt.figure()
fig = dailyWeather['2014'].plot(y = 'solarRadiation-W/m2', figsize = (15, 6), marker ='o')
fig.set_axis_bgcolor('w')
plt.title('Selected daily solar radiation', fontsize = 16)
plt.ylabel(r'$W/m^2$', fontsize = 14)
plt.show()

<matplotlib.figure.Figure at 0x109fd1bd0>

<matplotlib.figure.Figure at 0x105eb1950>

<matplotlib.figure.Figure at 0x10b8d4b50>

This is a number between 0 and 1. 0 indicated no occupants, 1 indicates normal occupancy. This is an estimate based on holidays, weekends and school academic calendar.

In [33]:

holidays = pd.read_excel('Data/holidays.xlsx')
holidays.head()

Out[33]:

	startDate	endDate	value
0	2011-07-01	2011-09-06	0.5
1	2011-10-10	2011-10-11	0.6
2	2011-11-24	2011-11-28	0.2
3	2011-12-22	2011-12-24	0.1
4	2011-12-24	2012-01-02	0.0

In [34]:

hourlyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'H')
occupancy = np.ones(len(hourlyTimestamp)) 

hourlyOccupancy = pd.DataFrame(data = occupancy, index = hourlyTimestamp, columns = ['occupancy'])


Saturdays = hourlyOccupancy.index.weekday == 5
Sundays = hourlyOccupancy.index.weekday == 6
hourlyOccupancy.loc[Saturdays, 'occupancy'] = 0.5 
hourlyOccupancy.loc[Sundays, 'occupancy'] = 0.5 


for i in range(len(holidays)):
    timestamp = pd.date_range(start = holidays.loc[i, 'startDate'], end = holidays.loc[i, 'endDate'], freq = 'H')
    hourlyOccupancy.loc[timestamp, 'occupancy'] = holidays.loc[i, 'value']

#hourlyHolidays['Datetime'] = pd.to_datetime(hourlyHolidays['Datetime'], format="%Y-%m-%d %H:%M:%S")
hourlyOccupancy['cosHour'] = np.cos((hourlyOccupancy.index.hour - 3) * 2 * np.pi / 24)

dailyOccupancy = hourlyOccupancy.resample('D')
dailyOccupancy.drop('cosHour', axis = 1, inplace = True)

Merge energy consumption data with weather and occupancy features¶

In [35]:

hourlyElectricityWithFeatures = hourlyElectricity.join(hourlyWeather, how = 'inner')
hourlyElectricityWithFeatures = hourlyElectricityWithFeatures.join(hourlyOccupancy, how = 'inner')
hourlyElectricityWithFeatures.dropna(axis=0, how='any', inplace = True)
hourlyElectricityWithFeatures.to_excel('Data/hourlyElectricityWithFeatures.xlsx')
hourlyElectricityWithFeatures.head()

Out[35]:

	electricity-kWh	startTime	endTime	RH-%	T-C	Tdew-C	pressure-mbar	windDirection	windSpeed-m/s	humidityRatio-kg/kg	heatingDegrees	cosHour
2012-01-01 01:00:00	111.479277	2012-01-01 01:00:00	2012-01-01 02:00:00	87	4	1.9	1004	310	4.166670	0.004396	11	0.866025
2012-01-01 02:00:00	117.989395	2012-01-01 02:00:00	2012-01-01 03:00:00	87	4	1.9	1005	280	4.166670	0.004391	11	0.965926
2012-01-01 03:00:00	119.010131	2012-01-01 03:00:00	2012-01-01 04:00:00	81	5	1.9	1006	270	4.166670	0.004380	10	1.000000
2012-01-01 04:00:00	116.005587	2012-01-01 04:00:00	2012-01-01 05:00:00	76	6	1.9	1007	290	4.722226	0.004401	9	0.965926
2012-01-01 05:00:00	111.132977	2012-01-01 05:00:00	2012-01-01 06:00:00	87	4	1.9	1007	280	3.055558	0.004382	11	0.866025

In [36]:

hourlyChilledWaterWithFeatures = hourlyChilledWater.join(hourlyWeather, how = 'inner')
hourlyChilledWaterWithFeatures = hourlyChilledWaterWithFeatures.join(hourlyOccupancy, how = 'inner')
hourlyChilledWaterWithFeatures.dropna(axis=0, how='any', inplace = True)
hourlyChilledWaterWithFeatures.to_excel('Data/hourlyChilledWaterWithFeatures.xlsx')

hourlySteamWithFeatures = hourlySteam.join(hourlyWeather, how = 'inner')
hourlySteamWithFeatures = hourlySteamWithFeatures.join(hourlyOccupancy, how = 'inner')
hourlySteamWithFeatures.dropna(axis=0, how='any', inplace = True)
hourlySteamWithFeatures.to_excel('Data/hourlySteamWithFeatures.xlsx')

dailyElectricityWithFeatures = dailyElectricity.join(dailyWeather, how = 'inner')
dailyElectricityWithFeatures = dailyElectricityWithFeatures.join(dailyOccupancy, how = 'inner')
dailyElectricityWithFeatures.dropna(axis=0, how='any', inplace = True)
dailyElectricityWithFeatures.to_excel('Data/dailyElectricityWithFeatures.xlsx')

dailyChilledWaterWithFeatures = dailyChilledWater.join(dailyWeather, how = 'inner')
dailyChilledWaterWithFeatures = dailyChilledWaterWithFeatures.join(dailyOccupancy, how = 'inner')
dailyChilledWaterWithFeatures.dropna(axis=0, how='any', inplace = True)
dailyChilledWaterWithFeatures.to_excel('Data/dailyChilledWaterWithFeatures.xlsx')

dailySteamWithFeatures = dailySteam.join(dailyWeather, how = 'inner')
dailySteamWithFeatures = dailySteamWithFeatures.join(dailyOccupancy, how = 'inner')
dailySteamWithFeatures.dropna(axis=0, how='any', inplace = True)
dailySteamWithFeatures.to_excel('Data/dailySteamWithFeatures.xlsx')

A note for features¶

Nomenclature (Alphabetically)¶

coolingDegrees: if T-C - 12 > 0, then = T-C - 12, else = 0. Assume that when outdoor temperature is below 12C, no cooling is needed, which is true for many buildings. This will be useful for daily prediction, because the average of hourly cooling degrees is better than average of hourly temperature.

cosHour: $\text{cos}(\text{hourOfDay} \cdot \frac{2\pi}{24})$

dehumidification: if humidityRatio - 0.00886 > 0, then = humidityRatio - 0.00886, else = 0. This will be useful for chilled water prediction, especially daily chilled water prediction.

heatingDegrees: if 15 - T-C > 0, then = 15 - T-C, else = 0. Assume that when outdoor temperature is above 15C, no heating is needed. This will be useful for daily prediction, because the average of hourly heating degrees is better than average of hourly temperature.

occupancy: A number between 0 and 1. 0 indicated no occupants, 1 indicates normal occupancy. This is an estimate based on holidays, weekends and school academic calendar.

pressure-mbar: atmospheric pressure

RH-% : Relative humidity

T-C : Dry-bulb temperature

Tdew-C : Dew-point temperature

Humidity¶

Humidity ratio is important for chilled water prediction as chilled water is also used to dry the air discharged to rooms. Using humidity ratio will be more efficient and effective than using RH and dew point temperature.