For some large scale buildings in Harvard, there are three types energy consumption, electricity, chilled water and steam. Chilled water is for cooling and steam is for heating. The following figure shows the buildings with chilled water and steam supply from Harvard plants.
<img src="Pics/cw and steam supply.png", style="width:70%">
Fig. Harvard chilled water and steam supply. (Left: chilled water, highlighted in blue. Right: Steam, highlighted in yellow.)
Image source: Harvard University Campus ServiceWe picked one building and got energy consumption data from 2011/07/01 to 2014/10/31. There are several months data missing due to meter malfunction. The data resolution is hourly. In original data, the hourly data is meter readings. So in order to get hourly consumption, we need to offset the data and do the minus. We have hourly data from January 2012 to October 2014 for both weather and energy (2.75 years). The weather data is from weather stations in Cambridge.
In this section, we are going to finish the following tasks.
Get hourly electricity, chilled water and steam from the original data downloaded from Harvard Energy Witness website manually.
Clean weather data, add more features including cooling degrees, heating degress and humidity ratio.
Estimate daily occupancy based on holidays, school year and weekends.
Create features reletate to hour, which is cos(hourOfDay * 2 * pi / 24).
Merge electricity, chilled water and steam dataframe with weather and time and occupancy features.
%matplotlib inline
import requests
from StringIO import StringIO
import numpy as np
import pandas as pd # pandas
import matplotlib.pyplot as plt # module for plotting
import datetime as dt # module for manipulating dates and times
import numpy.linalg as lin # module for performing linear algebra operations
from __future__ import division
from math import log10,exp
pd.options.display.mpl_style = 'default'
Original data are downloaded from Harvard Energy Witness Website <img src="Pics/energyWitness1.png", style="width:80%"> <img src="Pics/energyWitness3.png", style="width:80%">
Then we use Pandas to put them together into one dataframe.
file = 'Data/Org/0701-0930-2011.xls'
df = pd.read_excel(file, header = 0, skiprows = np.arange(0,6))
files = ['Data/Org/1101-1130-2011.xls', 'Data/Org/1201-2011-0131-2012.xls',
'Data/Org/0201-0331-2012.xls','Data/Org/0401-0531-2012.xls','Data/Org/0601-0630-2012.xls',
'Data/Org/0701-0831-2012.xls','Data/Org/0901-1031-2012.xls','Data/Org/1101-1231-2012.xls',
'Data/Org/0101-0228-2013.xls',
'Data/Org/0301-0430-2013.xls','Data/Org/0501-0630-2013.xls','Data/Org/0701-0831-2013.xls',
'Data/Org/0901-1031-2013.xls','Data/Org/1101-1231-2013.xls','Data/Org/0101-0228-2014.xls',
'Data/Org/0301-0430-2014.xls', 'Data/Org/0501-0630-2014.xls', 'Data/Org/0701-0831-2014.xls',
'Data/Org/0901-1031-2014.xls']
for file in files:
data = pd.read_excel(file, header = 0, skiprows = np.arange(0,6))
df = df.append(data)
df.head()
WARNING *** file size (2481102) not 512 + multiple of sector size (512) WARNING *** file size (848833) not 512 + multiple of sector size (512) WARNING *** file size (1694257) not 512 + multiple of sector size (512) WARNING *** file size (1640459) not 512 + multiple of sector size (512) WARNING *** file size (1667907) not 512 + multiple of sector size (512) WARNING *** file size (847258) not 512 + multiple of sector size (512) WARNING *** file size (1691449) not 512 + multiple of sector size (512) WARNING *** file size (1666647) not 512 + multiple of sector size (512) WARNING *** file size (1665736) not 512 + multiple of sector size (512) WARNING *** file size (1614814) not 512 + multiple of sector size (512) WARNING *** file size (1665980) not 512 + multiple of sector size (512) WARNING *** file size (1667276) not 512 + multiple of sector size (512) WARNING *** file size (1691736) not 512 + multiple of sector size (512) WARNING *** file size (1666704) not 512 + multiple of sector size (512) WARNING *** file size (1665920) not 512 + multiple of sector size (512) WARNING *** file size (1614900) not 512 + multiple of sector size (512) WARNING *** file size (1666228) not 512 + multiple of sector size (512) WARNING *** file size (1666191) not 512 + multiple of sector size (512) WARNING *** file size (1691845) not 512 + multiple of sector size (512) WARNING *** file size (1663846) not 512 + multiple of sector size (512)
Unnamed: 0 | Unnamed: 1 | Gund Bus-A 15 Min Block Demand - kW | Gund Bus-A CurrentA - Amps | Unnamed: 4 | Unnamed: 5 | Gund Bus-A CurrentB - Amps | Unnamed: 7 | Gund Bus-A CurrentC - Amps | Unnamed: 9 | ... | Gund Main Demand - Tons | Gund Main Energy - Ton-Days | Gund Main FlowRate - gpm | Gund Main FlowTotal - kgal(1000) | Gund Main SignalAeration - Count | Gund Main SignalStrength - Count | Gund Main SonicVelocity - Ft/Sec | Gund Main TempDelta - Deg F | Gund Main TempReturn - Deg F | Gund Main TempSupply - Deg F | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2011-07-01 01:00:00 | White | 48.458733 | 65.977882 | NaN | NaN | 52.631417 | NaN | 55.603840 | NaN | ... | 4.677294 | 17912.537804 | 6.916454 | 48168.083414 | 0.693405 | 57.208127 | 1437.640543 | 16.238684 | 59.757447 | 43.516103 |
1 | 2011-07-01 02:00:00 | White | 40.472697 | 57.230223 | NaN | NaN | 42.483092 | NaN | 50.243230 | NaN | ... | 4.586403 | 17912.853518 | 6.739337 | 48168.645429 | 0.567355 | 57.082909 | 1438.030719 | 16.263573 | 59.710199 | 43.495128 |
2 | 2011-07-01 03:00:00 | #d2e4b0 | 39.472809 | 55.487443 | NaN | NaN | 41.911784 | NaN | 48.482163 | NaN | ... | 4.462877 | 17913.169232 | 6.725142 | 48169.207444 | 0.441304 | 57.001646 | 1439.111130 | 15.797043 | 59.248158 | 43.457344 |
3 | 2011-07-01 04:00:00 | White | 39.198879 | 55.849806 | NaN | NaN | 41.525529 | NaN | 48.987457 | NaN | ... | 4.696993 | 17913.484946 | 7.041330 | 48169.769458 | 0.315254 | 57.000000 | 1440.768604 | 15.947392 | 59.207097 | 43.267682 |
4 | 2011-07-01 05:00:00 | White | 39.297522 | 55.736219 | NaN | NaN | 41.299381 | NaN | 48.710408 | NaN | ... | 4.550372 | 17913.800660 | 6.863004 | 48170.331473 | 0.189204 | 57.000000 | 1442.426077 | 15.903679 | 59.282707 | 43.372615 |
5 rows × 55 columns
Above is the original hourly data.
As you can see, it is quite messy. The first thing to remove meaningless columns.
df.rename(columns={'Unnamed: 0':'Datetime'}, inplace=True)
nonBlankColumns = ['Unnamed' not in s for s in df.columns]
columns = df.columns[nonBlankColumns]
df = df[columns]
df = df.set_index(['Datetime'])
df.index.name = None
df.head()
Gund Bus-A 15 Min Block Demand - kW | Gund Bus-A CurrentA - Amps | Gund Bus-A CurrentB - Amps | Gund Bus-A CurrentC - Amps | Gund Bus-A CurrentN - Amps | Gund Bus-A EnergyReal - kWhr | Gund Bus-A Freq - Hertz | Gund Bus-A Max Monthly Demand - kW | Gund Bus-A PowerApp - kVA | Gund Bus-A PowerReac - kVAR | ... | Gund Main Demand - Tons | Gund Main Energy - Ton-Days | Gund Main FlowRate - gpm | Gund Main FlowTotal - kgal(1000) | Gund Main SignalAeration - Count | Gund Main SignalStrength - Count | Gund Main SonicVelocity - Ft/Sec | Gund Main TempDelta - Deg F | Gund Main TempReturn - Deg F | Gund Main TempSupply - Deg F | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2011-07-01 01:00:00 | 48.458733 | 65.977882 | 52.631417 | 55.603840 | 15.982278 | 1796757.502803 | 59.837524 | 96.117915 | 48.757073 | 12.344712 | ... | 4.677294 | 17912.537804 | 6.916454 | 48168.083414 | 0.693405 | 57.208127 | 1437.640543 | 16.238684 | 59.757447 | 43.516103 |
2011-07-01 02:00:00 | 40.472697 | 57.230223 | 42.483092 | 50.243230 | 13.423762 | 1796800.145991 | 60.005569 | 96.117915 | 42.238685 | 12.967984 | ... | 4.586403 | 17912.853518 | 6.739337 | 48168.645429 | 0.567355 | 57.082909 | 1438.030719 | 16.263573 | 59.710199 | 43.495128 |
2011-07-01 03:00:00 | 39.472809 | 55.487443 | 41.911784 | 48.482163 | 13.478933 | 1796840.146023 | 59.833880 | 96.117915 | 41.278573 | 12.732046 | ... | 4.462877 | 17913.169232 | 6.725142 | 48169.207444 | 0.441304 | 57.001646 | 1439.111130 | 15.797043 | 59.248158 | 43.457344 |
2011-07-01 04:00:00 | 39.198879 | 55.849806 | 41.525529 | 48.987457 | 13.603309 | 1796879.023607 | 59.673044 | 96.117915 | 41.345776 | 12.687845 | ... | 4.696993 | 17913.484946 | 7.041330 | 48169.769458 | 0.315254 | 57.000000 | 1440.768604 | 15.947392 | 59.207097 | 43.267682 |
2011-07-01 05:00:00 | 39.297522 | 55.736219 | 41.299381 | 48.710408 | 13.797331 | 1796918.273558 | 59.986672 | 96.117915 | 41.166736 | 12.437842 | ... | 4.550372 | 17913.800660 | 6.863004 | 48170.331473 | 0.189204 | 57.000000 | 1442.426077 | 15.903679 | 59.282707 | 43.372615 |
5 rows × 48 columns
Then we print out all the column names. Only a few columns are useful to get the hourly electricity, chilled water and steam.
for item in df.columns:
print item
Gund Bus-A 15 Min Block Demand - kW Gund Bus-A CurrentA - Amps Gund Bus-A CurrentB - Amps Gund Bus-A CurrentC - Amps Gund Bus-A CurrentN - Amps Gund Bus-A EnergyReal - kWhr Gund Bus-A Freq - Hertz Gund Bus-A Max Monthly Demand - kW Gund Bus-A PowerApp - kVA Gund Bus-A PowerReac - kVAR Gund Bus-A PowerReal - kW Gund Bus-A TruePF - PF Gund Bus-A VoltageAB - Volts Gund Bus-A VoltageAN - Volts Gund Bus-A VoltageBC - Volts Gund Bus-A VoltageBN - Volts Gund Bus-A VoltageCA - Volts Gund Bus-A VoltageCN - Volts Gund Bus-B 15 Min Block Demand - kW Gund Bus-B CurrentA - Amps Gund Bus-B CurrentB - Amps Gund Bus-B CurrentC - Amps Gund Bus-B CurrentN - Amps Gund Bus-B EnergyReal - kWhr Gund Bus-B Freq - Hertz Gund Bus-B Max Monthly Demand - kW Gund Bus-B PowerApp - kVA Gund Bus-B PowerReac - kVAR Gund Bus-B PowerReal - kW Gund Bus-B TruePF - PF Gund Bus-B VoltageAB - Volts Gund Bus-B VoltageAN - Volts Gund Bus-B VoltageBC - Volts Gund Bus-B VoltageBN - Volts Gund Bus-B VoltageCA - Volts Gund Bus-B VoltageCN - Volts Gund Condensate Counter - count Gund Condensate FlowTotal - LBS Gund Main Demand - Tons Gund Main Energy - Ton-Days Gund Main FlowRate - gpm Gund Main FlowTotal - kgal(1000) Gund Main SignalAeration - Count Gund Main SignalStrength - Count Gund Main SonicVelocity - Ft/Sec Gund Main TempDelta - Deg F Gund Main TempReturn - Deg F Gund Main TempSupply - Deg F
Use electricity as an example, there are two meters, "Gund Bus A" and "Gund Bus B". The "EnergyReal - kWhr" records accumulative consumption. We are not sure exactly what is "PowerReal". Just in case, we put it into the electricity dateframe as well.
electricity=df[['Gund Bus-A EnergyReal - kWhr','Gund Bus-B EnergyReal - kWhr',
'Gund Bus-A PowerReal - kW','Gund Bus-B PowerReal - kW',]]
electricity.head()
Gund Bus-A EnergyReal - kWhr | Gund Bus-B EnergyReal - kWhr | Gund Bus-A PowerReal - kW | Gund Bus-B PowerReal - kW | |
---|---|---|---|---|
2011-07-01 01:00:00 | 1796757.502803 | 3657811.582122 | 47.184015 | 63.486186 |
2011-07-01 02:00:00 | 1796800.145991 | 3657873.464938 | 40.208796 | 61.270542 |
2011-07-01 03:00:00 | 1796840.146023 | 3657934.837505 | 39.209866 | 61.464394 |
2011-07-01 04:00:00 | 1796879.023607 | 3657995.470348 | 39.378507 | 59.396581 |
2011-07-01 05:00:00 | 1796918.273558 | 3658054.470285 | 39.240837 | 58.911729 |
In order to check whether our understanding of the data is correct, we want to calculate monthly electricity consumption from the hourly data, and then compare the results with the monthly data provided by facalities, which is available on Energy Witness as well.
Here is the monthly data provided by facalities, "Bus A & B" are called "CE603B kWh" and "CE604B kWh" in ths monthly form. They are simply two meters. Please note, the meter reading cycle is NOT calendar month.
file = 'Data/monthly electricity.csv'
monthlyElectricityFromFacility = pd.read_csv(file, header=0)
monthlyElectricityFromFacility
monthlyElectricityFromFacility = monthlyElectricityFromFacility.set_index(['month'])
monthlyElectricityFromFacility.head()
startDate | endDate | CE603B kWh | CE604B kWh | |
---|---|---|---|---|
month | ||||
Jul 11 | 6/16/11 | 7/18/11 | 43968.1 | 106307.1 |
Aug 11 | 7/18/11 | 8/17/11 | 41270.1 | 83121.1 |
Sep 11 | 8/17/11 | 9/16/11 | 51514.1 | 107083.1 |
Oct 11 | 9/16/11 | 10/18/11 | 65338.1 | 114350.1 |
Nov 11 | 10/18/11 | 11/17/11 | 65453.1 | 115318.1 |
We use the column "EnergyReal - kWhr" for two meters. We found the numbers of the begin and end date of the meter reading cycle, and use the number of the end date minus the number of the begin date, then we get our monthly electiricty consumption.
monthlyElectricityFromFacility['startDate'] = pd.to_datetime(monthlyElectricityFromFacility['startDate'], format="%m/%d/%y")
values = monthlyElectricityFromFacility.index.values
keys = np.array(monthlyElectricityFromFacility['startDate'])
dates = {}
for key, value in zip(keys, values):
dates[key] = value
sortedDates = np.sort(dates.keys())
sortedDates = sortedDates[sortedDates > np.datetime64('2011-11-01')]
months = []
monthlyElectricityOrg = np.zeros((len(sortedDates) - 1, 2))
for i in range(len(sortedDates) - 1):
begin = sortedDates[i]
end = sortedDates[i+1]
months.append(dates[sortedDates[i]])
monthlyElectricityOrg[i, 0] = (np.round(electricity.loc[end,'Gund Bus-A EnergyReal - kWhr']
- electricity.loc[begin,'Gund Bus-A EnergyReal - kWhr'], 1))
monthlyElectricityOrg[i, 1] = (np.round(electricity.loc[end,'Gund Bus-B EnergyReal - kWhr']
- electricity.loc[begin,'Gund Bus-B EnergyReal - kWhr'], 1))
monthlyElectricity = pd.DataFrame(data = monthlyElectricityOrg, index = months, columns = ['CE603B kWh', 'CE604B kWh'])
plt.figure()
fig, ax = plt.subplots()
fig = monthlyElectricity.plot(marker = 'o', figsize=(15,6), rot = 40, fontsize = 13, ax = ax, linestyle='')
fig.set_axis_bgcolor('w')
plt.xlabel('Billing month', fontsize = 15)
plt.ylabel('kWh', fontsize = 15)
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 13)
plt.xticks(np.arange(0,len(months)),months)
plt.title('Original monthly consumption from hourly data',fontsize = 17)
text = 'Meter malfunction'
ax.annotate(text, xy = (9, 4500000),
xytext = (5, 2), fontsize = 15,
textcoords = 'offset points', ha = 'center', va = 'top')
ax.annotate(text, xy = (8, -4500000),
xytext = (5, 2), fontsize = 15,
textcoords = 'offset points', ha = 'center', va = 'bottom')
ax.annotate(text, xy = (14, -2500000),
xytext = (5, 2), fontsize = 15,
textcoords = 'offset points', ha = 'center', va = 'bottom')
ax.annotate(text, xy = (15, 2500000),
xytext = (5, 2), fontsize = 15,
textcoords = 'offset points', ha = 'center', va = 'top')
plt.show()
<matplotlib.figure.Figure at 0x1044639d0>
Above is a plot of monthly electricity consumption using our data processing method. Obviously, the two meters malfunctioned for several months. There are two sets of dots "CE603B" and "CE604B", which come from two meters. There are two electricity meters. The sum of them is the total electricity consumption of the building.
monthlyElectricity.loc['Aug 12','CE604B kWh'] = np.nan
monthlyElectricity.loc['Sep 12','CE604B kWh'] = np.nan
monthlyElectricity.loc['Feb 13','CE603B kWh'] = np.nan
monthlyElectricity.loc['Mar 13','CE603B kWh'] = np.nan
fig,ax = plt.subplots(1, 1,figsize=(15,8))
#ax.set_axis_bgcolor('w')
plt.bar(np.arange(0, len(monthlyElectricity))-0.5,monthlyElectricity['CE603B kWh'], label='Our data processing from hourly data')
plt.plot(monthlyElectricityFromFacility.loc[months,'CE603B kWh'],'or', label='Facility data')
plt.xticks(np.arange(0,len(months)),months)
plt.xlabel('Month',fontsize=15)
plt.ylabel('kWh',fontsize=15)
plt.xlim([0, len(monthlyElectricity)])
plt.legend()
ax.set_xticklabels(months, rotation=40, fontsize=13)
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
plt.title('Comparison between our data processing and facilities, Meter CE603B',fontsize=20)
text = 'Meter malfunction \n estimated by facility'
ax.annotate(text, xy = (14, monthlyElectricityFromFacility.loc['Feb 13','CE603B kWh']),
xytext = (5, 50), fontsize = 15,
arrowprops=dict(facecolor='black', shrink=0.15),
textcoords = 'offset points', ha = 'center', va = 'bottom')
ax.annotate(text, xy = (15, monthlyElectricityFromFacility.loc['Mar 13','CE603B kWh']),
xytext = (5, 50), fontsize = 15,
arrowprops=dict(facecolor='black', shrink=0.15),
textcoords = 'offset points', ha = 'center', va = 'bottom')
plt.show()
fig,ax = plt.subplots(1, 1,figsize=(15,8))
#ax.set_axis_bgcolor('w')
plt.bar(np.arange(0, len(monthlyElectricity))-0.5, monthlyElectricity['CE604B kWh'], label='Our data processing from hourly data')
plt.plot(monthlyElectricityFromFacility.loc[months,'CE604B kWh'],'or', label='Facility data')
plt.xticks(np.arange(0,len(months)),months)
plt.xlabel('Month',fontsize=15)
plt.ylabel('kWh',fontsize=15)
plt.xlim([0, len(monthlyElectricity)])
plt.legend()
ax.set_xticklabels(months, rotation=40, fontsize=13)
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
plt.title('Comparison between our data processing and facilities, Meter CE604B',fontsize=20)
ax.annotate(text, xy = (9, monthlyElectricityFromFacility.loc['Sep 12','CE604B kWh']),
xytext = (5, 50), fontsize = 15,
arrowprops=dict(facecolor='black', shrink=0.15),
textcoords = 'offset points', ha = 'center', va = 'bottom')
ax.annotate(text, xy = (8, monthlyElectricityFromFacility.loc['Aug 12','CE604B kWh']),
xytext = (5, 50), fontsize = 15,
arrowprops=dict(facecolor='black', shrink=0.15),
textcoords = 'offset points', ha = 'center', va = 'bottom')
plt.show()
The dots in the picture are monthly data provided by facility. They are the numbers on a bill. When meters were malfunctioning, the facility just estimated the energy use for billin purpose. The data points we excluded are not in the plot. They are simply way higher than normal (30 times more than other normal consumption). "Meter malfunction, estimated by facility" means in this month, the meter was malfunctioning and this dot is what facilities estimated.
Meter "CE603B" malfunctioned on February and March 2013. Meter "CE604B" malfunctioned on August and September 2012. We set the nubmers of these months to np.nan and simply exclude them from regression. Plot them againa and compare with the facility monthly data. They match! Of course, excep when the meters were malfunctioning. In thoses months, the facilities estimated the energy consumption for billing purpose. For us, we just excluded hourly and daily data points during thoese months in regression.
The total building electricity consumption is the sum of these two meters. Just in case, we keep the "power" and compare them with "energy"
electricity['energy'] = electricity['Gund Bus-A EnergyReal - kWhr'] + electricity['Gund Bus-B EnergyReal - kWhr']
electricity['power'] = electricity['Gund Bus-A PowerReal - kW'] + electricity['Gund Bus-B PowerReal - kW']
electricity.head()
Gund Bus-A EnergyReal - kWhr | Gund Bus-B EnergyReal - kWhr | Gund Bus-A PowerReal - kW | Gund Bus-B PowerReal - kW | energy | power | startTime | endTime | |
---|---|---|---|---|---|---|---|---|
2011-07-01 00:00:00 | NaN | NaN | NaN | NaN | NaN | NaN | 2011-07-01 00:00:00 | 2011-07-01 01:00:00 |
2011-07-01 01:00:00 | 1796757.502803 | 3657811.582122 | 47.184015 | 63.486186 | 5454569.084925 | 110.670201 | 2011-07-01 01:00:00 | 2011-07-01 02:00:00 |
2011-07-01 02:00:00 | 1796800.145991 | 3657873.464938 | 40.208796 | 61.270542 | 5454673.610929 | 101.479338 | 2011-07-01 02:00:00 | 2011-07-01 03:00:00 |
2011-07-01 03:00:00 | 1796840.146023 | 3657934.837505 | 39.209866 | 61.464394 | 5454774.983528 | 100.674260 | 2011-07-01 03:00:00 | 2011-07-01 04:00:00 |
2011-07-01 04:00:00 | 1796879.023607 | 3657995.470348 | 39.378507 | 59.396581 | 5454874.493955 | 98.775088 | 2011-07-01 04:00:00 | 2011-07-01 05:00:00 |
The hourly energy consumption between this hour and next hour is the meter kWh reading ("energy" in the dataframe) of the next hour minus this hour. We assume the meter reading is recorded at the end of hour. In order to avoid confusion, we also mark the startTime and endTime of the meter reading in the dataframe. We compared hourly power of next hour with the derived hourly electricity, most of the time, power is close to hourly electricity. Sometimes, there is quite a difference.
The meters record accumulative energy consumption. Say at the beginning of today, the number is 10. After an hour, the energy use is 1, then the meter number adds one and become 11, etc. So the meter number should keep increasing. However, we found that sometimes, the meter reading suddenly drops to 0, and after quite a while, get a high positive number again. This will create negative hourly consumption, and then an absurdly high postive number. We get hourly/daily energy consumption by "meter reading at time t + 1 - at time t". If the result is negative, or absurdly high, we consider the meter was malfunctioning. We set these data points to np.nan now. We figured out the exact dates when the meters were malfunctioning by looking at the raw data in excel files and from monthly plots.
In addition to that, occasionaly, there are more non-sense data points, which are thousand times higher than normal values. The normal hourly consumption range is between 100 and 400. We created a filter: "index = abs(hourlyEnergy) < 200000", which means only values lower than 200000 are kept.
# In case there are any missing hours, reindex to get the entire time span. Fill in nan data.
hourlyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'H')
# Somehow, reindex does not work well. October 2011 and several other hours are missing.
# Basically it is just the length of original length.
#electricity.reindex(hourlyTimestamp, inplace = True, fill_value = np.nan)
startTime = hourlyTimestamp
endTime = hourlyTimestamp + np.timedelta64(1,'h')
hourlyTime = pd.DataFrame(data = np.transpose([startTime, endTime]), index = hourlyTimestamp, columns = ['startTime', 'endTime'])
electricity = electricity.join(hourlyTime, how = 'outer')
# Just in case, in order to use diff method, timestamp has to be in asending order.
electricity.sort_index(inplace = True)
hourlyEnergy = electricity.diff(periods=1)['energy']
hourlyElectricity = pd.DataFrame(data = hourlyEnergy.values, index = hourlyEnergy.index, columns = ['electricity-kWh'])
hourlyElectricity = hourlyElectricity.join(hourlyTime, how = 'inner')
print "Data length: ", len(hourlyElectricity)/24, " days"
hourlyElectricity.head()
Data length: 1218.04166667 days
electricity-kWh | startTime | endTime | |
---|---|---|---|
2011-07-01 00:00:00 | NaN | 2011-07-01 00:00:00 | 2011-07-01 01:00:00 |
2011-07-01 01:00:00 | NaN | 2011-07-01 01:00:00 | 2011-07-01 02:00:00 |
2011-07-01 02:00:00 | 104.526004 | 2011-07-01 02:00:00 | 2011-07-01 03:00:00 |
2011-07-01 03:00:00 | 101.372599 | 2011-07-01 03:00:00 | 2011-07-01 04:00:00 |
2011-07-01 04:00:00 | 99.510428 | 2011-07-01 04:00:00 | 2011-07-01 05:00:00 |
Above is the hourly electricity consumption. The data is exported to an excel file. We assume the meter reading is recorded at the end of hour.
# Filter the data, keep the NaN and generate two excels, with and without Nan
hourlyElectricity.loc[abs(hourlyElectricity['electricity-kWh']) > 100000,'electricity-kWh'] = np.nan
time = hourlyElectricity.index
index = ((time > np.datetime64('2012-07-26')) & (time < np.datetime64('2012-08-18'))) \
| ((time > np.datetime64('2013-01-21')) & (time < np.datetime64('2013-03-08')))
hourlyElectricity.loc[index,'electricity-kWh'] = np.nan
hourlyElectricityWithoutNaN = hourlyElectricity.dropna(axis=0, how='any')
hourlyElectricity.to_excel('Data/hourlyElectricity.xlsx')
hourlyElectricityWithoutNaN.to_excel('Data/hourlyElectricityWithoutNaN.xlsx')
plt.figure()
fig = hourlyElectricity.plot(fontsize = 15, figsize = (15, 6))
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
fig.set_axis_bgcolor('w')
plt.title('All the hourly electricity data', fontsize = 16)
plt.ylabel('kWh')
plt.show()
plt.figure()
fig = hourlyElectricity.iloc[26200:27400,:].plot(marker = 'o',label='hourly electricity', fontsize = 15, figsize = (15, 6))
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
fig.set_axis_bgcolor('w')
plt.title('Hourly electricity data of selected days', fontsize = 16)
plt.ylabel('kWh')
plt.legend()
plt.show()
<matplotlib.figure.Figure at 0x1096bca90>
<matplotlib.figure.Figure at 0x1096a1750>
Above are the hourly data plots. In the first graph, the blank part is due to missing data when the meters were malfunctioning. In the second graph, you can see difference between day and night, weekday and weekends.
We manage to get the daily electricity consumption through "reindex".
dailyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'D')
electricityReindexed = electricity.reindex(dailyTimestamp, inplace = False)
# Just in case, in order to use diff method, timestamp has to be in asending order.
electricityReindexed.sort_index(inplace = True)
dailyEnergy = electricityReindexed.diff(periods=1)['energy']
dailyElectricity = pd.DataFrame(data = dailyEnergy.values, index = electricityReindexed.index - np.timedelta64(1,'D'), columns = ['electricity-kWh'])
dailyElectricity['startDay'] = dailyElectricity.index
dailyElectricity['endDay'] = dailyElectricity.index + np.timedelta64(1,'D')
# Filter the data, keep the NaN and generate two excels, with and without Nan
dailyElectricity.loc[abs(dailyElectricity['electricity-kWh']) > 2000000,'electricity-kWh'] = np.nan
time = dailyElectricity.index
index = ((time > np.datetime64('2012-07-26')) & (time < np.datetime64('2012-08-18'))) | ((time > np.datetime64('2013-01-21')) & (time < np.datetime64('2013-03-08')))
dailyElectricity.loc[index,'electricity-kWh'] = np.nan
dailyElectricityWithoutNaN = dailyElectricity.dropna(axis=0, how='any')
dailyElectricity.to_excel('Data/dailyElectricity.xlsx')
dailyElectricityWithoutNaN.to_excel('Data/dailyElectricityWithoutNaN.xlsx')
dailyElectricity.head()
electricity-kWh | startDay | endDay | |
---|---|---|---|
2011-06-30 | NaN | 2011-06-30 | 2011-07-01 |
2011-07-01 | NaN | 2011-07-01 | 2011-07-02 |
2011-07-02 | 3630.398480 | 2011-07-02 | 2011-07-03 |
2011-07-03 | 3750.648885 | 2011-07-03 | 2011-07-04 |
2011-07-04 | 4568.724044 | 2011-07-04 | 2011-07-05 |
Above is the daily electricity consumption.
plt.figure()
fig = dailyElectricity.plot(figsize = (15, 6))
fig.set_axis_bgcolor('w')
plt.title('All the daily electricity data', fontsize = 16)
plt.ylabel('kWh')
plt.show()
plt.figure()
fig = dailyElectricity.iloc[1000:1130,:].plot(marker = 'o', figsize = (15, 6))
fig.set_axis_bgcolor('w')
plt.title('Daily electricity data of selected days', fontsize = 16)
plt.ylabel('kWh')
plt.show()
<matplotlib.figure.Figure at 0x10a556a90>
<matplotlib.figure.Figure at 0x109b48150>
Above is the daily electricity plot.
The super low electricity consumption happens after semesters end, including Christmas vacations. In addition, as you can see in the second figure, the summer energy consumption is lower when school starts.
We clean the chilled water data in the same way as electricity.
chilledWater = df[['Gund Main Energy - Ton-Days']]
chilledWater.head()
Gund Main Energy - Ton-Days | |
---|---|
2011-07-01 01:00:00 | 17912.537804 |
2011-07-01 02:00:00 | 17912.853518 |
2011-07-01 03:00:00 | 17913.169232 |
2011-07-01 04:00:00 | 17913.484946 |
2011-07-01 05:00:00 | 17913.800660 |
file = 'Data/monthly chilled water.csv'
monthlyChilledWaterFromFacility = pd.read_csv(file, header=0)
monthlyChilledWaterFromFacility.set_index(['month'], inplace = True)
monthlyChilledWaterFromFacility.head()
startDate | endDate | chilledWater | |
---|---|---|---|
month | |||
11-Jul | 6/12/11 | 7/12/11 | 2258 |
11-Aug | 7/12/11 | 8/12/11 | 2095 |
11-Sep | 8/12/11 | 9/12/11 | 2200 |
11-Oct | 9/12/11 | 10/12/11 | 1664 |
11-Nov | 10/12/11 | 11/12/11 | 447 |
monthlyChilledWaterFromFacility['startDate'] = pd.to_datetime(monthlyChilledWaterFromFacility['startDate'], format="%m/%d/%y")
values = monthlyChilledWaterFromFacility.index.values
keys = np.array(monthlyChilledWaterFromFacility['startDate'])
dates = {}
for key, value in zip(keys, values):
dates[key] = value
sortedDates = np.sort(dates.keys())
sortedDates = sortedDates[sortedDates > np.datetime64('2011-11-01')]
months = []
monthlyChilledWaterOrg = np.zeros((len(sortedDates) - 1))
for i in range(len(sortedDates) - 1):
begin = sortedDates[i]
end = sortedDates[i+1]
months.append(dates[sortedDates[i]])
monthlyChilledWaterOrg[i] = (np.round(chilledWater.loc[end,:] - chilledWater.loc[begin,:], 1))
monthlyChilledWater = pd.DataFrame(data = monthlyChilledWaterOrg, index = months, columns = ['chilledWater-TonDays'])
fig,ax = plt.subplots(1, 1,figsize=(15,8))
#ax.set_axis_bgcolor('w')
#plt.plot(monthlyChilledWater, label='Our data processing from hourly data', marker = 'x', markersize = 15, linestyle = '')
plt.bar(np.arange(len(monthlyChilledWater))-0.5, monthlyChilledWater.values, label='Our data processing from hourly data')
plt.plot(monthlyChilledWaterFromFacility[5:-1]['chilledWater'],'or', label='Facility data')
plt.xticks(np.arange(0,len(months)),months)
plt.xlabel('Month',fontsize=15)
plt.ylabel('kWh',fontsize=15)
plt.xlim([0,len(months)])
plt.legend()
ax.set_xticklabels(months, rotation=40, fontsize=13)
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
plt.title('Data Validation: comparison between our data processing and facilities',fontsize=20)
text = 'Match! Our processing method is valid.'
ax.annotate(text, xy = (15, 2000),
xytext = (5, 50), fontsize = 15,
textcoords = 'offset points', ha = 'center', va = 'bottom')
plt.show()
hourlyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'H')
chilledWater.reindex(hourlyTimestamp, inplace = True)
# Just in case, in order to use diff method, timestamp has to be in asending order.
chilledWater.sort_index(inplace = True)
hourlyEnergy = chilledWater.diff(periods=1)
hourlyChilledWater = pd.DataFrame(data = hourlyEnergy.values, index = hourlyEnergy.index, columns = ['chilledWater-TonDays'])
hourlyChilledWater['startTime'] = hourlyChilledWater.index
hourlyChilledWater['endTime'] = hourlyChilledWater.index + np.timedelta64(1,'h')
hourlyChilledWater.loc[abs(hourlyChilledWater['chilledWater-TonDays']) > 50,'chilledWater-TonDays'] = np.nan
hourlyChilledWaterWithoutNaN = hourlyChilledWater.dropna(axis=0, how='any')
hourlyChilledWater.to_excel('Data/hourlyChilledWater.xlsx')
hourlyChilledWaterWithoutNaN.to_excel('Data/hourlyChilledWaterWithoutNaN.xlsx')
plt.figure()
fig = hourlyChilledWater.plot(fontsize = 15, figsize = (15, 6))
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
fig.set_axis_bgcolor('w')
plt.title('All the hourly chilled water data', fontsize = 16)
plt.ylabel('Ton-Days')
plt.show()
hourlyChilledWater.head()
<matplotlib.figure.Figure at 0x10f0df950>
chilledWater-TonDays | startTime | endTime | |
---|---|---|---|
2011-07-01 01:00:00 | NaN | 2011-07-01 01:00:00 | 2011-07-01 02:00:00 |
2011-07-01 02:00:00 | 0.315714 | 2011-07-01 02:00:00 | 2011-07-01 03:00:00 |
2011-07-01 03:00:00 | 0.315714 | 2011-07-01 03:00:00 | 2011-07-01 04:00:00 |
2011-07-01 04:00:00 | 0.315714 | 2011-07-01 04:00:00 | 2011-07-01 05:00:00 |
2011-07-01 05:00:00 | 0.315714 | 2011-07-01 05:00:00 | 2011-07-01 06:00:00 |
dailyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'D')
chilledWaterReindexed = chilledWater.reindex(dailyTimestamp, inplace = False)
chilledWaterReindexed.sort_index(inplace = True)
dailyEnergy = chilledWaterReindexed.diff(periods=1)['Gund Main Energy - Ton-Days']
dailyChilledWater = pd.DataFrame(data = dailyEnergy.values, index = chilledWaterReindexed.index - np.timedelta64(1,'D'), columns = ['chilledWater-TonDays'])
dailyChilledWater['startDay'] = dailyChilledWater.index
dailyChilledWater['endDay'] = dailyChilledWater.index + np.timedelta64(1,'D')
dailyChilledWaterWithoutNaN = dailyChilledWater.dropna(axis=0, how='any')
dailyChilledWater.to_excel('Data/dailyChilledWater.xlsx')
dailyChilledWaterWithoutNaN.to_excel('Data/dailyChilledWaterWithoutNaN.xlsx')
plt.figure()
fig = dailyChilledWater.plot(fontsize = 15, figsize = (15, 6))
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
fig.set_axis_bgcolor('w')
plt.title('All the daily chilled water data', fontsize = 16)
plt.ylabel('Ton-Days')
plt.show()
dailyChilledWater.head()
<matplotlib.figure.Figure at 0x10a022050>
chilledWater-TonDays | startDay | endDay | |
---|---|---|---|
2011-06-30 | NaN | 2011-06-30 | 2011-07-01 |
2011-07-01 | NaN | 2011-07-01 | 2011-07-02 |
2011-07-02 | 54.741028 | 2011-07-02 | 2011-07-03 |
2011-07-03 | 55.649728 | 2011-07-03 | 2011-07-04 |
2011-07-04 | 109.049077 | 2011-07-04 | 2011-07-05 |
steam = df[['Gund Condensate FlowTotal - LBS']]
steam.head()
Gund Condensate FlowTotal - LBS | |
---|---|
2011-07-01 01:00:00 | 15443350.388455 |
2011-07-01 02:00:00 | 15443459.322917 |
2011-07-01 03:00:00 | 15443574.687500 |
2011-07-01 04:00:00 | 15443701.953125 |
2011-07-01 05:00:00 | 15443818.359375 |
file = 'Data/monthly steam.csv'
monthlySteamFromFacility = pd.read_csv(file, header=0)
monthlySteamFromFacility.set_index(['month'], inplace = True)
monthlySteamFromFacility.head()
startDate | endDate | steam | |
---|---|---|---|
month | |||
Jul 11 | 6/17/2011 | 7/21/2011 | 0.0 |
Aug 11 | 7/21/2011 | 8/20/2011 | 0.0 |
Sep 11 | 8/20/2011 | 9/17/2011 | 0.0 |
Oct 11 | 9/17/2011 | 10/20/2011 | 246.5 |
Nov 11 | 10/20/2011 | 11/20/2011 | 786.1 |
monthlySteamFromFacility['startDate'] = pd.to_datetime(monthlySteamFromFacility['startDate'], format="%m/%d/%Y")
values = monthlySteamFromFacility.index.values
keys = np.array(monthlySteamFromFacility['startDate'])
dates = {}
for key, value in zip(keys, values):
dates[key] = value
sortedDates = np.sort(dates.keys())
sortedDates = sortedDates[sortedDates > np.datetime64('2011-11-01')]
months = []
monthlySteamOrg = np.zeros((len(sortedDates) - 1))
for i in range(len(sortedDates) - 1):
begin = sortedDates[i]
end = sortedDates[i+1]
months.append(dates[sortedDates[i]])
monthlySteamOrg[i] = (np.round(steam.loc[end,:] - steam.loc[begin,:], 1))
monthlySteam = pd.DataFrame(data = monthlySteamOrg, index = months, columns = ['steam-LBS'])
# 867 LBS ~= 1MMBTU steam
fig,ax = plt.subplots(1, 1,figsize=(15,8))
#ax.set_axis_bgcolor('w')
#plt.plot(monthlySteam/867, label='Our data processing from hourly data')
plt.bar(np.arange(len(monthlySteam))-0.5, monthlySteam.values/867, label='Our data processing from hourly data')
plt.plot(monthlySteamFromFacility.loc[months,'steam'],'or', label='Facility data')
plt.xticks(np.arange(0,len(months)),months)
plt.xlabel('Month',fontsize=15)
plt.ylabel('Steam (MMBTU)',fontsize=15)
plt.xlim([0,len(months)])
plt.legend()
ax.set_xticklabels(months, rotation=40, fontsize=13)
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
plt.title('Comparison between our data processing and facilities - Steam',fontsize=20)
text = 'Match! Our processing method is valid.'
ax.annotate(text, xy = (9, 1500),
xytext = (5, 50), fontsize = 15,
textcoords = 'offset points', ha = 'center', va = 'bottom')
plt.show()
hourlyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'H')
steam.reindex(hourlyTimestamp, inplace = True)
# Just in case, in order to use diff method, timestamp has to be in asending order.
steam.sort_index(inplace = True)
hourlyEnergy = steam.diff(periods=1)
hourlySteam = pd.DataFrame(data = hourlyEnergy.values, index = hourlyEnergy.index, columns = ['steam-LBS'])
hourlySteam['startTime'] = hourlySteam.index
hourlySteam['endTime'] = hourlySteam.index + np.timedelta64(1,'h')
hourlySteam.loc[abs(hourlySteam['steam-LBS']) > 100000,'steam-LBS'] = np.nan
plt.figure()
fig = hourlySteam.plot(fontsize = 15, figsize = (15, 6))
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 17)
fig.set_axis_bgcolor('w')
plt.title('All the hourly steam data', fontsize = 16)
plt.ylabel('LBS')
plt.show()
hourlySteamWithoutNaN = hourlySteam.dropna(axis=0, how='any')
hourlySteam.to_excel('Data/hourlySteam.xlsx')
hourlySteamWithoutNaN.to_excel('Data/hourlySteamWithoutNaN.xlsx')
hourlySteam.head()
<matplotlib.figure.Figure at 0x10a53ecd0>
steam-LBS | startTime | endTime | |
---|---|---|---|
2011-07-01 01:00:00 | NaN | 2011-07-01 01:00:00 | 2011-07-01 02:00:00 |
2011-07-01 02:00:00 | 108.934462 | 2011-07-01 02:00:00 | 2011-07-01 03:00:00 |
2011-07-01 03:00:00 | 115.364583 | 2011-07-01 03:00:00 | 2011-07-01 04:00:00 |
2011-07-01 04:00:00 | 127.265625 | 2011-07-01 04:00:00 | 2011-07-01 05:00:00 |
2011-07-01 05:00:00 | 116.406250 | 2011-07-01 05:00:00 | 2011-07-01 06:00:00 |
dailyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'D')
steamReindexed = steam.reindex(dailyTimestamp, inplace = False)
steamReindexed.sort_index(inplace = True)
dailyEnergy = steamReindexed.diff(periods=1)['Gund Condensate FlowTotal - LBS']
dailySteam = pd.DataFrame(data = dailyEnergy.values, index = steamReindexed.index - np.timedelta64(1,'D'), columns = ['steam-LBS'])
dailySteam['startDay'] = dailySteam.index
dailySteam['endDay'] = dailySteam.index + np.timedelta64(1,'D')
plt.figure()
fig = dailySteam.plot(fontsize = 15, figsize = (15, 6))
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
fig.set_axis_bgcolor('w')
plt.title('All the daily steam data', fontsize = 16)
plt.ylabel('LBS')
plt.show()
dailySteamWithoutNaN = dailyChilledWater.dropna(axis=0, how='any')
dailySteam.to_excel('Data/dailySteam.xlsx')
dailySteamWithoutNaN.to_excel('Data/dailySteamWithoutNaN.xlsx')
dailySteam.head()
<matplotlib.figure.Figure at 0x107e2d150>
steam-LBS | startDay | endDay | |
---|---|---|---|
2011-06-30 | NaN | 2011-06-30 | 2011-07-01 |
2011-07-01 | NaN | 2011-07-01 | 2011-07-02 |
2011-07-02 | 3250.651042 | 2011-07-02 | 2011-07-03 |
2011-07-03 | 3271.276042 | 2011-07-03 | 2011-07-04 |
2011-07-04 | 3236.718750 | 2011-07-04 | 2011-07-05 |
There are two sources for weather data.
Here is the orginal weather data after a little cleaning including unit conversion. AS you can see, the interval is 5-minute.
weather2014 = pd.read_excel('Data/weather-2014.xlsx')
weather2014.head()
Datetime | T-C | RH-% | Tdew-C | windDirection | windSpeed-m/s | pressure-mbar | solarRadiation-W/m2 | |
---|---|---|---|---|---|---|---|---|
0 | 2014-01-01 00:00:00 | -5.02 | 53.2 | -13.07 | 256 | 4.5 | 1020.8 | 1 |
1 | 2014-01-01 00:05:00 | -5.14 | 54.3 | -12.93 | 257 | 4.0 | 1020.7 | 1 |
2 | 2014-01-01 00:10:00 | -5.08 | 53.4 | -13.08 | 258 | 3.5 | 1021.1 | 1 |
3 | 2014-01-01 00:15:00 | -5.17 | 52.6 | -13.35 | 257 | 2.8 | 1021.0 | 1 |
4 | 2014-01-01 00:20:00 | -5.23 | 52.9 | -13.33 | 248 | 2.5 | 1021.2 | 1 |
Convert to hourly by resampling method.
Here is the clean hourly weather data of year 2014.
weather2014 = weather2014.set_index('Datetime')
weather2014 = weather2014.resample('H')
weather2014.head()
T-C | RH-% | Tdew-C | windDirection | windSpeed-m/s | pressure-mbar | solarRadiation-W/m2 | |
---|---|---|---|---|---|---|---|
Datetime | |||||||
2014-01-01 00:00:00 | -5.281667 | 52.858333 | -13.393333 | 253.500000 | 2.775000 | 1021.158333 | 1 |
2014-01-01 01:00:00 | -5.725000 | 51.650000 | -14.090833 | 253.000000 | 2.350000 | 1021.700000 | 1 |
2014-01-01 02:00:00 | -6.002500 | 50.766667 | -14.555833 | 239.250000 | 1.133333 | 1022.708333 | 1 |
2014-01-01 03:00:00 | -6.320833 | 49.675000 | -15.114167 | 234.333333 | 1.191667 | 1023.233333 | 1 |
2014-01-01 04:00:00 | -6.535833 | 49.708333 | -15.305833 | 227.333333 | 1.633333 | 1023.925000 | 1 |
Here is the orginal weather data of the year 2013 and 2014 after a little cleaning including unit conversion. As you can see, the Datetime format is not correct.
weather2012and2013 = pd.read_excel('Data/weather-2012-2013.xlsx')
weather2012and2013.head()
Datetime | RH-% | windDirection | solarRadiation-W/m2 | T-C | Tdew-C | pressure-mbar | windSpeed-m/s | |
---|---|---|---|---|---|---|---|---|
0 | 2012-01-01-01 | 87 | 310 | 0 | 4 | 1.9 | 1004 | 4.166670 |
1 | 2012-01-01-02 | 87 | 280 | 0 | 4 | 1.9 | 1005 | 4.166670 |
2 | 2012-01-01-03 | 81 | 270 | 0 | 5 | 1.9 | 1006 | 4.166670 |
3 | 2012-01-01-04 | 76 | 290 | 0 | 6 | 1.9 | 1007 | 4.722226 |
4 | 2012-01-01-05 | 87 | 280 | 0 | 4 | 1.9 | 1007 | 3.055558 |
Correct the timestep.
Here is the hourly data after cleaning.
weather2012and2013['Datetime'] = pd.to_datetime(weather2012and2013['Datetime'], format='%Y-%m-%d-%H')
weather2012and2013 = weather2012and2013.set_index('Datetime')
weather2012and2013.head()
RH-% | windDirection | solarRadiation-W/m2 | T-C | Tdew-C | pressure-mbar | windSpeed-m/s | |
---|---|---|---|---|---|---|---|
Datetime | |||||||
2012-01-01 01:00:00 | 87 | 310 | 0 | 4 | 1.9 | 1004 | 4.166670 |
2012-01-01 02:00:00 | 87 | 280 | 0 | 4 | 1.9 | 1005 | 4.166670 |
2012-01-01 03:00:00 | 81 | 270 | 0 | 5 | 1.9 | 1006 | 4.166670 |
2012-01-01 04:00:00 | 76 | 290 | 0 | 6 | 1.9 | 1007 | 4.722226 |
2012-01-01 05:00:00 | 87 | 280 | 0 | 4 | 1.9 | 1007 | 3.055558 |
Combine two files and add more features including cooling degrees, heating degrees, humidity ratio and dehumidification.
Here is all the hourly weather data.
# Combine two weather files
hourlyWeather = weather2014.append(weather2012and2013)
hourlyWeather.index.name = None
hourlyWeather.sort_index(inplace = True)
# Add more features
# Convert relative humidity to specific humidity
Mw=18.0160 # molecular weight of water
Md=28.9660 # molecular weight of dry air
R = 8.31432E3 # gas constant
Rd = R/Md # specific gas constant for dry air
Rv = R/Mw # specific gas constant for vapour
Lv = 2.5e6 # heat release for condensation of water vapour [J kg-1]
eps = Mw/Md
#saturation pressure
def esat(T):
''' get sateration pressure (units [Pa]) for a given air temperature (units [K])'''
from numpy import log10
TK = 273.15
e1 = 101325.0
logTTK = log10(T/TK)
esat = e1*10**(10.79586*(1-TK/T)-5.02808*logTTK+ 1.50474*1e-4*(1.-10**(-8.29692*(T/TK-1)))+ 0.42873*1e-3*(10**(4.76955*(1-TK/T))-1)-2.2195983)
return esat
def rh2sh(RH,p,T):
'''purpose: conversion relative humidity (unitless) to specific humidity (humidity ratio) [kg/kg]'''
es = esat(T)
W = Mw/Md*RH*es/(p-RH*es)
return W/(1.+W)
p = hourlyWeather['pressure-mbar'] * 100
RH = hourlyWeather['RH-%'] / 100
T = hourlyWeather['T-C'] + 273.15
w = rh2sh(RH,p,T)
hourlyWeather['humidityRatio-kg/kg'] = w
hourlyWeather['coolingDegrees'] = hourlyWeather['T-C'] - 12
hourlyWeather.loc[hourlyWeather['coolingDegrees'] < 0, 'coolingDegrees'] = 0
hourlyWeather['heatingDegrees'] = 15 - hourlyWeather['T-C']
hourlyWeather.loc[hourlyWeather['heatingDegrees'] < 0, 'heatingDegrees'] = 0
hourlyWeather['dehumidification'] = hourlyWeather['humidityRatio-kg/kg'] - 0.00886
hourlyWeather.loc[hourlyWeather['dehumidification'] < 0, 'dehumidification'] = 0
#hourlyWeather.to_excel('Data/hourlyWeather.xlsx')
hourlyWeather.head()
RH-% | T-C | Tdew-C | pressure-mbar | solarRadiation-W/m2 | windDirection | windSpeed-m/s | humidityRatio-kg/kg | coolingDegrees | heatingDegrees | dehumidification | |
---|---|---|---|---|---|---|---|---|---|---|---|
2012-01-01 01:00:00 | 87 | 4 | 1.9 | 1004 | 0 | 310 | 4.166670 | 0.004396 | 0 | 11 | 0 |
2012-01-01 02:00:00 | 87 | 4 | 1.9 | 1005 | 0 | 280 | 4.166670 | 0.004391 | 0 | 11 | 0 |
2012-01-01 03:00:00 | 81 | 5 | 1.9 | 1006 | 0 | 270 | 4.166670 | 0.004380 | 0 | 10 | 0 |
2012-01-01 04:00:00 | 76 | 6 | 1.9 | 1007 | 0 | 290 | 4.722226 | 0.004401 | 0 | 9 | 0 |
2012-01-01 05:00:00 | 87 | 4 | 1.9 | 1007 | 0 | 280 | 3.055558 | 0.004382 | 0 | 11 | 0 |
plt.figure()
fig = hourlyWeather.plot(y = 'T-C', figsize = (15, 6))
fig.set_axis_bgcolor('w')
plt.title('All hourly temperture', fontsize = 16)
plt.ylabel(r'Temperature ($\circ$C)')
plt.show()
plt.figure()
fig = hourlyWeather.plot(y = 'solarRadiation-W/m2', figsize = (15, 6))
fig.set_axis_bgcolor('w')
plt.title('All hourly solar radiation', fontsize = 16)
plt.ylabel(r'$W/m^2$', fontsize = 13)
plt.show()
plt.figure()
fig = hourlyWeather['2014-10'].plot(y = 'T-C', figsize = (15, 6), marker = 'o')
fig.set_axis_bgcolor('w')
plt.title('Selected hourly temperture',fontsize = 16)
plt.ylabel(r'Temperature ($\circ$C)',fontsize = 13)
plt.show()
plt.figure()
fig = hourlyWeather['2014-10'].plot(y = 'solarRadiation-W/m2', figsize = (15, 6), marker ='o')
fig.set_axis_bgcolor('w')
plt.title('Selected hourly solar radiation', fontsize = 16)
plt.ylabel(r'$W/m^2$', fontsize = 13)
plt.show()
<matplotlib.figure.Figure at 0x10c8d2c10>
<matplotlib.figure.Figure at 0x107d9a690>
<matplotlib.figure.Figure at 0x10cab2f10>
<matplotlib.figure.Figure at 0x1123f1bd0>
dailyWeather = hourlyWeather.resample('D')
#dailyWeather.to_excel('Data/dailyWeather.xlsx')
dailyWeather.head()
RH-% | T-C | Tdew-C | pressure-mbar | solarRadiation-W/m2 | windDirection | windSpeed-m/s | humidityRatio-kg/kg | coolingDegrees | heatingDegrees | dehumidification | |
---|---|---|---|---|---|---|---|---|---|---|---|
2012-01-01 | 76.652174 | 7.173913 | 3.073913 | 1004.956522 | 95.260870 | 236.086957 | 4.118361 | 0.004796 | 0 | 7.826087 | 0 |
2012-01-02 | 55.958333 | 5.833333 | -2.937500 | 994.625000 | 87.333333 | 253.750000 | 5.914357 | 0.003415 | 0 | 9.166667 | 0 |
2012-01-03 | 42.500000 | -3.208333 | -12.975000 | 1002.125000 | 95.708333 | 302.916667 | 6.250005 | 0.001327 | 0 | 18.208333 | 0 |
2012-01-04 | 41.541667 | -7.083333 | -16.958333 | 1008.250000 | 98.750000 | 286.666667 | 5.127319 | 0.000890 | 0 | 22.083333 | 0 |
2012-01-05 | 46.916667 | -0.583333 | -9.866667 | 1002.041667 | 90.750000 | 258.333333 | 5.162041 | 0.001746 | 0 | 15.583333 | 0 |
plt.figure()
fig = dailyWeather.plot(y = 'T-C', figsize = (15, 6), marker ='o')
fig.set_axis_bgcolor('w')
plt.title('All daily temperture', fontsize = 16)
plt.ylabel(r'Temperature ($\circ$C)', fontsize = 13)
plt.show()
plt.figure()
fig = dailyWeather['2014'].plot(y = 'T-C', figsize = (15, 6), marker ='o')
fig.set_axis_bgcolor('w')
plt.title('Selected daily temperture', fontsize = 16)
plt.ylabel(r'Temperature ($\circ$C)', fontsize = 13)
plt.show()
plt.figure()
fig = dailyWeather['2014'].plot(y = 'solarRadiation-W/m2', figsize = (15, 6), marker ='o')
fig.set_axis_bgcolor('w')
plt.title('Selected daily solar radiation', fontsize = 16)
plt.ylabel(r'$W/m^2$', fontsize = 14)
plt.show()
<matplotlib.figure.Figure at 0x109fd1bd0>
<matplotlib.figure.Figure at 0x105eb1950>
<matplotlib.figure.Figure at 0x10b8d4b50>
This is a number between 0 and 1. 0 indicated no occupants, 1 indicates normal occupancy. This is an estimate based on holidays, weekends and school academic calendar.
holidays = pd.read_excel('Data/holidays.xlsx')
holidays.head()
startDate | endDate | value | |
---|---|---|---|
0 | 2011-07-01 | 2011-09-06 | 0.5 |
1 | 2011-10-10 | 2011-10-11 | 0.6 |
2 | 2011-11-24 | 2011-11-28 | 0.2 |
3 | 2011-12-22 | 2011-12-24 | 0.1 |
4 | 2011-12-24 | 2012-01-02 | 0.0 |
hourlyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'H')
occupancy = np.ones(len(hourlyTimestamp))
hourlyOccupancy = pd.DataFrame(data = occupancy, index = hourlyTimestamp, columns = ['occupancy'])
Saturdays = hourlyOccupancy.index.weekday == 5
Sundays = hourlyOccupancy.index.weekday == 6
hourlyOccupancy.loc[Saturdays, 'occupancy'] = 0.5
hourlyOccupancy.loc[Sundays, 'occupancy'] = 0.5
for i in range(len(holidays)):
timestamp = pd.date_range(start = holidays.loc[i, 'startDate'], end = holidays.loc[i, 'endDate'], freq = 'H')
hourlyOccupancy.loc[timestamp, 'occupancy'] = holidays.loc[i, 'value']
#hourlyHolidays['Datetime'] = pd.to_datetime(hourlyHolidays['Datetime'], format="%Y-%m-%d %H:%M:%S")
hourlyOccupancy['cosHour'] = np.cos((hourlyOccupancy.index.hour - 3) * 2 * np.pi / 24)
dailyOccupancy = hourlyOccupancy.resample('D')
dailyOccupancy.drop('cosHour', axis = 1, inplace = True)
hourlyElectricityWithFeatures = hourlyElectricity.join(hourlyWeather, how = 'inner')
hourlyElectricityWithFeatures = hourlyElectricityWithFeatures.join(hourlyOccupancy, how = 'inner')
hourlyElectricityWithFeatures.dropna(axis=0, how='any', inplace = True)
hourlyElectricityWithFeatures.to_excel('Data/hourlyElectricityWithFeatures.xlsx')
hourlyElectricityWithFeatures.head()
electricity-kWh | startTime | endTime | RH-% | T-C | Tdew-C | pressure-mbar | solarRadiation-W/m2 | windDirection | windSpeed-m/s | humidityRatio-kg/kg | coolingDegrees | heatingDegrees | dehumidification | occupancy | cosHour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2012-01-01 01:00:00 | 111.479277 | 2012-01-01 01:00:00 | 2012-01-01 02:00:00 | 87 | 4 | 1.9 | 1004 | 0 | 310 | 4.166670 | 0.004396 | 0 | 11 | 0 | 0 | 0.866025 |
2012-01-01 02:00:00 | 117.989395 | 2012-01-01 02:00:00 | 2012-01-01 03:00:00 | 87 | 4 | 1.9 | 1005 | 0 | 280 | 4.166670 | 0.004391 | 0 | 11 | 0 | 0 | 0.965926 |
2012-01-01 03:00:00 | 119.010131 | 2012-01-01 03:00:00 | 2012-01-01 04:00:00 | 81 | 5 | 1.9 | 1006 | 0 | 270 | 4.166670 | 0.004380 | 0 | 10 | 0 | 0 | 1.000000 |
2012-01-01 04:00:00 | 116.005587 | 2012-01-01 04:00:00 | 2012-01-01 05:00:00 | 76 | 6 | 1.9 | 1007 | 0 | 290 | 4.722226 | 0.004401 | 0 | 9 | 0 | 0 | 0.965926 |
2012-01-01 05:00:00 | 111.132977 | 2012-01-01 05:00:00 | 2012-01-01 06:00:00 | 87 | 4 | 1.9 | 1007 | 0 | 280 | 3.055558 | 0.004382 | 0 | 11 | 0 | 0 | 0.866025 |
hourlyChilledWaterWithFeatures = hourlyChilledWater.join(hourlyWeather, how = 'inner')
hourlyChilledWaterWithFeatures = hourlyChilledWaterWithFeatures.join(hourlyOccupancy, how = 'inner')
hourlyChilledWaterWithFeatures.dropna(axis=0, how='any', inplace = True)
hourlyChilledWaterWithFeatures.to_excel('Data/hourlyChilledWaterWithFeatures.xlsx')
hourlySteamWithFeatures = hourlySteam.join(hourlyWeather, how = 'inner')
hourlySteamWithFeatures = hourlySteamWithFeatures.join(hourlyOccupancy, how = 'inner')
hourlySteamWithFeatures.dropna(axis=0, how='any', inplace = True)
hourlySteamWithFeatures.to_excel('Data/hourlySteamWithFeatures.xlsx')
dailyElectricityWithFeatures = dailyElectricity.join(dailyWeather, how = 'inner')
dailyElectricityWithFeatures = dailyElectricityWithFeatures.join(dailyOccupancy, how = 'inner')
dailyElectricityWithFeatures.dropna(axis=0, how='any', inplace = True)
dailyElectricityWithFeatures.to_excel('Data/dailyElectricityWithFeatures.xlsx')
dailyChilledWaterWithFeatures = dailyChilledWater.join(dailyWeather, how = 'inner')
dailyChilledWaterWithFeatures = dailyChilledWaterWithFeatures.join(dailyOccupancy, how = 'inner')
dailyChilledWaterWithFeatures.dropna(axis=0, how='any', inplace = True)
dailyChilledWaterWithFeatures.to_excel('Data/dailyChilledWaterWithFeatures.xlsx')
dailySteamWithFeatures = dailySteam.join(dailyWeather, how = 'inner')
dailySteamWithFeatures = dailySteamWithFeatures.join(dailyOccupancy, how = 'inner')
dailySteamWithFeatures.dropna(axis=0, how='any', inplace = True)
dailySteamWithFeatures.to_excel('Data/dailySteamWithFeatures.xlsx')
coolingDegrees: if T-C - 12 > 0, then = T-C - 12, else = 0. Assume that when outdoor temperature is below 12C, no cooling is needed, which is true for many buildings. This will be useful for daily prediction, because the average of hourly cooling degrees is better than average of hourly temperature.
cosHour: $\text{cos}(\text{hourOfDay} \cdot \frac{2\pi}{24})$
dehumidification: if humidityRatio - 0.00886 > 0, then = humidityRatio - 0.00886, else = 0. This will be useful for chilled water prediction, especially daily chilled water prediction.
heatingDegrees: if 15 - T-C > 0, then = 15 - T-C, else = 0. Assume that when outdoor temperature is above 15C, no heating is needed. This will be useful for daily prediction, because the average of hourly heating degrees is better than average of hourly temperature.
occupancy: A number between 0 and 1. 0 indicated no occupants, 1 indicates normal occupancy. This is an estimate based on holidays, weekends and school academic calendar.
pressure-mbar: atmospheric pressure
RH-% : Relative humidity
T-C : Dry-bulb temperature
Tdew-C : Dew-point temperature
Humidity ratio is important for chilled water prediction as chilled water is also used to dry the air discharged to rooms. Using humidity ratio will be more efficient and effective than using RH and dew point temperature.