In [90]:

import pandas as pd
import json
import numpy
from IPython.display import display, HTML

First I load some data that I downloaded from pressureNet (using this script). I had downloaded the data (which is all readings from the Toronto area from June 1 - Sept 11, 2013) in smaller batches, and so have to load the JSON files as separate dataframes and concatenate them.

In [91]:

df0 = pd.read_json('file://localhost/home/natalie/pressurenet/data/June-01-2013:00:00:00_June-30-2013:00:00:00_43-44_-80--79.json', convert_dates=['daterecorded'])
df1 = pd.read_json('file://localhost/home/natalie/pressurenet/data/July-01-2013:00:00:00_July-31-2013:00:00:00_43-44_-80--79.json', convert_dates=['daterecorded'])
df2 = pd.read_json('file://localhost/home/natalie/pressurenet/data/August-01-2013:00:00:00_August-31-2013:00:00:00_43-44_-80--79.json', convert_dates=['daterecorded'])
df3 = pd.read_json('file://localhost/home/natalie/pressurenet/data/September-01-2013:00:00:00_September-07-2013:00:00:00_43-44_-80--79.json', convert_dates=['daterecorded'])
df4 = pd.read_json('file://localhost/home/natalie/pressurenet/data/September-07-2013:00:00:00_September-11-2013:00:00:00_43-44_-80--79.json', convert_dates=['daterecorded'])
pn = pd.concat([df0, df1, df2, df3, df4])
pn[:5]

Out[91]:

	client_key	daterecorded	latitude	location_accuracy	longitude	reading	sharing	tzoffset	user_id
0	ca.cumulonimbus.barometernetwork	2013-06-01 05:00:04.680000	43.736296	43.867	-79.417653	989.306274	Public	-14400000	e1551a39d3b6ec399f36acbb17beac
1	ca.cumulonimbus.barometernetwork	2013-06-01 05:00:54.448000	43.694750	20.000	-79.291521	987.429993	Us, Researchers and Forecasters	-14400000	334bcef2adeb9b7c39ac26d495167f
2	ca.cumulonimbus.barometernetwork	2013-06-01 05:01:04.736000	43.736296	43.867	-79.417653	989.026428	Public	-14400000	e1551a39d3b6ec399f36acbb17beac
3	ca.cumulonimbus.barometernetwork	2013-06-01 05:01:18.016000	43.231558	42.000	-79.853251	988.316772	Us, Researchers and Forecasters	-14400000	859545c4ccb81384621a7b4b1baf3538
4	ca.cumulonimbus.barometernetwork	2013-06-01 05:02:04.814000	43.736296	43.867	-79.417653	989.396240	Public	-14400000	e1551a39d3b6ec399f36acbb17beac

5 rows × 14 columns

To get an initial look at the data I decided to plot the atmospheric pressure (the "reading" field) against the date ("daterecorded").

In [92]:

pn.plot(x='daterecorded', y='reading', figsize=(15, 6))

Out[92]:

<matplotlib.axes.AxesSubplot at 0x314a0f90>

It looks like there are some outlying data points where the pressure is close to zero. I don't know a lot about the atmosphere, but I don't think those were accurate readings. There also seems to be a few periods of time for which there are far fewer recordings. There isn't much I can do about that issue, but for the first issue I can try to remove statistical outliers.

In [361]:

pn = pn[(pn['reading'] > 900) & (pn['reading'] < 1100)]

I settled for filtering out rows with a reading of less than 900 or more than 1100. The resulting plot looks better:

In [362]:

pn.plot(x='daterecorded', y='reading', figsize=(15, 6))

Out[362]:

<matplotlib.axes.AxesSubplot at 0x70290250>

In [95]:

heat = pd.read_json('file://localhost/home/natalie/torontodata/heat_alerts_list.json', convert_dates=['date'])
print heat.code.unique()
print heat.text.unique()
heat[:5]

[u'HAE' u'HA' u'EHAE' u'HAU' u'EHAD' u'EHA']
[u"Toronto's Medical Officer of Health has extended the Heat Alert"
 u"Toronto's Medical Officer of Health has issued a Heat Alert"
 u"Toronto's Medical Officer of Health has extended the Extreme Heat Alert"
 u"Toronto's Medical Officer of Health has upgraded the Heat Alert to an Extreme Heat Alert"
 u"Toronto's Medical Officer of Health has downgraded the Extreme Heat Alert to a Heat Alert"
 u"Toronto's Medical Officer of Health has issued an Extreme Heat Alert"]

Out[95]:

	code	date	id	text
0	HAE	2013-09-11	185	Toronto's Medical Officer of Health has extend...
1	HA	2013-09-10	184	Toronto's Medical Officer of Health has issued...
2	EHAE	2013-07-19	182	Toronto's Medical Officer of Health has extend...
3	EHAE	2013-07-18	181	Toronto's Medical Officer of Health has extend...
4	EHAE	2013-07-17	180	Toronto's Medical Officer of Health has extend...

5 rows × 4 columns

I load the list of heat warnings, which I downloaded from Toronto's open data portal here, and take a look at the data.

Next I want to join the two dataframes on date, but my pressureNet data has datetimes and the heat data has just date. I decide to add a column to the pressureNet dataframe with just the date, in the same format as the heat dataframe's date column.

In [96]:

heat.dtypes

Out[96]:

code            object
date    datetime64[ns]
id               int64
text            object
dtype: object

In [97]:

import arrow
pn["date"] = pn.apply(lambda x: str(arrow.get("T".join(str(x['daterecorded']).split(" "))).date()), axis=1).astype(datetime64)

To do this I use the arrow library. I build an arrow object out of the datetime string from each row from the pressureNet dataframe, get just the date part of that arrow object, cast it to a string, and then to a datetime64 dtype. I don't think this was the nicest way to accomplish my goal, but it worked:

In [98]:

pn[:1]

Out[98]:

	altitude	client_key	daterecorded	latitude	location_accuracy	longitude	observation_type	observation_unit	provider	reading	reading_accuracy	sharing	tzoffset	user_id	date
0	0	ca.cumulonimbus.barometernetwork	2013-06-01 05:00:04.680000	43.736296	43.867	-79.417653				989.306274	0	Public	-14400000	e1551a39d3b6ec399f36acbb17beac	2013-06-01

1 rows × 15 columns

In [99]:

tor = pd.merge(pn, heat, on='date', suffixes=('_p','_h'), how='left')
tor[:5]

Out[99]:

	client_key	daterecorded	latitude	location_accuracy	longitude	reading	sharing	tzoffset	user_id	date	code	id	text
0	ca.cumulonimbus.barometernetwork	2013-06-01 05:00:04.680000	43.736296	43.867	-79.417653	989.306274	Public	-14400000	e1551a39d3b6ec399f36acbb17beac	2013-06-01	NaN	NaN	NaN
1	ca.cumulonimbus.barometernetwork	2013-06-01 05:00:54.448000	43.694750	20.000	-79.291521	987.429993	Us, Researchers and Forecasters	-14400000	334bcef2adeb9b7c39ac26d495167f	2013-06-01	NaN	NaN	NaN
2	ca.cumulonimbus.barometernetwork	2013-06-01 05:01:04.736000	43.736296	43.867	-79.417653	989.026428	Public	-14400000	e1551a39d3b6ec399f36acbb17beac	2013-06-01	NaN	NaN	NaN
3	ca.cumulonimbus.barometernetwork	2013-06-01 05:01:18.016000	43.231558	42.000	-79.853251	988.316772	Us, Researchers and Forecasters	-14400000	859545c4ccb81384621a7b4b1baf3538	2013-06-01	NaN	NaN	NaN
4	ca.cumulonimbus.barometernetwork	2013-06-01 05:02:04.814000	43.736296	43.867	-79.417653	989.396240	Public	-14400000	e1551a39d3b6ec399f36acbb17beac	2013-06-01	NaN	NaN	NaN

5 rows × 18 columns

I perform the merge. From the first five rows there aren't any rows with heat warning data, but there were far more rows in the pressureNet dataframe than the heat one. I check to see if the 'text' field is always NaN:

In [100]:

tor.text.unique()

Out[100]:

array([nan, u"Toronto's Medical Officer of Health has issued a Heat Alert",
       u"Toronto's Medical Officer of Health has upgraded the Heat Alert to an Extreme Heat Alert",
       u"Toronto's Medical Officer of Health has extended the Extreme Heat Alert",
       u"Toronto's Medical Officer of Health has downgraded the Extreme Heat Alert to a Heat Alert",
       u"Toronto's Medical Officer of Health has extended the Heat Alert"], dtype=object)

And luckily it isn't!

Next I want to be able to measure the correlation between the text/code (there is a one-to-one mapping here, so I just go with the code field) from the heat warning data and the readings from the pressureNet data. Pandas can only calculate correlation on numeric fields, and my codes are strings, so I decide to add a column of numeric codes with a one-to-one mapping with the existing string codes.

In [101]:

import numpy as np
unique_vals, tor['code_no'] = np.unique(tor.code, return_inverse=True)
tor['code_no'].unique()

Out[101]:

array([0, 3, 5, 2, 1, 4])

In [375]:

tor[tor['code_no'] == 1][:1]['text']

Out[375]:

144016    Toronto's Medical Officer of Health has downgr...
Name: text, dtype: object

In [376]:

tor[tor['code_no'] == 2][:1]['text']

Out[376]:

138462    Toronto's Medical Officer of Health has extend...
Name: text, dtype: object

In [377]:

tor[tor['code_no'] == 3][:1]['text']

Out[377]:

126790    Toronto's Medical Officer of Health has issued...
Name: text, dtype: object

In [378]:

tor[tor['code_no'] == 4][:1]['text']

Out[378]:

573641    Toronto's Medical Officer of Health has extend...
Name: text, dtype: object

In [379]:

tor[tor['code_no'] == 5][:1]['text']

Out[379]:

132377    Toronto's Medical Officer of Health has upgrad...
Name: text, dtype: object

In [102]:

tor.corr()

Out[102]:

	altitude	latitude	location_accuracy	longitude	reading	reading_accuracy	tzoffset	id	code_no
altitude	1.000000	0.007098	-0.021511	0.018626	0.016783	NaN	0.151741	NaN	-0.012730
latitude	0.007098	1.000000	-0.187085	0.560309	-0.244301	NaN	0.037490	0.008536	0.004088
location_accuracy	-0.021511	-0.187085	1.000000	-0.168497	0.043525	NaN	-0.034505	-0.006546	-0.003210
longitude	0.018626	0.560309	-0.168497	1.000000	0.003294	NaN	0.131755	-0.011353	0.001112
reading	0.016783	-0.244301	0.043525	0.003294	1.000000	NaN	-0.009200	0.031602	0.092197
reading_accuracy	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
tzoffset	0.151741	0.037490	-0.034505	0.131755	-0.009200	NaN	1.000000	-0.123591	0.000231
id	NaN	0.008536	-0.006546	-0.011353	0.031602	NaN	-0.123591	1.000000	-0.162111
code_no	-0.012730	0.004088	-0.003210	0.001112	0.092197	NaN	0.000231	-0.162111	1.000000

9 rows × 9 columns

The built-in correlation method shows a low correlation between the code_no field and the reading field, which is not what I was hoping I'd find.

In [365]:

warnings = tor[tor['code_no'].isin([1, 2, 3, 4, 5])]
warnings2 = tor[tor['code_no'] == 2]
warnings1 = tor[tor['code_no'] == 1]
warnings3 = tor[tor['code_no'] == 3]
warnings4 = tor[tor['code_no'] == 4]
warnings5 = tor[tor['code_no'] == 5]
nonwarnings = tor[tor['code_no'] == 0]

I isolate the days with any kind of heat warning as well as with each specific type of warning/warning-related message.

In [104]:

tor.describe()

Out[104]:

	altitude	latitude	location_accuracy	longitude	reading	reading_accuracy	tzoffset	id	code_no
count	574590.000000	574590.000000	574590.000000	574590.000000	574590.000000	574590	574590.000000	59764.000000	574590.000000
mean	0.352973	43.640132	298.515366	-79.477884	995.170459	0	-13416276.301363	178.327036	0.294419
std	8.555421	0.180799	514.602824	0.184457	7.652223	0	3647785.368813	3.488297	0.954214
min	-59.100006	43.000120	1.000000	-79.999870	910.229736	0	-28800000.000000	173.000000	0.000000
25%	0.000000	43.646621	29.407000	-79.569317	989.965881	0	-14400000.000000	175.000000	0.000000
50%	0.000000	43.685568	40.500000	-79.417533	995.190002	0	-14400000.000000	179.000000	0.000000
75%	0.000000	43.736305	143.839500	-79.379235	1000.540100	0	-14400000.000000	181.000000	0.000000
max	1000.199951	43.999995	5000.000000	-79.000225	1344.660034	0	7200000.000000	185.000000	5.000000

8 rows × 9 columns

In [105]:

nonwarnings.describe()

Out[105]:

	altitude	latitude	location_accuracy	longitude	reading	reading_accuracy	tzoffset	id	code_no
count	514826.000000	514826.000000	514826.000000	514826.000000	514826.000000	514826	514826.000000	0	514826
mean	0.393948	43.639945	299.194298	-79.477633	995.040635	0	-13408116.528691	NaN	0
std	9.037479	0.181041	514.424446	0.184219	7.655763	0	3662129.542021	NaN	0
min	-59.100006	43.000120	1.000000	-79.999870	910.229736	0	-28800000.000000	NaN	0
25%	0.000000	43.646545	29.502250	-79.569317	989.890015	0	-14400000.000000	NaN	0
50%	0.000000	43.684683	40.689000	-79.417533	995.053101	0	-14400000.000000	NaN	0
75%	0.000000	43.736305	147.578000	-79.379184	1000.350220	0	-14400000.000000	NaN	0
max	1000.199951	43.999995	5000.000000	-79.000415	1344.660034	0	7200000.000000	NaN	0

8 rows × 9 columns

In [106]:

warnings.describe()

Out[106]:

	altitude	latitude	location_accuracy	longitude	reading	reading_accuracy	tzoffset	id	code_no
count	29521	29521.000000	29521.000000	29521.000000	29521.000000	29521	29521.000000	29521.000000	29521.000000
mean	0	43.643843	291.787895	-79.485729	993.677119	0	-13734656.685072	178.916229	1.807120
std	0	0.179121	514.789195	0.188699	7.315454	0	3023152.265227	2.769632	0.394566
min	0	43.015414	4.737000	-79.995023	958.433228	0	-18000000.000000	175.000000	1.000000
25%	0	43.646741	27.095000	-79.570170	987.667175	0	-14400000.000000	176.000000	2.000000
50%	0	43.695414	37.416000	-79.417536	994.070007	0	-14400000.000000	180.000000	2.000000
75%	0	43.736307	111.656000	-79.380830	998.940979	0	-14400000.000000	181.000000	2.000000
max	0	43.998816	3443.000000	-79.000505	1012.913269	0	0.000000	182.000000	2.000000

8 rows × 9 columns

In [366]:

print len(warnings1)
warnings1.describe()

Out[366]:

	altitude	latitude	location_accuracy	longitude	reading	reading_accuracy	tzoffset	id	code_no
count	5694	5694.000000	5694.000000	5694.000000	5694.000000	5694	5694.000000	5694	5694
mean	0	43.632695	264.226575	-79.497286	988.579502	0	-13047629.083246	176	1
std	0	0.180082	491.716751	0.185150	5.397092	0	4202346.483297	0	0
min	0	43.028590	4.737000	-79.979364	967.937988	0	-18000000.000000	176	1
25%	0	43.646365	31.245250	-79.624901	985.508362	0	-14400000.000000	176	1
50%	0	43.668456	43.500000	-79.417529	986.789978	0	-14400000.000000	176	1
75%	0	43.736316	86.079500	-79.379235	993.769958	0	-14400000.000000	176	1
max	0	43.997440	2615.000000	-79.003154	1000.289978	0	0.000000	176	1

8 rows × 9 columns

In [367]:

print len(warnings2)
warnings2.describe()

Out[367]:

	altitude	latitude	location_accuracy	longitude	reading	reading_accuracy	tzoffset	id	code_no
count	23827	23827.000000	23827.000000	23827.000000	23827.000000	23827	23827.000000	23827.000000	23827
mean	0	43.646507	298.374295	-79.482968	994.895309	0	-13898837.453309	179.613128	2
std	0	0.178791	519.945122	0.189437	7.186866	0	2638984.140779	2.643099	0
min	0	43.015414	4.737000	-79.995023	958.433228	0	-14400000.000000	175.000000	2
25%	0	43.646784	27.000000	-79.569345	989.320007	0	-14400000.000000	180.000000	2
50%	0	43.699368	36.026000	-79.417536	995.592773	0	-14400000.000000	181.000000	2
75%	0	43.736303	128.397000	-79.381984	999.920197	0	-14400000.000000	182.000000	2
max	0	43.998816	3443.000000	-79.000505	1012.913269	0	0.000000	182.000000	2

8 rows × 9 columns

In [368]:

print len(warnings3)

warnings3.describe()

Out[368]:

	altitude	latitude	location_accuracy	longitude	reading	reading_accuracy	tzoffset	id	code_no
count	17222	17222.000000	17222.000000	17222.000000	17222.000000	17222	17222.000000	17222.000000	17222
mean	0	43.635359	299.504357	-79.476144	998.575333	0	-13268284.752061	178.244281	3
std	0	0.180330	520.672903	0.183222	7.017629	0	3901858.699926	4.387450	0
min	0	43.003593	1.250000	-79.992424	974.284546	0	-21600000.000000	173.000000	3
25%	0	43.646312	29.740000	-79.557855	993.064209	0	-14400000.000000	173.000000	3
50%	0	43.675566	37.500000	-79.417526	999.139984	0	-14400000.000000	178.000000	3
75%	0	43.736284	121.500000	-79.382363	1003.039978	0	-14400000.000000	184.000000	3
max	0	43.999589	3521.000000	-79.001623	1157.923584	0	0.000000	184.000000	3

8 rows × 9 columns

In [369]:

print len(warnings4)

warnings4.describe()

Out[369]:

	altitude	latitude	location_accuracy	longitude	reading	reading_accuracy	tzoffset	id	code_no
count	949	949.000000	949.000000	949.000000	949.000000	949	949.000000	949	949
mean	0	43.591688	209.036393	-79.493094	995.035589	0	-13906849.315068	185	4
std	0	0.214829	395.285907	0.206161	5.332409	0	2830858.713231	0	0
min	0	43.127080	20.000000	-79.990436	974.619995	0	-21600000.000000	185	4
25%	0	43.558485	29.740000	-79.635275	990.559998	0	-14400000.000000	185	4
50%	0	43.656342	39.817000	-79.413306	995.729980	0	-14400000.000000	185	4
75%	0	43.739011	76.227000	-79.379711	999.800476	0	-14400000.000000	185	4
max	0	43.855108	2055.000000	-79.026341	1003.940002	0	0.000000	185	4

8 rows × 9 columns

In [370]:

print len(warnings5)

warnings5.describe()

Out[370]:

	altitude	latitude	location_accuracy	longitude	reading	reading_accuracy	tzoffset	id	code_no
count	12072	12072.000000	12072.000000	12072.000000	12072.000000	12072	12072.000000	12072.000000	12072
mean	0	43.649647	291.636069	-79.470711	999.512005	0	-13158250.497018	176.479705	5
std	0	0.171120	520.591584	0.183464	6.532511	0	4042351.371450	2.500021	0
min	0	43.060856	20.000000	-79.995097	970.759949	0	-14400000.000000	174.000000	5
25%	0	43.650078	31.740750	-79.548248	993.863770	0	-14400000.000000	174.000000	5
50%	0	43.694753	39.698500	-79.417518	1000.555023	0	-14400000.000000	174.000000	5
75%	0	43.736302	107.008500	-79.379235	1003.679993	0	-14400000.000000	179.000000	5
max	0	43.991877	3188.000000	-79.000225	1014.539978	0	0.000000	179.000000	5

8 rows × 9 columns

It seems that the mean air pressure for heat warning days is 993.677119, the overall mean air pressure for the 11 day window I'm looking at is 995.170459, and the mean for the days without heat warnings is 995.040635. The mean air pressure for code 1 warnings (extreme heat warnings downgraded to standard heat warnings) is as low as 988.579502.

In [109]:

import scipy.stats
scipy.stats.ttest_ind(warnings['reading'], nonwarnings['reading'])

Out[109]:

(array(-29.830170950049535), 2.2845887961730417e-195)

It was suggested to me that I use a t-test for the warnings and non-warnings data sets instead of just looking at their means. I think a t-test looks at the distribution of values in each data set. After all, it's possible for two very different distributions to results in the same mean value.

The results of this ttest_ind method are the t-statistic the the p-value. A quick google search makes me believe that a t-statistic greater than 2.0 or less than -2.0 means a significant difference of means, as does a p-value smaller than 0.05. My data seems to pass that test!

In [110]:

warnings.hist('reading')

Out[110]:

array([[<matplotlib.axes.AxesSubplot object at 0xc68719d0>]], dtype=object)

In [111]:

nonwarnings.hist('reading')

Out[111]:

array([[<matplotlib.axes.AxesSubplot object at 0xbb3ceed0>]], dtype=object)

In [112]:

tor.hist('reading')

Out[112]:

array([[<matplotlib.axes.AxesSubplot object at 0x71534bd0>]], dtype=object)

In [113]:

warnings["time"] = warnings.apply(lambda x: arrow.get("T".join(str(x['daterecorded']).split(" "))).time(), axis=1)
nonwarnings["time"] = nonwarnings.apply(lambda x: arrow.get("T".join(str(x['daterecorded']).split(" "))).time(), axis=1)
tor["time"] = tor.apply(lambda x: arrow.get("T".join(str(x['daterecorded']).split(" "))).time(), axis=1)

In [355]:

starttime = datetime.time(13,0)
endtime = datetime.time(23,59)

In [356]:

warnings_day = warnings[(warnings['time'] > starttime) & (warnings['time'] < endtime)]
nonwarnings_day = nonwarnings[(nonwarnings['time'] > starttime) & (nonwarnings['time'] < endtime)]
days = tor[(tor['time'] > starttime) & (tor['time'] < endtime)]

In [357]:

scipy.stats.ttest_ind(warnings_day['reading'], nonwarnings_day['reading'])

Out[357]:

(array(-42.752879538465386), 0.0)

With a bit of experimentation I found that limiting my data to readings from 1:00pm onward maximized the difference as reported by a ttest. I expected that the hours 11:00am - 5:00pm might be more statistically telling, since that is generally the hottest time of the day. But perhaps air pressure and temperature are not that closely correlated.

In [359]:

warnings_day.describe()

Out[359]:

	altitude	latitude	location_accuracy	longitude	reading	reading_accuracy	tzoffset	id	code_no
count	14423	14423.000000	14423.000000	14423.000000	14423.000000	14423	14423.000000	14423.000000	14423.000000
mean	0	43.643207	327.061149	-79.481421	992.680759	0	-13492699.161062	178.903349	1.802954
std	0	0.175201	532.516588	0.183891	7.252837	0	3499611.298638	2.781350	0.397781
min	0	43.075259	4.737000	-79.981781	958.433228	0	-18000000.000000	175.000000	1.000000
25%	0	43.646658	28.000000	-79.569886	986.769958	0	-14400000.000000	176.000000	2.000000
50%	0	43.687811	40.500000	-79.417532	992.849976	0	-14400000.000000	180.000000	2.000000
75%	0	43.736302	201.616000	-79.379235	997.729980	0	-14400000.000000	181.000000	2.000000
max	0	43.997440	3443.000000	-79.001904	1012.199951	0	0.000000	182.000000	2.000000

8 rows × 9 columns

In [312]:

nonwarnings_day.describe()

Out[312]:

	altitude	latitude	location_accuracy	longitude	reading	reading_accuracy	tzoffset	id	code_no
count	323236.000000	323236.000000	323236.000000	323236.000000	323236.000000	323236	323236.000000	0	323236
mean	0.362758	43.639612	309.844850	-79.477183	995.380575	0	-13442954.373894	NaN	0
std	9.457962	0.179994	522.518671	0.184118	7.747306	0	3600175.892965	NaN	0
min	-59.100006	43.000120	1.000000	-79.999870	910.229736	0	-28800000.000000	NaN	0
25%	0.000000	43.646387	30.000000	-79.568931	990.165771	0	-14400000.000000	NaN	0
50%	0.000000	43.682701	41.907000	-79.417531	995.459961	0	-14400000.000000	NaN	0
75%	0.000000	43.736308	159.068000	-79.379235	1000.859924	0	-14400000.000000	NaN	0
max	1000.199951	43.999977	4908.000000	-79.000415	1145.450439	0	7200000.000000	NaN	0

8 rows × 9 columns

The mean reading for warning days during the hours of 1pm to midnight is just 992.680759, compared to 995.380575 for the nonwarning days.

In [313]:

warnings_day.hist('reading')

Out[313]:

array([[<matplotlib.axes.AxesSubplot object at 0x4a559ad0>]], dtype=object)

In [314]:

nonwarnings_day.hist('reading')

Out[314]:

array([[<matplotlib.axes.AxesSubplot object at 0x4b016810>]], dtype=object)

In [315]:

days.hist('reading')

Out[315]:

array([[<matplotlib.axes.AxesSubplot object at 0x4b20bc90>]], dtype=object)

My conclusion is that there was some truth to my hypothesis: the distribution of atmospheric pressure readings on days with heat warnings is different from the distribution on days without heat warnings. Although I still don't understand the exact relationship between air pressure and temperature.

If I were to return to this problem in the future I would look further into trends for each of the different warning codes , recalling that code 1 (extreme heat warning downgraded to standard heat warning) days had the lowest mean reading of all the data sets.