import pandas as pd
import json
import numpy
from IPython.display import display, HTML
First I load some data that I downloaded from pressureNet (using this script). I had downloaded the data (which is all readings from the Toronto area from June 1 - Sept 11, 2013) in smaller batches, and so have to load the JSON files as separate dataframes and concatenate them.
df0 = pd.read_json('file://localhost/home/natalie/pressurenet/data/June-01-2013:00:00:00_June-30-2013:00:00:00_43-44_-80--79.json', convert_dates=['daterecorded'])
df1 = pd.read_json('file://localhost/home/natalie/pressurenet/data/July-01-2013:00:00:00_July-31-2013:00:00:00_43-44_-80--79.json', convert_dates=['daterecorded'])
df2 = pd.read_json('file://localhost/home/natalie/pressurenet/data/August-01-2013:00:00:00_August-31-2013:00:00:00_43-44_-80--79.json', convert_dates=['daterecorded'])
df3 = pd.read_json('file://localhost/home/natalie/pressurenet/data/September-01-2013:00:00:00_September-07-2013:00:00:00_43-44_-80--79.json', convert_dates=['daterecorded'])
df4 = pd.read_json('file://localhost/home/natalie/pressurenet/data/September-07-2013:00:00:00_September-11-2013:00:00:00_43-44_-80--79.json', convert_dates=['daterecorded'])
pn = pd.concat([df0, df1, df2, df3, df4])
pn[:5]
altitude | client_key | daterecorded | latitude | location_accuracy | longitude | observation_type | observation_unit | provider | reading | reading_accuracy | sharing | tzoffset | user_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | ca.cumulonimbus.barometernetwork | 2013-06-01 05:00:04.680000 | 43.736296 | 43.867 | -79.417653 | 989.306274 | 0 | Public | -14400000 | e1551a39d3b6ec399f36acbb17beac | |||
1 | 0 | ca.cumulonimbus.barometernetwork | 2013-06-01 05:00:54.448000 | 43.694750 | 20.000 | -79.291521 | 987.429993 | 0 | Us, Researchers and Forecasters | -14400000 | 334bcef2adeb9b7c39ac26d495167f | |||
2 | 0 | ca.cumulonimbus.barometernetwork | 2013-06-01 05:01:04.736000 | 43.736296 | 43.867 | -79.417653 | 989.026428 | 0 | Public | -14400000 | e1551a39d3b6ec399f36acbb17beac | |||
3 | 0 | ca.cumulonimbus.barometernetwork | 2013-06-01 05:01:18.016000 | 43.231558 | 42.000 | -79.853251 | 988.316772 | 0 | Us, Researchers and Forecasters | -14400000 | 859545c4ccb81384621a7b4b1baf3538 | |||
4 | 0 | ca.cumulonimbus.barometernetwork | 2013-06-01 05:02:04.814000 | 43.736296 | 43.867 | -79.417653 | 989.396240 | 0 | Public | -14400000 | e1551a39d3b6ec399f36acbb17beac |
5 rows × 14 columns
To get an initial look at the data I decided to plot the atmospheric pressure (the "reading" field) against the date ("daterecorded").
pn.plot(x='daterecorded', y='reading', figsize=(15, 6))
<matplotlib.axes.AxesSubplot at 0x314a0f90>
It looks like there are some outlying data points where the pressure is close to zero. I don't know a lot about the atmosphere, but I don't think those were accurate readings. There also seems to be a few periods of time for which there are far fewer recordings. There isn't much I can do about that issue, but for the first issue I can try to remove statistical outliers.
pn = pn[(pn['reading'] > 900) & (pn['reading'] < 1100)]
I settled for filtering out rows with a reading of less than 900 or more than 1100. The resulting plot looks better:
pn.plot(x='daterecorded', y='reading', figsize=(15, 6))
<matplotlib.axes.AxesSubplot at 0x70290250>
heat = pd.read_json('file://localhost/home/natalie/torontodata/heat_alerts_list.json', convert_dates=['date'])
print heat.code.unique()
print heat.text.unique()
heat[:5]
[u'HAE' u'HA' u'EHAE' u'HAU' u'EHAD' u'EHA'] [u"Toronto's Medical Officer of Health has extended the Heat Alert" u"Toronto's Medical Officer of Health has issued a Heat Alert" u"Toronto's Medical Officer of Health has extended the Extreme Heat Alert" u"Toronto's Medical Officer of Health has upgraded the Heat Alert to an Extreme Heat Alert" u"Toronto's Medical Officer of Health has downgraded the Extreme Heat Alert to a Heat Alert" u"Toronto's Medical Officer of Health has issued an Extreme Heat Alert"]
code | date | id | text | |
---|---|---|---|---|
0 | HAE | 2013-09-11 | 185 | Toronto's Medical Officer of Health has extend... |
1 | HA | 2013-09-10 | 184 | Toronto's Medical Officer of Health has issued... |
2 | EHAE | 2013-07-19 | 182 | Toronto's Medical Officer of Health has extend... |
3 | EHAE | 2013-07-18 | 181 | Toronto's Medical Officer of Health has extend... |
4 | EHAE | 2013-07-17 | 180 | Toronto's Medical Officer of Health has extend... |
5 rows × 4 columns
I load the list of heat warnings, which I downloaded from Toronto's open data portal here, and take a look at the data.
Next I want to join the two dataframes on date, but my pressureNet data has datetimes and the heat data has just date. I decide to add a column to the pressureNet dataframe with just the date, in the same format as the heat dataframe's date column.
heat.dtypes
code object date datetime64[ns] id int64 text object dtype: object
import arrow
pn["date"] = pn.apply(lambda x: str(arrow.get("T".join(str(x['daterecorded']).split(" "))).date()), axis=1).astype(datetime64)
To do this I use the arrow library. I build an arrow object out of the datetime string from each row from the pressureNet dataframe, get just the date part of that arrow object, cast it to a string, and then to a datetime64 dtype. I don't think this was the nicest way to accomplish my goal, but it worked:
pn[:1]
altitude | client_key | daterecorded | latitude | location_accuracy | longitude | observation_type | observation_unit | provider | reading | reading_accuracy | sharing | tzoffset | user_id | date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | ca.cumulonimbus.barometernetwork | 2013-06-01 05:00:04.680000 | 43.736296 | 43.867 | -79.417653 | 989.306274 | 0 | Public | -14400000 | e1551a39d3b6ec399f36acbb17beac | 2013-06-01 |
1 rows × 15 columns
tor = pd.merge(pn, heat, on='date', suffixes=('_p','_h'), how='left')
tor[:5]
altitude | client_key | daterecorded | latitude | location_accuracy | longitude | observation_type | observation_unit | provider | reading | reading_accuracy | sharing | tzoffset | user_id | date | code | id | text | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | ca.cumulonimbus.barometernetwork | 2013-06-01 05:00:04.680000 | 43.736296 | 43.867 | -79.417653 | 989.306274 | 0 | Public | -14400000 | e1551a39d3b6ec399f36acbb17beac | 2013-06-01 | NaN | NaN | NaN | |||
1 | 0 | ca.cumulonimbus.barometernetwork | 2013-06-01 05:00:54.448000 | 43.694750 | 20.000 | -79.291521 | 987.429993 | 0 | Us, Researchers and Forecasters | -14400000 | 334bcef2adeb9b7c39ac26d495167f | 2013-06-01 | NaN | NaN | NaN | |||
2 | 0 | ca.cumulonimbus.barometernetwork | 2013-06-01 05:01:04.736000 | 43.736296 | 43.867 | -79.417653 | 989.026428 | 0 | Public | -14400000 | e1551a39d3b6ec399f36acbb17beac | 2013-06-01 | NaN | NaN | NaN | |||
3 | 0 | ca.cumulonimbus.barometernetwork | 2013-06-01 05:01:18.016000 | 43.231558 | 42.000 | -79.853251 | 988.316772 | 0 | Us, Researchers and Forecasters | -14400000 | 859545c4ccb81384621a7b4b1baf3538 | 2013-06-01 | NaN | NaN | NaN | |||
4 | 0 | ca.cumulonimbus.barometernetwork | 2013-06-01 05:02:04.814000 | 43.736296 | 43.867 | -79.417653 | 989.396240 | 0 | Public | -14400000 | e1551a39d3b6ec399f36acbb17beac | 2013-06-01 | NaN | NaN | NaN |
5 rows × 18 columns
I perform the merge. From the first five rows there aren't any rows with heat warning data, but there were far more rows in the pressureNet dataframe than the heat one. I check to see if the 'text' field is always NaN:
tor.text.unique()
array([nan, u"Toronto's Medical Officer of Health has issued a Heat Alert", u"Toronto's Medical Officer of Health has upgraded the Heat Alert to an Extreme Heat Alert", u"Toronto's Medical Officer of Health has extended the Extreme Heat Alert", u"Toronto's Medical Officer of Health has downgraded the Extreme Heat Alert to a Heat Alert", u"Toronto's Medical Officer of Health has extended the Heat Alert"], dtype=object)
And luckily it isn't!
Next I want to be able to measure the correlation between the text/code (there is a one-to-one mapping here, so I just go with the code field) from the heat warning data and the readings from the pressureNet data. Pandas can only calculate correlation on numeric fields, and my codes are strings, so I decide to add a column of numeric codes with a one-to-one mapping with the existing string codes.
import numpy as np
unique_vals, tor['code_no'] = np.unique(tor.code, return_inverse=True)
tor['code_no'].unique()
array([0, 3, 5, 2, 1, 4])
tor[tor['code_no'] == 1][:1]['text']
144016 Toronto's Medical Officer of Health has downgr... Name: text, dtype: object
tor[tor['code_no'] == 2][:1]['text']
138462 Toronto's Medical Officer of Health has extend... Name: text, dtype: object
tor[tor['code_no'] == 3][:1]['text']
126790 Toronto's Medical Officer of Health has issued... Name: text, dtype: object
tor[tor['code_no'] == 4][:1]['text']
573641 Toronto's Medical Officer of Health has extend... Name: text, dtype: object
tor[tor['code_no'] == 5][:1]['text']
132377 Toronto's Medical Officer of Health has upgrad... Name: text, dtype: object
tor.corr()
altitude | latitude | location_accuracy | longitude | reading | reading_accuracy | tzoffset | id | code_no | |
---|---|---|---|---|---|---|---|---|---|
altitude | 1.000000 | 0.007098 | -0.021511 | 0.018626 | 0.016783 | NaN | 0.151741 | NaN | -0.012730 |
latitude | 0.007098 | 1.000000 | -0.187085 | 0.560309 | -0.244301 | NaN | 0.037490 | 0.008536 | 0.004088 |
location_accuracy | -0.021511 | -0.187085 | 1.000000 | -0.168497 | 0.043525 | NaN | -0.034505 | -0.006546 | -0.003210 |
longitude | 0.018626 | 0.560309 | -0.168497 | 1.000000 | 0.003294 | NaN | 0.131755 | -0.011353 | 0.001112 |
reading | 0.016783 | -0.244301 | 0.043525 | 0.003294 | 1.000000 | NaN | -0.009200 | 0.031602 | 0.092197 |
reading_accuracy | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
tzoffset | 0.151741 | 0.037490 | -0.034505 | 0.131755 | -0.009200 | NaN | 1.000000 | -0.123591 | 0.000231 |
id | NaN | 0.008536 | -0.006546 | -0.011353 | 0.031602 | NaN | -0.123591 | 1.000000 | -0.162111 |
code_no | -0.012730 | 0.004088 | -0.003210 | 0.001112 | 0.092197 | NaN | 0.000231 | -0.162111 | 1.000000 |
9 rows × 9 columns
The built-in correlation method shows a low correlation between the code_no field and the reading field, which is not what I was hoping I'd find.
warnings = tor[tor['code_no'].isin([1, 2, 3, 4, 5])]
warnings2 = tor[tor['code_no'] == 2]
warnings1 = tor[tor['code_no'] == 1]
warnings3 = tor[tor['code_no'] == 3]
warnings4 = tor[tor['code_no'] == 4]
warnings5 = tor[tor['code_no'] == 5]
nonwarnings = tor[tor['code_no'] == 0]
I isolate the days with any kind of heat warning as well as with each specific type of warning/warning-related message.
tor.describe()
altitude | latitude | location_accuracy | longitude | reading | reading_accuracy | tzoffset | id | code_no | |
---|---|---|---|---|---|---|---|---|---|
count | 574590.000000 | 574590.000000 | 574590.000000 | 574590.000000 | 574590.000000 | 574590 | 574590.000000 | 59764.000000 | 574590.000000 |
mean | 0.352973 | 43.640132 | 298.515366 | -79.477884 | 995.170459 | 0 | -13416276.301363 | 178.327036 | 0.294419 |
std | 8.555421 | 0.180799 | 514.602824 | 0.184457 | 7.652223 | 0 | 3647785.368813 | 3.488297 | 0.954214 |
min | -59.100006 | 43.000120 | 1.000000 | -79.999870 | 910.229736 | 0 | -28800000.000000 | 173.000000 | 0.000000 |
25% | 0.000000 | 43.646621 | 29.407000 | -79.569317 | 989.965881 | 0 | -14400000.000000 | 175.000000 | 0.000000 |
50% | 0.000000 | 43.685568 | 40.500000 | -79.417533 | 995.190002 | 0 | -14400000.000000 | 179.000000 | 0.000000 |
75% | 0.000000 | 43.736305 | 143.839500 | -79.379235 | 1000.540100 | 0 | -14400000.000000 | 181.000000 | 0.000000 |
max | 1000.199951 | 43.999995 | 5000.000000 | -79.000225 | 1344.660034 | 0 | 7200000.000000 | 185.000000 | 5.000000 |
8 rows × 9 columns
nonwarnings.describe()
altitude | latitude | location_accuracy | longitude | reading | reading_accuracy | tzoffset | id | code_no | |
---|---|---|---|---|---|---|---|---|---|
count | 514826.000000 | 514826.000000 | 514826.000000 | 514826.000000 | 514826.000000 | 514826 | 514826.000000 | 0 | 514826 |
mean | 0.393948 | 43.639945 | 299.194298 | -79.477633 | 995.040635 | 0 | -13408116.528691 | NaN | 0 |
std | 9.037479 | 0.181041 | 514.424446 | 0.184219 | 7.655763 | 0 | 3662129.542021 | NaN | 0 |
min | -59.100006 | 43.000120 | 1.000000 | -79.999870 | 910.229736 | 0 | -28800000.000000 | NaN | 0 |
25% | 0.000000 | 43.646545 | 29.502250 | -79.569317 | 989.890015 | 0 | -14400000.000000 | NaN | 0 |
50% | 0.000000 | 43.684683 | 40.689000 | -79.417533 | 995.053101 | 0 | -14400000.000000 | NaN | 0 |
75% | 0.000000 | 43.736305 | 147.578000 | -79.379184 | 1000.350220 | 0 | -14400000.000000 | NaN | 0 |
max | 1000.199951 | 43.999995 | 5000.000000 | -79.000415 | 1344.660034 | 0 | 7200000.000000 | NaN | 0 |
8 rows × 9 columns
warnings.describe()
altitude | latitude | location_accuracy | longitude | reading | reading_accuracy | tzoffset | id | code_no | |
---|---|---|---|---|---|---|---|---|---|
count | 29521 | 29521.000000 | 29521.000000 | 29521.000000 | 29521.000000 | 29521 | 29521.000000 | 29521.000000 | 29521.000000 |
mean | 0 | 43.643843 | 291.787895 | -79.485729 | 993.677119 | 0 | -13734656.685072 | 178.916229 | 1.807120 |
std | 0 | 0.179121 | 514.789195 | 0.188699 | 7.315454 | 0 | 3023152.265227 | 2.769632 | 0.394566 |
min | 0 | 43.015414 | 4.737000 | -79.995023 | 958.433228 | 0 | -18000000.000000 | 175.000000 | 1.000000 |
25% | 0 | 43.646741 | 27.095000 | -79.570170 | 987.667175 | 0 | -14400000.000000 | 176.000000 | 2.000000 |
50% | 0 | 43.695414 | 37.416000 | -79.417536 | 994.070007 | 0 | -14400000.000000 | 180.000000 | 2.000000 |
75% | 0 | 43.736307 | 111.656000 | -79.380830 | 998.940979 | 0 | -14400000.000000 | 181.000000 | 2.000000 |
max | 0 | 43.998816 | 3443.000000 | -79.000505 | 1012.913269 | 0 | 0.000000 | 182.000000 | 2.000000 |
8 rows × 9 columns
print len(warnings1)
warnings1.describe()
5694
altitude | latitude | location_accuracy | longitude | reading | reading_accuracy | tzoffset | id | code_no | |
---|---|---|---|---|---|---|---|---|---|
count | 5694 | 5694.000000 | 5694.000000 | 5694.000000 | 5694.000000 | 5694 | 5694.000000 | 5694 | 5694 |
mean | 0 | 43.632695 | 264.226575 | -79.497286 | 988.579502 | 0 | -13047629.083246 | 176 | 1 |
std | 0 | 0.180082 | 491.716751 | 0.185150 | 5.397092 | 0 | 4202346.483297 | 0 | 0 |
min | 0 | 43.028590 | 4.737000 | -79.979364 | 967.937988 | 0 | -18000000.000000 | 176 | 1 |
25% | 0 | 43.646365 | 31.245250 | -79.624901 | 985.508362 | 0 | -14400000.000000 | 176 | 1 |
50% | 0 | 43.668456 | 43.500000 | -79.417529 | 986.789978 | 0 | -14400000.000000 | 176 | 1 |
75% | 0 | 43.736316 | 86.079500 | -79.379235 | 993.769958 | 0 | -14400000.000000 | 176 | 1 |
max | 0 | 43.997440 | 2615.000000 | -79.003154 | 1000.289978 | 0 | 0.000000 | 176 | 1 |
8 rows × 9 columns
print len(warnings2)
warnings2.describe()
23827
altitude | latitude | location_accuracy | longitude | reading | reading_accuracy | tzoffset | id | code_no | |
---|---|---|---|---|---|---|---|---|---|
count | 23827 | 23827.000000 | 23827.000000 | 23827.000000 | 23827.000000 | 23827 | 23827.000000 | 23827.000000 | 23827 |
mean | 0 | 43.646507 | 298.374295 | -79.482968 | 994.895309 | 0 | -13898837.453309 | 179.613128 | 2 |
std | 0 | 0.178791 | 519.945122 | 0.189437 | 7.186866 | 0 | 2638984.140779 | 2.643099 | 0 |
min | 0 | 43.015414 | 4.737000 | -79.995023 | 958.433228 | 0 | -14400000.000000 | 175.000000 | 2 |
25% | 0 | 43.646784 | 27.000000 | -79.569345 | 989.320007 | 0 | -14400000.000000 | 180.000000 | 2 |
50% | 0 | 43.699368 | 36.026000 | -79.417536 | 995.592773 | 0 | -14400000.000000 | 181.000000 | 2 |
75% | 0 | 43.736303 | 128.397000 | -79.381984 | 999.920197 | 0 | -14400000.000000 | 182.000000 | 2 |
max | 0 | 43.998816 | 3443.000000 | -79.000505 | 1012.913269 | 0 | 0.000000 | 182.000000 | 2 |
8 rows × 9 columns
print len(warnings3)
warnings3.describe()
17222
altitude | latitude | location_accuracy | longitude | reading | reading_accuracy | tzoffset | id | code_no | |
---|---|---|---|---|---|---|---|---|---|
count | 17222 | 17222.000000 | 17222.000000 | 17222.000000 | 17222.000000 | 17222 | 17222.000000 | 17222.000000 | 17222 |
mean | 0 | 43.635359 | 299.504357 | -79.476144 | 998.575333 | 0 | -13268284.752061 | 178.244281 | 3 |
std | 0 | 0.180330 | 520.672903 | 0.183222 | 7.017629 | 0 | 3901858.699926 | 4.387450 | 0 |
min | 0 | 43.003593 | 1.250000 | -79.992424 | 974.284546 | 0 | -21600000.000000 | 173.000000 | 3 |
25% | 0 | 43.646312 | 29.740000 | -79.557855 | 993.064209 | 0 | -14400000.000000 | 173.000000 | 3 |
50% | 0 | 43.675566 | 37.500000 | -79.417526 | 999.139984 | 0 | -14400000.000000 | 178.000000 | 3 |
75% | 0 | 43.736284 | 121.500000 | -79.382363 | 1003.039978 | 0 | -14400000.000000 | 184.000000 | 3 |
max | 0 | 43.999589 | 3521.000000 | -79.001623 | 1157.923584 | 0 | 0.000000 | 184.000000 | 3 |
8 rows × 9 columns
print len(warnings4)
warnings4.describe()
949
altitude | latitude | location_accuracy | longitude | reading | reading_accuracy | tzoffset | id | code_no | |
---|---|---|---|---|---|---|---|---|---|
count | 949 | 949.000000 | 949.000000 | 949.000000 | 949.000000 | 949 | 949.000000 | 949 | 949 |
mean | 0 | 43.591688 | 209.036393 | -79.493094 | 995.035589 | 0 | -13906849.315068 | 185 | 4 |
std | 0 | 0.214829 | 395.285907 | 0.206161 | 5.332409 | 0 | 2830858.713231 | 0 | 0 |
min | 0 | 43.127080 | 20.000000 | -79.990436 | 974.619995 | 0 | -21600000.000000 | 185 | 4 |
25% | 0 | 43.558485 | 29.740000 | -79.635275 | 990.559998 | 0 | -14400000.000000 | 185 | 4 |
50% | 0 | 43.656342 | 39.817000 | -79.413306 | 995.729980 | 0 | -14400000.000000 | 185 | 4 |
75% | 0 | 43.739011 | 76.227000 | -79.379711 | 999.800476 | 0 | -14400000.000000 | 185 | 4 |
max | 0 | 43.855108 | 2055.000000 | -79.026341 | 1003.940002 | 0 | 0.000000 | 185 | 4 |
8 rows × 9 columns
print len(warnings5)
warnings5.describe()
12072
altitude | latitude | location_accuracy | longitude | reading | reading_accuracy | tzoffset | id | code_no | |
---|---|---|---|---|---|---|---|---|---|
count | 12072 | 12072.000000 | 12072.000000 | 12072.000000 | 12072.000000 | 12072 | 12072.000000 | 12072.000000 | 12072 |
mean | 0 | 43.649647 | 291.636069 | -79.470711 | 999.512005 | 0 | -13158250.497018 | 176.479705 | 5 |
std | 0 | 0.171120 | 520.591584 | 0.183464 | 6.532511 | 0 | 4042351.371450 | 2.500021 | 0 |
min | 0 | 43.060856 | 20.000000 | -79.995097 | 970.759949 | 0 | -14400000.000000 | 174.000000 | 5 |
25% | 0 | 43.650078 | 31.740750 | -79.548248 | 993.863770 | 0 | -14400000.000000 | 174.000000 | 5 |
50% | 0 | 43.694753 | 39.698500 | -79.417518 | 1000.555023 | 0 | -14400000.000000 | 174.000000 | 5 |
75% | 0 | 43.736302 | 107.008500 | -79.379235 | 1003.679993 | 0 | -14400000.000000 | 179.000000 | 5 |
max | 0 | 43.991877 | 3188.000000 | -79.000225 | 1014.539978 | 0 | 0.000000 | 179.000000 | 5 |
8 rows × 9 columns
It seems that the mean air pressure for heat warning days is 993.677119, the overall mean air pressure for the 11 day window I'm looking at is 995.170459, and the mean for the days without heat warnings is 995.040635. The mean air pressure for code 1 warnings (extreme heat warnings downgraded to standard heat warnings) is as low as 988.579502.
import scipy.stats
scipy.stats.ttest_ind(warnings['reading'], nonwarnings['reading'])
(array(-29.830170950049535), 2.2845887961730417e-195)
It was suggested to me that I use a t-test for the warnings and non-warnings data sets instead of just looking at their means. I think a t-test looks at the distribution of values in each data set. After all, it's possible for two very different distributions to results in the same mean value.
The results of this ttest_ind method are the t-statistic the the p-value. A quick google search makes me believe that a t-statistic greater than 2.0 or less than -2.0 means a significant difference of means, as does a p-value smaller than 0.05. My data seems to pass that test!
warnings.hist('reading')
array([[<matplotlib.axes.AxesSubplot object at 0xc68719d0>]], dtype=object)
nonwarnings.hist('reading')
array([[<matplotlib.axes.AxesSubplot object at 0xbb3ceed0>]], dtype=object)
tor.hist('reading')
array([[<matplotlib.axes.AxesSubplot object at 0x71534bd0>]], dtype=object)
warnings["time"] = warnings.apply(lambda x: arrow.get("T".join(str(x['daterecorded']).split(" "))).time(), axis=1)
nonwarnings["time"] = nonwarnings.apply(lambda x: arrow.get("T".join(str(x['daterecorded']).split(" "))).time(), axis=1)
tor["time"] = tor.apply(lambda x: arrow.get("T".join(str(x['daterecorded']).split(" "))).time(), axis=1)
starttime = datetime.time(13,0)
endtime = datetime.time(23,59)
warnings_day = warnings[(warnings['time'] > starttime) & (warnings['time'] < endtime)]
nonwarnings_day = nonwarnings[(nonwarnings['time'] > starttime) & (nonwarnings['time'] < endtime)]
days = tor[(tor['time'] > starttime) & (tor['time'] < endtime)]
scipy.stats.ttest_ind(warnings_day['reading'], nonwarnings_day['reading'])
(array(-42.752879538465386), 0.0)
With a bit of experimentation I found that limiting my data to readings from 1:00pm onward maximized the difference as reported by a ttest. I expected that the hours 11:00am - 5:00pm might be more statistically telling, since that is generally the hottest time of the day. But perhaps air pressure and temperature are not that closely correlated.
warnings_day.describe()
altitude | latitude | location_accuracy | longitude | reading | reading_accuracy | tzoffset | id | code_no | |
---|---|---|---|---|---|---|---|---|---|
count | 14423 | 14423.000000 | 14423.000000 | 14423.000000 | 14423.000000 | 14423 | 14423.000000 | 14423.000000 | 14423.000000 |
mean | 0 | 43.643207 | 327.061149 | -79.481421 | 992.680759 | 0 | -13492699.161062 | 178.903349 | 1.802954 |
std | 0 | 0.175201 | 532.516588 | 0.183891 | 7.252837 | 0 | 3499611.298638 | 2.781350 | 0.397781 |
min | 0 | 43.075259 | 4.737000 | -79.981781 | 958.433228 | 0 | -18000000.000000 | 175.000000 | 1.000000 |
25% | 0 | 43.646658 | 28.000000 | -79.569886 | 986.769958 | 0 | -14400000.000000 | 176.000000 | 2.000000 |
50% | 0 | 43.687811 | 40.500000 | -79.417532 | 992.849976 | 0 | -14400000.000000 | 180.000000 | 2.000000 |
75% | 0 | 43.736302 | 201.616000 | -79.379235 | 997.729980 | 0 | -14400000.000000 | 181.000000 | 2.000000 |
max | 0 | 43.997440 | 3443.000000 | -79.001904 | 1012.199951 | 0 | 0.000000 | 182.000000 | 2.000000 |
8 rows × 9 columns
nonwarnings_day.describe()
altitude | latitude | location_accuracy | longitude | reading | reading_accuracy | tzoffset | id | code_no | |
---|---|---|---|---|---|---|---|---|---|
count | 323236.000000 | 323236.000000 | 323236.000000 | 323236.000000 | 323236.000000 | 323236 | 323236.000000 | 0 | 323236 |
mean | 0.362758 | 43.639612 | 309.844850 | -79.477183 | 995.380575 | 0 | -13442954.373894 | NaN | 0 |
std | 9.457962 | 0.179994 | 522.518671 | 0.184118 | 7.747306 | 0 | 3600175.892965 | NaN | 0 |
min | -59.100006 | 43.000120 | 1.000000 | -79.999870 | 910.229736 | 0 | -28800000.000000 | NaN | 0 |
25% | 0.000000 | 43.646387 | 30.000000 | -79.568931 | 990.165771 | 0 | -14400000.000000 | NaN | 0 |
50% | 0.000000 | 43.682701 | 41.907000 | -79.417531 | 995.459961 | 0 | -14400000.000000 | NaN | 0 |
75% | 0.000000 | 43.736308 | 159.068000 | -79.379235 | 1000.859924 | 0 | -14400000.000000 | NaN | 0 |
max | 1000.199951 | 43.999977 | 4908.000000 | -79.000415 | 1145.450439 | 0 | 7200000.000000 | NaN | 0 |
8 rows × 9 columns
The mean reading for warning days during the hours of 1pm to midnight is just 992.680759, compared to 995.380575 for the nonwarning days.
warnings_day.hist('reading')
array([[<matplotlib.axes.AxesSubplot object at 0x4a559ad0>]], dtype=object)
nonwarnings_day.hist('reading')
array([[<matplotlib.axes.AxesSubplot object at 0x4b016810>]], dtype=object)
days.hist('reading')
array([[<matplotlib.axes.AxesSubplot object at 0x4b20bc90>]], dtype=object)
My conclusion is that there was some truth to my hypothesis: the distribution of atmospheric pressure readings on days with heat warnings is different from the distribution on days without heat warnings. Although I still don't understand the exact relationship between air pressure and temperature.
If I were to return to this problem in the future I would look further into trends for each of the different warning codes , recalling that code 1 (extreme heat warning downgraded to standard heat warning) days had the lowest mean reading of all the data sets.