This notebook shows some example queries using the weather history database set up as described in the complementary notebook weather_mysqlconfig. The MySQLdb library is used to manage the interaction with the database. See this blog post for more context.
import pandas as pd
import numpy as np
import MySQLdb as mdb
AWSAccessKeyId, AWSSecretKey = ( line.strip().split('=')[1] for line in open('/Users/brian/rootkey.csv','r') )
sshKeyName, instancePass, mysqlPass, myIP = ( line.strip().split('=')[1] for line in open('/Users/brian/passwords.csv','r') )
def sqlCall(cmd):
con = mdb.connect( host = 'ec2-54-91-115-227.compute-1.amazonaws.com',
passwd = mysqlPass,
user = 'weatherdbuser',
db='weather' );
cur = con.cursor()
cur.execute(cmd)
field_names = [i[0] for i in cur.description]
output = []
for i in range(cur.rowcount):
output.append(cur.fetchone())
con.close()
df = pd.DataFrame(output)
try:
df.columns = field_names
except:
pass
return df
cmd = "SELECT * FROM country LIMIT 5"
cmd += ";"
sqlCall(cmd)
FIPS_ID | country_name | |
---|---|---|
0 | AA | ARUBA |
1 | AC | ANTIGUA AND BARBUDA |
2 | AF | AFGHANISTAN |
3 | AG | ALGERIA |
4 | AI | ASCENSION ISLAND |
cmd = "SELECT * FROM station LIMIT 5"
cmd += ";"
sqlCall(cmd)
USAF | WBAN | station_name | CTRY | FIPS_ID | ST | callid | lat | lon | elev | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 000000 | 99999 | NYGGBUKTA GREENLAND- STA | GL | GL | 73483 | 21567 | 30 | ||
1 | 000010 | 99999 | JAN HAYEN | NO | NO | 70983 | -7700 | 229 | ||
2 | 000020 | 99999 | ISFJORD RADIO SPITZBERGEN | NO | NO | 78067 | 13633 | 79 | ||
3 | 000030 | 99999 | BJORNOYA BARENTS SEA | NO | NO | 74467 | 19283 | 290 | ||
4 | 000040 | 99999 | VAROO | NO | NO | 70367 | 31100 | 119 |
cmd = "SELECT * FROM observation LIMIT 5"
cmd += ";"
sqlCall(cmd)
STNYMD | STN | WBAN | year | month | day | temp | dewp | slp | stp | visib | wdsp | mxspd | gust | maxtemp | mintemp | prcp | sndp | FRSHTT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 00820920090309 | 008209 | 99999 | 2009 | 3 | 9 | 782 | 710 | 99999 | 99999 | 9999 | 13 | 19 | 9999 | 806 | 716 | 0 | 9999 | 000000 |
1 | 00820920090310 | 008209 | 99999 | 2009 | 3 | 10 | 678 | 656 | 99999 | 99999 | 9999 | 12 | 41 | 9999 | 806 | 554 | 0 | 9999 | 000000 |
2 | 00820920090311 | 008209 | 99999 | 2009 | 3 | 11 | 681 | 646 | 99999 | 99999 | 9999 | 13 | 29 | 9999 | 824 | 554 | 0 | 9999 | 000000 |
3 | 00820920090312 | 008209 | 99999 | 2009 | 3 | 12 | 666 | 639 | 99999 | 99999 | 9999 | 12 | 41 | 9999 | 806 | 572 | 0 | 9999 | 000000 |
4 | 00820920090313 | 008209 | 99999 | 2009 | 3 | 13 | 669 | 622 | 99999 | 99999 | 9999 | 13 | 41 | 9999 | 806 | 554 | 0 | 9999 | 000000 |
The database contains weather data for more than 9000 sites. For initial exploration and visualization, we will select a subset of these stations. In particular, the following cells ([7] through [11]) select the longest-running record (ie. the one with the most observations) from each country, breaking ties randomly.
cmd = "SELECT USAF, FIPS_ID, obs.numYears "
cmd += "FROM station "
cmd += "JOIN (SELECT STN, COUNT(DISTINCT year) as numYears "
cmd += "FROM observation "
cmd += "GROUP BY STN) as obs "
cmd += "ON station.USAF = obs.STN "
cmd += ";"
r = sqlCall(cmd)
r.head()
USAF | FIPS_ID | numYears | |
---|---|---|---|
0 | 010010 | NO | 41 |
1 | 010013 | NO | 3 |
2 | 010014 | NO | 24 |
3 | 010015 | NO | 23 |
4 | 010016 | NO | 5 |
idx = r.groupby(['FIPS_ID'])['numYears'].transform(max) == r['numYears']
longestPerCountry = r[idx].sort('FIPS_ID')
longestPerCountry.tail()
USAF | FIPS_ID | numYears | |
---|---|---|---|
11161 | 679750 | ZI | 44 |
11128 | 677750 | ZI | 44 |
11167 | 679910 | ZI | 44 |
11124 | 677650 | ZI | 44 |
11143 | 678670 | ZI | 44 |
fn = lambda obj: obj.loc[np.random.choice(obj.index, 1, True),:]
selectedSites = longestPerCountry.groupby('FIPS_ID', as_index=False).apply(fn)
selectedSites.tail()
USAF | FIPS_ID | numYears | ||
---|---|---|---|---|
255 | 7819 | 414363 | YM | 22 |
256 | 3391 | 131500 | YU | 31 |
257 | 10227 | 601400 | YY | 3 |
258 | 11106 | 676650 | ZA | 37 |
259 | 11167 | 679910 | ZI | 44 |
selectedSitesList = selectedSites['USAF'].tolist()
selectedSitesList[:10]
['035963', '789820', '788620', '411960', '409480', '603900', '375750', '136150', '377890', '664220']
The following collects the ID, station name, country name, latitude, longitude and elevation for each of the stations on the given site list.
slist = str(selectedSitesList)[1:-1]
slist[:100]
"'035963', '789820', '788620', '411960', '409480', '603900', '375750', '136150', '377890', '664220', "
cmd = "SELECT s.USAF, s.station_name, c.country_name, "
cmd += "(s.lat / 1000) as lat, (s.lon / 1000) as lon, (s.elev / 10) as elev "
cmd += "FROM station s "
cmd += "JOIN country c ON s.FIPS_ID = c.FIPS_ID "
cmd += "WHERE s.USAF IN (" + slist + ");"
sites = sqlCall(cmd)
sites[['lat','lon','elev']] = sites[['lat','lon','elev']].astype(float)
sites.head()
USAF | station_name | country_name | lat | lon | elev | |
---|---|---|---|---|---|---|
0 | 010620 | HOPEN | SVALBARD | 76.500 | 25.067 | 10 |
1 | 013840 | OSLO/GARDERMOEN | NORWAY | 60.200 | 11.083 | 204 |
2 | 026800 | HOBURG | SWEDEN | 56.917 | 18.150 | 39 |
3 | 029740 | HELSINKI-VANTAA | FINLAND | 60.317 | 24.967 | 56 |
4 | 031350 | PRESTWICK(CIV/NAVY) | UNITED KINGDOM | 55.500 | -4.583 | 20 |
The weather observations in the database are recorded on a daily basis. For our purposes, we are only interested in the monthly average, minimum and maximum temperatures, and only for the site list described above.
siteTemps = []
for site in selectedSitesList:
cmd = "SELECT STN, year, month, AVG(temp), MAX(maxtemp), MIN(mintemp) "
cmd += "FROM observation "
cmd += "WHERE STN = '" + site + "' "
cmd += "GROUP BY year, month "
cmd += ";"
siteTemps.append(sqlCall(cmd))
siteTemps[0].head(10)
STN | year | month | AVG(temp) | MAX(maxtemp) | MIN(mintemp) | |
---|---|---|---|---|---|---|
0 | 035963 | 1945 | 1 | 330.0323 | 473 | 174 |
1 | 035963 | 1945 | 2 | 453.7500 | 603 | 313 |
2 | 035963 | 1945 | 3 | 466.0000 | 644 | 313 |
3 | 035963 | 1945 | 4 | 489.9333 | 734 | 324 |
4 | 035963 | 1945 | 5 | 545.1667 | 793 | 324 |
5 | 035963 | 1945 | 6 | 592.4333 | 783 | 453 |
6 | 035963 | 1945 | 7 | 590.0000 | 703 | 484 |
7 | 035963 | 1951 | 6 | 590.1200 | 763 | 441 |
8 | 035963 | 1951 | 7 | 624.9032 | 784 | 453 |
9 | 035963 | 1951 | 8 | 610.3548 | 734 | 473 |
The following graph shows the distribution of selected sites by latitude and longitude (note that the rough shape of Africa is visible in the center of the graph).
sites.plot(x='lon',y='lat',kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x10863fdd0>
The following graph shows an example scatter plot of monthly average temperatures (over many years) for a single site.
grapher = siteTemps[0]
grapher[['year']] = grapher[['year']].astype(int)
grapher[['month']] = grapher[['month']].astype(int)
grapher[['t']] = grapher[['AVG(temp)']].astype(float) / 10
grapher.groupby('year')
grapher.plot(x='month',y='t',kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x10b8b2210>
The sites and temperature data is exported to text files as follows. It is then used in the D3 visualization shown in the blog post here
sites.to_csv('sites.csv',sep='\t',index=False)
for i in range(len(selectedSitesList)):
siteTemps[i].to_csv('sitesTemps' + str(i) + '.csv',sep='\t',index=False)