Using MySQL Weather History Database on Amazon EC2¶

This notebook shows some example queries using the weather history database set up as described in the complementary notebook weather_mysqlconfig. The MySQLdb library is used to manage the interaction with the database. See this blog post for more context.

Configuration: Libaries, Keys, Convenience Methods¶

In [1]:

import pandas as pd
import numpy as np
import MySQLdb as mdb

In [2]:

AWSAccessKeyId, AWSSecretKey = ( line.strip().split('=')[1] for line in open('/Users/brian/rootkey.csv','r') )
sshKeyName, instancePass, mysqlPass, myIP = ( line.strip().split('=')[1] for line in open('/Users/brian/passwords.csv','r') )

In [3]:

def sqlCall(cmd):
    con = mdb.connect( host = 'ec2-54-91-115-227.compute-1.amazonaws.com', 
                   passwd = mysqlPass, 
                   user = 'weatherdbuser', 
                   db='weather' );
    cur = con.cursor()
    cur.execute(cmd)
    field_names = [i[0] for i in cur.description]
    output = []
    for i in range(cur.rowcount):
        output.append(cur.fetchone())
    con.close()
    df = pd.DataFrame(output)
    try:
        df.columns = field_names
    except:
        pass
    return df

Quick Views of the Tables in the Database¶

In [4]:

cmd =  "SELECT * FROM country LIMIT 5"
cmd += ";"
sqlCall(cmd)

Out[4]:

	FIPS_ID	country_name
0	AA	ARUBA
1	AC	ANTIGUA AND BARBUDA
2	AF	AFGHANISTAN
3	AG	ALGERIA
4	AI	ASCENSION ISLAND

In [5]:

cmd =  "SELECT * FROM station LIMIT 5"
cmd += ";"
sqlCall(cmd)

Out[5]:

	USAF	WBAN	station_name	CTRY	FIPS_ID	lat	lon	elev
0	000000	99999	NYGGBUKTA GREENLAND- STA	GL	GL	73483	21567	30
1	000010	99999	JAN HAYEN	NO	NO	70983	-7700	229
2	000020	99999	ISFJORD RADIO SPITZBERGEN	NO	NO	78067	13633	79
3	000030	99999	BJORNOYA BARENTS SEA	NO	NO	74467	19283	290
4	000040	99999	VAROO	NO	NO	70367	31100	119

In [6]:

cmd =  "SELECT * FROM observation LIMIT 5"
cmd += ";"
sqlCall(cmd)

Out[6]:

	STNYMD	STN	WBAN	year	month	day	temp	dewp	slp	stp	visib	wdsp	mxspd	gust	maxtemp	mintemp	sndp
0	00820920090309	008209	99999	2009	3	9	782	710	99999	99999	9999	13	19	9999	806	716	9999
1	00820920090310	008209	99999	2009	3	10	678	656	99999	99999	9999	12	41	9999	806	554	9999
2	00820920090311	008209	99999	2009	3	11	681	646	99999	99999	9999	13	29	9999	824	554	9999
3	00820920090312	008209	99999	2009	3	12	666	639	99999	99999	9999	12	41	9999	806	572	9999
4	00820920090313	008209	99999	2009	3	13	669	622	99999	99999	9999	13	41	9999	806	554	9999

Selecting a Subset of Stations for Further Analysis and Visualization¶

The database contains weather data for more than 9000 sites. For initial exploration and visualization, we will select a subset of these stations. In particular, the following cells ([7] through [11]) select the longest-running record (ie. the one with the most observations) from each country, breaking ties randomly.

In [7]:

cmd  = "SELECT USAF, FIPS_ID, obs.numYears "
cmd += "FROM station "
cmd += "JOIN (SELECT STN, COUNT(DISTINCT year) as numYears "
cmd += "FROM observation "
cmd += "GROUP BY STN) as obs "
cmd += "ON station.USAF = obs.STN "
cmd += ";"
r = sqlCall(cmd)

In [8]:

r.head()

Out[8]:

	USAF	FIPS_ID	numYears
0	010010	NO	41
1	010013	NO	3
2	010014	NO	24
3	010015	NO	23
4	010016	NO	5

In [9]:

idx = r.groupby(['FIPS_ID'])['numYears'].transform(max) == r['numYears']
longestPerCountry = r[idx].sort('FIPS_ID')
longestPerCountry.tail()

Out[9]:

	USAF	FIPS_ID	numYears
11161	679750	ZI	44
11128	677750	ZI	44
11167	679910	ZI	44
11124	677650	ZI	44
11143	678670	ZI	44

In [10]:

fn = lambda obj: obj.loc[np.random.choice(obj.index, 1, True),:]
selectedSites = longestPerCountry.groupby('FIPS_ID', as_index=False).apply(fn)
selectedSites.tail()

Out[10]:

		USAF	FIPS_ID	numYears
255	7819	414363	YM	22
256	3391	131500	YU	31
257	10227	601400	YY	3
258	11106	676650	ZA	37
259	11167	679910	ZI	44

In [11]:

selectedSitesList = selectedSites['USAF'].tolist()
selectedSitesList[:10]

Out[11]:

['035963',
 '789820',
 '788620',
 '411960',
 '409480',
 '603900',
 '375750',
 '136150',
 '377890',
 '664220']

Collecting Site Description Details for these Sites¶

The following collects the ID, station name, country name, latitude, longitude and elevation for each of the stations on the given site list.

In [12]:

slist = str(selectedSitesList)[1:-1]
slist[:100]

Out[12]:

"'035963', '789820', '788620', '411960', '409480', '603900', '375750', '136150', '377890', '664220', "

In [13]:

cmd  = "SELECT s.USAF, s.station_name, c.country_name, "
cmd += "(s.lat / 1000) as lat, (s.lon / 1000) as lon, (s.elev / 10) as elev "
cmd += "FROM station s "
cmd += "JOIN country c ON s.FIPS_ID = c.FIPS_ID "
cmd += "WHERE s.USAF IN (" + slist + ");"

In [14]:

sites = sqlCall(cmd)
sites[['lat','lon','elev']] = sites[['lat','lon','elev']].astype(float)
sites.head()

Out[14]:

	USAF	station_name	country_name	lat	lon	elev
0	010620	HOPEN	SVALBARD	76.500	25.067	10
1	013840	OSLO/GARDERMOEN	NORWAY	60.200	11.083	204
2	026800	HOBURG	SWEDEN	56.917	18.150	39
3	029740	HELSINKI-VANTAA	FINLAND	60.317	24.967	56
4	031350	PRESTWICK(CIV/NAVY)	UNITED KINGDOM	55.500	-4.583	20

Collecting Monthly Average, Minimum and Maximum Temperatures for these Sites¶

The weather observations in the database are recorded on a daily basis. For our purposes, we are only interested in the monthly average, minimum and maximum temperatures, and only for the site list described above.

In [24]:

siteTemps = []
for site in selectedSitesList:
    cmd  = "SELECT STN, year, month, AVG(temp), MAX(maxtemp), MIN(mintemp) "
    cmd += "FROM observation "
    cmd += "WHERE STN = '" + site + "' " 
    cmd += "GROUP BY year, month "
    cmd += ";"
    siteTemps.append(sqlCall(cmd))
siteTemps[0].head(10)

Out[24]:

	STN	year	month	AVG(temp)	MAX(maxtemp)	MIN(mintemp)
0	035963	1945	1	330.0323	473	174
1	035963	1945	2	453.7500	603	313
2	035963	1945	3	466.0000	644	313
3	035963	1945	4	489.9333	734	324
4	035963	1945	5	545.1667	793	324
5	035963	1945	6	592.4333	783	453
6	035963	1945	7	590.0000	703	484
7	035963	1951	6	590.1200	763	441
8	035963	1951	7	624.9032	784	453
9	035963	1951	8	610.3548	734	473

Initial graphing¶

The following graph shows the distribution of selected sites by latitude and longitude (note that the rough shape of Africa is visible in the center of the graph).

In [25]:

sites.plot(x='lon',y='lat',kind='scatter')

Out[25]:

<matplotlib.axes._subplots.AxesSubplot at 0x10863fdd0>

The following graph shows an example scatter plot of monthly average temperatures (over many years) for a single site.

In [49]:

grapher = siteTemps[0]
grapher[['year']] = grapher[['year']].astype(int)
grapher[['month']] = grapher[['month']].astype(int)
grapher[['t']] = grapher[['AVG(temp)']].astype(float) / 10
grapher.groupby('year')
grapher.plot(x='month',y='t',kind='scatter')

Out[49]:

<matplotlib.axes._subplots.AxesSubplot at 0x10b8b2210>

Exporting Processed Data for D3 Visualization¶

The sites and temperature data is exported to text files as follows. It is then used in the D3 visualization shown in the blog post here

In [26]:

sites.to_csv('sites.csv',sep='\t',index=False)

In [27]:

for i in range(len(selectedSitesList)):
    siteTemps[i].to_csv('sitesTemps' + str(i) + '.csv',sep='\t',index=False)