SQL¶

Preface -¶

This captures all the SQL stuff that I wanted collate and have in one place.

In [2]:

import sys
import os
import numpy as np
import pandas as pd
import seaborn as sns
from pandas.io import sql
import matplotlib
import matplotlib.pyplot as plt 
%matplotlib inline

Basic database and DB connection -¶

Recall that SQL is Declarative!

Just, specify what you want in the select statement.

Recall the lexical ordering for writing queries:

SELECT
FROM
WHERE
GROUP BY
HAVING
UNION
ORDER BY

But, the order of execution is different:

FROM - Generates the data set (with join).
WHERE - Reduces the generated data set.
GROUPBY - Aggregates the reduced data set.
HAVING - Reduces the aggregated data set.
SELECT - Transforms the reduced aggregated data set.
ORDER - BY sorts the transformed data set.
LIMIT - OFFSET frames the sorted data set.

Connecting -¶

To connect normally, I use psql.

This is simply:

$ psql test.

You then enter the database and can very easily commit changes:

test=# DROP TABLE posts;

SQLAlchemy -¶

sqlalchemy is nice if we want to talk to the database from python directly.

Create db connection using sqlalchemy.

The return value of create_engine() is an instance of Engine, and it represents the core interface to the database

In [3]:

import sqlalchemy
from sqlalchemy import create_engine
engine = create_engine('postgresql://lmartin@localhost:5432/test')

WHERE -¶

So, I have a test postgres database in which one column is timestamp.

I can select dates from this column easily (below).

Note that there are some critical rules about working with timestamp objects:

The time-zero is specified relative to UTC.
For example, -07:00 is PST.

In [10]:

l='2013-1-01 10:00:00-07:00' # Note that the timezone relative to UTC is specified in the query! 
u='2013-12-22 17:00:00-07:00'
infs=" 'CMV','ASPERGILLUS','GPR' "

all_tests=sql.read_sql("select specimen as s,tested_infection as t "
                "from Clinical_Table "
                "WHERE tx_date < '%s' AND tx_date > '%s' AND tested_infection IN (%s) "%(u,l,infs),engine)
all_tests.head(3)

Out[10]:

	s	t
0	Serum	CMV
1	Serum	CMV
2	Serum	CMV

JOIN -¶

Inner join simply entries common to both tables.

In [8]:

# Pull data for patient 
i_join=sql.read_sql("select stat_table.patient_id as i  "
                 "from stat_table "
                 "INNER JOIN sample_table "
                 "ON stat_table.patient_id=sample_table.patient_id ",engine)
i_join.head(3)

Out[8]:

	i
0	I10
1	I10
2	I10

Semi-join filters records based upon data in another table.

Note that we are including a sub-query in the WHERE clause.

In [9]:

s_join=sql.read_sql("select stat_table.patient_id as i "
                "from stat_table "
                "WHERE patient_id IN (SELECT sample_table.patient_id FROM sample_table) ",engine)

s_join.head(3)

Out[9]:

	i
0	I10
1	I10
2	I10

GROUPBY -¶

The columns that we define in the GROUP BY are the keys that we can use:

If you have a GROUP BY clause, you may only reference:

(1) columns from that clause
(2) aggregate functions

Common aggregate functions that we can apply following GROUP BY:

http://dev.mysql.com/doc/refman/5.0/en/group-by-functions.html

For example, we compute the total number of records within each group!

In [13]:

cd=sql.read_sql("select specimen random_name, count(*) AS c "
                "from Clinical_Table "
                "GROUP BY random_name ",engine)
cd.head(3)

Out[13]:

	random_name	c
0	Lumen	460
1	Skin	46
2	Nasopharyngeal	1762

We can the HAVING clause to filter grouped categories.

ORDER let's us order the results by a specified column and LIMIT simply restricts the output count:

http://blog.jooq.org/2014/12/04/do-you-really-understand-sqls-group-by-and-having-clauses/

In [11]:

out=sql.read_sql("select specimen random_name, count(*) AS c "
                "from Clinical_Table "
                "GROUP BY specimen " 
                "HAVING sum(test_result) > 1 "
                "order by 2 desc "
                "limit 3 ",engine)
cd

Out[11]:

	random_name	c
0	Serum	12502
1	BAL	9969
2	Urine	5823

We count the unique number of tested infections per specimen group.

In [250]:

cd=sql.read_sql("select specimen s, count(distinct tested_infection) AS counts "
                "from Clinical_Table "
                "GROUP BY s ",engine)
print cd.shape
cd.head(3)

(21, 2)

Out[250]:

	s	counts
0	BAL	75
1	Bone Marrow	3
2	CSF	53

A nice things to recall is that we can partition timestamp objects into weeks.

There are many useful function we can apply to datetime objecets:

http://www.postgresql.org/docs/8.1/static/functions-datetime.html

In [265]:

l='2013-1-01 10:00:00-07:00' # Note that the timezone relative to UTC is specified in the query! 
u='2013-12-22 17:00:00-07:00'
infs=" 'CMV','ASPERGILLUS','GPR' "
t=sql.read_sql(" select EXTRACT(WEEK FROM tx_date) as week,tested_infection,specimen,COUNT(*)"
               " from Clinical_Table "
               " WHERE tx_date < '%s' AND tx_date > '%s' AND tested_infection in (%s) "
               " GROUP BY EXTRACT(WEEK FROM tx_date),tested_infection,specimen"%(u,l,infs),engine)
t.head(3)

Out[265]:

	week	tested_infection	specimen	count
0	43	CMV	Tissue	1
1	51	ASPERGILLUS	Stool	3
2	35	ASPERGILLUS	Whole Blood	18

Window functions -¶

A window function:

Performs a calculation across a set of table rows that are somehow related to the current row.
Does not cause rows to become grouped into a single output row.
The rows retain their separate identities.

http://postgresguide.com/tips/window.html

Ok, so let's assume we want to rank the specimens sampled per week per tested infection!

First, recall a basic query below.

In [13]:

l='2013-1-01 10:00:00-07:00' # Note that the timezone relative to UTC is specified in the query! 
u='2013-12-22 17:00:00-07:00'
infs=" 'CMV','ASPERGILLUS','GPR' "
t=sql.read_sql(" select tx_date d,tested_infection t,specimen s"
               " from Clinical_Table "
               " WHERE tx_date < '%s' AND tx_date > '%s' AND tested_infection in (%s) "%(u,l,infs),engine)
t.head(3)

Out[13]:

	d	t	s
0	2013-08-30 00:00:00-07:00	CMV	Serum
1	2013-08-30 00:00:00-07:00	CMV	Serum
2	2013-08-30 00:00:00-07:00	CMV	Serum

We can use window functions to rank over a certain grouping, as needed.

https://robots.thoughtbot.com/postgres-window-functions

In [12]:

l='2013-1-01 10:00:00-07:00' # Note that the timezone relative to UTC is specified in the query! 
u='2013-12-22 17:00:00-07:00'
infs=" 'CMV','ASPERGILLUS','GPR' "
t=sql.read_sql(" select tested_infection as a,specimen as s,rank() OVER (PARTITION BY tested_infection ORDER BY specimen DESC)"
               " from Clinical_Table "
               " WHERE tx_date < '%s' AND tx_date > '%s' AND tested_infection in (%s) "%(u,l,infs),engine)
t.head(5)

Out[12]:

	a	s	rank
0	ASPERGILLUS	Whole Blood	1
1	ASPERGILLUS	Whole Blood	1
2	ASPERGILLUS	Whole Blood	1
3	ASPERGILLUS	Whole Blood	1
4	ASPERGILLUS	Whole Blood	1

It's computed the frequency of items in the group and ranked accordingly.

ORMs -¶

ORM presents a high level and abstracted pattern of usage assocaited with sqlalchemy.

When using the ORM, the configurational process starts by describing the database tables we’ll be dealing with.

http://docs.sqlalchemy.org/en/rel_0_9/orm/tutorial.html

In [14]:

from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()

Now that we have a “base”, we can define any number of mapped classes in terms of it:

at a minimum needs a tablename attribute
at least one Column which is part of a primary key

Here's a new table users.

In [16]:

from sqlalchemy import Column, Integer, String

class User(Base):
    __tablename__ = 'users'
    id = Column(Integer, primary_key=True)
    name = Column(String)
    fullname = Column(String)
    password = Column(String)

With our User class constructed via the Declarative system, we have defined information about our table, known as table metadata.

In [17]:

User.__table__

Out[17]:

Table('users', MetaData(bind=None), Column('id', Integer(), table=<users>, primary_key=True, nullable=False), Column('name', String(), table=<users>), Column('fullname', String(), table=<users>), Column('password', String(), table=<users>), schema=None)

As our database does not actually have a users table present, we can use MetaData to issue CREATE TABLE.

Now the table exists.

In [23]:

Base.metadata.create_all(engine) 

We’re now ready to start talking to the database.

In [18]:

from sqlalchemy.orm import sessionmaker
Session = sessionmaker(bind=engine)

Whenever you need to have a conversation with the database, you instantiate a Session.

In [24]:

session = Session()

Whenever you need to have a conversation with the database, you instantiate a Session.

In [25]:

ed_user = User(name='ed', fullname='Ed Jones', password='edspassword')
session.add(ed_user)

In [26]:

session.add_all([
    User(name='wendy', fullname='Wendy Williams', password='foobar'),
    User(name='mary', fullname='Mary Contrary', password='xxg527'),
    User(name='fred', fullname='Fred Flinstone', password='blah')])

We tell the Session that we’d like to issue all remaining changes to the database and commit the transaction.

In [27]:

session.commit()

Now, let's make sure we can read it back out.

In [28]:

t=sql.read_sql(" select *"
               " from Users ",engine)
t.head(3)

Out[28]:

	id	name	fullname	password
0	1	ed	Ed Jones	edspassword
1	2	wendy	Wendy Williams	foobar
2	3	mary	Mary Contrary	xxg527

More complex window functions -¶

Let's define a new table.

In [92]:

Base = declarative_base()

Note that we can re-define existing class defined:

In [90]:

Post.__table_args__ = {'extend_existing': True}
Comments.__table_args__ = {'extend_existing': True}

In [93]:

from sqlalchemy.dialects.postgresql import CHAR, TIMESTAMP
class Post(Base):
    __tablename__ = 'posts'
    id = Column(Integer, primary_key=True)
    body = Column(String)    
    created_at = Column(TIMESTAMP)

In [94]:

class Comments(Base):
    __tablename__ = 'comments'
    id = Column(Integer, primary_key=True)
    post_id = Column(Integer) 
    body = Column(String) 
    created_at = Column(TIMESTAMP)

Create the tables.

In [95]:

Base.metadata.create_all(engine) 

Add data to the new tables.

In [96]:

session = Session()

In [97]:

session.add_all([
    Post(id=1, body='foo'),
    Post(id=2, body='bar'),])

session.add_all([
    Comments(id=1, post_id=1, body='foo old'),
    Comments(id=2, post_id=1, body='foo new'),
    Comments(id=3, post_id=1, body='foo newer'),
    Comments(id=4, post_id=1, body='foo newest'),])

session.add_all([
    Comments(id=5, post_id=2, body='bar old'),
    Comments(id=6, post_id=2, body='bar new'),
    Comments(id=7, post_id=2, body='bar newer'),
    Comments(id=8, post_id=2, body='bar newest'),])

In [98]:

session.commit()

In [101]:

p=sql.read_sql(" select *"
               " from posts ",engine)
p

Out[101]:

	id	body	created_at
0	1	foo	None
1	2	bar	None

In [102]:

c=sql.read_sql(" select *"
               " from Comments ",engine)
c

Out[102]:

	id	post_id	body	created_at
0	1	1	foo old	None
1	2	1	foo new	None
2	3	1	foo newer	None
3	4	1	foo newest	None
4	5	2	bar old	None
5	6	2	bar new	None
6	7	2	bar newer	None
7	8	2	bar newest	None

In [106]:

c=sql.read_sql(" select posts.id AS post_id, comments.id AS comment_ids, comments.body AS body "
               " from posts LEFT OUTER JOIN comments on posts.id=comments.post_id ",engine)
c

Out[106]:

	post_id	comment_ids	body
0	1	1	foo old
1	1	2	foo new
2	1	3	foo newer
3	1	4	foo newest
4	2	5	bar old
5	2	6	bar new
6	2	7	bar newer
7	2	8	bar newest

Derived tables -¶

Let's say we want something more complex, like the fraction of tests for a given infection that are from a given specimen.

In [234]:

cd=sql.read_sql("select * from Clinical_Table ",engine2)
a=cd.groupby(['tested_infection']).size()
a.sort(ascending=False)
b=cd[cd['tested_infection'] == 'CMV']
ser=b[b['specimen'] == 'Serum']
ser.shape[0]/float(b.shape[0])

(43251, 10)

Out[234]:

0.835023664638269

Group by using Pandas is nice for this!

In [235]:

all_tests=sql.read_sql("select * from Clinical_Table",engine)
a=all_tests[['tested_infection','specimen']]
b=a.groupby('tested_infection').agg(lambda x:float(x[x.specimen=='Serum'].shape[0])/x.shape[0])
b.sort('specimen',ascending=False,inplace=True)
b.head(4)

Out[235]:

	specimen
tested_infection
Hepatitis A AB	1.000000
HEPATITIS B	1.000000
EBV	0.881657
CMV	0.835024

But, we can also use Derived tables to get this same result!

These are basically like variables.

In [236]:

tmp=sql.read_sql(" select tested_infection,COUNT(*)/CAST(total AS float) as pct "
                 " from Clinical_Table c "
                 " inner join "
                 " ( "  
                 "   SELECT tested_infection as inf, COUNT(*) as total "
                 "   from Clinical_Table "
                 "   GROUP BY tested_infection "              
                 " ) d "
                 " on c.tested_infection = d.inf "
                 " WHERE c.specimen = 'Serum' "
                 " GROUP BY tested_infection,total ",engine)

print tmp.shape
tmp.sort('pct',inplace=True,ascending=False)
tmp.head(4)

(57, 2)

Out[236]:

	tested_infection	pct
2	HEPATITIS B	1.000000
6	Hepatitis A AB	1.000000
17	EBV	0.881657
9	CMV	0.835024