This captures all the SQL
stuff that I wanted collate and have in one place.
import sys
import os
import numpy as np
import pandas as pd
import seaborn as sns
from pandas.io import sql
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
Recall that SQL
is Declarative!
Just, specify what you want in the select
statement.
Recall the lexical ordering for writing queries:
But, the order of execution is different:
FROM
- Generates the data set (with join).WHERE
- Reduces the generated data set.GROUPBY
- Aggregates the reduced data set.HAVING
- Reduces the aggregated data set.SELECT
- Transforms the reduced aggregated data set.ORDER
- BY sorts the transformed data set.LIMIT
- OFFSET frames the sorted data set.To connect normally, I use psql
.
This is simply:
$ psql test
.
You then enter the database and can very easily commit changes:
test=# DROP TABLE posts;
sqlalchemy
is nice if we want to talk to the database from python
directly.
Create db connection using sqlalchemy
.
The return value of create_engine()
is an instance of Engine
, and it represents the core interface to the database
import sqlalchemy
from sqlalchemy import create_engine
engine = create_engine('postgresql://lmartin@localhost:5432/test')
So, I have a test postgres
database in which one column is timestamp
.
I can select dates from this column easily (below).
Note that there are some critical rules about working with timestamp
objects:
-07:00
is PST.l='2013-1-01 10:00:00-07:00' # Note that the timezone relative to UTC is specified in the query!
u='2013-12-22 17:00:00-07:00'
infs=" 'CMV','ASPERGILLUS','GPR' "
all_tests=sql.read_sql("select specimen as s,tested_infection as t "
"from Clinical_Table "
"WHERE tx_date < '%s' AND tx_date > '%s' AND tested_infection IN (%s) "%(u,l,infs),engine)
all_tests.head(3)
s | t | |
---|---|---|
0 | Serum | CMV |
1 | Serum | CMV |
2 | Serum | CMV |
Inner join simply entries common to both tables.
# Pull data for patient
i_join=sql.read_sql("select stat_table.patient_id as i "
"from stat_table "
"INNER JOIN sample_table "
"ON stat_table.patient_id=sample_table.patient_id ",engine)
i_join.head(3)
i | |
---|---|
0 | I10 |
1 | I10 |
2 | I10 |
Semi-join filters records based upon data in another table.
Note that we are including a sub-query
in the WHERE
clause.
s_join=sql.read_sql("select stat_table.patient_id as i "
"from stat_table "
"WHERE patient_id IN (SELECT sample_table.patient_id FROM sample_table) ",engine)
s_join.head(3)
i | |
---|---|
0 | I10 |
1 | I10 |
2 | I10 |
The columns that we define in the GROUP BY
are the keys that we can use:
If you have a GROUP BY
clause, you may only reference:
Common aggregate functions that we can apply following GROUP BY
:
http://dev.mysql.com/doc/refman/5.0/en/group-by-functions.html
For example, we compute the total number of records within each group!
cd=sql.read_sql("select specimen random_name, count(*) AS c "
"from Clinical_Table "
"GROUP BY random_name ",engine)
cd.head(3)
random_name | c | |
---|---|---|
0 | Lumen | 460 |
1 | Skin | 46 |
2 | Nasopharyngeal | 1762 |
We can the HAVING
clause to filter grouped categories.
ORDER
let's us order the results by a specified column and LIMIT
simply restricts the output count:
out=sql.read_sql("select specimen random_name, count(*) AS c "
"from Clinical_Table "
"GROUP BY specimen "
"HAVING sum(test_result) > 1 "
"order by 2 desc "
"limit 3 ",engine)
cd
random_name | c | |
---|---|---|
0 | Serum | 12502 |
1 | BAL | 9969 |
2 | Urine | 5823 |
We count the unique number of tested infections per specimen group.
cd=sql.read_sql("select specimen s, count(distinct tested_infection) AS counts "
"from Clinical_Table "
"GROUP BY s ",engine)
print cd.shape
cd.head(3)
(21, 2)
s | counts | |
---|---|---|
0 | BAL | 75 |
1 | Bone Marrow | 3 |
2 | CSF | 53 |
A nice things to recall is that we can partition timestamp
objects into weeks.
There are many useful function we can apply to datetime objecets:
l='2013-1-01 10:00:00-07:00' # Note that the timezone relative to UTC is specified in the query!
u='2013-12-22 17:00:00-07:00'
infs=" 'CMV','ASPERGILLUS','GPR' "
t=sql.read_sql(" select EXTRACT(WEEK FROM tx_date) as week,tested_infection,specimen,COUNT(*)"
" from Clinical_Table "
" WHERE tx_date < '%s' AND tx_date > '%s' AND tested_infection in (%s) "
" GROUP BY EXTRACT(WEEK FROM tx_date),tested_infection,specimen"%(u,l,infs),engine)
t.head(3)
week | tested_infection | specimen | count | |
---|---|---|---|---|
0 | 43 | CMV | Tissue | 1 |
1 | 51 | ASPERGILLUS | Stool | 3 |
2 | 35 | ASPERGILLUS | Whole Blood | 18 |
A window function:
http://postgresguide.com/tips/window.html
Ok, so let's assume we want to rank the specimens sampled per week per tested infection!
First, recall a basic query below.
l='2013-1-01 10:00:00-07:00' # Note that the timezone relative to UTC is specified in the query!
u='2013-12-22 17:00:00-07:00'
infs=" 'CMV','ASPERGILLUS','GPR' "
t=sql.read_sql(" select tx_date d,tested_infection t,specimen s"
" from Clinical_Table "
" WHERE tx_date < '%s' AND tx_date > '%s' AND tested_infection in (%s) "%(u,l,infs),engine)
t.head(3)
d | t | s | |
---|---|---|---|
0 | 2013-08-30 00:00:00-07:00 | CMV | Serum |
1 | 2013-08-30 00:00:00-07:00 | CMV | Serum |
2 | 2013-08-30 00:00:00-07:00 | CMV | Serum |
We can use window functions
to rank over a certain grouping, as needed.
l='2013-1-01 10:00:00-07:00' # Note that the timezone relative to UTC is specified in the query!
u='2013-12-22 17:00:00-07:00'
infs=" 'CMV','ASPERGILLUS','GPR' "
t=sql.read_sql(" select tested_infection as a,specimen as s,rank() OVER (PARTITION BY tested_infection ORDER BY specimen DESC)"
" from Clinical_Table "
" WHERE tx_date < '%s' AND tx_date > '%s' AND tested_infection in (%s) "%(u,l,infs),engine)
t.head(5)
a | s | rank | |
---|---|---|---|
0 | ASPERGILLUS | Whole Blood | 1 |
1 | ASPERGILLUS | Whole Blood | 1 |
2 | ASPERGILLUS | Whole Blood | 1 |
3 | ASPERGILLUS | Whole Blood | 1 |
4 | ASPERGILLUS | Whole Blood | 1 |
It's computed the frequency of items in the group and ranked accordingly.
ORM presents a high level and abstracted pattern of usage assocaited with sqlalchemy
.
When using the ORM, the configurational process starts by describing the database tables we’ll be dealing with.
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
Now that we have a “base”, we can define any number of mapped classes in terms of it:
Here's a new table users
.
from sqlalchemy import Column, Integer, String
class User(Base):
__tablename__ = 'users'
id = Column(Integer, primary_key=True)
name = Column(String)
fullname = Column(String)
password = Column(String)
With our User class constructed via the Declarative system, we have defined information about our table, known as table metadata.
User.__table__
Table('users', MetaData(bind=None), Column('id', Integer(), table=<users>, primary_key=True, nullable=False), Column('name', String(), table=<users>), Column('fullname', String(), table=<users>), Column('password', String(), table=<users>), schema=None)
As our database does not actually have a users table present, we can use MetaData to issue CREATE TABLE.
Now the table exists.
Base.metadata.create_all(engine)
We’re now ready to start talking to the database.
from sqlalchemy.orm import sessionmaker
Session = sessionmaker(bind=engine)
Whenever you need to have a conversation with the database, you instantiate a Session.
session = Session()
Whenever you need to have a conversation with the database, you instantiate a Session.
ed_user = User(name='ed', fullname='Ed Jones', password='edspassword')
session.add(ed_user)
session.add_all([
User(name='wendy', fullname='Wendy Williams', password='foobar'),
User(name='mary', fullname='Mary Contrary', password='xxg527'),
User(name='fred', fullname='Fred Flinstone', password='blah')])
We tell the Session that we’d like to issue all remaining changes to the database and commit the transaction.
session.commit()
Now, let's make sure we can read it back out.
t=sql.read_sql(" select *"
" from Users ",engine)
t.head(3)
id | name | fullname | password | |
---|---|---|---|---|
0 | 1 | ed | Ed Jones | edspassword |
1 | 2 | wendy | Wendy Williams | foobar |
2 | 3 | mary | Mary Contrary | xxg527 |
Let's define a new table.
Base = declarative_base()
Note that we can re-define existing class
defined:
Post.__table_args__ = {'extend_existing': True}
Comments.__table_args__ = {'extend_existing': True}
from sqlalchemy.dialects.postgresql import CHAR, TIMESTAMP
class Post(Base):
__tablename__ = 'posts'
id = Column(Integer, primary_key=True)
body = Column(String)
created_at = Column(TIMESTAMP)
class Comments(Base):
__tablename__ = 'comments'
id = Column(Integer, primary_key=True)
post_id = Column(Integer)
body = Column(String)
created_at = Column(TIMESTAMP)
Create the tables.
Base.metadata.create_all(engine)
Add data to the new tables.
session = Session()
session.add_all([
Post(id=1, body='foo'),
Post(id=2, body='bar'),])
session.add_all([
Comments(id=1, post_id=1, body='foo old'),
Comments(id=2, post_id=1, body='foo new'),
Comments(id=3, post_id=1, body='foo newer'),
Comments(id=4, post_id=1, body='foo newest'),])
session.add_all([
Comments(id=5, post_id=2, body='bar old'),
Comments(id=6, post_id=2, body='bar new'),
Comments(id=7, post_id=2, body='bar newer'),
Comments(id=8, post_id=2, body='bar newest'),])
session.commit()
p=sql.read_sql(" select *"
" from posts ",engine)
p
id | body | created_at | |
---|---|---|---|
0 | 1 | foo | None |
1 | 2 | bar | None |
c=sql.read_sql(" select *"
" from Comments ",engine)
c
id | post_id | body | created_at | |
---|---|---|---|---|
0 | 1 | 1 | foo old | None |
1 | 2 | 1 | foo new | None |
2 | 3 | 1 | foo newer | None |
3 | 4 | 1 | foo newest | None |
4 | 5 | 2 | bar old | None |
5 | 6 | 2 | bar new | None |
6 | 7 | 2 | bar newer | None |
7 | 8 | 2 | bar newest | None |
c=sql.read_sql(" select posts.id AS post_id, comments.id AS comment_ids, comments.body AS body "
" from posts LEFT OUTER JOIN comments on posts.id=comments.post_id ",engine)
c
post_id | comment_ids | body | |
---|---|---|---|
0 | 1 | 1 | foo old |
1 | 1 | 2 | foo new |
2 | 1 | 3 | foo newer |
3 | 1 | 4 | foo newest |
4 | 2 | 5 | bar old |
5 | 2 | 6 | bar new |
6 | 2 | 7 | bar newer |
7 | 2 | 8 | bar newest |
Let's say we want something more complex, like the fraction of tests for a given infection that are from a given specimen.
cd=sql.read_sql("select * from Clinical_Table ",engine2)
a=cd.groupby(['tested_infection']).size()
a.sort(ascending=False)
b=cd[cd['tested_infection'] == 'CMV']
ser=b[b['specimen'] == 'Serum']
ser.shape[0]/float(b.shape[0])
(43251, 10)
0.835023664638269
Group by
using Pandas
is nice for this!
all_tests=sql.read_sql("select * from Clinical_Table",engine)
a=all_tests[['tested_infection','specimen']]
b=a.groupby('tested_infection').agg(lambda x:float(x[x.specimen=='Serum'].shape[0])/x.shape[0])
b.sort('specimen',ascending=False,inplace=True)
b.head(4)
specimen | |
---|---|
tested_infection | |
Hepatitis A AB | 1.000000 |
HEPATITIS B | 1.000000 |
EBV | 0.881657 |
CMV | 0.835024 |
But, we can also use Derived tables to get this same result!
These are basically like variables.
tmp=sql.read_sql(" select tested_infection,COUNT(*)/CAST(total AS float) as pct "
" from Clinical_Table c "
" inner join "
" ( "
" SELECT tested_infection as inf, COUNT(*) as total "
" from Clinical_Table "
" GROUP BY tested_infection "
" ) d "
" on c.tested_infection = d.inf "
" WHERE c.specimen = 'Serum' "
" GROUP BY tested_infection,total ",engine)
print tmp.shape
tmp.sort('pct',inplace=True,ascending=False)
tmp.head(4)
(57, 2)
tested_infection | pct | |
---|---|---|
2 | HEPATITIS B | 1.000000 |
6 | Hepatitis A AB | 1.000000 |
17 | EBV | 0.881657 |
9 | CMV | 0.835024 |