Notebook

From CSV to SQL: A Journey¶

You're sold on SQL as a powerful language for querying data, and on PostgreSQL as a powerful engine for storing and sharing that data. But how do you get data in there? That's a fine question. In this tutorial, I'm going to show you the steps necessary to import a CSV into PostgreSQL.

We're going to use the MovieLens review data that you worked with in a previous tutorial. If you don't have the data handy anymore, download it here. For convenience, unzip the file in the same directory as this notebook.

Creating a table¶

PostgreSQL has a built in command---\copy---that can import data from a CSV file. However, that command can only import data into an existing table. In order to make the command work, we first need to create tables to contain the data.

Our eventual goal is to create tables for the users, reviews, and movies in the MovieLens data set, then use the \copy command to import the data into those tables. First, though, let's do a smaller example, just to acclimatize ourselves to the basics.

Types¶

The task of creating a table consists of figuring out which columns we need, and then figuring out the correct data type for each column. The kinds of types supported by SQL (and PostgreSQL in particular) are different from the data types that we're used to working with in Python, though there are some obvious analogues between the types supported by both. Here's a list of all the types supported in PostgreSQL.

The data types you're most likely to encounter in SQL are the following:

type	description
`int`	an integer value (exact range varies depending on database)
`numeric(digits, fraction)`	a number with user-specified precision (see below)
`varchar(n)`	a string of characters with a given maximum length `n`
`text`	a string with unlimited length
`timestamp`	holds a date and time (e.g., 2015-08-01T12:34:56)
`date`	holds a date only (e.g., 2015-08-01)
`boolean`	holds true or false

Let's say that we're creating a table to store information about the widgets from a previous homework assignment. Looking at the data, we can make the following determinations:

The table should contain one row per widget.
The table needs fields for the part number, the widget name, the widget price, and the quantity of widgets on hand.
The part number and widget names are both strings. The quantity of widgets is an integer. The widget price is a number, but it has a decimal point in it, so we can't just use an integer for that column in the table.

That's enough information to create the necessary table. The command to create a table is, unsurprisingly enough, CREATE TABLE and the syntax looks like this:

CREATE TABLE table_name (
  field1 datatype1,
  field2 datatype2,
  field3 datatype3...
)

... replacing table_name with the desired name of the table, replacing field1, field2, etc. with the names of the desired fields, and replacing datatype1, datatype2, etc. with the names of the desired data types for the corresponding fields.

Okay, now we know enough to create a table for our widgets. Let's create a database to hold the table, first of all. Open psql and enter the following commands:

CREATE DATABASE widgetdb;
\c widgetdb

Once you've created the database and connected to it, you can run the create table command for our widget data. Here's what it looks like:

CREATE TABLE widget (
  part_no varchar(20),
  name text,
  price numeric(10, 2),
  quantity int
);

Copy that text and paste it into psql and hit Enter. You should see a confirmation message (CREATE TABLE); if you don't, check again and make sure that you got the syntax right.

The \d command should now display the table that you just created:

widgetdb=# \d
         List of relations
 Schema |  Name  | Type  |  Owner  
--------+--------+-------+---------
 public | widget | table | allison
(1 row)

widgetdb=# \d widget
            Table "public.widget"
  Column  |         Type          | Modifiers 
----------+-----------------------+-----------
 part_no  | character varying(20) | 
 name     | text                  | 
 price    | numeric(10,2)         | 
 quantity | integer               |

Numeric? What's that?¶

Computers are notoriously imprecise when handling floating-point numbers (see here for more details). This is problematic in many situations, especially when you're using a floating point number to represent money. (Recall the rounding error subplots from Superman III and Office Space.) The numeric type side-steps this problem with a different internal representation of numbers that isn't subject to these errors. When specifying a numeric field, you need to give the type two parameters: the number of digits that the number should be able to contain, and the number of digits that can go after the decimal point. (e.g., numeric(10, 2) specifies a numeric field that can store a number as high as 999999.99---definitely sufficient for our widget database.)

Inserting data¶

Ah, a fresh new table with no mistakes in it yet. Or data, for that matter. Now that we have a table, how do we get data into it? With the INSERT command!

The INSERT command allows you to add data one row at a time to the table. The syntax looks like this:

INSERT INTO table_name (col1, col2, col3, ...) VALUES (val1, val2, val3, ...)

... replacing table_name with the name of the table you want to insert values into, and replacing col1 (etc.) with the columns you want to insert data into, and val1 (etc.) with the corresponding data. For example, here's the first widget listed on the widget page:

Part No.	Name	Price	Quantity in warehouse
C1-9476	Skinner Widget	$2.70	512

To insert the data from this row, issue the following command:

INSERT INTO widget (part_no, name, price, quantity) VALUES ('C1-9476', 'Skinner Widget', 2.70, 512);

You'll get a confirmation message (i.e., INSERT 0 1) that tells you that the insert was successful. (If you didn't get that message, check the syntax and try again.)

Confirm that the insert was successful by issuing a SELECT statement:

widgetdb=# SELECT * FROM widget;
 part_no |      name      | price | quantity 
---------+----------------+-------+----------
 C1-9476 | Skinner Widget |  2.70 |      512
(1 row)

Hey perfect! It worked.

Importing CSVs¶

You could continue on in this way, issuing INSERT statements for each widget that you want to add to the table. This is kind of tedious, obviously. In real life, you'd probably get the data into your database by importing it from another source. One method of doing this would be to programmatically issue INSERT statements in Python (say, looping over a list of dictionaries from an API).

Another method is to use PostgreSQL's \copy command, which takes a CSV file and inserts its contents, row by row, into an existing table.

But there has to be an existing table. So before we can use the \copy command, we need to create the table first. Let's try it out with the u.user file from the MovieLens data set.

Here are the first several lines from that file:

In [23]:

!head -10 ml-100k/u.user

1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
6|42|M|executive|98101
7|57|M|administrator|91344
8|36|M|administrator|05201
9|29|M|student|01002
10|53|M|lawyer|90703

Referring back to the README file, the meaning of these fields are:

user id
age
gender
occupation
zip code

Here are the data types I would pick for these fields:

user id: int
age: int
gender: varchar(1)
occupation: varchar(80)
zip code: varchar(10) (in case there are any extended ZIP codes?)

And the CREATE TABLE statement:

CREATE TABLE uuser (
  user_id int,
  age int,
  gender varchar(1),
  occupation varchar(80),
  zip_code varchar(10)
);

(Note that I called the table uuser---you can't just use the name user, as it's a reserved word in SQL.)

Okay cool. Now that the table exists, we can use the \copy command to import the actual data. Here's what the command looks like:

\copy table_name from path delimiter ',' csv

The words from and delimiter and csv are part of the command. The parts you should change are table_name (replace this with the name of the table you want to import into), path (replace this with the path to the CSV file on your hard drive) and ',' (replace the comma with whatever character separates the items on each line of the file).

So, for example, to import the data from u.user:

\copy uuser from /Users/allison/Dropbox/projects/lede/pandas-notes/ml-100k/u.user delimiter '|' csv

Make sure to replace the path with the path to your actual file. If everything's worked out correctly, you should see a confirmation (COPY 943, indicating the 943 rows were copied). If not, check the syntax and try again.

You can confirm that the copy worked by issuing a SELECT statement, like:

widgetdb=# SELECT * FROM uuser WHERE gender = 'F' AND occupation = 'engineer';
 user_id | age | gender | occupation | zip_code 
---------+-----+--------+------------+----------
     786 |  36 | F      | engineer   | 01754
     827 |  23 | F      | engineer   | 80228
(2 rows)

Repeating the process!¶

Let's repeat the process with the u.data table. Here's what the data looks like:

In [31]:

!head -10 ml-100k/u.data

196	242	3	881250949
186	302	3	891717742
22	377	1	878887116
244	51	2	880606923
166	346	1	886397596
298	474	4	884182806
115	265	2	881171488
253	465	5	891628467
305	451	3	886324817
6	86	3	883603013

According to the README, here's what the columns mean:

user id | item id | rating | timestamp

My best guess at a table schema:

CREATE TABLE udata (
  user_id int,
  item_id int,
  rating int,
  timestamp int
);

And the \copy command:

\copy udata from /Users/allison/Dropbox/projects/lede/pandas-notes/ml-100k/u.data delimiter E'\t' csv

(Why the E in front of the '\t'? I don't know. I had to google it. This is computers we're talking about here, there's always one damn thing or another getting in the way of your doing what you want to do)

If everything goes according to plan, you'll see a confirmation (COPY 100000). You can test the data with a SELECT statement like so:

widgetdb=# SELECT user_id, avg(rating), count(rating)
   FROM udata GROUP BY user_id ORDER BY avg(rating) DESC LIMIT 10;
 user_id |        avg         | count 
---------+--------------------+-------
     849 | 4.8695652173913043 |    23
     688 | 4.8333333333333333 |    24
     507 | 4.7241379310344828 |    58
     628 | 4.7037037037037037 |    27
     928 | 4.6875000000000000 |    32
     118 | 4.6619718309859155 |    71
     907 | 4.5714285714285714 |   147
     686 | 4.5633802816901408 |    71
     427 | 4.5483870967741935 |    31
     565 | 4.5428571428571429 |    35
(10 rows)

This query shows the users with the highest average ratings (along with the number of ratings). Let's join on the uuser table to get that user's age and occupation as well:

widgetdb=# SELECT udata.user_id, uuser.occupation, uuser.age, 
widgetdb-#   avg(udata.rating), count(udata.rating)
widgetdb-# FROM udata JOIN uuser ON udata.user_id = uuser.user_id
widgetdb-# GROUP BY udata.user_id, uuser.occupation, uuser.age
widgetdb-# ORDER BY avg(udata.rating) DESC
widgetdb-# LIMIT 10;
 user_id |  occupation   | age |        avg         | count 
---------+---------------+-----+--------------------+-------
     849 | student       |  15 | 4.8695652173913043 |    23
     688 | administrator |  37 | 4.8333333333333333 |    24
     507 | writer        |  18 | 4.7241379310344828 |    58
     628 | none          |  13 | 4.7037037037037037 |    27
     928 | student       |  21 | 4.6875000000000000 |    32
     118 | administrator |  21 | 4.6619718309859155 |    71
     907 | other         |  25 | 4.5714285714285714 |   147
     686 | educator      |  32 | 4.5633802816901408 |    71
     427 | doctor        |  51 | 4.5483870967741935 |    31
     565 | student       |  40 | 4.5428571428571429 |    35
(10 rows)

Another example: average ratings, grouped by occupation:

widgetdb=# select uuser.occupation, avg(udata.rating) from uuser join udata on uuser.user_id = udata.user_id group by uuser.occupation order by avg(udata.rating) desc;
  occupation   |        avg         
---------------+--------------------
 none          | 3.7791342952275250
 lawyer        | 3.7353159851301115
 doctor        | 3.6888888888888889
 educator      | 3.6706206312221987
 artist        | 3.6533795493934142
 administrator | 3.6356464768017115
 scientist     | 3.6112730806608358
 salesman      | 3.5829439252336449
 programmer    | 3.5682604794257147
 librarian     | 3.5607813388962640
 other         | 3.5523773797242802
 engineer      | 3.5414067278287462
 technician    | 3.5322304620650314
 student       | 3.5151432345038029
 marketing     | 3.4856410256410256
 retired       | 3.4667495338719702
 entertainment | 3.4410501193317422
 writer        | 3.3757225433526012
 executive     | 3.3491037320011754
 homemaker     | 3.3010033444816054
 healthcare    | 2.8962196861626248
(21 rows)

NOTE: Importing the u.item table is left as an exercise for the reader. But think of all the fun stuff you could do!

If something goes wrong¶

Here are some situations you might encounter and some strategies for fixing them.

If you messed up your table schema, or if you made your table in the wrong database, use the DROP command to drop the table:

DROP TABLE your_table;

... and then create the table again.

If you messed up the data import (e.g., you used the wrong file, or the wrong delimiter), you can delete all of the rows from a table using:

DELETE FROM your_table;

After you've deleted the rows, you can run the \copy command again.

Further notes¶

The tables we created in this tutorial work fine, but relational databases can be very slow when you're working with millions of records, or complicated joins. You can alleviate this problem with indexes.
We've talked about inserting data, querying data, and (to some extent) deleting data. There's one other common relational database operation that we haven't talked about, which is updating existing data. If you're interested in this particular task, read up on the UPDATE statement.
You can also create a new table based on a query of an existing table. See the CREATE TABLE AS statement.
Pandas has a fair amount of SQL interoperability, including the ability to export Pandas DataFrames as SQL tables. Handy!
csvkit also has what looks to be a very extensible SQL importer for CSV files.