You're sold on SQL as a powerful language for querying data, and on PostgreSQL as a powerful engine for storing and sharing that data. But how do you get data in there? That's a fine question. In this tutorial, I'm going to show you the steps necessary to import a CSV into PostgreSQL.
We're going to use the MovieLens review data that you worked with in a previous tutorial. If you don't have the data handy anymore, download it here. For convenience, unzip the file in the same directory as this notebook.
PostgreSQL has a built in command---\copy
---that can import data from a CSV file. However, that command can only import data into an existing table. In order to make the command work, we first need to create tables to contain the data.
Our eventual goal is to create tables for the users, reviews, and movies in the MovieLens data set, then use the \copy
command to import the data into those tables. First, though, let's do a smaller example, just to acclimatize ourselves to the basics.
The task of creating a table consists of figuring out which columns we need, and then figuring out the correct data type for each column. The kinds of types supported by SQL (and PostgreSQL in particular) are different from the data types that we're used to working with in Python, though there are some obvious analogues between the types supported by both. Here's a list of all the types supported in PostgreSQL.
The data types you're most likely to encounter in SQL are the following:
type | description |
---|---|
int |
an integer value (exact range varies depending on database) |
numeric(digits, fraction) |
a number with user-specified precision (see below) |
varchar(n) |
a string of characters with a given maximum length n |
text |
a string with unlimited length |
timestamp |
holds a date and time (e.g., 2015-08-01T12:34:56) |
date |
holds a date only (e.g., 2015-08-01) |
boolean |
holds true or false |
Let's say that we're creating a table to store information about the widgets from a previous homework assignment. Looking at the data, we can make the following determinations:
That's enough information to create the necessary table. The command to create a table is, unsurprisingly enough, CREATE TABLE
and the syntax looks like this:
CREATE TABLE table_name (
field1 datatype1,
field2 datatype2,
field3 datatype3...
)
... replacing table_name
with the desired name of the table, replacing field1
, field2
, etc. with the names of the desired fields, and replacing datatype1
, datatype2
, etc. with the names of the desired data types for the corresponding fields.
Okay, now we know enough to create a table for our widgets. Let's create a database to hold the table, first of all. Open psql
and enter the following commands:
CREATE DATABASE widgetdb;
\c widgetdb
Once you've created the database and connected to it, you can run the create table command for our widget data. Here's what it looks like:
CREATE TABLE widget (
part_no varchar(20),
name text,
price numeric(10, 2),
quantity int
);
Copy that text and paste it into psql
and hit Enter. You should see a confirmation message (CREATE TABLE
); if you don't, check again and make sure that you got the syntax right.
The \d
command should now display the table that you just created:
widgetdb=# \d
List of relations
Schema | Name | Type | Owner
--------+--------+-------+---------
public | widget | table | allison
(1 row)
widgetdb=# \d widget
Table "public.widget"
Column | Type | Modifiers
----------+-----------------------+-----------
part_no | character varying(20) |
name | text |
price | numeric(10,2) |
quantity | integer |
Computers are notoriously imprecise when handling floating-point numbers (see here for more details). This is problematic in many situations, especially when you're using a floating point number to represent money. (Recall the rounding error subplots from Superman III and Office Space.) The numeric
type side-steps this problem with a different internal representation of numbers that isn't subject to these errors. When specifying a numeric field, you need to give the type two parameters: the number of digits that the number should be able to contain, and the number of digits that can go after the decimal point. (e.g., numeric(10, 2)
specifies a numeric field that can store a number as high as 999999.99---definitely sufficient for our widget database.)
Ah, a fresh new table with no mistakes in it yet. Or data, for that matter. Now that we have a table, how do we get data into it? With the INSERT
command!
The INSERT
command allows you to add data one row at a time to the table. The syntax looks like this:
INSERT INTO table_name (col1, col2, col3, ...) VALUES (val1, val2, val3, ...)
... replacing table_name
with the name of the table you want to insert values into, and replacing col1
(etc.) with the columns you want to insert data into, and val1
(etc.) with the corresponding data. For example, here's the first widget listed on the widget page:
Part No. | Name | Price | Quantity in warehouse |
---|---|---|---|
C1-9476 | Skinner Widget | $2.70 | 512 |
To insert the data from this row, issue the following command:
INSERT INTO widget (part_no, name, price, quantity) VALUES ('C1-9476', 'Skinner Widget', 2.70, 512);
You'll get a confirmation message (i.e., INSERT 0 1
) that tells you that the insert was successful. (If you didn't get that message, check the syntax and try again.)
Confirm that the insert was successful by issuing a SELECT
statement:
widgetdb=# SELECT * FROM widget;
part_no | name | price | quantity
---------+----------------+-------+----------
C1-9476 | Skinner Widget | 2.70 | 512
(1 row)
Hey perfect! It worked.
You could continue on in this way, issuing INSERT
statements for each widget that you want to add to the table. This is kind of tedious, obviously. In real life, you'd probably get the data into your database by importing it from another source. One method of doing this would be to programmatically issue INSERT
statements in Python (say, looping over a list of dictionaries from an API).
Another method is to use PostgreSQL's \copy
command, which takes a CSV file and inserts its contents, row by row, into an existing table.
But there has to be an existing table. So before we can use the \copy
command, we need to create the table first. Let's try it out with the u.user
file from the MovieLens data set.
Here are the first several lines from that file:
!head -10 ml-100k/u.user
1|24|M|technician|85711 2|53|F|other|94043 3|23|M|writer|32067 4|24|M|technician|43537 5|33|F|other|15213 6|42|M|executive|98101 7|57|M|administrator|91344 8|36|M|administrator|05201 9|29|M|student|01002 10|53|M|lawyer|90703
Referring back to the README
file, the meaning of these fields are:
Here are the data types I would pick for these fields:
int
int
varchar(1)
varchar(80)
varchar(10)
(in case there are any extended ZIP codes?)And the CREATE TABLE
statement:
CREATE TABLE uuser (
user_id int,
age int,
gender varchar(1),
occupation varchar(80),
zip_code varchar(10)
);
(Note that I called the table uuser
---you can't just use the name user
, as it's a reserved word in SQL.)
Okay cool. Now that the table exists, we can use the \copy
command to import the actual data. Here's what the command looks like:
\copy table_name from path delimiter ',' csv
The words from
and delimiter
and csv
are part of the command. The parts you should change are table_name
(replace this with the name of the table you want to import into), path
(replace this with the path to the CSV file on your hard drive) and ','
(replace the comma with whatever character separates the items on each line of the file).
So, for example, to import the data from u.user
:
\copy uuser from /Users/allison/Dropbox/projects/lede/pandas-notes/ml-100k/u.user delimiter '|' csv
Make sure to replace the path with the path to your actual file. If everything's worked out correctly, you should see a confirmation (COPY 943
, indicating the 943 rows were copied). If not, check the syntax and try again.
You can confirm that the copy worked by issuing a SELECT
statement, like:
widgetdb=# SELECT * FROM uuser WHERE gender = 'F' AND occupation = 'engineer';
user_id | age | gender | occupation | zip_code
---------+-----+--------+------------+----------
786 | 36 | F | engineer | 01754
827 | 23 | F | engineer | 80228
(2 rows)
Let's repeat the process with the u.data
table. Here's what the data looks like:
!head -10 ml-100k/u.data
196 242 3 881250949 186 302 3 891717742 22 377 1 878887116 244 51 2 880606923 166 346 1 886397596 298 474 4 884182806 115 265 2 881171488 253 465 5 891628467 305 451 3 886324817 6 86 3 883603013
According to the README, here's what the columns mean:
user id | item id | rating | timestamp
My best guess at a table schema:
CREATE TABLE udata (
user_id int,
item_id int,
rating int,
timestamp int
);
And the \copy
command:
\copy udata from /Users/allison/Dropbox/projects/lede/pandas-notes/ml-100k/u.data delimiter E'\t' csv
(Why the E
in front of the '\t'
? I don't know. I had to google it. This is computers we're talking about here, there's always one damn thing or another getting in the way of your doing what you want to do)
If everything goes according to plan, you'll see a confirmation (COPY 100000
). You can test the data with a SELECT
statement like so:
widgetdb=# SELECT user_id, avg(rating), count(rating)
FROM udata GROUP BY user_id ORDER BY avg(rating) DESC LIMIT 10;
user_id | avg | count
---------+--------------------+-------
849 | 4.8695652173913043 | 23
688 | 4.8333333333333333 | 24
507 | 4.7241379310344828 | 58
628 | 4.7037037037037037 | 27
928 | 4.6875000000000000 | 32
118 | 4.6619718309859155 | 71
907 | 4.5714285714285714 | 147
686 | 4.5633802816901408 | 71
427 | 4.5483870967741935 | 31
565 | 4.5428571428571429 | 35
(10 rows)
This query shows the users with the highest average ratings (along with the number of ratings). Let's join on the uuser
table to get that user's age and occupation as well:
widgetdb=# SELECT udata.user_id, uuser.occupation, uuser.age,
widgetdb-# avg(udata.rating), count(udata.rating)
widgetdb-# FROM udata JOIN uuser ON udata.user_id = uuser.user_id
widgetdb-# GROUP BY udata.user_id, uuser.occupation, uuser.age
widgetdb-# ORDER BY avg(udata.rating) DESC
widgetdb-# LIMIT 10;
user_id | occupation | age | avg | count
---------+---------------+-----+--------------------+-------
849 | student | 15 | 4.8695652173913043 | 23
688 | administrator | 37 | 4.8333333333333333 | 24
507 | writer | 18 | 4.7241379310344828 | 58
628 | none | 13 | 4.7037037037037037 | 27
928 | student | 21 | 4.6875000000000000 | 32
118 | administrator | 21 | 4.6619718309859155 | 71
907 | other | 25 | 4.5714285714285714 | 147
686 | educator | 32 | 4.5633802816901408 | 71
427 | doctor | 51 | 4.5483870967741935 | 31
565 | student | 40 | 4.5428571428571429 | 35
(10 rows)
Another example: average ratings, grouped by occupation:
widgetdb=# select uuser.occupation, avg(udata.rating) from uuser join udata on uuser.user_id = udata.user_id group by uuser.occupation order by avg(udata.rating) desc;
occupation | avg
---------------+--------------------
none | 3.7791342952275250
lawyer | 3.7353159851301115
doctor | 3.6888888888888889
educator | 3.6706206312221987
artist | 3.6533795493934142
administrator | 3.6356464768017115
scientist | 3.6112730806608358
salesman | 3.5829439252336449
programmer | 3.5682604794257147
librarian | 3.5607813388962640
other | 3.5523773797242802
engineer | 3.5414067278287462
technician | 3.5322304620650314
student | 3.5151432345038029
marketing | 3.4856410256410256
retired | 3.4667495338719702
entertainment | 3.4410501193317422
writer | 3.3757225433526012
executive | 3.3491037320011754
homemaker | 3.3010033444816054
healthcare | 2.8962196861626248
(21 rows)
NOTE: Importing the
u.item
table is left as an exercise for the reader. But think of all the fun stuff you could do!
Here are some situations you might encounter and some strategies for fixing them.
If you messed up your table schema, or if you made your table in the wrong database, use the DROP
command to drop the table:
DROP TABLE your_table;
... and then create the table again.
If you messed up the data import (e.g., you used the wrong file, or the wrong delimiter), you can delete all of the rows from a table using:
DELETE FROM your_table;
After you've deleted the rows, you can run the \copy
command again.
UPDATE
statement.