# special IPython command to prepare the notebook for matplotlib
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
pd.options.display.mpl_style = 'default'
Previously discussed:
Download this notebook from Github
This data set contains the prices and other attributes of almost 54,000 diamonds. This dataset is available on Github in the 2014_data repository and is called diamonds.csv
.
This is a .csv
file, so we will use the function read_csv()
that will read in a CSV file into a pandas DataFrame.
url = 'https://raw.githubusercontent.com/cs109/2014_data/master/diamonds.csv'
diamonds = pd.read_csv(url, sep = ',', index_col=0)
diamonds.head()
carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 |
2 | 0.21 | Premium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 |
3 | 0.23 | Good | E | VS1 | 56.9 | 65 | 327 | 4.05 | 4.07 | 2.31 |
4 | 0.29 | Premium | I | VS2 | 62.4 | 58 | 334 | 4.20 | 4.23 | 2.63 |
5 | 0.31 | Good | J | SI2 | 63.3 | 58 | 335 | 4.34 | 4.35 | 2.75 |
Here is a table containing a description of all the column names.
Column name | Description |
---|---|
carat | weight of the diamond (0.2–5.01) |
cut | quality of the cut (Fair, Good, Very Good, Premium, Ideal) |
colour | diamond colour, from J (worst) to D (best) |
clarity | a measurement of how clear the diamond is (I1 (worst), SI1, SI2, VS1, VS2, VVS1, VVS2, IF (best)) |
depth | total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79) |
table | width of top of diamond relative to widest point (43–95) |
price | price in US dollars ($326–$18,823) |
x | length in mm (0–10.74) |
y | width in mm (0–58.9) |
z | depth in mm (0–31.8) |
The variables carat
and price
are both continuous variables, while color
and clarity
are discrete variables. First, let's look at some summary statistics of the diamonds data set.
diamonds.describe()
carat | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|
count | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 |
mean | 0.797940 | 61.749405 | 57.457184 | 3932.799722 | 5.731157 | 5.734526 | 3.538734 |
std | 0.474011 | 1.432621 | 2.234491 | 3989.439738 | 1.121761 | 1.142135 | 0.705699 |
min | 0.200000 | 43.000000 | 43.000000 | 326.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.400000 | 61.000000 | 56.000000 | 950.000000 | 4.710000 | 4.720000 | 2.910000 |
50% | 0.700000 | 61.800000 | 57.000000 | 2401.000000 | 5.700000 | 5.710000 | 3.530000 |
75% | 1.040000 | 62.500000 | 59.000000 | 5324.250000 | 6.540000 | 6.540000 | 4.040000 |
max | 5.010000 | 79.000000 | 95.000000 | 18823.000000 | 10.740000 | 58.900000 | 31.800000 |
Let's look at the distribution of carats and price using a histogram.
diamonds['price'].hist(bins=50, color = 'black')
plt.title('Distribution of Price')
plt.xlabel('Price')
<matplotlib.text.Text at 0x107793b50>
# Try changing bin size from 20 to 500. What do you notice?
diamonds['carat'].hist(bins=20, color = 'black', figsize=(6, 4))
plt.title('Distribution of weights in carats')
plt.xlim(0, 3)
plt.xlabel('Weight in Carats')
<matplotlib.text.Text at 0x10810bf90>
Plot the density of the price of the diamonds
diamonds['price'].plot(kind='kde', color = 'black')
plt.title('Distribution of Price')
<matplotlib.text.Text at 0x10831fdd0>
Now, let's look at the relationship between the price of a diamond and its weight in carats. Try changing alpha (ranges from 0 to 1) to control over plotting.
diamonds.plot(x='carat', y='price', kind = 'scatter', color = 'black', alpha = 1)
<matplotlib.axes.AxesSubplot at 0x10a1cb1d0>
We can also create a scatter plot using matplotlib.pyplot instead of pandas directly.
plt.scatter(diamonds['carat'], diamonds['price'], color = 'black', alpha = 0.05)
plt.xlabel('Carat')
plt.ylabel('Price')
<matplotlib.text.Text at 0x10ace49d0>
Let's look at the scatter plots of price
and carat
but grouped by color.
diamonds.groupby('color').plot(x='carat', y='price', kind = 'scatter', color = 'black', alpha = 1)
color D Axes(0.125,0.125;0.775x0.775) E Axes(0.125,0.125;0.775x0.775) F Axes(0.125,0.125;0.775x0.775) G Axes(0.125,0.125;0.775x0.775) H Axes(0.125,0.125;0.775x0.775) I Axes(0.125,0.125;0.775x0.775) J Axes(0.125,0.125;0.775x0.775) dtype: object
What happens if you look at the scatter plots of price
and carat
but grouped by clarity.
# try here
We could also look at boxplots of the price
grouped by color
.
diamonds.boxplot('price', by = 'color')
<matplotlib.axes.AxesSubplot at 0x10b8b1ad0>
Now that we have done some exploratory data analysis by looking at histograms, scatter plots and boxplots let's look more about how to work with the pandas DataFrame itself.
We just learned about diamonds.describe()
above, what else can we do?
diamonds.mean()
carat 0.797940 depth 61.749405 table 57.457184 price 3932.799722 x 5.731157 y 5.734526 z 3.538734 dtype: float64
diamonds.corr() # correlation
carat | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|
carat | 1.000000 | 0.028224 | 0.181618 | 0.921591 | 0.975094 | 0.951722 | 0.953387 |
depth | 0.028224 | 1.000000 | -0.295779 | -0.010647 | -0.025289 | -0.029341 | 0.094924 |
table | 0.181618 | -0.295779 | 1.000000 | 0.127134 | 0.195344 | 0.183760 | 0.150929 |
price | 0.921591 | -0.010647 | 0.127134 | 1.000000 | 0.884435 | 0.865421 | 0.861249 |
x | 0.975094 | -0.025289 | 0.195344 | 0.884435 | 1.000000 | 0.974701 | 0.970772 |
y | 0.951722 | -0.029341 | 0.183760 | 0.865421 | 0.974701 | 1.000000 | 0.952006 |
z | 0.953387 | 0.094924 | 0.150929 | 0.861249 | 0.970772 | 0.952006 | 1.000000 |
diamonds.var() # variance
carat 0.224687 depth 2.052404 table 4.992948 price 15915629.424301 x 1.258347 y 1.304472 z 0.498011 dtype: float64
diamonds.sort('price', ascending = True, inplace = False).head() # sorting
carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 |
2 | 0.21 | Premium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 |
3 | 0.23 | Good | E | VS1 | 56.9 | 65 | 327 | 4.05 | 4.07 | 2.31 |
4 | 0.29 | Premium | I | VS2 | 62.4 | 58 | 334 | 4.20 | 4.23 | 2.63 |
5 | 0.31 | Good | J | SI2 | 63.3 | 58 | 335 | 4.34 | 4.35 | 2.75 |
subtable = diamonds.iloc[0:2, 0:2]
print "subtable"
print subtable
print ""
column = diamonds['color']
print "head of the color column"
print column.head()
print ""
row = diamonds.ix[1:2] #row 1 and 2
print "row"
print row
print ""
rows = diamonds.ix[:3] # all the rows before 3
print "rows"
print rows
print ""
color = diamonds.ix[1,'color']
print "color of diamond in row 1"
print color
print ""
# max along column
print "max price %g" % diamonds['price'].max()
print ""
# axes
print "axes"
print diamonds.axes
print ""
row = diamonds.ix[1]
print "row info"
print row.name
print row.index
print ""
subtable carat cut 1 0.23 Ideal 2 0.21 Premium head of the color column 1 E 2 E 3 E 4 I 5 J Name: color, dtype: object row carat cut color clarity depth table price x y z 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 rows carat cut color clarity depth table price x y z 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 color of diamond in row 1 E max price 18823 axes [Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, ...], dtype='int64'), Index([u'carat', u'cut', u'color', u'clarity', u'depth', u'table', u'price', u'x', u'y', u'z'], dtype='object')] row info 1 Index([u'carat', u'cut', u'color', u'clarity', u'depth', u'table', u'price', u'x', u'y', u'z'], dtype='object')
New functions can be defined using one of the 31 keywords in Python: def
.
def squared(x):
""" Return the square of a
value """
return x ** 2
squared(4)
16
The first line of the function (the header) must start with the keyword def
, the name of the function (which can contain underscores), parentheses (with any arguments inside of it) and a colon. The arguments can be specified in any order.
The rest of the function (the body) always has an indentation of four spaces. If you define a function in the interactive mode, the interpreter will print ellipses (...) to let you know the function isn't complete. To complete the function, enter an empty line (not necessary in a script).
To return a value from a function, use return
. The function will immediately terminate and not run any code written past this point.
When defining new functions, you can add a docstring
(i.e. the documentation of function) at the beginning of the function that documents what the function does. The docstring is a triple quoted (multi-line) string. We highly recommend you to document the functions you define as good python coding practice.
Lambda functions are one-line functions. To define this function using the lambda
keyword, you do not need to include the return
argument. For example, we can re-write the squared()
function above using the following syntax:
f = lambda x: x**2
f(4)
16
Defining a for
loop is similar to defining a new function. The header ends with a colon and the body is indented with four spaces. The function range(n)
takes in an integer n and creates a set of values from 0 to n - 1. for
loops are not just for counters, but they can iterate through many types of objects such as strings, lists and dictionaries.
for i in range(4):
print 'Hello world!'
Hello world! Hello world! Hello world! Hello world!
To traverse through all characters in a given string, you can use for
or while
loops. Here we create the names of the duck statues in the Public Gardens in downtown Boston: Jack, Kack, Lack, Mack, Nack, Oack, Pack, Qack.
prefixes = 'JKLMNOPQ'
suffix = 'ack'
for letter in prefixes:
print letter + suffix
Jack Kack Lack Mack Nack Oack Pack Qack
Defining a while
loop is again similar to defining a for
loop or new function. The header ends with a colon and the body is indented with four spaces.
def countdown(n):
while n > 0:
print n
n = n-1
print 'Blastoff!'
countdown(3)
3 2 1 Blastoff!
Another powerful feature of Python is list comprehension which maps one list onto another list and applying a function to each element. Here, we take each element in the list a
(temporarily assigning it the value i) and square each element in the list. This creates a new list and does not modify a
. In the second line, we can add a conditional statements of only squaring the elements if the element is not equal to 10.
a = [5, 10, 15, 20]
b = [i**2 for i in a]
c = [i**2 for i in a if i != 10]
print "a: ", a
print "b: ", b
print "c: ", c
a: [5, 10, 15, 20] b: [25, 100, 225, 400] c: [25, 225, 400]