In this lesson, we're going to do some basic data analyses on a set of diamond characteristics and prices.
import os
my_dir = os.getcwd() # get current working directory
my_dir
You should see something like '/home/user/work/DIBS_materials/python'
. Make sure it ends with /DIBS_materials/python
.
!mkdir data # make directory called "data"
target_dir = os.path.join(my_dir, 'data/')
!wget -P "$target_dir" "https://people.duke.edu/~jmp33/dibs/minerals.csv" # download csv to data folder
# if this doesn't work, manually download `minerals.csv` from https://people.duke.edu/~jmp33/dibs/
# to your local machine, and upload it to `data` folder
If you open the file in a text editor, you will see that it consists of a bunch of lines, each with a bunch of commas. This is a csv or "comma-separated value" file. Every row represents a record (like a row in a spreadsheet), with each cell separated by a comma. The first row has the same format, but the entries are the names of the columns.
We would like to load this data into Python. But to do so, we will need to tell Python that we want to get some tools ready. These tools are located in a library (the Python term is "module") called Pandas. So we do this:
import pandas as pd
This tells Python to load the pandas
module and nickname it pd
. Then, whenever we want to use functionality from pandas
, we do so by giving the address of the function we want. In this case, we want to load data from a csv file, so we call the function read_csv
:
# we can include comments like this
# note that for the following to work, you will need to be running the notebook from a folder
# with a subdirectory called data that has the minerals.csv file inside
data = pd.read_csv('data/minerals.csv')
Let's read from the left:
data =
tells Python that we want to do something (on the right hand side of the equation) and assign its output to a variable called data
. We could have called it duck
or shotput
or harmony
, but we want to give our variables meaningful names, and in cases where we are only going to be exploring a single dataset, it's convenient to name it the obvious thing.
On the right hand side, we're doing several things:
pandas
module (nicknamed pd
)read_csv
that's found there (we will find that looking up functions in modules is a lot like looking up files in directories)We will see this pattern repeatedly in Python. We use names like read_csv
to tell Python to perform actions. In parentheses, we will supply variables or pieces of information needed as inputs by Python to perform those actions. The actions are called functions (related to the idea of functions in math, which are objects that take inputs and produce an output), and the pieces of information inside parentheses are called "arguments." Much more on all of this later.
So what did we accomplish?
The easiest way to see is by asking Python to print the variable to the screen. We can do this with
print(data)
You should be able to see that the data consists of a bunch of rows and columns, and that, at some point in the middle, Python puts a bunch of ...'s, indicating it's not printing all the information. That's good in this case, since the data are pretty big.
So how big are the data? We can find this out by typing
data.shape
We could have gotten the same answer by typing
print(data.shape)
but when we just type the variable name, Python assumes we mean print
.
So what does this answer mean? It means that our data have something like 65,000 rows and 16 columns. Here, the convention is (rows, columns).
Notice also that the way we got this piece of information was by typing the variable, followed by .
, followed by the name of a property (called an "attribute"). Again, you can think of this variable as an object having both pieces of information (attributes) and pieces of behavior (functions or methods) tucked inside of it like a file system. The way we access those is by giving a path, except with .
instead of /
.
But there's an even friendlier way to look at our data that's special to the notebook. To look at the first few rows of our data, we can type
data.head()
This gives 5 rows by default (note that counting starts at 0!), but we can easily ask Python for 10:
data.head(10)
Here, head
is a method of the variable data
(meaning it's a function stored in the data object). In the second case, we explicitly told head
how many rows we wanted, while in the first, the number defaulted to 5.
Likewise, we can ask for the last few rows of the dataset with tail
:
data.tail(7)
If you look carefully, you might notice that the rows seem to be sorted by the last item, price. The odd characters under delivery data are a result of the process used to download these data from the internet.
Finally, as a first, pass, we might just want to know some simple summary statistics of our data:
data.describe()
The Python ecosystem is huge. Nobody knows all the functions for all the libraries. This means that when you start analyzing data in earnest, you will need to learn the parts important for solving your particular problem. Initially, this will be difficult; everything will be new to you. Eventually, though, you develop a conceptual base that will be easy to add to.
So what should you do in this case? How do we learn more about the functions we've used so far?
First off, let's figure out what type of object data
is. Every variable in Python is an object (meaning it has both information and behavior stored inside of it), and every object has a type. We can find this by using the type
function:
type(1)
Here, int
means integer, a number with no decimal part.
type(1.5)
Float is anything with a decimal point. Be aware that the precision of float
variables is limited, and there is the potential for roundoff errors in calculations if you ever start to do serious data crunching (though most of the time you'll be fine).
type(data)
So what in the world is this?
Read it like so: the type of object data
is is defined in the pandas
module, in the core
submodule, in the frame
sub-submodule, and it is DataFrame
. Again, using our filesystem analogy, the data
variable has type DataFrame
, and Python gives us the full path to its definition. As we will see, dataframes are a very convenient type of object in which to store our data, since this type of object carries with it very powerful behaviors (methods) that can be used to clean and analyze data.
So if type
tells us the type of object, how do we find out what's in it?
We can do that with the dir
command. dir
is short for "directory," and tells us the name of all the attributes (information) and methods (behaviors) associated with an object:
dir(data)
Whoa!
Okay, keep in mind a few things:
_
or __
is a private variable (like files beginning with .
in the shell). You can safely ignore these until you are much more experienced.How do we do that?
First, IPython has some pretty spiffy things that can help us right from the shell or notebook. For instance, if I want to learn about the sort
item in the list, I can type
data.sort?
And IPython will pop up a handy (or cryptic, as you might feel at first) documentation window for the function. At the very least, we are told at the top of the page that the type of data.sort
is an instancemethod, meaning that it's a behavior and not a piece of information. The help then goes on to tell us what inputs the function takes and what outputs it gives.
More realistically, I would Google
DataFrame.sort python
(Remember, DataFrame is the type of the data object, and the internet doesn't know that data
is a variable name for us. So I ask for the type.method and throw in the keyword "python" so Google knows what I'm talking about.)
The first result that pops up should be this.
Good news!
As a result of this, if we look carefully, we might be able to puzzle out that if we want to sort the dataset by price per carat, we can do
data_sorted = data.sort_values(by='price per carat')
Notice that we have to save the output of the sort here, since the sort
function doesn't touch the data
variable. Instead, it returns a new data frame that we need to assign to a new variable name. In other words,
data.head(10)
looks the same as before, while
data_sorted.head(10)
data_sorted.tail(10)
Looks like it worked!
By default, the sort function sorts in ascending order (lowest to highest). Use Google-fu to figure out how to sort the data in descending order by carat.
What happens if we sort by shape instead? What sort order is used?
So let's get down to some real data analysis.
If we want to know what the columns in our data frame are (if, for instance, there are too many to fit onscreen, or we need the list of them to manipulate), we can do
data.columns
So we see columns is an attribute, not a method (the giveaway in this case is that we did not have to use parentheses afterward, as we would with a function/method).
What type of object is this?
type(data.columns)
And what can we do with an Index?
dir(data.columns)
Oh, a lot.
What will often be very useful to us is to get a single column out of the data frame. We can do this like so:
data.price
or
data['price']
The second method (but not the first) also works when the column name has spaces:
data['length to width ratio']
Note that the result of this operation doesn't return a 1-column data frame, but a Series:
type(data['length to width ratio'])
What you might be able to guess here is that a DataFrame is an object that (more or less) contains a bunch of Series objects named in an Index.
Finally, we can get multiple columns at once:
data[['price', 'price per carat']]
This actually does return a data frame. Note that the expression we put between the brackets in this case was a comma-separated list of strings (things in quotes), also enclosed in brackets. This is a type of object called a list in Python:
mylist = [1, 'a string', 'another string', 6.7, True]
type(mylist)
Clearly, lists are pretty useful for holding, well, lists. And the items in the list don't have to be of the same type. You can learn about Python lists in any basic Python intro, and we'll do more with them, but the key idea is that Python has several data types like this that we can use as collections of other objects. As we meet the other types, including tuples and dictionaries, we will discuss when it's best to use one or another.
Handily, pandas
has some functions we can use to calculate basic statistics of our data:
# here, let's save these outputs to variable names to make the code cleaner
ppc_mean = data['price per carat'].mean()
ppc_std = data['price per carat'].std()
# we can concatenate strings by using +, but first we have to use the str function to convert the numbers to strings
print("The average price per carat is " + str(ppc_mean))
print("The standard deviation of price per carat is " + str(ppc_std))
print("The coefficient of variation (std / mean) is thus " + str(ppc_mean / ppc_std))
In fact, these functions will work column-by-column where appropriate:
data.max()
Finally, we might eventually want to take select subsets of our data, for which there are lots of methods described here.
For example, say we wanted to look only at diamonds between 1 and 2 carats. One of the nicest methods to select these data is to use the query
method:
subset = data.query('(carat >= 1) & (carat <= 2)')
print("The mean for the whole dataset is " + str(data['price'].mean()))
print("The mean for the subset is " + str(subset['price'].mean()))
Extract the subset of data with less than 1 carat and cut equal to Very Good (VG). (Hint: the double equals ==
operator tests for equality. The normal equals sign is only for assigning values to variables.)
Extract the subset of data with color other than J. (Hint, if you have a query that would return all the rows you don't want, you can negate that query by putting ~
in front of it.)
Plotting is just fun. It is also, far and away, the best method for exploring your data.
# this magic makes sure our plots appear in the browser
%matplotlib inline
# let's look at some distributions of variables
data['price'].plot(kind='hist')
Hmm. Let's fix two things:
#first, import numpy, which has the logarithm function. We'll also give it a nickname.
import numpy as np
# the apply method applies a function to every element of a data frame (or series in this case)
# we will use this to create a new column in the data frame called log_price
data['log_price'] = data['price'].apply(np.log10)
data['log_price'].plot(kind='hist');
But we can do so much better!
#let's pull out the big guns
import matplotlib.pyplot as plt
data['log_price'].plot(kind='hist', bins=100)
plt.xlabel('Log_10 price (dollars)')
plt.ylabel('Count');
What about other types of plots?
# the value_counts() method counts the number of times each value occurs in the data['color'] series
data['color'].value_counts().plot(kind='bar');
That's a bit ugly. Let's change plot styles.
plt.style.use('ggplot')
# scatter plot the relationship between carat and price
data.plot(kind='scatter', x='carat', y='price');
# do the same thing, but plot y on a log scale
data.plot(kind='scatter', x='carat', y='price', logy=True);
data.boxplot(column='log_price', by='color');
data.boxplot(column='log_price', by='cut');
You can see from the above above that color and cut don't seem to matter much to price. Can you think of a reason why this might be?
data.boxplot(column='carat', by='color');
plt.ylim(0, 3);
data.boxplot(column='carat', by='cut');
plt.ylim(0, 3);
from pandas.plotting import scatter_matrix
column_list = ['carat', 'price per carat', 'log_price']
scatter_matrix(data[column_list]);
There are lots and lots and lots of plotting libraries out there.
Plotting is lots of fun to play around with, but almost no plot is going to be of publication quality without some tweaking. Once you pick a package, you will want to spend time learning how to get labels, spacing, tick marks, etc. right. All of the packages above are very powerful, but inevitably, you will want to do something that seems simple and turns out to be hard.
Why not just take the plot that's easy to make and pretty it up in Adobe Illustrator? Any plot that winds up in a paper will be revised many times in the course of revision and peer review. Learn to let the program do the hard work. You want code that will get you 90 - 95% of the way to publication quality.
Thankfully, it's very, very easy to learn how to do this. Because plotting routines present such nice visual feedback, there are lots and lots of examples on line with code that will show you how to make gorgeous plots. Here again, documentation and StackOverflow are your friends!
Now let's practice some of what we learned by analyzing data from a very simple survey.
I asked members of CCN and Neurobiology to answer the following question:
!wget -P "$target_dir" "https://people.duke.edu/~jmp33/dibs/time_alloc.csv" # download csv to data folder
# if this doesn't work, manually download `time_alloc.csv` from https://people.duke.edu/~jmp33/dibs/
# to your local machine, and upload it to `data` folder
dat = pd.read_csv('data/time_alloc.csv')
dat.head()
dat.columns
dat.shape
Let's just pull out the data we care about, the columns that start with 'Q1_'
:
cols_to_extract = [c for c in dat.columns if 'Q1_' in c]
print(cols_to_extract)
Now we want to shorten the column names to just the descriptive part:
col_names = [n.split(' - ')[-1] for n in cols_to_extract]
print(col_names)
Finally, make a reduced dataset in which we drop the first row and get the columns we want. Set the column name to the description we extracted:
dat_red = dat[cols_to_extract]
dat_red.columns = col_names
dat_red.head()
Now, we can figure out the average percent time allocated to each aspect of a project:
dat_red.mean()
And we can visualize this with a box plot:
import seaborn as sns
plt.figure(figsize=(10, 5))
sns.boxplot(data=dat_red)
plt.figure(figsize=(10, 5))
sns.violinplot(data=dat_red)
plt.ylim([0, 100])
And we might want to know how these covary (bearing in mind that the values have to sum to 100):
dat_red.corr()