Lab 2 Exploratory Data Analysis¶

For this lab, we'll explore some data from a very useful source, the UC Irvine machine learning data repository.

First, create a directory ~/labs/lab2 on your VM.

Open a browser on your VM and point it at this ipynb file. Click on the download link (top right hand corner of the page) and download this file into your lab2 directory.

Now from inside (or outside) your VM open a browser and go to this URL (this may be a good time to enable clipboard sharing on your VM. Its under the "Devices" menu): https://archive.ics.uci.edu/ml/datasets/Heart+Disease and read the dataset description.

Click on the "Data Folder" link near the top of the page. If you're not inside your VM, copy the URL for this page and then paste it into a browser on your VM.

Now click on "processed.cleveland.data" and save it into ~/labs/lab2.

Now read the data into a python and create a variable "cleveland_raw_data" which is a list of rows from this dataset. Each row should be a list of string values returned by the csv file reader.

In [ ]:

labdir = "/home/datascience/labs/lab2/"
import csv

with open(labdir+"processed.cleveland.data") as csvfile:
    cleveland_raw_data = list(csv.reader(csvfile))    
        

TODO: How many rows are there in the dataset?

Data Cleaning¶

First we have to clean and sanitize the data. This data is pretty clean and is mostly numeric but contains some '?' in some fields. To make it easier to handle, we convert those fields to 'None'. For convenience, you should define a function "safefloat" that takes a string argument, and returns None if the argument is '?', otherwise the float value of the string.

In [ ]:

import string 

def safefloat(x):       # TODO: Implement safefloat()
    
cleveland_data = [[safefloat(x) for x in y] for y in cleveland_raw_data]

As discussed in the dataset summary, the following are the column names.

In [ ]:

headers = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num']

Now we construct a dictionary mapping these header names to the column numbers 0...13:

In [ ]:

headernum = dict(zip(headers, range(len(headers))))    

Define a function "getcol" that takes a column name and returns the data in that column as a list of numbers.

In [ ]:

def getcol(name): # TODO write getcol

Basic Statistics¶

What is the minimum, maximum, mean and standard deviation of the age of this set of subjects? Use the numpy package with contains the mean() and std() functions.

In [ ]:

import numpy as np
age = getcol('age')
[min(age), max(age), np.mean(age), np.std(age)]

Next we define a function select which given a column name and a predicate, returns the values of that column at rows for which the predicate is true.

In [ ]:

def select(colname, predicate):
    icol = headernum[colname]
    return [i[icol] for i in cleveland_data if predicate(i)]

Now run these expressions to get the mean age of male and female subjects.

In [ ]:

def fieldis(colname, cval):
    icol = headernum[colname]
    return lambda(x): x[icol] == cval

[np.mean(select('age', fieldis('sex',1))), np.mean(select('age', fieldis('sex',0)))]

TODO: What were the mean ages for females and males?

Histograms of Data Fields¶

Plot histograms of age and resting blood pressure

In [ ]:

import pylab
%matplotlib inline

h1 = pylab.hist(age, 30)

In [ ]:

bp = getcol('trestbps')
h2 = pylab.hist(bp, 30)

TODO

Describe the rough shape of the distribution of bps. Is it skewed?

Scatter Plots¶

Make scatter plots of:

age vs bp (resting blood pressure)
age vs thalach (max heart rate)

In [ ]:

pylab.scatter(age, bp)

In [ ]:

maxhr = getcol('thalach')
pylab.scatter(age, maxhr)

We can augment the basic scatter plots with other information that might be relevant. In the plot below, we used the 'num' field to color the dots. num is an integer indicating the degree of heart disease from 0...4. We also make the dots larger with the s= argument to make the colors easier to see.

In [ ]:

pylab.scatter(age, bp, c=getcol('num'), s=50)

To figure out what color encodes what value, we can do a simple plot of the values 0...4

In [ ]:

pylab.scatter(range(5), range(5), c=range(5), s=50)

TODO: What do you notice about the distribution of num = 2 diagnoses?

These scatter plots seem to show trends. To make those clearer we can overlay regression lines. The regression line minimizes the total squared vertical distance from the line to the data points, and shows the general trend for the data.

In [ ]:

# for numpy we need arrays instead of lists of values
age = np.array(getcol('age'))
bp = np.array(getcol('trestbps'))

pylab.scatter(age, bp)
m, b = np.polyfit(age, bp, 1)
pylab.plot(age, m*age + b, '-', color='red')

In [ ]:

maxhr = np.array(getcol('thalach'))

pylab.scatter(age, maxhr)
m, b = np.polyfit(age, maxhr, 1)
pylab.plot(age, m*age + b, '-', color='red')

Critical Thinking with Data¶

The following scatter plot and regression line shows the relationship between blood pressure (X-axis) and heart disease (Y-axis).

In [ ]:

num = np.array(getcol('num'))
factor = bp

pylab.scatter(factor, num)
m, b = np.polyfit(factor, num, 1)
pylab.plot(factor, m*factor + b, '-', color='red')

TODO: Based on this plot, do you think blood pressure influences heart disease?

Now consider this plot of age versus num:

In [ ]:

num = np.array(getcol('num'))
factor = age

pylab.scatter(factor, num)
m, b = np.polyfit(factor, num, 1)
pylab.plot(factor, m*factor + b, '-', color='red')

TODO: Based on this plot of Age vs Num and the previous plot of Age vs BPS, what would you say now about the relation between BPS and Num?

Dimension Reduction¶

Recall that dimension reduction allows you to look at the dominant factors in high-dimensional data. Matplotlib includes the PCA function for this purpose. You use it like this:

In [ ]:

from matplotlib.mlab import PCA
cleveland_matrix = np.array(cleveland_data, dtype=np.float64) # First put the data in a 2D array of double-precision floats
results = PCA(cleveland_matrix[:,0:8])                      # leave out columns with None in them
yy = results.Y                                              # returns the projections of the data into the principal component directions

In [ ]:

pylab.scatter(yy[:,0],yy[:,1])

TODO: Do you see a relationship between the two main variables (X and Y axes of this plot)?

Text Data¶

Download the NY times Dataset from here https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/docword.nips.txt.gz and save it to your lab2 directory. Unzip the file, producing docword.nips.txt.

This file has 3 header lines: num docs, num distinct words, num total words. The following lines represent the documents with three fields:

docid wordid wordcount

We can read the file with a csv reader:

In [ ]:

with open(labdir+"docword.nips.txt") as csvfile:
    ndocs = int(csvfile.readline())
    nwords = int(csvfile.readline())
    nnz = int(csvfile.readline())
    nips_raw_data = list(csv.reader(csvfile, delimiter=' '))
    
nips_data = [[int(x) for x in y] for y in nips_raw_data] # convert from string to numeric data

In [ ]:

[ndocs, nwords, nnz]

Now we're going to create an array 'counts' containing the counts for each word over all documents. Note that we use 'row[1]-1' as the index. The docword files use 1-based array indexing, but python uses zero-based indexing.

In [ ]:

counts = [0] * nwords
for row in nips_data:
    counts[row[1]-1] += row[2] # increment the count for this word by the value in the third column

Next we zip the word index as the first column, and sort this table by word count in descending order.

In [ ]:

import operator
wordtab = zip(range(nwords), counts)
wordtab.sort(key=lambda x: x[1], reverse=True)

The top (first) values in this list are the most frequent word ids (first column), and their counts (second column):

In [ ]:

wordtab[0:8]

Now grab the vocabulary file for nips: https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/vocab.nips.txt and save it to lab2. Run the following to load it and create a dictionary (word -> wordid) and inverse dictionary (wordid -> word) from it.

In [ ]:

mydict = {}            # word dictionary
words = [''] * nwords  # invese dictionary - just an array of strings
i = 0
with open(labdir+"vocab.nips.txt") as txtfile:
    for line in txtfile:
        word = line.rstrip('\n')
        mydict[word] = i
        words[i] = word
        i += 1

Now we can find the top words using the inverse dictionary:

In [ ]:

topwords = [words[x] for x,y in wordtab[0:10]]
topwords

TODO: What do you think is the topic of the NIPS dataset?

Finally, we can plot the counts words in rank order (decreasing order of frequency).

In [ ]:

scounts = [y for x,y in wordtab]
pylab.plot(scounts)

What form does this curve have? To make it clearer, lets to a log-log plot.

In [ ]:

pylab.loglog(scounts)

TODO: What is the approximate slope (in log-log space) of this curve over the frequency range 10^1 to 10^3 ?

Lab 2 Responses¶

The lab 2 responses should be entered here: https://bcourses.berkeley.edu/courses/1377158/quizzes/2045090