For this lab, we'll explore some data from a very useful source, the UC Irvine machine learning data repository.
First, create a directory ~/labs/lab2 on your VM.
Open a browser on your VM and point it at this ipynb file. Click on the download link (top right hand corner of the page) and download this file into your lab2 directory.
Now from inside (or outside) your VM open a browser and go to this URL (this may be a good time to enable clipboard sharing on your VM. Its under the "Devices" menu): https://archive.ics.uci.edu/ml/datasets/Heart+Disease and read the dataset description.
Click on the "Data Folder" link near the top of the page. If you're not inside your VM, copy the URL for this page and then paste it into a browser on your VM.
Now click on "processed.cleveland.data" and save it into ~/labs/lab2.
Now read the data into a python and create a variable "cleveland_raw_data" which is a list of rows from this dataset. Each row should be a list of string values returned by the csv file reader.
labdir = "/home/datascience/labs/lab2/"
import csv
with open(labdir+"processed.cleveland.data") as csvfile:
cleveland_raw_data = list(csv.reader(csvfile))
TODO: How many rows are there in the dataset?
First we have to clean and sanitize the data. This data is pretty clean and is mostly numeric but contains some '?' in some fields. To make it easier to handle, we convert those fields to 'None'. For convenience, you should define a function "safefloat" that takes a string argument, and returns None if the argument is '?', otherwise the float value of the string.
import string
def safefloat(x): # TODO: Implement safefloat()
cleveland_data = [[safefloat(x) for x in y] for y in cleveland_raw_data]
As discussed in the dataset summary, the following are the column names.
headers = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num']
Now we construct a dictionary mapping these header names to the column numbers 0...13:
headernum = dict(zip(headers, range(len(headers))))
Define a function "getcol" that takes a column name and returns the data in that column as a list of numbers.
def getcol(name): # TODO write getcol
What is the minimum, maximum, mean and standard deviation of the age of this set of subjects? Use the numpy package with contains the mean() and std() functions.
import numpy as np
age = getcol('age')
[min(age), max(age), np.mean(age), np.std(age)]
Next we define a function select which given a column name and a predicate, returns the values of that column at rows for which the predicate is true.
def select(colname, predicate):
icol = headernum[colname]
return [i[icol] for i in cleveland_data if predicate(i)]
Now run these expressions to get the mean age of male and female subjects.
def fieldis(colname, cval):
icol = headernum[colname]
return lambda(x): x[icol] == cval
[np.mean(select('age', fieldis('sex',1))), np.mean(select('age', fieldis('sex',0)))]
TODO: What were the mean ages for females and males?
Plot histograms of age and resting blood pressure
import pylab
%matplotlib inline
h1 = pylab.hist(age, 30)
bp = getcol('trestbps')
h2 = pylab.hist(bp, 30)
TODO
Describe the rough shape of the distribution of bps. Is it skewed?
Make scatter plots of:
pylab.scatter(age, bp)
maxhr = getcol('thalach')
pylab.scatter(age, maxhr)
We can augment the basic scatter plots with other information that might be relevant. In the plot below, we used the 'num' field to color the dots. num is an integer indicating the degree of heart disease from 0...4. We also make the dots larger with the s= argument to make the colors easier to see.
pylab.scatter(age, bp, c=getcol('num'), s=50)
To figure out what color encodes what value, we can do a simple plot of the values 0...4
pylab.scatter(range(5), range(5), c=range(5), s=50)
TODO: What do you notice about the distribution of num = 2 diagnoses?
These scatter plots seem to show trends. To make those clearer we can overlay regression lines. The regression line minimizes the total squared vertical distance from the line to the data points, and shows the general trend for the data.
# for numpy we need arrays instead of lists of values
age = np.array(getcol('age'))
bp = np.array(getcol('trestbps'))
pylab.scatter(age, bp)
m, b = np.polyfit(age, bp, 1)
pylab.plot(age, m*age + b, '-', color='red')
maxhr = np.array(getcol('thalach'))
pylab.scatter(age, maxhr)
m, b = np.polyfit(age, maxhr, 1)
pylab.plot(age, m*age + b, '-', color='red')
The following scatter plot and regression line shows the relationship between blood pressure (X-axis) and heart disease (Y-axis).
num = np.array(getcol('num'))
factor = bp
pylab.scatter(factor, num)
m, b = np.polyfit(factor, num, 1)
pylab.plot(factor, m*factor + b, '-', color='red')
TODO: Based on this plot, do you think blood pressure influences heart disease?
Now consider this plot of age versus num:
num = np.array(getcol('num'))
factor = age
pylab.scatter(factor, num)
m, b = np.polyfit(factor, num, 1)
pylab.plot(factor, m*factor + b, '-', color='red')
TODO: Based on this plot of Age vs Num and the previous plot of Age vs BPS, what would you say now about the relation between BPS and Num?
Recall that dimension reduction allows you to look at the dominant factors in high-dimensional data. Matplotlib includes the PCA function for this purpose. You use it like this:
from matplotlib.mlab import PCA
cleveland_matrix = np.array(cleveland_data, dtype=np.float64) # First put the data in a 2D array of double-precision floats
results = PCA(cleveland_matrix[:,0:8]) # leave out columns with None in them
yy = results.Y # returns the projections of the data into the principal component directions
pylab.scatter(yy[:,0],yy[:,1])
TODO: Do you see a relationship between the two main variables (X and Y axes of this plot)?
Download the NY times Dataset from here https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/docword.nips.txt.gz and save it to your lab2 directory. Unzip the file, producing docword.nips.txt.
This file has 3 header lines: num docs, num distinct words, num total words. The following lines represent the documents with three fields:
docid wordid wordcount
We can read the file with a csv reader:
with open(labdir+"docword.nips.txt") as csvfile:
ndocs = int(csvfile.readline())
nwords = int(csvfile.readline())
nnz = int(csvfile.readline())
nips_raw_data = list(csv.reader(csvfile, delimiter=' '))
nips_data = [[int(x) for x in y] for y in nips_raw_data] # convert from string to numeric data
[ndocs, nwords, nnz]
Now we're going to create an array 'counts' containing the counts for each word over all documents. Note that we use 'row[1]-1' as the index. The docword files use 1-based array indexing, but python uses zero-based indexing.
counts = [0] * nwords
for row in nips_data:
counts[row[1]-1] += row[2] # increment the count for this word by the value in the third column
Next we zip the word index as the first column, and sort this table by word count in descending order.
import operator
wordtab = zip(range(nwords), counts)
wordtab.sort(key=lambda x: x[1], reverse=True)
The top (first) values in this list are the most frequent word ids (first column), and their counts (second column):
wordtab[0:8]
Now grab the vocabulary file for nips: https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/vocab.nips.txt and save it to lab2. Run the following to load it and create a dictionary (word -> wordid) and inverse dictionary (wordid -> word) from it.
mydict = {} # word dictionary
words = [''] * nwords # invese dictionary - just an array of strings
i = 0
with open(labdir+"vocab.nips.txt") as txtfile:
for line in txtfile:
word = line.rstrip('\n')
mydict[word] = i
words[i] = word
i += 1
Now we can find the top words using the inverse dictionary:
topwords = [words[x] for x,y in wordtab[0:10]]
topwords
TODO: What do you think is the topic of the NIPS dataset?
Finally, we can plot the counts words in rank order (decreasing order of frequency).
scounts = [y for x,y in wordtab]
pylab.plot(scounts)
What form does this curve have? To make it clearer, lets to a log-log plot.
pylab.loglog(scounts)
TODO: What is the approximate slope (in log-log space) of this curve over the frequency range 10^1 to 10^3 ?
The lab 2 responses should be entered here: https://bcourses.berkeley.edu/courses/1377158/quizzes/2045090