#!/usr/bin/env python # coding: utf-8 # # Exploratory Data Analysis # # Author: Andrew Andrade ([andrew@andrewandrade.ca](mailto:andrew@andrewandrade.ca)) # # This is complimentory tutorial for [datascienceguide.github.io](http://datascienceguide.github.io/) outlining the basics of [exploratory data analysis](http://datascienceguide.github.io/) # # In this tutorial, we will learn to open a comma seperated value (CSV) data file and make find summary statistics and basic visualizations on the variables in the Ansombe dataset (to see the importance of visualization). Next we will investigate [Fisher's Iris data set](https://en.wikipedia.org/wiki/Iris_flower_data_set) using more powerful visualizations. # # These tutorials assumes a basic understanding of python so for those new to python, understanding basic syntax will be very helpful. I recommend writing python code in Jupyter notebook as it allows you to rapidly prototype and annotate your code. # # Python is a very easy language to get started with and there are many guides: # Full list: # http://docs.python-guide.org/en/latest/intro/learning/ # # My favourite resources: # https://docs.python.org/2/tutorial/introduction.html # https://docs.python.org/2/tutorial/ # http://learnpythonthehardway.org/book/ # https://www.udacity.com/wiki/cs101/%3A-python-reference # http://rosettacode.org/wiki/Category:Python # # Once you are familiar with python, the first part of this guide is useful in learning some of the libraries we will be using: # http://cs231n.github.io/python-numpy-tutorial # # In addition, the following post helps teach the basics for data analysis in python: # # http://www.analyticsvidhya.com/blog/2014/07/baby-steps-libraries-data-structure/ # http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/ # # # ## Downloading csvs # # We should store this in a known location on our local computer or server. The simplist way is to download and save it in the same folder you launch Jupyter notebook from, but I prefer to save my datasets in a datasets folder 1 directory up from my tutorial code (../datasets/). # # You should dowload the following CSVs: # # http://datascienceguide.github.io/datasets/anscombe_i.csv # # http://datascienceguide.github.io/datasets/anscombe_ii.csv # # http://datascienceguide.github.io/datasets/anscombe_iii.csv # # http://datascienceguide.github.io/datasets/anscombe_iv.csv # # # http://datascienceguide.github.io/datasets/iris.csv # # # If using a server, you can download the file by using the following command: # # ```bash # wget http://datascienceguide.github.io/datasets/iris.csv # ``` # Now we can run the following code to open the csv. # In[1]: import matplotlib.pyplot as plt import numpy as np import pandas as pd get_ipython().run_line_magic('matplotlib', 'inline') anscombe_i = pd.read_csv('../datasets/anscombe_i.csv') anscombe_ii = pd.read_csv('../datasets/anscombe_ii.csv') anscombe_iii = pd.read_csv('../datasets/anscombe_iii.csv') anscombe_iv = pd.read_csv('../datasets/anscombe_iv.csv') # The first three lines of code import libraries we are using and renames to shorter names. # # [Matplotlib](http://matplotlib.org/) is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. We will use it for basic graphics # # [Numpy](http://www.numpy.org/) is the fundamental package for scientific computing with Python. It contains among other things: # # - a powerful N-dimensional array object # - sophisticated (broadcasting) functions # - tools for integrating C/C++ and Fortran code # - useful linear algebra, Fourier transform, and random number capabilities # # [Pandas](http://pandas.pydata.org/) is open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. # # It extends the numpy array to allow for columns of different variable types. # # Since we are using Jupyter notebook we use the line `%matplotlib inline` to tell python to put the figures inline with the notebook (instead of a popup) # # pd.read_csv opens a .csv file and stores it into a dataframe object which we call anscombe_i, anscombe_ii, etc. # # Next, let us see the structure of the data by printing the first 5 rows (using [:5]) data set: # In[2]: print anscombe_i[0:5] # Now let us use the describe function to see the 3 most basic summary statistics # In[3]: print "Data Set I" print anscombe_i.describe()[:3] print "Data Set II" print anscombe_ii.describe()[:3] print "Data Set III" print anscombe_iii.describe()[:3] print "Data Set IV" print anscombe_iv.describe()[:3] # It appears that the datasets are almost identical by looking only at the mean and the standard deviation. Instead, let us make a scatter plot for each of the data sets. # # Since the data is stored in a data frame (similar to an excel sheet), we can see the column names on top and we can access the columns using the following syntax # # anscombe_i.x # # anscombe_i.y # # or # # anscombe_i['x'] # # anscombe_i['y'] # # # In[4]: plt.figure(1) plt.scatter(anscombe_i.x, anscombe_i.y, color='black') plt.title("anscombe_i") plt.xlabel("x") plt.ylabel("y") plt.figure(2) plt.scatter(anscombe_ii.x, anscombe_ii.y, color='black') plt.title("anscombe_ii") plt.xlabel("x") plt.ylabel("y") plt.figure(3) plt.scatter(anscombe_iii.x, anscombe_iii.y, color='black') plt.title("anscombe_iii") plt.xlabel("x") plt.ylabel("y") plt.figure(4) plt.scatter(anscombe_iv.x, anscombe_iv.y, color='black') plt.title("anscombe_iv") plt.xlabel("x") plt.ylabel("y") # Shockily we can clearly see that the datasets are quite different! The first data set has pure irreducable error, the second data set is not linear, the third dataset has an outlier, and the fourth dataset all of x values are the same except for an outlier. If you do not believe me, I uploaded an excel worksheet with the full datasets and summary statistics [here](http://datascienceguide.github.io/datasets/ansombe.xlsx) # # Now let us learn how to make a box plot. Before writing this tutorial I didn't know how to make a box plot in matplotlib (I usually use seaborn which we will learn soon). I did a quick google search for "box plot matplotlib) and found an example [here](http://matplotlib.org/examples/pylab_examples/boxplot_demo.html) which outlines a couple of styling options. # # In[5]: # basic box plot plt.figure(1) plt.boxplot(anscombe_i.y) plt.title("anscombe_i y box plot") # Trying reading the documentation for the box plot above and make your own visuaizations. # # Next we are going to learn how to use Seaborn which is a very powerful visualization library. Matplotlib is a great library and has [many examples of different plots](http://matplotlib.org/gallery.html), but seaborn is built on top of matplot lib and offers better plots for statistical analysis. If you do not have seaborn installed, you can follow the instructions here: http://stanford.edu/~mwaskom/software/seaborn/installing.html#installing . Seaborn also has many [examples](http://stanford.edu/~mwaskom/software/seaborn/examples/index.html) and also has a [tutorial](http://stanford.edu/~mwaskom/software/seaborn/tutorial.html). # # # To show the power of the library we are going to plot the anscombe datasets in 1 plot following this example: http://stanford.edu/~mwaskom/software/seaborn/examples/anscombes_quartet.html . Do not worry to much about what the code does (it loads the same dataset and changes setting to make the visualization clearer), we will get more experince with seaborn soon. # # In[6]: import seaborn as sns sns.set(style="ticks") # Load the example dataset for Anscombe's quartet df = sns.load_dataset("anscombe") # Show the results of a linear regression within each dataset sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=df, col_wrap=2, ci=None, palette="muted", size=4, scatter_kws={"s": 50, "alpha": 1}) # Seaborn does linear regression automatically (which we will learn soon). We can also see that the linear regression is the same for each dataset even though they are quite different. # # The big takeway here is that summary statistics can be deceptive! Always make visualizations of your data before making any models. # # # # Irist Dataset # # Next we are going to visualize the Iris dataset. Let us first read the .csv and print the first elements of the dataframe. We also get the basic summary statistics. # In[7]: iris = pd.read_csv('../datasets/iris.csv') print iris[0:5] print iris.describe() # As we can see, it is difficult to interpret the results. We can see that sepal length, sepal width, petal length and petal width are all numeric features, and the iris variable is the specific type of iris (or categorical variable). To better understand the data, we can split the data based on each type of iris, make a histogram for each numeric feature, scatter plot between features and make many visualizations. I will demonstrate the process for generating a histogram for sepal length of Iris-setosa and a scatter plot for sepal length vs width for Iris-setosa # In[8]: #select all Iris-setosa iris_setosa = iris[iris.iris == "Iris-setosa"] plt.figure(1) #make histogram of sepal lenth plt.hist(iris_setosa["sepal length"]) plt.xlabel("sepal length") plt.figure(2) plt.scatter(iris_setosa["sepal width"], iris_setosa["sepal length"] ) plt.xlabel("sepal width") plt.ylabel("sepal lenth") # This would help us to better undestand the data and is necessary for good analysis, but to do this for all the features and iris types (classes) would take a significant amount of time. Seaborn has a function called the pairplot which will do all of that for us! # In[9]: sns.pairplot(iris, hue="iris") # We have a much better understanding of the data. For example we can see linear correlations between some of the numeric features. We can also see which numeric features seperate seperate the types of iris well and which would not. # # Exploratory data analysis is not done! We could spend a whole course on doing exploratory data analysis (I took one when I was on exchange at the National Univesity of Singapore). For this reason, EDA will be a re-occuring theme in these tutorials and we will continue to always visualizate data. Data will come in different forms, it is our role as data scientists to quickly and effectively understand data. # # In the next tutorial we will be using the ansombe dataset for regression, and in future tutorials we will re-visting the iris dataset to do classification. # # # Next Actions: # # Exploratory data analysis is always an ongoing process, and we we have learnt in this tutorial, it is a necessary step before we start modeling. The way to get better at plotting data is to get started plotting! Pick an interesting dataset you can find and start exploring! # # Here a some datasets to get you started: # # http://www.kdnuggets.com/datasets/index.html # https://github.com/caesar0301/awesome-public-datasets # https://github.com/datasciencemasters/data # https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public # http://opendata.city-of-waterloo.opendata.arcgis.com/ # https://github.com/uWaterloo/Datasets # # You can also look for examples and sample code online for others using matplotlib, seaborn and ggplot2 (for those using R) for inspiration. # # Have fun! # #