# Lesson 2

Create Data - We begin by creating our own data set for analysis. This prevents the end user reading this tutorial from having to download any files to replicate the results below. We will export this data set to a text file so that you can get some experience pulling data from a text file.
Get Data - We will learn how to read in the text file containing the baby names. The data consist of baby names born in the year 1880.
Prepare Data - Here we will simply take a look at the data and make sure it is clean. By clean I mean we will take a look inside the contents of the text file and look for any anomalities. These can include missing data, inconsistencies in the data, or any other data that seems out of place. If any are found we will then have to make decisions on what to do with these records.
Analyze Data - We will simply find the most popular name in a specific year.
Present Data - Through tabular data and a graph, clearly show the end user what is the most popular name in a specific year.

NOTE:
Make sure you have looked through all previous lessons as the knowledge learned in previous lessons will be needed for this exercise.

Numpy will be used to help generate the sample data set. Importing the libraries is the first step we will take in the lesson.

In [1]:
# Import all libraries needed for the tutorial

# General syntax to import specific functions in a library:
##from (library) import (specific library function)
from pandas import DataFrame, read_csv
from numpy import random

# General syntax to import a library but no functions:
##import (library) as (give the library a nickname/alias)
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
print 'Pandas version ' + pd.__version__

Pandas version 0.11.0



# Create Data

The data set will consist of 1,000 baby names and the number of births recorded for that year (1880). We will also add plenty of duplicates so you will see the same baby name more than once. You can think of the multiple entries per name simply being different hospitals around the country reporting the number of births per baby name. So if two hospitals reported the baby name "Bob", the data will have two values for the name Bob. We will start by creating the random set of baby names.

In [3]:
# The inital set of baby names
names = ['Bob','Jessica','Mary','John','Mel']


To make a random list of 1,000 baby names using the five above we will do the following:

• Generate a random number between 0 and 4

To do this we will be using the functions seed, randint, len, range, and zip.

In [4]:
'''
This will ensure the random samples below can be reproduced.
This means the random samples will always be identical.
'''

random.seed?

In [5]:
'''
randint(low, high=None, size=None)
Return random integers from low (inclusive) to high (exclusive).
'''

randint?

In [6]:
'''
len(object) -> integer
Return the number of items of a sequence or mapping.
'''

len?

In [7]:
'''
range([start,] stop[, step]) -> list of integers
Return a list containing an arithmetic progression of integers.
'''

range?

In [8]:
'''
zip(seq1 [, seq2 [...]]) -> [(seq1[0], seq2[0] ...), (...)]
Return a list of tuples, where each tuple contains the i-th element
from each of the argument sequences.  The returned list is truncated
in length to the length of the shortest argument sequence.
'''

zip?


seed(500) - Create seed

randint(low=0,high=len(names)) - Generate a random integer between zero and the length of the list "names".

names[n] - Select the name where its index is equal to n.

for i in range(n) - Loop until i is equal to n, i.e. 1,2,3,....n.

random_names = Select a random name from the name list and do this n times.

In [9]:
seed(500)
random_names = [names[randint(low=0,high=len(names))] for i in range(1000)]

# Print first 10 records
print random_names[:10]

['Mary', 'Jessica', 'Jessica', 'Bob', 'Jessica', 'Jessica', 'Jessica', 'Mary', 'Mary', 'Mary']


• Generate a random numbers between 0 and 1000
In [10]:
# The number of births per name for the year 1880
births = [randint(low=0,high=1000) for i in range(1000)]
print births[:10]

[968, 155, 77, 578, 973, 124, 155, 403, 199, 191]


• Merge the names and the births data set using the zip function.
In [11]:
BabyDataSet = zip(random_names,births)
print BabyDataSet[:10]

[('Mary', 968), ('Jessica', 155), ('Jessica', 77), ('Bob', 578), ('Jessica', 973), ('Jessica', 124), ('Jessica', 155), ('Mary', 403), ('Mary', 199), ('Mary', 191)]



We are basically done creating the data set. We now will use the pandas library to export this data set into a csv file.

df will be a DataFrame object. You can think of this object holding the contents of the BabyDataSet in a format similar to a sql table or an excel spreadsheet. Lets take a look below at the contents inside df.

In [12]:
df = DataFrame(data = BabyDataSet, columns=['Names', 'Births'])
df[:10]

Out[12]:
Names Births
0 Mary 968
1 Jessica 155
2 Jessica 77
3 Bob 578
4 Jessica 973
5 Jessica 124
6 Jessica 155
7 Mary 403
8 Mary 199
9 Mary 191
• Export the dataframe to a text file. We can name the file births1880.txt. The function to_csv will be used to export. The file will be saved in the same location of the notebook unless specified otherwise.
In [13]:
'''
df.to_csv(self, path_or_buf, sep=',', na_rep='', float_format=None, cols=None, header=True, index=True, index_label=None, mode='w', nanRep=None, encoding=None, quoting=None, line_terminator='\n')
Write DataFrame to a comma-separated values (csv) file
'''

df.to_csv?


The only parameters we will use is index and header. Setting these parameters to True will prevent the index and header names from being exported. Change the values of these parameters to get a better understanding of their use.

In [14]:
df.to_csv('births1880.txt',index=False,header=False)


## Get Data

To pull in the text file, we will use the pandas function read_csv. Let us take a look at this function and what inputs it takes.

read_csv(filepath_or_buffer, sep=',', dialect=None, compression=None, doublequote=True, escapechar=None, quotechar='"', quoting=0, skipinitialspace=False, lineterminator=None, header='infer', index_col=None, names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, na_values=None, true_values=None, false_values=None, delimiter=None, converters=None, dtype=None, usecols=None, engine='c', delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, verbose=False, encoding=None, squeeze=False)

In [15]:
read_csv?


Even though this functions has many parameters, we will simply pass it the location of the text file.

Location = C:USERNAME.xy\startups1880.txt

Note: Depending on where you save your notebooks, you may need to modify the location above.

In [16]:
Location = r'C:\Users\hdrojas\.xy\startups\births1880.txt'
df = read_csv(Location)


Notice the r before the string. Since the slashes are special characters, prefixing the string with a r will escape the whole string.

In [17]:
df

Out[17]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 999 entries, 0 to 998
Data columns (total 2 columns):
Mary    999  non-null values
968     999  non-null values
dtypes: int64(1), object(1)


When the dataframe is large, pandas will print out a summary of the data.

Summary says:
* There are 999 records in the data set
* There is a column named Mary with 999 values
* There is a column named 539 with 999 values
* Out of the two columns, one is numeric, the other is non numeric

To actually see the contents of the dataframe we can use the head() function which by default will return the first five records. You can also pass in a number n to return the top n records of the dataframe.

In [18]:
df.head()

Out[18]:
Mary 968
0 Jessica 155
1 Jessica 77
2 Bob 578
3 Jessica 973
4 Jessica 124

This brings us the our first problem of the exercise. The read_csv function treated the first record in the text file as the header names. This is obviously not correct since the text file did not provide us with header names.

To correct this we will pass the header parameter to the read_csv function and set it to None (means null in python).

In [19]:
df = read_csv(Location, header=None)
df

Out[19]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 2 columns):
0    1000  non-null values
1    1000  non-null values
dtypes: int64(1), object(1)


Summary now says:
* There are 1000 records in the data set
* There is a column named 0 with 1000 values
* There is a column named 1 with 1000 values
* Out of the two columns, one is numeric, the other is non numeric

Now lets take a look at the last five records of the dataframe

In [20]:
df.tail()

Out[20]:
0 1
995 John 151
996 Jessica 511
997 John 756
998 Jessica 294
999 John 152

If we wanted to give the columns specific names, we would have to pass another paramter called names. We can also omit the header parameter.

In [21]:
df = read_csv(Location, names=['Names','Births'])
df.head(5)

Out[21]:
Names Births
0 Mary 968
1 Jessica 155
2 Jessica 77
3 Bob 578
4 Jessica 973

You can think of the numbers [0,1,2,3,4,...] as the row numbers in an Excel file. In pandas these are part of the index of the dataframe. You can think of the index as the primary key of a sql table with the exception that an index is allowed to have duplicates.

[Names, Births] can be though of as column headers similar to the ones found in an Excel spreadsheet or sql database.

Delete the csv file now that we are done using it.

In [22]:
import os
os.remove(Location)


## Prepare Data

The data we have consists of baby names and the number of births in the year 1880. We already know that we have 999 records and none of the records are missing (non-null values). We can verify the "Names" column still only has five unique names.

We can use the unique property of the dataframe to find all the unique records of the "Names" column.

In [23]:
# Method 1:
df['Names'].unique()

Out[23]:
array([Mary, Jessica, Bob, John, Mel], dtype=object)

In [24]:
# If you actually want to print the unique values:
for x in df['Names'].unique():
print x

Mary
Jessica
Bob
John
Mel


In [25]:
# Method 2:
print df['Names'].describe()

count     1000
unique       5
top        Bob
freq       206
dtype: object



Since we have multiple values per baby name, we need to aggregate this data so we only have a baby name appear once. This means the 1,000 rows will need to become 5. We can accomplish this by using the groupby function.

In [26]:
'''
df.groupby(self, by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True)
Group series using mapper (dict or key function, apply given function
to group, return result as series) or by a series of columns
'''

df.groupby?

In [27]:
# Create a groupby onject
Name = df.groupby(df['Names'])

# Apply the sum function to the groupby object
df = Name.sum()
df

Out[27]:
Births
Names
Bob 106817
Jessica 97826
John 90705
Mary 99438
Mel 102319

## Analyze Data

To find the most popular name or the baby name with the higest birth rate, we can do one of the following.

• Sort the dataframe and select the top row
• Use the max() attribute to find the maximum value
In [28]:
# Method 1:
Sorted = df.sort(['Births'], ascending=[0])
Sorted.head(1)

Out[28]:
Births
Names
Bob 106817
In [29]:
# Method 2:
df['Births'].max()

Out[29]:
106817


## Present Data

Here we can plot the Births column and label the graph to show the end user the highest point on the graph. In conjunction with the table, the end user has a clear picture that Bob is the most popular baby name in the data set.

plot() is a convinient attribute where pandas lets you painlessly plot the data in your dataframe. We learned how to find the maximum value of the Births column in the previous section. Now to find the actual baby name of the 998 value looks a bit tricky, so lets go over it.

Explain the pieces:
df['Names'] - This is the entire list of baby names, the entire Names column
df['Births'] - This is the entire list of Births in the year 1880, the entire Births column
df['Births'].max() - This is the maximum value found in the Births column

[df['Births'] == df['Births'].max()] IS EQUAL TO [Find all of the records in the Births column where it is equal to 998]
df['Names'][df['Births'] == df['Births'].max()] IS EQUAL TO Select all of the records in the Names column WHERE [The Births column is equal to 998]

An alternative way could have been to use the Sorted dataframe:
Sorted['Names'].head(1).value

The str() function simply converts an object into a string.

In [30]:
# Create graph
df['Births'].plot()

# Maximum value in the data set
MaxValue = df['Births'].max()

# Name associated with the maximum value
MaxName = df[df['Births'] == df['Births'].max()].index[0]

# Text to display on graph
Text = str(MaxValue) + " - " + MaxName

# Add text to graph
plt.annotate(Text, xy=(1, MaxValue), xytext=(8, 0),
xycoords=('axes fraction', 'data'), textcoords='offset points')

print "The most popular name"
df[df['Births'] == df['Births'].max()]
#Sorted.head(1) can also be used

The most popular name


Out[30]:
Births
Names
Bob 106817