Getting Warmed Up:¶

This is a series of notebooks (in progress) to document my learning, and hopefully to help others learn machine learning. I would love suggestions / corrections / feedback for these notebooks.

<a target="_parent"href="http://rmdk.ca%22%3EVisit my webpage for more.

Email me: ryan@rmdk.ca

I'd love if you shared this post

In [4]:

social()

Out[4]:

Follow @Ryanmdk

Learning NumPy Basics¶

In [66]:

import numpy as np
numpy.version.full_version

Out[66]:

'1.8.0'

In [48]:

a = np.array([0,1,2,3,4, 5])
a

Out[48]:

array([0, 1, 2, 3, 4, 5])

Get a look at the dimensions and shape of the data

In [49]:

print a.ndim
print a.shape

1
(6,)

Transform array into 2D matrix.

In [50]:

b = a.reshape((3,2)) # rows, columns
b

Out[50]:

array([[0, 1],
       [2, 3],
       [4, 5]])

However, since we set a = b, any changes in either will be reflected in both to avoid copying.

In [51]:

b[0][0] = 100
b

Out[51]:

array([[100,   1],
       [  2,   3],
       [  4,   5]])

In [52]:

Out[52]:

array([100,   1,   2,   3,   4,   5])

Whenever you need a true copy, use .copy( )

In [53]:

c = a.reshape((3,2)).copy()

Another feature of NumPy array's is that the operations are propagated to the individual elements. Which is in contrast to normal python lists

In [84]:

print a * 2
print [1,2,3,4] * 2

[6 2 4 6 6 6]
[1, 2, 3, 4, 1, 2, 3, 4]

Indexing¶

Arrays can be accessed in several ways.

In addtion to normal list indexing, we can use arrays themselves as indices.

In [55]:

# Index a with a vector 2,3,4
a[np.array([2,3,4])]

Out[55]:

array([2, 3, 4])

Since conditions are propagated to the individual elements, we can access out data in interesting ways.

In [56]:

# Return boolean mask
print a > 4
mask = a > 4

print a[a>4] == a[mask]

# Return the masked data or everything but the mask
print a[mask]
print a[-mask]

[ True False False False False  True]
[ True  True]
[100   5]
[1 2 3 4]

You could also do things like trim outliers.

In [57]:

a[a>3] = 3
a

Out[57]:

array([3, 1, 2, 3, 3, 3])

It turns out that this is pretty popular, so there is a predefined function for it.

.clip( ) will take two arguments are clip the values at both ends of the interval.

In [58]:

a.clip(1,3)

Out[58]:

array([3, 1, 2, 3, 3, 3])

Dealing with missing values¶

One of the most common things we run into as data scientists is missing data. How we deal with that missing data is integral to the outcome and robustness of the analysis. NumPy can use one of several special NAN characters to denote missing values

In [29]:

# Pretend to be read from text file
c = np.array([1,2, np.NAN, 3, 5]) 
print c
print np.isnan(c)

[  1.   2.  nan   3.   5.]
[False False  True False False]

In [32]:

c[-np.isnan(c)] # Non-missing data

Out[32]:

array([ 1.,  2.,  3.,  5.])

This becomes a required tool even for the first stages of exploratory analysis

In [36]:

print np.mean(c)
print np.mean(c[-np.isnan(c)])

nan
2.75

Lets compare runtime between NumPy and regular python lists¶

We are using NumPy for a reason right? Here we will simply calculate the sum of squares for all numbers from 1 - 2 000 and report how long it takes over 10 000 iterations.

In [74]:

from timeit import timeit

normal_python = timeit('sum(x*x for x in xrange(1000))',
                       number = 10000)

#compute dot product of vectors which is equivilent to the product
Numpy_python = timeit('x.dot(x)', setup='import numpy as np;\
                      x=np.arange(1000)' ,number=10000)

print("Normal Python: {} seconds").format(normal_python)
print ("Numpy: {} seconds").format(Numpy_python)

Normal Python: 0.676285982132 seconds
Numpy: 0.0202009677887 seconds

However, we have to be careful, because simply using NumPy does not gaurentee efficiency.

If we dont take advantage of the optimized Numpy code, we are not going to get anywhere. We should always look for the optimized, or vectorized versions, which allow us to operate on the entire matrix or array at once, rather than looping.

In [76]:

dumb_numpy = timeit ('sum(x*x)', setup='import numpy as np;\
                    x=np.arange(1000)', number=10000)
print("Dumb numpy: {} seconds").format(dumb_numpy)

Dumb numpy: 3.61384701729 seconds

At the mercy of speed, we sacrifice some of the flexibility of python lists. In Numpy we can only store a single datatype in an array, where as a list can hold pretty much anything.

Keep in mind that sometimes a list could be better suited to your problem rather than a NumPy array

In [77]:

a.dtype

Out[77]:

dtype('int64')

When we try to use different data types in the same array, Numpy will try to coerce them into a common format. For example, if we combine strings and integers, NumPy will convert the numeric values into strings.

In [78]:

np.array([1,'string'])

Out[78]:

array(['1', 'string'], 
      dtype='|S6')

In [2]:

from IPython.core.display import HTML


def css_styling():
    styles = open("/users/ryankelly/desktop/custom_notebook.css", "r").read()
    return HTML(styles)
css_styling()

Out[2]:

In [3]:

def social():
    code = """
    <a style='float:left; margin-right:5px;' href="https://twitter.com/share" class="twitter-share-button" data-text="Check this out" data-via="Ryanmdk">Tweet</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
    <a style='float:left; margin-right:5px;' href="https://twitter.com/Ryanmdk" class="twitter-follow-button" data-show-count="false">Follow @Ryanmdk</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
    <a style='float:left; margin-right:5px;'target='_parent' href="http://www.reddit.com/submit" onclick="window.location = 'http://www.reddit.com/submit?url=' + encodeURIComponent(window.location); return false"> <img src="http://www.reddit.com/static/spreddit7.gif" alt="submit to reddit" border="0" /> </a>
<script src="//platform.linkedin.com/in.js" type="text/javascript">
  lang: en_US
</script>
<script type="IN/Share"></script>
"""
    return HTML(code)

In [ ]: