This is a series of notebooks (in progress) to document my learning, and hopefully to help others learn machine learning. I would love suggestions / corrections / feedback for these notebooks.
<a target="_parent"href="http://rmdk.ca%22%3EVisit my webpage for more.
Email me: ryan@rmdk.ca
I'd love if you shared this post
social()
import numpy as np
numpy.version.full_version
'1.8.0'
a = np.array([0,1,2,3,4, 5])
a
array([0, 1, 2, 3, 4, 5])
Get a look at the dimensions and shape of the data
print a.ndim
print a.shape
1 (6,)
Transform array into 2D matrix.
b = a.reshape((3,2)) # rows, columns
b
array([[0, 1], [2, 3], [4, 5]])
However, since we set a = b, any changes in either will be reflected in both to avoid copying.
b[0][0] = 100
b
array([[100, 1], [ 2, 3], [ 4, 5]])
a
array([100, 1, 2, 3, 4, 5])
Whenever you need a true copy, use .copy( )
c = a.reshape((3,2)).copy()
Another feature of NumPy array's is that the operations are propagated to the individual elements. Which is in contrast to normal python lists
print a * 2
print [1,2,3,4] * 2
[6 2 4 6 6 6] [1, 2, 3, 4, 1, 2, 3, 4]
Arrays can be accessed in several ways.
# Index a with a vector 2,3,4
a[np.array([2,3,4])]
array([2, 3, 4])
Since conditions are propagated to the individual elements, we can access out data in interesting ways.
# Return boolean mask
print a > 4
mask = a > 4
print a[a>4] == a[mask]
# Return the masked data or everything but the mask
print a[mask]
print a[-mask]
[ True False False False False True] [ True True] [100 5] [1 2 3 4]
You could also do things like trim outliers.
a[a>3] = 3
a
array([3, 1, 2, 3, 3, 3])
It turns out that this is pretty popular, so there is a predefined function for it.
.clip( ) will take two arguments are clip the values at both ends of the interval.
a.clip(1,3)
array([3, 1, 2, 3, 3, 3])
One of the most common things we run into as data scientists is missing data. How we deal with that missing data is integral to the outcome and robustness of the analysis. NumPy can use one of several special NAN characters to denote missing values
# Pretend to be read from text file
c = np.array([1,2, np.NAN, 3, 5])
print c
print np.isnan(c)
[ 1. 2. nan 3. 5.] [False False True False False]
c[-np.isnan(c)] # Non-missing data
array([ 1., 2., 3., 5.])
This becomes a required tool even for the first stages of exploratory analysis
print np.mean(c)
print np.mean(c[-np.isnan(c)])
nan 2.75
We are using NumPy for a reason right? Here we will simply calculate the sum of squares for all numbers from 1 - 2 000 and report how long it takes over 10 000 iterations.
from timeit import timeit
normal_python = timeit('sum(x*x for x in xrange(1000))',
number = 10000)
#compute dot product of vectors which is equivilent to the product
Numpy_python = timeit('x.dot(x)', setup='import numpy as np;\
x=np.arange(1000)' ,number=10000)
print("Normal Python: {} seconds").format(normal_python)
print ("Numpy: {} seconds").format(Numpy_python)
Normal Python: 0.676285982132 seconds Numpy: 0.0202009677887 seconds
However, we have to be careful, because simply using NumPy does not gaurentee efficiency.
If we dont take advantage of the optimized Numpy code, we are not going to get anywhere. We should always look for the optimized, or vectorized versions, which allow us to operate on the entire matrix or array at once, rather than looping.
dumb_numpy = timeit ('sum(x*x)', setup='import numpy as np;\
x=np.arange(1000)', number=10000)
print("Dumb numpy: {} seconds").format(dumb_numpy)
Dumb numpy: 3.61384701729 seconds
At the mercy of speed, we sacrifice some of the flexibility of python lists. In Numpy we can only store a single datatype in an array, where as a list can hold pretty much anything.
Keep in mind that sometimes a list could be better suited to your problem rather than a NumPy array
a.dtype
dtype('int64')
When we try to use different data types in the same array, Numpy will try to coerce them into a common format. For example, if we combine strings and integers, NumPy will convert the numeric values into strings.
np.array([1,'string'])
array(['1', 'string'], dtype='|S6')
from IPython.core.display import HTML
def css_styling():
styles = open("/users/ryankelly/desktop/custom_notebook.css", "r").read()
return HTML(styles)
css_styling()
def social():
code = """
<a style='float:left; margin-right:5px;' href="https://twitter.com/share" class="twitter-share-button" data-text="Check this out" data-via="Ryanmdk">Tweet</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
<a style='float:left; margin-right:5px;' href="https://twitter.com/Ryanmdk" class="twitter-follow-button" data-show-count="false">Follow @Ryanmdk</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
<a style='float:left; margin-right:5px;'target='_parent' href="http://www.reddit.com/submit" onclick="window.location = 'http://www.reddit.com/submit?url=' + encodeURIComponent(window.location); return false"> <img src="http://www.reddit.com/static/spreddit7.gif" alt="submit to reddit" border="0" /> </a>
<script src="//platform.linkedin.com/in.js" type="text/javascript">
lang: en_US
</script>
<script type="IN/Share"></script>
"""
return HTML(code)