Introduction¶

We will use Python's BeautifulSoup library to parse PyCon Au's 2014 videos.

The website doesn't have any listed Terms and Conditions or a robot.txt so we should be good to go.

We won't use the site's RSS feed as that would be too easy!

This tutorial is based on the blog post: http://blog.miguelgrinberg.com/post/easy-web-scraping-with-python

Beautiful Soup¶

From the website http://www.crummy.com/software/BeautifulSoup/

Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful:

Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn't take much code to write an application
Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't detect one. Then you just have to specify the original encoding.
Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility.

Requests¶

From the website http://docs.python-requests.org/en/latest/

Requests is an Apache2 Licensed HTTP library, written in Python, for human beings.

Python’s standard urllib2 module provides most of the HTTP capabilities you need, but the API is thoroughly broken. It was built for a different time — and a different web. It requires an enormous amount of work (even method overrides) to perform the simplest of tasks.

Things shouldn’t be this way. Not in Python.

In [37]:

from bs4 import BeautifulSoup
import requests

Read the pyvideo HTML into Beautiful Soup

In [38]:

pycon_au_2014_url = r'http://pyvideo.org/category/56/pycon-australia-2014'
pycon_au_2014_response = requests.get(pycon_au_2014_url)
pycon_au_2014_text = pycon_au_2014_response.text

In [39]:

soup = BeautifulSoup(pycon_au_2014_text)

Eyeballing the HTML we see that each link to the description of each video is contained in a div with the css class video-summary.

Lets populate a list with the URLs to these links.

In [40]:

pycon_au_2014_video_url_list = ['http://pyvideo.org'+video.select('div strong a')[0].get('href') for video in soup.select('.video-summary')]

# Notes:
# + is defined to concatenate strings in Python
# BeautifulSoup's select returns a list so we use [0] to access the first element
# get returns the attribute's value

Open each Video Description Page then read the description and metadata into a Python Dictionary. Combine all these dictionaries into a list.

In [41]:

video_descriptions = []

for video_url in pycon_au_2014_video_url_list:
    video_response = requests.get(video_url)
    video_page_text = video_response.text
    video_soup = BeautifulSoup(video_page_text)
    
    video_data = {}
    
    try:
        video_data['Description'] = '\n'.join([x.text for x in video_soup.select('.section')[1].select('p')])
    except IndexError:
        video_data['Description'] = None
    
    video_metadata_categories = [x.text for x in video_soup.select('#sidebar')[0].select('dt')]
    video_metadata_values = [x.text.strip() for x in video_soup.select('#sidebar')[0].select('dd')]
    for i in range(len(video_metadata_categories)):
        video_data[video_metadata_categories[i]] = video_metadata_values[i]
        
    video_descriptions.append(video_data)
    

Now we check if all videos are hosted on Youtube

all() returns True if each element in your collection is True

In [42]:

all(['http://www.youtube.com/watch?v=' in x['Video origin'] for x in video_descriptions])

Out[42]:

True

We continue our scraping onto Youtube, where in the website's Terms and Services we notice:

You agree not to use or launch any automated system, including without limitation, "robots," "spiders," or "offline readers," that accesses the Service in a manner that sends more request messages to the YouTube servers in a given period of time than a human can reasonably produce in the same period by using a conventional on-line web browser.* Notwithstanding the foregoing, YouTube grants the operators of public search engines permission to use spiders to copy materials from the site for the sole purpose of and solely to the extent necessary for creating publicly available searchable indices of the materials, but not caches or archives of such materials. YouTube reserves the right to revoke these exceptions either generally or in specific cases. You agree not to collect or harvest any personally identifiable information, including account names, from the Service, nor to use the communication systems provided by the Service (e.g., comments, email) for any commercial solicitation purposes. You agree not to solicit, for commercial purposes, any users of the Service with respect to their Content.*

We add a reasonable 5 second sleep after each HTTP request.

In [43]:

import time

def get_youtube_data(video_data):
    youtube_response = requests.get(video_data['Video origin'])
    youtube_page_text = youtube_response.text
    youtube_soup = BeautifulSoup(youtube_page_text)
    try:
        video_data['Youtube views'] = int(youtube_soup.select('.watch-view-count')[0].text.split(' ')[0].replace(',',''))
    except IndexError:
        if 'This video is unavailable.' in youtube_soup.select('.message')[0].text:
            video_data['Youtube views'] = None
            video_data['Youtube likes'] = None
            video_data['Youtube dislikes'] = None
            return video_data
    video_data['Youtube likes'] = int(youtube_soup.select('#watch-like .yt-uix-button-content')[0].text.replace(',',''))
    video_data['Youtube dislikes'] = int(youtube_soup.select('#watch-dislike .yt-uix-button-content')[0].text.replace(',',''))
    time.sleep(5)
    return video_data

video_descriptions = list(map(get_youtube_data, video_descriptions))

The above code demonstrates how not using avaliable APIs can be a bad idea. We have very poor error handling.

We now save the file as JSON (commented out as it has already been saved!)

In [44]:

#import json
#with open(r'./Assets/pycon_au_2014_video_data.json', 'w') as outfile:
#    json.dump(video_descriptions, outfile)

#Code to test that loading works
#with open(r'./Assets/pycon_au_2014_video_data.json', 'r') as infile:
#    video_descriptions = json.load(infile)

We will now use Pandas to properly clean the file

From the website http://pandas.pydata.org/

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

- a powerful N-dimensional array object
- sophisticated (broadcasting) functions
- tools for integrating C/C++ and Fortran code
- useful linear algebra, Fourier transform, and random number capabilities

Generally you try to use pandas as much as possible as NumPy is a lot harder to use.

In [5]:

import pandas as pd
import numpy as np

Reading data into pandas is extremely easy

In [6]:

df = pd.read_json(r'./Assets/pycon_au_2014_video_data.json')

In [7]:

print(len(df))
df.head(n=3)

Out[7]:

	Category	Copyright/License Information	Description	Download	Language	Last updated	Metadata	Recorded	Speakers	Tags	Video origin	Youtube likes	Youtube views
0	PyCon AU 2014	http://www.youtube.com/t/terms	"On the internet, fraudulent and abusive behav...	ogg	English	Aug. 14, 2014	JSON	Aug. 11, 2014	Rhys Elsmore	No tags	http://www.youtube.com/watch?v=_VO8QxIkjqY	1	117
1	PyCon AU 2014	http://www.youtube.com/t/terms	Web APIs are how much of the modern web speaks...	ogg	English	Aug. 14, 2014	JSON	Aug. 11, 2014	HawkOwl	No tags	http://www.youtube.com/watch?v=pXa4SV3E5JY	1	229
2	PyCon AU 2014	http://creativecommons.org/licenses/by/3.0/	The question: How do I make my website fast?\n...	ogg	English	Aug. 14, 2014	JSON	Aug. 11, 2014	Tom Eastman	No tags	http://www.youtube.com/watch?v=bIWnQ3F1eLA	2	198

In [8]:

df.columns

Out[8]:

Index(['Category', 'Copyright/License Information', 'Description', 'Download', 'Language', 'Last updated', 'Metadata', 'Recorded', 'Speakers', 'Tags', 'Video origin', 'Youtube dislikes', 'Youtube likes', 'Youtube views'], dtype='object')

We now clean our columns

In [9]:

df.groupby('Category').size()

Out[9]:

Category
PyCon AU 2014    51
dtype: int64

In [10]:

df.groupby('Copyright/License Information').size()

Out[10]:

Copyright/License Information
http://creativecommons.org/licenses/by/3.0/    17
http://www.youtube.com/t/terms                 34
dtype: int64

Blank Descriptions use None instead of np.NaN, lets make NaN consistent across filetypes

In [11]:

print(df['Description'][0:5])
df['Description'].fillna(np.NaN,inplace=True)
print(df['Description'][0:5])

0    "On the internet, fraudulent and abusive behav...
1    Web APIs are how much of the modern web speaks...
2    The question: How do I make my website fast?\n...
3                                                 None
4    This is a talk about how OpenStack does databa...
Name: Description, dtype: object
0    "On the internet, fraudulent and abusive behav...
1    Web APIs are how much of the modern web speaks...
2    The question: How do I make my website fast?\n...
3                                                  NaN
4    This is a talk about how OpenStack does databa...
Name: Description, dtype: object

In [12]:

df.groupby('Download').size()

Out[12]:

Download
No downloadable files.     1
ogg                       50
dtype: int64

In [13]:

df.replace('No downloadable files.',np.NaN,inplace=True)

In [14]:

df.groupby('Language').size()

Out[14]:

Language
English     51
dtype: int64

The 'Last updated' column is a string instead of a date, rather than writing a formatter we simply use dateutil's parse function

In [15]:

print(type(df['Last updated'][0]))
print(df.groupby('Last updated').size())
from dateutil.parser import parse
df['Last updated'] = list(map(parse,df['Last updated']))
print(df.groupby('Last updated').size())

<class 'str'>
Last updated
Aug. 14, 2014    51
dtype: int64
Last updated
2014-08-14      51
dtype: int64

In [16]:

df.groupby('Metadata').size()

Out[16]:

Metadata
JSON        51
dtype: int64

In [17]:

print(df.groupby('Recorded').size())
df['Recorded'] = list(map(parse,df['Recorded']))
print(df.groupby('Recorded').size())

Recorded
Aug. 11, 2014    25
Aug. 5, 2014      2
Aug. 7, 2014      9
Aug. 9, 2014     15
dtype: int64
Recorded
2014-08-05     2
2014-08-07     9
2014-08-09    15
2014-08-11    25
dtype: int64

There was one talk with two speakers, they are seperated by newlines (probably in list tags in the original HTML) - lets change this to 'and' instead

In [18]:

df['Speakers'].replace('Alexey Kotlyarov\n\nDanielle Madeley', 'Alexey Kotlyarov and Danielle Madeley', inplace=True)

Believe it or not, none of the talks have tags! Lets replace 'No tags' with np.NaN

In [19]:

df['Tags'].replace('No tags', np.NaN, inplace=True)

The Video origin column and all the Youtube columns are fine for our purposes, even though one Youtube video was not available. Pandas changed all the None values to np.NaN for us.

Lets change all the columns to snake_case, someone wrote us a robust-enough function on Stack Overflow avaliable here:

http://stackoverflow.com/questions/1175208/elegant-python-function-to-convert-camelcase-to-camel-case

In [20]:

import re

def snake_case(name):
    s1 = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
    return re.sub('([a-z0-9])([A-Z])', r'\1_\2', s1).lower()

df.columns = list(map(snake_case,df.columns))

print(df.columns)

Index(['category', 'copyright/_license _information', 'description', 'download', 'language', 'last updated', 'metadata', 'recorded', 'speakers', 'tags', 'video origin', 'youtube dislikes', 'youtube likes', 'youtube views'], dtype='object')

It didn't appear to work for some columns, we use list comprehension to fix the rest

In [21]:

df.columns = [x.replace(' ','_') for x in df.columns]
print(df.columns)

Index(['category', 'copyright/_license__information', 'description', 'download', 'language', 'last_updated', 'metadata', 'recorded', 'speakers', 'tags', 'video_origin', 'youtube_dislikes', 'youtube_likes', 'youtube_views'], dtype='object')

and finally we fix the copyright column

In [22]:

df.rename(columns={'copyright/_license__information': 'copyright_or_license__information'},inplace=True)
print(df.columns)

Index(['category', 'copyright_or_license__information', 'description', 'download', 'language', 'last_updated', 'metadata', 'recorded', 'speakers', 'tags', 'video_origin', 'youtube_dislikes', 'youtube_likes', 'youtube_views'], dtype='object')

Lets create a new column with net-likes

In [23]:

df['youtube_net_likes'] = df['youtube_likes'] - df['youtube_dislikes']

Finally we save the file

In [27]:

df.to_csv('./Assets/cleaned_pycon_au_2014_video_data.csv')

Creating some plots

The first line, makes plots display inline.

The second line changes to plot style to a ggplot-esque style

In [28]:

%matplotlib inline

pd.options.display.mpl_style = 'default'

In [29]:

df.hist('youtube_net_likes',bins=10);

In [30]:

df.hist('youtube_views', bins=10);

matplotlib is the standard plotting library in Python - there are many different data visualisation libraries which I won't cover today like seaborn and vincent

In [31]:

import matplotlib.pyplot as plt

In [32]:

df['youtube_views'].fillna(0).plot(kind='kde');

In [33]:

df.plot(kind='scatter', x='youtube_views', y='youtube_net_likes');

Interactive* plotting with Bokeh

In [34]:

from bokeh.plotting import scatter, figure, show, output_notebook

output_notebook()

BokehJS successfully loaded.

In [35]:

scatter(x=df['youtube_views'], y=df['youtube_net_likes'], tools="pan, wheel_zoom, resize, select, save")
show()