We will use Python's BeautifulSoup library to parse PyCon Au's 2014 videos.
The website doesn't have any listed Terms and Conditions or a robot.txt so we should be good to go.
We won't use the site's RSS feed as that would be too easy!
This tutorial is based on the blog post: http://blog.miguelgrinberg.com/post/easy-web-scraping-with-python
From the website http://www.crummy.com/software/BeautifulSoup/
Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful:
From the website http://docs.python-requests.org/en/latest/
Requests is an Apache2 Licensed HTTP library, written in Python, for human beings.
Python’s standard urllib2 module provides most of the HTTP capabilities you need, but the API is thoroughly broken. It was built for a different time — and a different web. It requires an enormous amount of work (even method overrides) to perform the simplest of tasks.
Things shouldn’t be this way. Not in Python.
from bs4 import BeautifulSoup
import requests
Read the pyvideo HTML into Beautiful Soup
pycon_au_2014_url = r'http://pyvideo.org/category/56/pycon-australia-2014'
pycon_au_2014_response = requests.get(pycon_au_2014_url)
pycon_au_2014_text = pycon_au_2014_response.text
soup = BeautifulSoup(pycon_au_2014_text)
Eyeballing the HTML we see that each link to the description of each video is contained in a div with the css class video-summary.
Lets populate a list with the URLs to these links.
pycon_au_2014_video_url_list = ['http://pyvideo.org'+video.select('div strong a')[0].get('href') for video in soup.select('.video-summary')]
# Notes:
# + is defined to concatenate strings in Python
# BeautifulSoup's select returns a list so we use [0] to access the first element
# get returns the attribute's value
Open each Video Description Page then read the description and metadata into a Python Dictionary. Combine all these dictionaries into a list.
video_descriptions = []
for video_url in pycon_au_2014_video_url_list:
video_response = requests.get(video_url)
video_page_text = video_response.text
video_soup = BeautifulSoup(video_page_text)
video_data = {}
try:
video_data['Description'] = '\n'.join([x.text for x in video_soup.select('.section')[1].select('p')])
except IndexError:
video_data['Description'] = None
video_metadata_categories = [x.text for x in video_soup.select('#sidebar')[0].select('dt')]
video_metadata_values = [x.text.strip() for x in video_soup.select('#sidebar')[0].select('dd')]
for i in range(len(video_metadata_categories)):
video_data[video_metadata_categories[i]] = video_metadata_values[i]
video_descriptions.append(video_data)
Now we check if all videos are hosted on Youtube
all() returns True if each element in your collection is True
all(['http://www.youtube.com/watch?v=' in x['Video origin'] for x in video_descriptions])
True
We continue our scraping onto Youtube, where in the website's Terms and Services we notice:
You agree not to use or launch any automated system, including without limitation, "robots," "spiders," or "offline readers," that accesses the Service in a manner that sends more request messages to the YouTube servers in a given period of time than a human can reasonably produce in the same period by using a conventional on-line web browser.* Notwithstanding the foregoing, YouTube grants the operators of public search engines permission to use spiders to copy materials from the site for the sole purpose of and solely to the extent necessary for creating publicly available searchable indices of the materials, but not caches or archives of such materials. YouTube reserves the right to revoke these exceptions either generally or in specific cases. You agree not to collect or harvest any personally identifiable information, including account names, from the Service, nor to use the communication systems provided by the Service (e.g., comments, email) for any commercial solicitation purposes. You agree not to solicit, for commercial purposes, any users of the Service with respect to their Content.*
We add a reasonable 5 second sleep after each HTTP request.
import time
def get_youtube_data(video_data):
youtube_response = requests.get(video_data['Video origin'])
youtube_page_text = youtube_response.text
youtube_soup = BeautifulSoup(youtube_page_text)
try:
video_data['Youtube views'] = int(youtube_soup.select('.watch-view-count')[0].text.split(' ')[0].replace(',',''))
except IndexError:
if 'This video is unavailable.' in youtube_soup.select('.message')[0].text:
video_data['Youtube views'] = None
video_data['Youtube likes'] = None
video_data['Youtube dislikes'] = None
return video_data
video_data['Youtube likes'] = int(youtube_soup.select('#watch-like .yt-uix-button-content')[0].text.replace(',',''))
video_data['Youtube dislikes'] = int(youtube_soup.select('#watch-dislike .yt-uix-button-content')[0].text.replace(',',''))
time.sleep(5)
return video_data
video_descriptions = list(map(get_youtube_data, video_descriptions))
The above code demonstrates how not using avaliable APIs can be a bad idea. We have very poor error handling.
We now save the file as JSON (commented out as it has already been saved!)
#import json
#with open(r'./Assets/pycon_au_2014_video_data.json', 'w') as outfile:
# json.dump(video_descriptions, outfile)
#Code to test that loading works
#with open(r'./Assets/pycon_au_2014_video_data.json', 'r') as infile:
# video_descriptions = json.load(infile)
We will now use Pandas to properly clean the file
From the website http://pandas.pydata.org/
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
NumPy is the fundamental package for scientific computing with Python. It contains among other things:
- a powerful N-dimensional array object
- sophisticated (broadcasting) functions
- tools for integrating C/C++ and Fortran code
- useful linear algebra, Fourier transform, and random number capabilities
Generally you try to use pandas as much as possible as NumPy is a lot harder to use.
import pandas as pd
import numpy as np
Reading data into pandas is extremely easy
df = pd.read_json(r'./Assets/pycon_au_2014_video_data.json')
print(len(df))
df.head(n=3)
51
Category | Copyright/License Information | Description | Download | Language | Last updated | Metadata | Recorded | Speakers | Tags | Video origin | Youtube dislikes | Youtube likes | Youtube views | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | PyCon AU 2014 | http://www.youtube.com/t/terms | "On the internet, fraudulent and abusive behav... | ogg | English | Aug. 14, 2014 | JSON | Aug. 11, 2014 | Rhys Elsmore | No tags | http://www.youtube.com/watch?v=_VO8QxIkjqY | 0 | 1 | 117 |
1 | PyCon AU 2014 | http://www.youtube.com/t/terms | Web APIs are how much of the modern web speaks... | ogg | English | Aug. 14, 2014 | JSON | Aug. 11, 2014 | HawkOwl | No tags | http://www.youtube.com/watch?v=pXa4SV3E5JY | 0 | 1 | 229 |
2 | PyCon AU 2014 | http://creativecommons.org/licenses/by/3.0/ | The question: How do I make my website fast?\n... | ogg | English | Aug. 14, 2014 | JSON | Aug. 11, 2014 | Tom Eastman | No tags | http://www.youtube.com/watch?v=bIWnQ3F1eLA | 0 | 2 | 198 |
df.columns
Index(['Category', 'Copyright/License Information', 'Description', 'Download', 'Language', 'Last updated', 'Metadata', 'Recorded', 'Speakers', 'Tags', 'Video origin', 'Youtube dislikes', 'Youtube likes', 'Youtube views'], dtype='object')
We now clean our columns
df.groupby('Category').size()
Category PyCon AU 2014 51 dtype: int64
df.groupby('Copyright/License Information').size()
Copyright/License Information http://creativecommons.org/licenses/by/3.0/ 17 http://www.youtube.com/t/terms 34 dtype: int64
Blank Descriptions use None instead of np.NaN, lets make NaN consistent across filetypes
print(df['Description'][0:5])
df['Description'].fillna(np.NaN,inplace=True)
print(df['Description'][0:5])
0 "On the internet, fraudulent and abusive behav... 1 Web APIs are how much of the modern web speaks... 2 The question: How do I make my website fast?\n... 3 None 4 This is a talk about how OpenStack does databa... Name: Description, dtype: object 0 "On the internet, fraudulent and abusive behav... 1 Web APIs are how much of the modern web speaks... 2 The question: How do I make my website fast?\n... 3 NaN 4 This is a talk about how OpenStack does databa... Name: Description, dtype: object
df.groupby('Download').size()
Download No downloadable files. 1 ogg 50 dtype: int64
df.replace('No downloadable files.',np.NaN,inplace=True)
df.groupby('Language').size()
Language English 51 dtype: int64
The 'Last updated' column is a string instead of a date, rather than writing a formatter we simply use dateutil's parse function
print(type(df['Last updated'][0]))
print(df.groupby('Last updated').size())
from dateutil.parser import parse
df['Last updated'] = list(map(parse,df['Last updated']))
print(df.groupby('Last updated').size())
<class 'str'> Last updated Aug. 14, 2014 51 dtype: int64 Last updated 2014-08-14 51 dtype: int64
df.groupby('Metadata').size()
Metadata JSON 51 dtype: int64
print(df.groupby('Recorded').size())
df['Recorded'] = list(map(parse,df['Recorded']))
print(df.groupby('Recorded').size())
Recorded Aug. 11, 2014 25 Aug. 5, 2014 2 Aug. 7, 2014 9 Aug. 9, 2014 15 dtype: int64 Recorded 2014-08-05 2 2014-08-07 9 2014-08-09 15 2014-08-11 25 dtype: int64
There was one talk with two speakers, they are seperated by newlines (probably in list tags in the original HTML) - lets change this to 'and' instead
df['Speakers'].replace('Alexey Kotlyarov\n\nDanielle Madeley', 'Alexey Kotlyarov and Danielle Madeley', inplace=True)
Believe it or not, none of the talks have tags! Lets replace 'No tags' with np.NaN
df['Tags'].replace('No tags', np.NaN, inplace=True)
The Video origin column and all the Youtube columns are fine for our purposes, even though one Youtube video was not available. Pandas changed all the None values to np.NaN for us.
Lets change all the columns to snake_case, someone wrote us a robust-enough function on Stack Overflow avaliable here:
import re
def snake_case(name):
s1 = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
return re.sub('([a-z0-9])([A-Z])', r'\1_\2', s1).lower()
df.columns = list(map(snake_case,df.columns))
print(df.columns)
Index(['category', 'copyright/_license _information', 'description', 'download', 'language', 'last updated', 'metadata', 'recorded', 'speakers', 'tags', 'video origin', 'youtube dislikes', 'youtube likes', 'youtube views'], dtype='object')
It didn't appear to work for some columns, we use list comprehension to fix the rest
df.columns = [x.replace(' ','_') for x in df.columns]
print(df.columns)
Index(['category', 'copyright/_license__information', 'description', 'download', 'language', 'last_updated', 'metadata', 'recorded', 'speakers', 'tags', 'video_origin', 'youtube_dislikes', 'youtube_likes', 'youtube_views'], dtype='object')
and finally we fix the copyright column
df.rename(columns={'copyright/_license__information': 'copyright_or_license__information'},inplace=True)
print(df.columns)
Index(['category', 'copyright_or_license__information', 'description', 'download', 'language', 'last_updated', 'metadata', 'recorded', 'speakers', 'tags', 'video_origin', 'youtube_dislikes', 'youtube_likes', 'youtube_views'], dtype='object')
Lets create a new column with net-likes
df['youtube_net_likes'] = df['youtube_likes'] - df['youtube_dislikes']
Finally we save the file
df.to_csv('./Assets/cleaned_pycon_au_2014_video_data.csv')
Creating some plots
The first line, makes plots display inline.
The second line changes to plot style to a ggplot-esque style
%matplotlib inline
pd.options.display.mpl_style = 'default'
df.hist('youtube_net_likes',bins=10);
df.hist('youtube_views', bins=10);
matplotlib is the standard plotting library in Python - there are many different data visualisation libraries which I won't cover today like seaborn and vincent
import matplotlib.pyplot as plt
df['youtube_views'].fillna(0).plot(kind='kde');
df.plot(kind='scatter', x='youtube_views', y='youtube_net_likes');
Interactive* plotting with Bokeh
from bokeh.plotting import scatter, figure, show, output_notebook
output_notebook()
scatter(x=df['youtube_views'], y=df['youtube_net_likes'], tools="pan, wheel_zoom, resize, select, save")
show()