TED is a nonprofit devoted to Ideas Worth Spreading. It started out (in 1984) as a conference bringing together people from three worlds: Technology, Entertainment, and Design. The TED Open Translation Project brings TED Talks beyond the English-speaking world by offering subtitles, interactive transcripts and the ability for any talk to be translated by volunteers worldwide. The project was launched with 300 translations, 40 languages and 200 volunteer translators; now, there are more than 32,000 completed translations from the thousands-strong community. The TEDx program is designed to give communities the opportunity to stimulate dialogue through TED-like experiences at the local level.
Our project wants to encourage people to translate TEDx Talk as well by showing how TEDx Talk videos are translated and spreaded among different languages, places and topics, and comparing the spreading status with TED Talk videos.
The questions we are trying to answer:
Since TEDx did not provide any API for people to retrieve video data, we write our own scrapper to crawl various attributes of the TEDx videos. And since all TEDx videos are on YouTube, we also use YouTube API to retrieve more interesting information about the videos.
TEDx Website
YouTube API
First, we try to get all the type portal links from TEDx home URL.
We try to find out all the links begin with the following strings in the TEDx home URL with Beautiful Soup.
import requests
from bs4 import BeautifulSoup
TEDX_HOME_URL = "http://tedxtalks.ted.com"
LANG_URL = "/browse/talks-by-language/"
s = requests.get(TEDX_HOME_URL)
soup = BeautifulSoup(s.content)
total = 0
link_tags = soup.find_all('a', href=True)
for link_tag in link_tags:
link = link_tag['href']
lang = link_tag.next_element.next_element.next_element
if link.startswith(LANG_URL):
print("Language %s: %s" % (lang, link))
total += 1
print("Total: %d" % (total))
Then we go to each type portal link to get video links, and the type will become the attribute of the video.
We will go through page 1, page 2, until there is no other pages in that type attribute. For example, we will go to 'http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=1', then 'http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=2' and stop at ''http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=3' to get all the 28 videos in Icelandic.
In 'http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=1' we can find the video link '/list/search%3Atag%3A%22icelandic%22/video/TEDxReykjavik-Eythor-Edvardsson' by using Beautiful Soup and the regular expression. For example, the second video in 'http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=1'
<a id="mvp_grid_panel_img_1" href="/list/search%3Atag%3A%22icelandic%22/video/TEDxReykjavik-Eythor-Edvardsson" class="mvp_thumbnail_magnified" style="position: relative; width: 293px; height: 220px; background-image: url('http://s3.amazonaws.com/magnifythumbs/GZL3TH25NHYK1LB6.jpg'); filter: progid:DXImageTransform.Microsoft.AlphaImageLoader( src='http://s3.amazonaws.com/magnifythumbs/GZL3TH25NHYK1LB6.jpg', sizingMethod='scale');" title="TEDxReykjavik - Eythor Edvardsson -"></a>
import re
VIDEO_LINK_PREFIX = "mvp_grid_panel_img_"
MSG_CLASS = "mvp_padded_message"
EMPTY_PAGE_MSG = "This page is empty."
portal_url = "http://tedxtalks.ted.com/browse/talks-by-language/icelandic"
page = 1
while(True):
# EX: http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=1
url = portal_url + "?page=" + str(page)
print("Reading URL: " + url)
s = requests.get(url)
soup = BeautifulSoup(s.content)
# if there is no Next page
# <div class="mvp_padded_message">This page is empty.</div>
msg_tag = soup.find('div', {'class': MSG_CLASS})
if msg_tag and msg_tag.get_text() == EMPTY_PAGE_MSG:
print("empty page.")
break
link_tags = soup.find_all('a', id=re.compile(VIDEO_LINK_PREFIX), href=True)
for link_tag in link_tags:
link = link_tag['href']
# EX: from /list/search%3Atag%3A%22chinese%22/video/The-tragedy-of-Hong-Kong-Archiv to /video/The-tragedy-of-Hong-Kong-Archiv
pos = link.find("/video")
link = link[pos:]
print("video link: %s (attr Language: Icelandic)" % (link))
page += 1
Since all TEDx videos are on YouTube, and we also use YouTube API to get other interesting information of the videos, we use YouTube ID as the key to represent the video.
And because Beautiful Soup does not parse through the tag <embed>
so we just use the regualr expression to get the YouTube ID.
For example, in the video 'http://tedxtalks.ted.com/video/TEDxReykjavik-Eythor-Edvardsson' we can find its YouTube ID: 'bzF4GPguPL8'
<embed type="application/x-shockwave-flash" src="http://www.youtube.com/v/bzF4GPguPL8&rel=0&fs=1&showsearch=0&enablejsapi=1&modestbranding=1&autoplay=1&playerapiid=mvp_swfo_embed_V8C4K631YLWW0QF3_1299136773" width="634" height="382" style="undefined" id="mvp_swfo_embed_V8C4K631YLWW0QF3_1299136773" name="mvp_swfo_embed_V8C4K631YLWW0QF3_1299136773" quality="high" allowfullscreen="true" allowscriptaccess="always" wmode="opaque" loop="false">
VIDEO_ID_RE = b"""
<embed.*\ src=\\\\\".*/v/(.*?)\\\\\".*>.*</embed>
"""
url = "http://tedxtalks.ted.com/video/TEDxReykjavik-Eythor-Edvardsson"
s = requests.get(url)
html = s.content
video_ids = re.findall(VIDEO_ID_RE, html, re.IGNORECASE|re.VERBOSE)
for video_id in video_ids:
# NOTE: byte string => need decode
print("YouTube ID: %s (%s)" % (url, video_id.decode('utf-8')))
#https://developers.google.com/youtube/articles/view_youtube_jsonc_responses
#https://developers.google.com/youtube/2.0/developers_guide_jsonc
import requests
import json
import time
user_id = "tedxtalks"
page = 1
maxcount = 25
count = 0
start_index = 0
# Obtaining Total page number
s = requests.get("https://gdata.youtube.com/feeds/api/users/"+user_id+"/uploads?v=2&alt=jsonc&start-index=1&max-result=1")
data = [json.loads(row) for row in s.content.split("\n") if row]
totalcount = data[0]['data']['totalItems']
pagenumber = totalcount/maxcount +1
key = ['id', 'uploaded', 'category', 'title', 'tags', 'thumbnail', 'duration', 'likeCount', 'rating', 'ratingCount', 'viewCount', 'favoriteCount', 'commentCount']
tedx ={'id':'',
'data':{ 'uploaded':'','title':'','tags':'','thumbnail':'','duration':'','likeCount':'','rating':'','ratingCount':'','viewCount':'','favoriteCount':'','commentCount':''}
}
# Obtaining Data from each page (sample)
for index in range(1,2): #range(1,pagenumber):
# changing index number
if index == 1:
start_index = 1
else:
start_index = index*maxcount
s = requests.get("https://gdata.youtube.com/feeds/api/users/"+user_id+"/uploads?v=2&alt=jsonc&start-index="+str(start_index)+"&max-result="+str(maxcount))
data = [json.loads(row) for row in s.content.split("\n") if row]
metadata = data[0]['data']['items']
# obtaining each data in a page (25 items)
for i in range(5):#len(metadata)):
count +=1
u = metadata[i]
#missing key-value pair
for j in key:
if j=='id':
tedx['id']=u['id']
elif j =='thumbnail':
tedx['data'][j] = u[j][u'hqDefault']
elif j == 'title':
tedx['data'][j] = u[j].encode('utf-8')
else:
tedx['data'][j] = u[j] if not j in list(set(key) -set(u.keys())) else '-'
the_dump = json.dumps(tedx)
print the_dump
# delay
time.sleep(1)
# https://developers.google.com/youtube/2.0/developers_guide_jsonc
Now that we've got the video attributes from both TEDx website and YouTube, we can merge these attributes.
import json
SITE_JSON = "tedx_video.json"
YOUTUBE_JSON = "tedx_v7.txt"
SITE_ATTR_LIST = ['lang', 'event', 'country', 'topic']
# import JSON from TEDx website and make video_dict
site_json_file = open(SITE_JSON)
site_json = json.load(site_json_file)
site_json_file.close()
video_dict = {}
for video in site_json:
vid = site_json[video]['id']
video_dict[vid] = {}
for attr in SITE_ATTR_LIST:
if attr in site_json[video]:
video_dict[vid][attr] = site_json[video][attr]
# get JSON from YouTube and print to merged result file
merged_cnt = 0
with open(YOUTUBE_JSON, "r") as youtube_json_file:
for line in youtube_json_file:
if merged_cnt >= 10:
break
youtube_json = json.loads(line)
vid = youtube_json['id']
merged_video = youtube_json['data']
merged_video['id'] = vid
if vid in video_dict:
attr_cnt = 0
for attr in SITE_ATTR_LIST:
if attr in video_dict[vid]:
merged_video[attr] = video_dict[vid][attr]
attr_cnt += 1
if attr_cnt == 4:
print(json.dumps(merged_video))
merged_cnt += 1
Now that we've merged all the information we got, we can try to discover some basic statistics of these videos.
import pandas as pd
from pandas import Series, DataFrame
import json
TEDX_JSON_FILE = "final_tedx.json"
tedx_video_list = []
with open(TEDX_JSON_FILE, "r") as tedx_json_file:
for line in tedx_json_file:
tedx_video_list.append(json.loads(line))
tedx_df = DataFrame(tedx_video_list)
tedx_df.set_index('id', inplace=True, drop=True)
tedx_df
Conclusion?
tedx_df['lang'].value_counts()[:10]
tedx_df[tedx_df.lang!='English']['lang'].value_counts().plot(kind="bar")
Conclusion?
tedx_df['viewCount'] = tedx_df['viewCount'].fillna(0)
tedx_df[tedx_df.viewCount!='-'][['viewCount', 'lang']].groupby('lang').sum().sort('viewCount', ascending=0)[:10]
tmp_tedx_df = tedx_df[tedx_df.lang!='English'].copy()
tmp_tedx_df[tmp_tedx_df.viewCount!='-'][['viewCount', 'lang']].groupby('lang').sum().sort('viewCount', ascending=0).plot(kind="bar")
Conclusion?
tedx_df.groupby('lang').event.nunique().order(ascending=False)[:10]
tmp_tedx_df = tedx_df[tedx_df.lang!='English'].copy()
tmp_tedx_df.groupby('lang').event.nunique().order(ascending=False).plot(kind="bar")
tedx_df['country'].value_counts()[:10]
tedx_df[tedx_df.viewCount!='-'][['viewCount', 'country']].groupby('country').sum().sort('viewCount', ascending=0)[:10]
tedx_df.groupby('country').event.nunique().order(ascending=False)[:10]
# more information for visualization, including how to prepare data for D3.js
# http://nbviewer.ipython.org/5501063
from IPython.display import HTML
HTML('<iframe src="http://96chany.com/projects/tedx_popularity" width="1000" height="800"></iframe>')
# you can navigate years by sliding 'year' digits
from IPython.display import HTML
HTML('<iframe src="http://96chany.com/projects/tedx_comparison" width="1100" height="750"></iframe>')
from pandas import read_csv
from urllib import urlopen
from pandas import Series, DataFrame
page = urlopen("list of languages by number of native speaker.csv")
df = read_csv(page)
df.set_index('Language',inplace=True,drop=True)
df[:30].plot(kind="bar")
Data inconsistancy : The video TEDxCausewayBay - Terence Wong - 04/15/10 appears in language Chinese and Korean, and in event TEDxCAU and TEDxCausewayBay