TED is a nonprofit devoted to Ideas Worth Spreading. It started out (in 1984) as a conference bringing together people from three worlds: Technology, Entertainment, and Design. The TED Open Translation Project brings TED Talks beyond the English-speaking world by offering subtitles, interactive transcripts and the ability for any talk to be translated by volunteers worldwide. The project was launched with 300 translations, 40 languages and 200 volunteer translators; now, there are more than 32,000 completed translations from the thousands-strong community. The TEDx program is designed to give communities the opportunity to stimulate dialogue through TED-like experiences at the local level.
Our project wants to encourage people to translate TEDx Talk as well by showing how TEDx Talk videos are translated and spreaded among different languages, places and topics, and comparing the spreading status with TED Talk videos.
The questions we are trying to answer:
Since TEDx did not provide any API for people to retrieve video data, we write our own scrapper to crawl various attributes of the TEDx videos. And since all TEDx videos are on YouTube, we also use YouTube API to retrieve more interesting information about the videos.
TEDx Website
YouTube API
First, we try to get all the type portal links from TEDx home URL.
We try to find out all the links begin with the following strings in the TEDx home URL with Beautiful Soup.
import requests
from bs4 import BeautifulSoup
TEDX_HOME_URL = "http://tedxtalks.ted.com"
LANG_URL = "/browse/talks-by-language/"
s = requests.get(TEDX_HOME_URL)
soup = BeautifulSoup(s.content)
total = 0
link_tags = soup.find_all('a', href=True)
for link_tag in link_tags:
link = link_tag['href']
lang = link_tag.next_element.next_element.next_element
if link.startswith(LANG_URL):
print("Language %s: %s" % (lang, link))
total += 1
print("Total: %d" % (total))
Language American Sign Language: /browse/talks-by-language/asl Language Azerbaijani: /browse/talks-by-language/azerbaijani Language Galician: /browse/talks-by-language/galician Language Arabic: /browse/talks-by-language/arabic Language Bulgarian: /browse/talks-by-language/bulgarian Language Catalan: /browse/talks-by-language/catalan Language Chinese: /browse/talks-by-language/chinese Language Croatian: /browse/talks-by-language/croatian Language Czech: /browse/talks-by-language/czech Language Dutch: /browse/talks-by-language/dutch Language English: /browse/talks-by-language/english Language Estonian: /browse/talks-by-language/estonian Language Finnish: /browse/talks-by-language/finnish Language French: /browse/talks-by-language/french Language German: /browse/talks-by-language/german Language Greek: /browse/talks-by-language/greek Language Hebrew: /browse/talks-by-language/hebrew Language Hindi: /browse/talks-by-language/hindi Language Hungarian: /browse/talks-by-language/hungarian Language Icelandic: /browse/talks-by-language/icelandic Language Indonesian: /browse/talks-by-language/indonesian Language Italian: /browse/talks-by-language/italian Language Japanese: /browse/talks-by-language/japanese Language Korean: /browse/talks-by-language/korean Language Lithuanian: /browse/talks-by-language/lithuanian Language Malay: /browse/talks-by-language/malay Language Polish: /browse/talks-by-language/polish Language Portuguese: /browse/talks-by-language/portuguese Language Rajasthani: /browse/talks-by-language/rajasthani Language Romanian: /browse/talks-by-language/romanian Language Russian: /browse/talks-by-language/russian Language Slovak: /browse/talks-by-language/slovak Language Slovene: /browse/talks-by-language/slovene Language Spanish: /browse/talks-by-language/spanish Language Swedish: /browse/talks-by-language/swedish Language Tamil: /browse/talks-by-language/tamil Language Thai: /browse/talks-by-language/thai Language Turkish: /browse/talks-by-language/turkish Language Ukrainian: /browse/talks-by-language/ukrainian Language Urdu: /browse/talks-by-language/urdu Total: 40
Then we go to each type portal link to get video links, and the type will become the attribute of the video.
We will go through page 1, page 2, until there is no other pages in that type attribute. For example, we will go to 'http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=1', then 'http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=2' and stop at ''http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=3' to get all the 28 videos in Icelandic.
In 'http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=1' we can find the video link '/list/search%3Atag%3A%22icelandic%22/video/TEDxReykjavik-Eythor-Edvardsson' by using Beautiful Soup and the regular expression. For example, the second video in 'http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=1'
<a id="mvp_grid_panel_img_1" href="/list/search%3Atag%3A%22icelandic%22/video/TEDxReykjavik-Eythor-Edvardsson" class="mvp_thumbnail_magnified" style="position: relative; width: 293px; height: 220px; background-image: url('http://s3.amazonaws.com/magnifythumbs/GZL3TH25NHYK1LB6.jpg'); filter: progid:DXImageTransform.Microsoft.AlphaImageLoader( src='http://s3.amazonaws.com/magnifythumbs/GZL3TH25NHYK1LB6.jpg', sizingMethod='scale');" title="TEDxReykjavik - Eythor Edvardsson -"></a>
import re
VIDEO_LINK_PREFIX = "mvp_grid_panel_img_"
MSG_CLASS = "mvp_padded_message"
EMPTY_PAGE_MSG = "This page is empty."
portal_url = "http://tedxtalks.ted.com/browse/talks-by-language/icelandic"
page = 1
while(True):
# EX: http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=1
url = portal_url + "?page=" + str(page)
print("Reading URL: " + url)
s = requests.get(url)
soup = BeautifulSoup(s.content)
# if there is no Next page
# <div class="mvp_padded_message">This page is empty.</div>
msg_tag = soup.find('div', {'class': MSG_CLASS})
if msg_tag and msg_tag.get_text() == EMPTY_PAGE_MSG:
print("empty page.")
break
link_tags = soup.find_all('a', id=re.compile(VIDEO_LINK_PREFIX), href=True)
for link_tag in link_tags:
link = link_tag['href']
# EX: from /list/search%3Atag%3A%22chinese%22/video/The-tragedy-of-Hong-Kong-Archiv to /video/The-tragedy-of-Hong-Kong-Archiv
pos = link.find("/video")
link = link[pos:]
print("video link: %s (attr Language: Icelandic)" % (link))
page += 1
Reading URL: http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=1 video link: /video/TEDxReykjavik-Berghildur-Bergrs (attr Language: Icelandic) video link: /video/TEDxReykjavik-Eythor-Edvardsson (attr Language: Icelandic) video link: /video/TEDxReykjavik-Ari-Kristinn-Jons (attr Language: Icelandic) video link: /video/TEDxReykjavik-Danielle-Morrill (attr Language: Icelandic) video link: /video/TEDxReykjavik-Deepa-Iyengar-Jed (attr Language: Icelandic) video link: /video/TEDxReykjavik-Rakel-Solvadottir (attr Language: Icelandic) video link: /video/TEDxReykjavik-Alex-MacNeil-Exit (attr Language: Icelandic) video link: /video/TEDxReykjavik-Daddi-Gudbergsson (attr Language: Icelandic) video link: /video/TEDxReykjavik-Peter-Anderson-Mo (attr Language: Icelandic) video link: /video/TEDxReykjavik-Iceland-Dance-Com (attr Language: Icelandic) video link: /video/TEDxReykjavik-Hrund-Gunnsteinsd (attr Language: Icelandic) video link: /video/TEDxReykjavik-Gudrun-Petursdott (attr Language: Icelandic) video link: /video/TEDxReykjavik-Ingibjorg-Greta-G (attr Language: Icelandic) video link: /video/TEDxReykjavik-Ragnheidur-Harald (attr Language: Icelandic) video link: /video/TEDxReykjavik-Smri-McCarthy-960 (attr Language: Icelandic) video link: /video/TEDxReykjavik-Skli-Mogensen-960 (attr Language: Icelandic) video link: /video/TEDxReykjavik-Edda-Bjrgvinsdtti (attr Language: Icelandic) video link: /video/TEDxReykjavik-Margrt-Dra-Ragnar (attr Language: Icelandic) video link: /video/TEDxReykjavik-Jnas-Antonsson-96 (attr Language: Icelandic) video link: /video/TEDxReykjavik-Guni-Gunnarsson-9 (attr Language: Icelandic) Reading URL: http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=2 video link: /video/TEDxReykjavik-Gumundur-Oddur-96 (attr Language: Icelandic) video link: /video/TEDxReykjavik-Andri-Heiar-Krist (attr Language: Icelandic) video link: /video/TEDxReykjavik-Kristin-Drfjr-960 (attr Language: Icelandic) video link: /video/TEDxReykjavik-Gurn-Lilja-Gunnla (attr Language: Icelandic) video link: /video/TEDxReykjavik-Teitur-orkelsson (attr Language: Icelandic) video link: /video/TEDxReykjavik-orvaldur-orsteins (attr Language: Icelandic) video link: /video/TEDxReykjavik-Mary-Frances-Davi (attr Language: Icelandic) video link: /video/TEDxReykjavik-Torfi-G-Yngvason (attr Language: Icelandic) Reading URL: http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=3 empty page.
Since all TEDx videos are on YouTube, and we also use YouTube API to get other interesting information of the videos, we use YouTube ID as the key to represent the video.
And because Beautiful Soup does not parse through the tag <embed>
so we just use the regualr expression to get the YouTube ID.
For example, in the video 'http://tedxtalks.ted.com/video/TEDxReykjavik-Eythor-Edvardsson' we can find its YouTube ID: 'bzF4GPguPL8'
<embed type="application/x-shockwave-flash" src="http://www.youtube.com/v/bzF4GPguPL8&rel=0&fs=1&showsearch=0&enablejsapi=1&modestbranding=1&autoplay=1&playerapiid=mvp_swfo_embed_V8C4K631YLWW0QF3_1299136773" width="634" height="382" style="undefined" id="mvp_swfo_embed_V8C4K631YLWW0QF3_1299136773" name="mvp_swfo_embed_V8C4K631YLWW0QF3_1299136773" quality="high" allowfullscreen="true" allowscriptaccess="always" wmode="opaque" loop="false">
VIDEO_ID_RE = b"""
<embed.*\ src=\\\\\".*/v/(.*?)\\\\\".*>.*</embed>
"""
url = "http://tedxtalks.ted.com/video/TEDxReykjavik-Eythor-Edvardsson"
s = requests.get(url)
html = s.content
video_ids = re.findall(VIDEO_ID_RE, html, re.IGNORECASE|re.VERBOSE)
for video_id in video_ids:
# NOTE: byte string => need decode
print("YouTube ID: %s (%s)" % (url, video_id.decode('utf-8')))
YouTube ID: http://tedxtalks.ted.com/video/TEDxReykjavik-Eythor-Edvardsson (bzF4GPguPL8)
#https://developers.google.com/youtube/articles/view_youtube_jsonc_responses
#https://developers.google.com/youtube/2.0/developers_guide_jsonc
import requests
import json
import time
user_id = "tedxtalks"
page = 1
maxcount = 25
count = 0
start_index = 0
# Obtaining Total page number
s = requests.get("https://gdata.youtube.com/feeds/api/users/"+user_id+"/uploads?v=2&alt=jsonc&start-index=1&max-result=1")
data = [json.loads(row) for row in s.content.split("\n") if row]
totalcount = data[0]['data']['totalItems']
pagenumber = totalcount/maxcount +1
key = ['id', 'uploaded', 'category', 'title', 'tags', 'thumbnail', 'duration', 'likeCount', 'rating', 'ratingCount', 'viewCount', 'favoriteCount', 'commentCount']
tedx ={'id':'',
'data':{ 'uploaded':'','title':'','tags':'','thumbnail':'','duration':'','likeCount':'','rating':'','ratingCount':'','viewCount':'','favoriteCount':'','commentCount':''}
}
# Obtaining Data from each page (sample)
for index in range(1,2): #range(1,pagenumber):
# changing index number
if index == 1:
start_index = 1
else:
start_index = index*maxcount
s = requests.get("https://gdata.youtube.com/feeds/api/users/"+user_id+"/uploads?v=2&alt=jsonc&start-index="+str(start_index)+"&max-result="+str(maxcount))
data = [json.loads(row) for row in s.content.split("\n") if row]
metadata = data[0]['data']['items']
# obtaining each data in a page (25 items)
for i in range(5):#len(metadata)):
count +=1
u = metadata[i]
#missing key-value pair
for j in key:
if j=='id':
tedx['id']=u['id']
elif j =='thumbnail':
tedx['data'][j] = u[j][u'hqDefault']
elif j == 'title':
tedx['data'][j] = u[j].encode('utf-8')
else:
tedx['data'][j] = u[j] if not j in list(set(key) -set(u.keys())) else '-'
the_dump = json.dumps(tedx)
print the_dump
# delay
time.sleep(1)
# https://developers.google.com/youtube/2.0/developers_guide_jsonc
{"data": {"uploaded": "2013-05-02T08:46:29.000Z", "rating": "-", "tags": "-", "likeCount": "-", "commentCount": 0, "ratingCount": "-", "duration": 960, "category": "People", "viewCount": 2, "title": "Corporate rebels: Peter Vander Auwera at TEDxBrusselsChange", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/GgdwNiOwajg/hqdefault.jpg"}, "id": "GgdwNiOwajg"} {"data": {"uploaded": "2013-05-02T06:43:09.000Z", "rating": 5.0, "tags": "-", "likeCount": "3", "commentCount": 0, "ratingCount": 3, "duration": 1134, "category": "Tech", "viewCount": 39, "title": "One Chance at Life - What Would You Do: Chuck Berry at TEDxQueenstown", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/5LMMIu813zQ/hqdefault.jpg"}, "id": "5LMMIu813zQ"} {"data": {"uploaded": "2013-05-02T05:47:59.000Z", "rating": 4.75, "tags": "-", "likeCount": "15", "commentCount": 2, "ratingCount": 16, "duration": 866, "category": "People", "viewCount": 130, "title": "The Habbits of Highly Boring People: Chris Sauve at TEDxCarletonU", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/3rbVQNTzCh8/hqdefault.jpg"}, "id": "3rbVQNTzCh8"} {"data": {"uploaded": "2013-05-02T05:05:34.000Z", "rating": 5.0, "tags": "-", "likeCount": "4", "commentCount": 0, "ratingCount": 4, "duration": 924, "category": "People", "viewCount": 21, "title": "A Selfless Good Deed: Trevor Deley at TEDxCarletonU", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/OvZlGIT1tOA/hqdefault.jpg"}, "id": "OvZlGIT1tOA"} {"data": {"uploaded": "2013-05-02T04:24:49.000Z", "rating": 5.0, "tags": "-", "likeCount": "2", "commentCount": 0, "ratingCount": 2, "duration": 858, "category": "Sports", "viewCount": 14, "title": "Fishing for the Future: Dr. Steven Cooke at TEDxCarletonU", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/Wsz8Wn76h-4/hqdefault.jpg"}, "id": "Wsz8Wn76h-4"}
Now that we've got the video attributes from both TEDx website and YouTube, we can merge these attributes.
import json
SITE_JSON = "tedx_video.json"
YOUTUBE_JSON = "tedx_v7.txt"
SITE_ATTR_LIST = ['lang', 'event', 'country', 'topic']
# import JSON from TEDx website and make video_dict
site_json_file = open(SITE_JSON)
site_json = json.load(site_json_file)
site_json_file.close()
video_dict = {}
for video in site_json:
vid = site_json[video]['id']
video_dict[vid] = {}
for attr in SITE_ATTR_LIST:
if attr in site_json[video]:
video_dict[vid][attr] = site_json[video][attr]
# get JSON from YouTube and print to merged result file
merged_cnt = 0
with open(YOUTUBE_JSON, "r") as youtube_json_file:
for line in youtube_json_file:
if merged_cnt >= 10:
break
youtube_json = json.loads(line)
vid = youtube_json['id']
merged_video = youtube_json['data']
merged_video['id'] = vid
if vid in video_dict:
attr_cnt = 0
for attr in SITE_ATTR_LIST:
if attr in video_dict[vid]:
merged_video[attr] = video_dict[vid][attr]
attr_cnt += 1
if attr_cnt == 4:
print(json.dumps(merged_video))
merged_cnt += 1
{"uploaded": "2013-04-24T08:42:45.000Z", "rating": 4.6, "lang": "English", "tags": "-", "country": "Spain", "id": "JcqXD5JgVXw", "title": "Wonder and beauty in education: Catherine L'Ecuyer at TEDxManresa", "event": "TEDxManresa", "likeCount": "9", "commentCount": 0, "topic": "Education", "ratingCount": 10, "duration": 1087, "category": "Nonprofit", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/JcqXD5JgVXw/hqdefault.jpg", "viewCount": 900} {"uploaded": "2013-04-23T10:09:17.000Z", "rating": 4.8974357, "lang": "English", "tags": "-", "country": "Greece", "id": "s6KM9MxY5ZM", "title": "Learning is a Game: Ed Cooke at TEDxThessaloniki", "event": "TEDxThessaloniki", "likeCount": "38", "commentCount": 5, "topic": "Education", "ratingCount": 39, "duration": 1187, "category": "Education", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/s6KM9MxY5ZM/hqdefault.jpg", "viewCount": 2152} {"uploaded": "2013-04-23T07:26:23.000Z", "rating": 5.0, "lang": "Czech", "tags": "-", "country": "Czech Republic", "id": "pZsORC8sgl4", "title": "Architektura jako starost o m\u00edsto kde \u017eijeme: Roman Brychta at TEDxHradecKralove", "event": "TEDxHradecKralove", "likeCount": "1", "commentCount": 0, "topic": "Education", "ratingCount": 1, "duration": 981, "category": "Education", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/pZsORC8sgl4/hqdefault.jpg", "viewCount": 89} {"uploaded": "2013-04-23T07:25:08.000Z", "rating": 5.0, "lang": "Czech", "tags": "-", "country": "Czech Republic", "id": "HeSH7cKTs0s", "title": "Kdy\u017e se chce, tak to jde... V\u011b\u0159te n\u00e1m, testovali jsme to na lidech: Dan P\u0159ib\u00e1\u0148 at TEDxHradecKralove", "event": "TEDxHradecKralove", "likeCount": "82", "commentCount": 6, "topic": "Education", "ratingCount": 82, "duration": 1031, "category": "Education", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/HeSH7cKTs0s/hqdefault.jpg", "viewCount": 2799} {"uploaded": "2013-04-23T07:24:51.000Z", "rating": 5.0, "lang": "Czech", "tags": "-", "country": "Czech Republic", "id": "Bkrku3_sv88", "title": "Geometrie trojrozm\u011brn\u00e9ho \u017eivota: Jan Han\u00e1k at TEDxHradecKralove", "event": "TEDxHradecKralove", "likeCount": "1", "commentCount": 0, "topic": "Education", "ratingCount": 1, "duration": 883, "category": "Education", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/Bkrku3_sv88/hqdefault.jpg", "viewCount": 92} {"uploaded": "2013-04-23T07:24:17.000Z", "rating": 5.0, "lang": "Czech", "tags": "-", "country": "Czech Republic", "id": "zNS7kdSMVac", "title": "Psan\u00edm k sebepozn\u00e1n\u00ed: Ji\u0159\u00ed Van\u011bk at TEDxHradecKralove", "event": "TEDxHradecKralove", "likeCount": "2", "commentCount": 0, "topic": "Education", "ratingCount": 2, "duration": 874, "category": "Education", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/zNS7kdSMVac/hqdefault.jpg", "viewCount": 243} {"uploaded": "2013-04-22T21:18:14.000Z", "rating": 4.6363635, "lang": "English", "tags": "-", "country": "United States", "id": "hktzJ7QNcMU", "title": "Empowering Women and Girls: Halima Hima at TEDxChange", "event": "TEDxChange", "likeCount": "10", "commentCount": 0, "topic": "Entertainment", "ratingCount": 11, "duration": 1419, "category": "Entertainment", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/hktzJ7QNcMU/hqdefault.jpg", "viewCount": 173} {"uploaded": "2013-04-20T00:51:52.000Z", "rating": 4.6363635, "lang": "English", "tags": "-", "country": "United States", "id": "845UrCAFTsQ", "title": "Iconic toilets: Mathew Lippincott at TEDxConcordiaUPortland", "event": "TEDxConcordiaUPortland", "likeCount": "20", "commentCount": 6, "topic": "Education", "ratingCount": 22, "duration": 648, "category": "Education", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/845UrCAFTsQ/hqdefault.jpg", "viewCount": 1104} {"uploaded": "2013-04-18T09:38:39.000Z", "rating": 5.0, "lang": "English", "tags": "-", "country": "India", "id": "MiwjplU6kAc", "title": "Three laws of user experience: Apala Lahiri Chavan at TEDxGolfLinksPark", "event": "TEDxGolflinkspark", "likeCount": "10", "commentCount": 2, "topic": "Education", "ratingCount": 10, "duration": 1393, "category": "Education", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/MiwjplU6kAc/hqdefault.jpg", "viewCount": 1069} {"uploaded": "2013-04-18T02:18:44.000Z", "rating": 4.2941175, "lang": "English", "tags": "-", "country": "United States", "id": "5-YIxJEyBBs", "title": "Crowd sourcing the feminine intelligence of the planet: Jensine Larsen at TEDxConcordiaUPortland", "event": "TEDxConcordiaUPortland", "likeCount": "14", "commentCount": 6, "topic": "Education", "ratingCount": 17, "duration": 1149, "category": "Education", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/5-YIxJEyBBs/hqdefault.jpg", "viewCount": 432}
Now that we've merged all the information we got, we can try to discover some basic statistics of these videos.
import pandas as pd
from pandas import Series, DataFrame
import json
TEDX_JSON_FILE = "final_tedx.json"
tedx_video_list = []
with open(TEDX_JSON_FILE, "r") as tedx_json_file:
for line in tedx_json_file:
tedx_video_list.append(json.loads(line))
tedx_df = DataFrame(tedx_video_list)
tedx_df.set_index('id', inplace=True, drop=True)
tedx_df
<class 'pandas.core.frame.DataFrame'> Index: 29982 entries, ew0ovccWuQg to QZkUPZr1Zbc Data columns: category 27081 non-null values commentCount 27081 non-null values country 19813 non-null values duration 27081 non-null values event 23177 non-null values favoriteCount 27081 non-null values lang 24061 non-null values likeCount 27081 non-null values rating 27081 non-null values ratingCount 27081 non-null values tags 27081 non-null values thumbnail 27081 non-null values title 27081 non-null values topic 13332 non-null values uploaded 27081 non-null values viewCount 27081 non-null values dtypes: float64(1), object(15)
Conclusion?
tedx_df['lang'].value_counts()[:10]
English 17479 Spanish 1388 Portuguese 1002 Korean 654 French 559 Arabic 471 Russian 397 Japanese 289 Italian 253 Polish 158
tedx_df[tedx_df.lang!='English']['lang'].value_counts().plot(kind="bar")
<matplotlib.axes.AxesSubplot at 0x18f04610>
Conclusion?
tedx_df['viewCount'] = tedx_df['viewCount'].fillna(0)
tedx_df[tedx_df.viewCount!='-'][['viewCount', 'lang']].groupby('lang').sum().sort('viewCount', ascending=0)[:10]
viewCount | |
---|---|
lang | |
English | 58477968 |
Arabic | 4213183 |
Spanish | 4173229 |
French | 3571510 |
Portuguese | 1516388 |
Japanese | 909865 |
Korean | 892479 |
Polish | 827042 |
Greek | 490969 |
Indonesian | 478289 |
tmp_tedx_df = tedx_df[tedx_df.lang!='English'].copy()
tmp_tedx_df[tmp_tedx_df.viewCount!='-'][['viewCount', 'lang']].groupby('lang').sum().sort('viewCount', ascending=0).plot(kind="bar")
<matplotlib.axes.AxesSubplot at 0x1976f0f0>
Conclusion?
tedx_df.groupby('lang').event.nunique().order(ascending=False)[:10]
lang English 1035 Spanish 101 Portuguese 63 French 54 Korean 48 Arabic 35 Russian 26 Italian 19 Japanese 15 Chinese 14
tmp_tedx_df = tedx_df[tedx_df.lang!='English'].copy()
tmp_tedx_df.groupby('lang').event.nunique().order(ascending=False).plot(kind="bar")
<matplotlib.axes.AxesSubplot at 0x17887670>
tedx_df['country'].value_counts()[:10]
United States 5634 Canada 1249 India 815 Brazil 726 Netherlands 710 South Korea 685 Spain 674 Australia 605 United Kingdom 591 Japan 474
tedx_df[tedx_df.viewCount!='-'][['viewCount', 'country']].groupby('country').sum().sort('viewCount', ascending=0)[:10]
viewCount | |
---|---|
country | |
United States | 18189465 |
Canada | 4572056 |
France | 3065666 |
United Kingdom | 2226657 |
Argentina | 2171808 |
Netherlands | 1947574 |
Japan | 1734785 |
India | 1510407 |
Yemen | 1308153 |
Spain | 1196308 |
tedx_df.groupby('country').event.nunique().order(ascending=False)[:10]
country United States 359 India 89 Canada 80 United Kingdom 51 South Korea 47 Brazil 44 Spain 41 Netherlands 32 France 30 Australia 28
# more information for visualization, including how to prepare data for D3.js
# http://nbviewer.ipython.org/5501063
from IPython.display import HTML
HTML('<iframe src="http://96chany.com/projects/tedx_popularity" width="1000" height="800"></iframe>')
# you can navigate years by sliding 'year' digits
from IPython.display import HTML
HTML('<iframe src="http://96chany.com/projects/tedx_comparison" width="1100" height="750"></iframe>')
from pandas import read_csv
from urllib import urlopen
from pandas import Series, DataFrame
page = urlopen("list of languages by number of native speaker.csv")
df = read_csv(page)
df.set_index('Language',inplace=True,drop=True)
df[:30].plot(kind="bar")
<matplotlib.axes.AxesSubplot at 0x6aacb90>
Data inconsistancy : The video TEDxCausewayBay - Terence Wong - 04/15/10 appears in language Chinese and Korean, and in event TEDxCAU and TEDxCausewayBay