pandas is the only dependency for this tutorial.
import pandas as pd
The data we'll use comes from StatCounter. Rather than beating their servers to get the data, I've downloaded search engine market share data for a few countries.
Notice the difference between the URL to each country-level CSV is simply the country's abbreviation. We'll use this string manipulation when building a DataFrame from the CSV data at each URL.
The CSV data is comma delimited with column names in the first row with double quotes around values. For example, here is a look at the US data:
"Date","Google","Yahoo!","bing","AOL","Ask Jeeves","Other"
2008-07,79.29,12.88,0,1.86,0.12,5.85
2008-08,79.59,12.62,0,0.85,1.04,5.9
2008-09,79.36,12.57,0,0.78,0.96,6.33
2008-10,80.14,12.34,0,0.79,0.93,5.8
...
First make a list of the country abbreviations so we can iterate over it later, then create a function that reads in the URL of a CSV file and returns a DataFrame. The first column of the CSV file is the date (monthly aggregation) of each observation, which makes it a good candidate for the index of the DataFrame.
countries = ['US', 'CA', 'GB', 'FR', 'CN', 'RU', 'DE']
def makeDataFrame(country):
baseurl = 'http://s3.amazonaws.com/econpy/hhi'
url = '%s/search_engine-%s-monthly-200807-201304.csv' % (baseurl, country)
dframe = pd.read_csv(url, index_col=0)
dframe.index = pd.DatetimeIndex(dframe.index)
return dframe
Now use the makeDataFrame
function to create a DataFrame for each country.
us_df = makeDataFrame('US')
us_df.plot(title='All companies in the US')
pylab.show()
us_df['Google'].plot(color='g', title='Google')
pylab.show()
us_df['bing'].plot(color='b')
us_df['Yahoo!'].plot(color='y', title='Bing (Blue) and Yahoo! (Yellow)')
pylab.show()
us_df['AOL'].plot(color='k')
us_df['Ask Jeeves'].plot(color='r', title='AOL (Black) and Ask Jeeves (Red)')
pylab.show()
The story in the US is pretty straight forward - Google has maintained a consistent market share for many years, while Bing and Yahoo battle it out for the 2nd and 3rd largest shares, hovering around 10% each. AOL and Ask Jeeves bring up the rear, each holding less than 1% of the market.
china_df = makeDataFrame('CN')
china_df.plot(title='All companies in China')
pylab.show()
china_df['Google'].plot(color='g')
china_df['Baidu'].plot(color='b')
china_df['360 Search'].plot(color='r', title='Google (Green), Baidu (Blue) and 360 Search (Red)')
pylab.show()
china_df['bing'].plot(color='b')
china_df['Yahoo!'].plot(color='y', title='Bing (Blue) and Yahoo! (Yellow)')
pylab.show()
China is rather unique as it is the only country in this sample where Google isn't the market leader - in fact, Google is #3 today. Much of this has to do with regulations passed in China over the last few years that have been aimed at censoring Google in an attempt to push them out of the market. Baidu has benefited largely from these regulations, as has 360 Search in recent months.
uk_df = makeDataFrame('GB')
uk_df.plot(title='All companies in the UK')
pylab.show()
uk_df['Google'].plot(color='g', title='Google')
pylab.show()
uk_df['bing'].plot(color='b')
uk_df['Yahoo!'].plot(color='y', title='Bing (Blue) and Yahoo! (Yellow)')
pylab.show()
uk_df['Ask Jeeves'].plot(color='k')
uk_df['AOL'].plot(color='r', title='Ask Jeeves (Black) and AOL (Red)')
pylab.show()
The UK is somewhat similar to the US, except Google maintains an even larger share of the market (~90%). However, Bing has been steadily increasing their market share in the UK, sitting at nearly 6% today. Yahoo has been hanging around in the #3 position, with Ask Jeeves and AOL bring up the rear with a combined market share of less than 1%.
germany_df = makeDataFrame('DE')
germany_df.plot(title='All companies in Germany')
pylab.show()
germany_df['Google'].plot(color='g', title='Google')
pylab.show()
germany_df['bing'].plot(color='b')
germany_df['Yahoo!'].plot(color='y', title='Bing (Blue) and Yahoo! (Yellow)')
pylab.show()
germany_df['WEB.DE'].plot(color='k')
germany_df['Ask Jeeves'].plot(color='r', title='web.de (Black) and Ask Jeeves (Red)')
pylab.show()
Germany is definitely where Google has flexed its muscles the most, holding onto over 95% of the market. Similar to the UK, Bing has gained on Yahoo in recent years, but both struggle to keep up with Google's market domination.
canada_df = makeDataFrame('CA')
canada_df.plot(title='All companies in Canada')
pylab.show()
canada_df['Google'].plot(color='g', title='Google')
pylab.show()
canada_df['bing'].plot(color='b')
canada_df['Yahoo!'].plot(color='y', title='Bing (Blue) and Yahoo! (Yellow)')
pylab.show()
The story in Canada is almost identical to the UK - Google holds onto a market share of around 90% with Bing edging out Yahoo in recent years for the #2 spot.
russia_df = makeDataFrame('RU')
russia_df.plot(title='All companies in Russia')
pylab.show()
russia_df['YANDEX RU'].plot(color='r', title='Yandex')
pylab.show()
russia_df['Google'].plot(color='g', title='Google')
pylab.show()
russia_df['bing'].plot(color='b')
russia_df['Yahoo!'].plot(color='y', title='Bing (Blue) and Yahoo! (Yellow)')
pylab.show()
Russia tells an interesting story as it is one of the only countries where Google has had consistent competition from another search engine, Yandex. Although Google is still the market leader, Yandex holds on to a respectable ~40% of the market. Bing and Yahoo have a smaller market share in Russia than other countries with a combined market share below 2%.
france_df = makeDataFrame('FR')
france_df.plot(title='All companies in France')
pylab.show()
france_df['Google'].plot(color='g', title='Google')
pylab.show()
france_df['bing'].plot(color='b')
france_df['Yahoo!'].plot(color='y', title='Bing (Blue) and Yahoo! (Yellow)')
pylab.show()
france_df['Voila'].plot(color='k')
france_df['Ask Jeeves'].plot(color='r', title='Viola (Black) and Ask Jeeves (Red)')
pylab.show()
France is very similar to other EU countries as Google dominates while Bing and Yahoo fight it out, with Bing gaining slowly over the last few years.
Here we'll build a panel of the data so we can create some cross-country comparisons within a search engine.
panel = pd.Panel({country: makeDataFrame(country) for country in countries})
panel['US']['Google'].plot(color='b')
panel['GB']['Google'].plot(color='g')
panel['FR']['Google'].plot(color='y')
panel['DE']['Google'].plot(color='r')
<matplotlib.axes.AxesSubplot at 0x4848890>
panel['CN']['bing'].plot(color='r')
panel['US']['bing'].plot(color='b')
panel['GB']['bing'].plot(color='g')
panel['FR']['bing'].plot(color='y')
panel['DE']['bing'].plot(color='k')
<matplotlib.axes.AxesSubplot at 0x4d268d0>
panel['CN']['Yahoo!'].plot(color='r')
panel['US']['Yahoo!'].plot(color='b')
panel['GB']['Yahoo!'].plot(color='g')
panel['FR']['Yahoo!'].plot(color='y')
panel['DE']['Yahoo!'].plot(color='k')
<matplotlib.axes.AxesSubplot at 0x4d2f810>
The Herfindahl Index for each country in a period of time is defined as:
where si is the market share of search engine i in a particular country. We need to compute the HHI in each month in each country. To do so, create a function to compute the HHI from the data:
def get_hhi(dframe, drop_other=True):
# If true, drop the 'Other' group from the data.
# Dropping the 'Other' group is a good idea when calculating the HHI.
if drop_other == True:
dframe.pop('Other')
HHI_VALS = []
for idx in dframe.iterrows():
shares = [s for s in idx[1] if s > 0]
sqr_shares = [s*s for s in shares]
hhi_val = sum(sqr_shares)
HHI_VALS.append({'month': idx[0], 'hhi': hhi_val})
dframeHHI = pd.DataFrame(HHI_VALS)
dframeHHI.index = pd.DatetimeIndex(dframeHHI.pop('month'))
return dframeHHI['hhi']
See how much the HHI calculation is being thrown off by the Others category.
us_df = makeDataFrame('US')
us_hhi_t = get_hhi(us_df, drop_other=False)
us_hhi_f = get_hhi(us_df, drop_other=True)
us_hhi_f.plot(color='b')
us_hhi_t.plot(color='r')
<matplotlib.axes.AxesSubplot at 0x4f7fc50>
Print some basic summary statistics.
for searchengine in ['Google', 'bing', 'Yahoo!']:
print '%s mean var \n%s' % (searchengine, '='*18)
for country in ['US', 'GB', 'CA', 'DE', 'FR', 'CN', 'RU']:
tmpdf = panel[country][searchengine]
print '%s: %.3f %.3f' % (country, tmpdf.mean(), tmpdf.var())
print '\n'
Google mean var ================== US: 79.788 1.445 GB: 91.371 1.106 CA: 91.504 0.983 DE: 96.225 1.232 FR: 94.865 0.886 CN: 41.062 598.092 RU: 55.675 12.137 bing mean var ================== US: 7.314 13.196 GB: 3.281 3.095 CA: 3.934 4.371 DE: 1.255 0.631 FR: 1.986 1.136 CN: 0.863 0.804 RU: 0.755 0.152 Yahoo! mean var ================== US: 9.798 2.320 GB: 2.932 0.744 CA: 3.041 0.681 DE: 1.153 0.029 FR: 1.467 0.204 CN: 1.889 1.620 RU: 0.849 0.177
Now make a DataFrame with the HHI of every country. This one-liner creates a pandas DataFrame by passing it a dictionary whose keys are country names (strings) and values are the HHI value of each country (floats) using the get_hhi function defined earlier. get_hhi
hhi_df = pd.DataFrame({country: get_hhi(makeDataFrame(country)) for country in countries})
hhi_df.plot()
<matplotlib.axes.AxesSubplot at 0x5228c10>
And there you have it. Want competition in the search engine market? Go to Russia!