import pandas as pd
import pandas.io.ga as ga
import numpy as np
%matplotlib inline
import sys
print ("PYTHON ", sys.version)
print ("PANDAS ", pd.__version__)
PYTHON 3.4.3 |Anaconda 2.4.0 (x86_64)| (default, Oct 20 2015, 14:27:51) [GCC 4.2.1 (Apple Inc. build 5577)] PANDAS 0.17.0
Now that our tools are loaded, let us fetch the data from the two Google Analytics properties.
The traffic data without anonymization since September 1st, 2015.
sources = {
'full' : {
'property_id': "UA-424540-4",
'profile_id': "5334921",
},
'anon' : {
'property_id': "UA-424540-11",
'profile_id': "107030134",
},
}
source_data = {}
for key in sources:
source_data[key] = ga.read_ga(
property_id = sources[key]['property_id'],
profile_id = sources[key]['profile_id'],
metrics = "sessions",
dimensions = ['country','city'],
start_date = "2015-09-01",
index_col = ['country','city'],
)
print(source_data[key]['sessions'].sum())
49376 49161
Less than 1% difference in volumes. Since the tracking is not forcefully simultaneous, that was expected. Let's just have a look at one of them.
source_data['full'].head()
sessions | ||
---|---|---|
country | city | |
(not set) | (not set) | 126 |
Afghanistan | (not set) | 2 |
Albania | (not set) | 1 |
Tirana | 12 | |
Algeria | (not set) | 15 |
Let us now join those two dataframes based on their country/city index:
data = pd.concat(source_data, axis=1, join='outer')
# rename homonymous columns
data.columns=['full_ip_sessions', 'anon_ip_sessions']
data.head()
full_ip_sessions | anon_ip_sessions | ||
---|---|---|---|
country | city | ||
(not set) | (not set) | 130 | 126 |
Afghanistan | (not set) | 2 | 2 |
Albania | (not set) | 1 | 1 |
Tirana | 12 | 12 | |
Algeria | (not set) | 14 | 15 |
Let's list the countries where the biggest proportional losses & wins happen.
# group by level 0 of the index (i.e. countries) and sum columns for groups
country_data = data.groupby(level=0).sum()
# compute delta and its proportion
country_data['delta'] = country_data.anon_ip_sessions - country_data.full_ip_sessions
country_data['dprop'] = country_data.delta / country_data.full_ip_sessions
# sort by prop. delta, ascending
country_data.sort_values(by='dprop', inplace=True)
Countries with proportionally large losses:
country_data.query('full_ip_sessions > 200').head(10)
full_ip_sessions | anon_ip_sessions | delta | dprop | |
---|---|---|---|---|
country | ||||
Austria | 387 | 356 | -31 | -0.080103 |
China | 341 | 326 | -15 | -0.043988 |
Japan | 443 | 429 | -14 | -0.031603 |
United States | 7705 | 7465 | -240 | -0.031149 |
Denmark | 242 | 238 | -4 | -0.016529 |
Bulgaria | 207 | 206 | -1 | -0.004831 |
Finland | 275 | 274 | -1 | -0.003636 |
Mexico | 332 | 332 | 0 | 0.000000 |
Taiwan | 278 | 278 | 0 | 0.000000 |
Czech Republic | 420 | 420 | 0 | 0.000000 |
Countries with proportionally large gains:
country_data.query('full_ip_sessions > 200').tail(10)
full_ip_sessions | anon_ip_sessions | delta | dprop | |
---|---|---|---|---|
country | ||||
Philippines | 266 | 272 | 6 | 0.022556 |
Lithuania | 217 | 222 | 5 | 0.023041 |
India | 4142 | 4243 | 101 | 0.024384 |
Indonesia | 430 | 441 | 11 | 0.025581 |
Portugal | 230 | 236 | 6 | 0.026087 |
Malaysia | 233 | 242 | 9 | 0.038627 |
Ukraine | 1100 | 1144 | 44 | 0.040000 |
Ireland | 202 | 211 | 9 | 0.044554 |
Sweden | 478 | 506 | 28 | 0.058577 |
Singapore | 311 | 331 | 20 | 0.064309 |
What's the proportion of the fluctuation?
country_data.delta.map(abs).sum()/country_data.full_ip_sessions.sum()
0.018429242692378105
The deltas are below 10%, either positive or negative. And there's globally less than 2% of country attribution mismatch.
One can then say that Country attribution is largely insensitive to IP anonymization.
Let us dive one level deeper: at city level. We will focus on Switzerland since we have enough traffic from it.
country_data.query('country == "Switzerland"')
full_ip_sessions | anon_ip_sessions | delta | dprop | |
---|---|---|---|---|
country | ||||
Switzerland | 4141 | 4156 | 15 | 0.003622 |
Less than 1 percent loss at country level for Switzerland, rather stable. But what's happening at city level?
# create a clean subset
swiss_data = data.query('country == "Switzerland"').copy()
swiss_data.sum()
full_ip_sessions 4141 anon_ip_sessions 4156 dtype: float64
swiss_data['delta'] = swiss_data.anon_ip_sessions - swiss_data.full_ip_sessions
swiss_data['dprop'] = swiss_data.delta / swiss_data.full_ip_sessions
swiss_data.sort_values(by='dprop', inplace=True)
swiss_data.query('full_ip_sessions > 50')
full_ip_sessions | anon_ip_sessions | delta | dprop | ||
---|---|---|---|---|---|
country | city | ||||
Switzerland | Porrentruy | 138 | 1 | -137 | -0.992754 |
Ebikon | 81 | 1 | -80 | -0.987654 | |
Basel | 162 | 111 | -51 | -0.314815 | |
Lugano | 54 | 42 | -12 | -0.222222 | |
Lucerne | 71 | 58 | -13 | -0.183099 | |
Lausanne | 341 | 345 | 4 | 0.011730 | |
Zurich | 1476 | 1504 | 28 | 0.018970 | |
Bern | 206 | 232 | 26 | 0.126214 | |
Saint Gallen | 86 | 104 | 18 | 0.209302 | |
Winterthur | 77 | 97 | 20 | 0.259740 | |
Geneva | 118 | 189 | 71 | 0.601695 | |
Fribourg | 105 | 300 | 195 | 1.857143 |
# absolute sum of delta
swiss_data.delta.map(abs).sum()/swiss_data.full_ip_sessions.sum()
0.2711905336875151
Quite some turmoil at city level! For example, Fribourg gains 185% of attributions while Basel loses 30%, Something wild going on in Porrentruy and Ebikon, ...
Overall, we see more than 25% mismatch in city attribution for Switzerland.