import ujson as json
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import plotly.plotly as py
from moztelemetry import get_pings, get_pings_properties, get_one_ping_per_client
%pylab inline
Populating the interactive namespace from numpy and matplotlib
How many parallel workers do we have?
sc.defaultParallelism
16
Fetch all the submissions for the nightly build made on 2015-07-15 to 2015-07-25:
pings = get_pings(sc, app="Firefox", channel="nightly", build_id=("20150715000000", "20150725999999"), fraction=1.0)
Extract the device pixels per virtual pixel setting from each ping (one per user) and filter out the unspecified values:
subset = get_pings_properties(pings, ["clientID", "environment/settings/userPrefs/layout.css.devPixelsPerPx"])
subset = get_one_ping_per_client(subset)
valid_subset = subset.filter(lambda x: x['environment/settings/userPrefs/layout.css.devPixelsPerPx'] != None)
What percentage of users have this setting enabled?
valid_subset.first()
{'clientID': u'facef8c1-418b-44a2-b8bb-e6383c508b61', 'environment/settings/userPrefs/layout.css.devPixelsPerPx': u'1'}
valid_count, total_count = valid_subset.count(), subset.count()
"{}% ({} of {})".format(100.0 * valid_count / total_count, valid_count, total_count)
'0.261086171757% (196 of 75071)'
Caching is fundamental as it allows for an iterative, real-time development workflow:
cached = valid_subset.cache()
Aggregate the settings by their value:
settings = cached.map(lambda x: (x['environment/settings/userPrefs/layout.css.devPixelsPerPx'], 1)).reduceByKey(lambda a, b: a + b).collectAsMap()
settings
{u'-1': 11, u'-1.0': 1, u'0': 2, u'0.8': 1, u'0.85': 2, u'0.9': 1, u'0.92': 1, u'1': 55, u'1,8': 1, u'1.': 1, u'1.0': 28, u'1.00': 1, u'1.05': 10, u'1.1': 9, u'1.15': 2, u'1.2': 10, u'1.25': 10, u'1.3': 3, u'1.333333': 1, u'1.4': 6, u'1.5': 15, u'1.8': 5, u'1.88': 1, u'1.9': 1, u'2': 14, u'2.0': 4}
And finally plot the data:
plt.figure(figsize=(15, 7))
pairs = sorted(settings.items(), key=lambda x: x[0])
width = 0.8
plt.bar(range(len(pairs)), map(lambda x: x[1], pairs), width=width)
ax = plt.gca()
ax.set_xticks(np.arange(len(pairs)) + width/2)
ax.set_xticklabels(map(lambda x: x[0], pairs), rotation=90)
plt.xlabel("layout.css.devPixelsPerPx Value")
plt.ylabel("Number of client IDs")
plt.show()
Some of these values are duplicated. To group settings by their numerical values, we parse the labels:
plt.figure(figsize=(15, 7))
from collections import defaultdict
numerical_settings = defaultdict(int)
for setting, count in settings.items():
try: numerical_settings[float(setting)] += count
except ValueError: pass
pairs = sorted(numerical_settings.items(), key=lambda x: x[0])
width = 0.8
plt.bar(range(len(pairs)), map(lambda x: x[1], pairs), width=width)
ax = plt.gca()
ax.set_xticks(np.arange(len(pairs)) + width/2)
ax.set_xticklabels(map(lambda x: x[0], pairs), rotation=90)
plt.xlabel("layout.css.devPixelsPerPx Value")
plt.ylabel("Number of client IDs")
plt.show()