Extra 3.1 - Historical Provenance - Application 2: CollabMap Data Quality

Assessing the quality of crowdsourced data in CollabMap from their provenance.

In this notebook, we explore the performance of classification using the provenance of a data entity instead of its dependencies (as shown here and in the paper). In order to distinguish between the two, we call the former historical provenance and the latter forward provenance. Apart from using the historical provenance, all other steps are the same as the original experiments.

  • Goal: To determine if the provenance network analytics method can identify trustworthy data (i.e. buildings, routes, and route sets) contributed by crowd workers in CollabMap.
  • Classification labels: $\mathcal{L} = \left\{ \textit{trusted}, \textit{uncertain} \right\} $.
  • Training data:
    • Buildings: 5175
    • Routes: 4710
    • Route sets: 4997

Reading data [Changed]

The CollabMap dataset based on historical provenance is provided in the collabmap/ancestor-graphs.csv file, each row corresponds to a building, route, or route sets created in the application:

  • id: the identifier of the data entity (i.e. building/route/route set).
  • trust_value: the beta trust value calculated from the votes for the data entity.
  • The remaining columns provide the provenance network metrics calculated from the historical provenance graph of the entity.
In [1]:
import pandas as pd
In [2]:
df = pd.read_csv("collabmap/ancestor-graphs.csv", index_col='id')
df.head()
Out[2]:
trust_value entities agents activities nodes edges diameter assortativity acc acc_e ... mfd_e_a mfd_e_ag mfd_a_e mfd_a_a mfd_a_ag mfd_ag_e mfd_ag_a mfd_ag_ag mfd_der powerlaw_alpha
id
Route41053.0 0.833333 2 2 2 6 6 4 0.500000 0.555556 0.666667 ... 2 3 1 2 3 0 0 0 1 -1.0
RouteSet9042.1 0.600000 3 2 2 7 9 4 0.363774 0.750000 0.833333 ... 2 3 1 2 3 0 0 0 1 -1.0
Building19305.0 0.428571 1 1 1 3 2 2 -1.000000 0.000000 0.000000 ... 1 2 0 0 1 0 0 0 -1 -1.0
Building1136.0 0.428571 1 1 1 3 2 2 -1.000000 0.000000 0.000000 ... 1 2 0 0 1 0 0 0 -1 -1.0
Building24156.0 0.833333 1 1 1 3 2 2 -1.000000 0.000000 0.000000 ... 1 2 0 0 1 0 0 0 -1 -1.0

5 rows × 23 columns

In [3]:
df.describe()
Out[3]:
trust_value entities agents activities nodes edges diameter assortativity acc acc_e ... mfd_e_a mfd_e_ag mfd_a_e mfd_a_a mfd_a_ag mfd_ag_e mfd_ag_a mfd_ag_ag mfd_der powerlaw_alpha
count 14882.000000 14882.000000 14882.000000 14882.000000 14882.000000 14882.000000 14882.000000 14882.000000 14882.000000 14882.000000 ... 14882.000000 14882.000000 14882.000000 14882.000000 14882.000000 14882.0 14882.0 14882.0 14882.000000 14882.000000
mean 0.766706 2.786050 1.808090 2.109461 6.703602 9.253326 3.175581 -0.146027 0.405963 0.461089 ... 2.007996 2.745935 1.000739 1.646486 2.204542 0.0 0.0 0.0 0.670945 -0.991271
std 0.115301 2.300939 1.220964 1.387607 4.825255 11.594616 1.341400 0.631886 0.305110 0.345399 ... 1.017158 0.997692 1.014742 1.381358 1.403419 0.0 0.0 0.0 1.426788 0.230387
min 0.153846 1.000000 1.000000 1.000000 3.000000 2.000000 2.000000 -1.000000 0.000000 0.000000 ... 1.000000 2.000000 0.000000 0.000000 1.000000 0.0 0.0 0.0 -1.000000 -1.000000
25% 0.750000 1.000000 1.000000 1.000000 3.000000 2.000000 2.000000 -1.000000 0.000000 0.000000 ... 1.000000 2.000000 0.000000 0.000000 1.000000 0.0 0.0 0.0 -1.000000 -1.000000
50% 0.800000 2.000000 1.000000 2.000000 6.000000 6.000000 2.000000 0.185369 0.555556 0.661111 ... 2.000000 2.000000 1.000000 2.000000 1.000000 0.0 0.0 0.0 1.000000 -1.000000
75% 0.833333 3.000000 2.000000 2.000000 7.000000 9.000000 4.000000 0.363774 0.622222 0.725000 ... 2.000000 3.000000 1.000000 2.000000 3.000000 0.0 0.0 0.0 1.000000 -1.000000
max 0.965517 28.000000 19.000000 24.000000 71.000000 254.000000 8.000000 0.515307 0.750000 0.833333 ... 12.000000 11.000000 10.000000 11.000000 10.000000 0.0 0.0 0.0 11.000000 7.124572

8 rows × 23 columns

Labelling data

Based on its trust value, we categorise the data entity into two sets: trusted and uncertain. Here, the threshold for the trust value, whose range is [0, 1], is chosen to be 0.75.

In [4]:
trust_threshold = 0.75
df['label'] = df.apply(lambda row: 'Trusted' if row.trust_value >= trust_threshold else 'Uncertain', axis=1)
df.head()  # The new label column is the last column below
Out[4]:
trust_value entities agents activities nodes edges diameter assortativity acc acc_e ... mfd_e_ag mfd_a_e mfd_a_a mfd_a_ag mfd_ag_e mfd_ag_a mfd_ag_ag mfd_der powerlaw_alpha label
id
Route41053.0 0.833333 2 2 2 6 6 4 0.500000 0.555556 0.666667 ... 3 1 2 3 0 0 0 1 -1.0 Trusted
RouteSet9042.1 0.600000 3 2 2 7 9 4 0.363774 0.750000 0.833333 ... 3 1 2 3 0 0 0 1 -1.0 Uncertain
Building19305.0 0.428571 1 1 1 3 2 2 -1.000000 0.000000 0.000000 ... 2 0 0 1 0 0 0 -1 -1.0 Uncertain
Building1136.0 0.428571 1 1 1 3 2 2 -1.000000 0.000000 0.000000 ... 2 0 0 1 0 0 0 -1 -1.0 Uncertain
Building24156.0 0.833333 1 1 1 3 2 2 -1.000000 0.000000 0.000000 ... 2 0 0 1 0 0 0 -1 -1.0 Trusted

5 rows × 24 columns

Having used the trust valuue to label all the data entities, we remove the trust_value column from the data frame.

In [5]:
# We will not use trust value from now on
df.drop('trust_value', axis=1, inplace=True)
df.shape  # the dataframe now have 23 columns (22 metrics + label)
Out[5]:
(14882, 23)

Filtering data

We split the dataset into three: buildings, routes, and route sets.

In [6]:
df_buildings = df.filter(like="Building", axis=0)
df_routes = df.filter(regex="^Route\d", axis=0)
df_routesets = df.filter(like="RouteSet", axis=0)
df_buildings.shape, df_routes.shape, df_routesets.shape  # The number of data points in each dataset
Out[6]:
((5175, 23), (4997, 23), (4710, 23))

Balancing Data

This section explore the balance of each of the three datasets and balance them using the SMOTE Oversampling Method.

In [7]:
from analytics import balance_smote

Buildings

In [8]:
df_buildings.label.value_counts()
Out[8]:
Trusted      4491
Uncertain     684
Name: label, dtype: int64

Balancing the building dataset:

In [9]:
df_buildings = balance_smote(df_buildings)
Original data shapes: (5175, 22) (5175,)
Balanced data shapes: (8982, 22) (8982,)

Routes

In [10]:
df_routes.label.value_counts()
Out[10]:
Trusted      3908
Uncertain    1089
Name: label, dtype: int64

Balancing the route dataset:

In [11]:
df_routes = balance_smote(df_routes)
Original data shapes: (4997, 22) (4997,)
Balanced data shapes: (7816, 22) (7816,)

Route Sets

In [12]:
df_routesets.label.value_counts()
Out[12]:
Trusted      3019
Uncertain    1691
Name: label, dtype: int64

Balancing the route set dataset:

In [13]:
df_routesets = balance_smote(df_routesets)
Original data shapes: (4710, 22) (4710,)
Balanced data shapes: (6038, 22) (6038,)

Cross Validation

We now run the cross validation tests on the three balanaced datasets (df_buildings, df_routes, and df_routesets) using all the features (combined), only the generic network metrics (generic), and only the provenance-specific network metrics (provenance). Please refer to Cross Validation Code.ipynb for the detailed description of the cross validation code.

In [14]:
from analytics import test_classification

Building Classification

We test the classification of buildings, collect individual accuracy scores results and the importance of every feature in each test in importances (both are Pandas Dataframes). These two tables will also be used to collect data from testing the classification of routes and route sets later.

In [15]:
# Cross validation test on building classification
res, imps = test_classification(df_buildings)

# adding the Data Type column
res['Data Type'] = 'Building'
imps['Data Type'] = 'Building'

# storing the results and importance of features
results = res
importances = imps

# showing a few newest rows
results.tail()
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:1907: RuntimeWarning: invalid value encountered in multiply
  lower_bound = self.a * scale + loc
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:1908: RuntimeWarning: invalid value encountered in multiply
  upper_bound = self.b * scale + loc
Accuracy: 50.00% ±nan <-- combined
Accuracy: 50.00% ±nan <-- generic
Accuracy: 50.00% ±nan <-- provenance
Out[15]:
Accuracy Metrics Data Type
2995 0.5 provenance Building
2996 0.5 provenance Building
2997 0.5 provenance Building
2998 0.5 provenance Building
2999 0.5 provenance Building

Route Classification

In [16]:
# Cross validation test on route classification
res, imps = test_classification(df_routes)

# adding the Data Type column
res['Data Type'] = 'Route'
imps['Data Type'] = 'Route'

# storing the results and importance of features
results = results.append(res, ignore_index=True)
importances = importances.append(imps, ignore_index=True)

# showing a few newest rows
results.tail()
Accuracy: 61.20% ±0.0983 <-- combined
Accuracy: 61.19% ±0.1001 <-- generic
Accuracy: 61.23% ±0.0972 <-- provenance
Out[16]:
Accuracy Metrics Data Type
5995 0.593350 provenance Route
5996 0.606138 provenance Route
5997 0.631714 provenance Route
5998 0.607692 provenance Route
5999 0.612821 provenance Route

Route Set Classification

In [17]:
# Cross validation test on route classification
res, imps = test_classification(df_routesets)

# adding the Data Type column
res['Data Type'] = 'Route Set'
imps['Data Type'] = 'Route Set'

# storing the results and importance of features
results = results.append(res, ignore_index=True)
importances = importances.append(imps, ignore_index=True)

# showing a few newest rows
results.tail()
Accuracy: 63.38% ±0.0974 <-- combined
Accuracy: 63.31% ±0.0972 <-- generic
Accuracy: 63.42% ±0.0970 <-- provenance
Out[17]:
Accuracy Metrics Data Type
8995 0.678808 provenance Route Set
8996 0.637417 provenance Route Set
8997 0.645695 provenance Route Set
8998 0.639073 provenance Route Set
8999 0.661130 provenance Route Set

Charting the accuracy scores

In [20]:
%matplotlib inline
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("paper", font_scale=1.4)

Converting the accuracy score from [0, 1] to percentage, i.e [0, 100]:

In [21]:
results.Accuracy = results.Accuracy * 100
results.head()
Out[21]:
Accuracy Metrics Data Type
0 50.0 combined Building
1 50.0 combined Building
2 50.0 combined Building
3 50.0 combined Building
4 50.0 combined Building
In [22]:
from matplotlib.font_manager import FontProperties
fontP = FontProperties()
fontP.set_size(12)
In [34]:
pal = sns.light_palette("seagreen", n_colors=3, reverse=True)
plot = sns.barplot(x="Data Type", y="Accuracy", hue='Metrics', palette=pal, errwidth=1, capsize=0.02, data=results)
plot.set_ylim(40, 90)
plot.legend(loc='upper center', bbox_to_anchor=(0.5, 1.0), ncol=3)
plot.set_ylabel('Accuracy (%)')
Out[34]:
<matplotlib.text.Text at 0x1167f3668>

Results: We cannot use the provenance of buildings in our approach for assessing their quality because they are all the same in terms of the topology of their historical provenance graphs. There are small correlations between the provenance of routes/route sets and their quality. However, as shown above, the decision tree classifier's accuracy in predicting their quality is very low, 61% and 63%, compared with 97% and 96% while using the dependency graphs (or the forward provenance). Note that the baseline accuracy for random selection, in this application, is 50%.