Application 1: ProvStore Documents¶

Identifying owners of provenance documents from their provenance network metrics

• Goal: To determine if the provenance network analytics method can identify the owner of a provenance document from its provenance network metrics.
• Training data: In order to ensure that there are sufficient samples to represent a user's provenance documents the Training phase, we limit our experiment to users who have at least 20 documents. There are fourteen such users (the authors were excluded to avoid bias), who we named $u_{1}, u_{2}, \ldots, u_{14}$. Their numbers of documents range between 21 and 6,745, with the total number of documents in the data set is 13,870.
• Classification labels: $\mathcal{L} = \left\{ u_1, u_2, \ldots, u_{14} \right\}$, where $l_{x} = u_i$ if the provenance document $x$ belongs to user $u_i$. Hence, there are 14 labels in total.

Reading data¶

For each provenance document, we calculate the 22 provenance network metrics. The dataset provided contains those metrics values for 13,870 provenance documents along with the owner identifier (i.e. $u_{1}, u_{2}, \ldots, u_{14}$).

In [1]:
import pandas as pd
In [2]:
df = pd.read_csv("provstore/data.csv")
df.head()
Out[2]:
label entities agents activities nodes edges diameter assortativity acc acc_e ... mfd_e_a mfd_e_ag mfd_a_e mfd_a_a mfd_a_ag mfd_ag_e mfd_ag_a mfd_ag_ag mfd_der powerlaw_alpha
0 u_3 17 5 9 31 49 6 -0.196362 0.444709 0.466667 ... 5 8 4 2 5 0 0 0 3 -1.0
1 u_2 7 0 2 9 0 -1 -1.000000 0.000000 0.000000 ... 0 0 0 0 0 0 0 0 -1 -1.0
2 u_2 7 0 2 9 0 -1 -1.000000 0.000000 0.000000 ... 0 0 0 0 0 0 0 0 -1 -1.0
3 u_2 7 0 2 9 0 -1 -1.000000 0.000000 0.000000 ... 0 0 0 0 0 0 0 0 -1 -1.0
4 u_2 7 0 2 9 0 -1 -1.000000 0.000000 0.000000 ... 0 0 0 0 0 0 0 0 -1 -1.0

5 rows × 23 columns

In [3]:
df.describe()
Out[3]:
entities agents activities nodes edges diameter assortativity acc acc_e acc_a ... mfd_e_a mfd_e_ag mfd_a_e mfd_a_a mfd_a_ag mfd_ag_e mfd_ag_a mfd_ag_ag mfd_der powerlaw_alpha
count 13870.000000 13870.00000 13870.000000 13870.000000 13870.000000 13870.000000 13870.000000 13870.000000 13870.000000 13870.000000 ... 13870.000000 13870.000000 13870.000000 13870.000000 13870.000000 13870.000000 13870.000000 13870.000000 13870.000000 13870.000000
mean 9.913338 2.08695 1.836193 13.836482 19.212689 0.868926 -0.628690 0.347835 0.341142 0.323606 ... 1.312761 1.754939 1.073540 0.709229 0.752127 0.017448 0.014924 0.030353 2.185436 -0.916534
std 28.931915 2.27716 18.570823 43.352894 134.640366 1.943905 0.376718 0.394531 0.409577 0.395727 ... 1.769329 1.314874 1.622606 1.343363 1.077628 0.200902 0.152351 0.209759 5.211118 0.612437
min 0.000000 0.00000 0.000000 1.000000 0.000000 -1.000000 -1.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -1.000000 -1.000000
25% 2.000000 1.00000 0.000000 5.000000 5.000000 -1.000000 -1.000000 0.000000 0.000000 0.000000 ... 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 -1.000000
50% 4.000000 1.00000 1.000000 7.000000 9.000000 1.000000 -0.592949 0.000000 0.000000 0.000000 ... 1.000000 2.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 2.000000 -1.000000
75% 5.000000 3.00000 2.000000 10.000000 13.000000 2.000000 -0.350000 0.674147 0.750000 0.666667 ... 2.000000 2.000000 2.000000 1.000000 1.000000 0.000000 0.000000 0.000000 2.000000 -1.000000
max 1188.000000 51.00000 1580.000000 2776.000000 6853.000000 10.000000 1.000000 1.000000 1.000000 1.000000 ... 52.000000 44.000000 51.000000 52.000000 43.000000 4.000000 5.000000 6.000000 303.000000 8.184413

8 rows × 22 columns

In [4]:
# The number of each label in the dataset
df.label.value_counts()
Out[4]:
u_3     6745
u_8     4449
u_5     1327
u_2      487
u_12     312
u_14     150
u_9      141
u_6       71
u_7       66
u_4       34
u_1       25
u_13      21
u_10      21
u_11      21
Name: label, dtype: int64

Experiment¶

In [5]:
from analytics import balance_smote, test_classification

Balancing the data

With an unbalanced like the above, the resulted trained classifier will typically be skewed towards the majority labels. In order to mitigate this, we balance the dataset using the SMOTE Oversampling Method.

In [6]:
df = balance_smote(df)
Original data shapes: (13870, 22) (13870,)
Balanced data shapes: (94430, 22) (94430,)

Cross Validation tests: We now run the cross validation tests on the dataset (df) using all the features (combined), only the generic network metrics (generic), and only the provenance-specific network metrics (provenance). Please refer to Cross Validation Code.ipynb for the detailed description of the cross validation code.

In [7]:
results, importances = test_classification(df)
Accuracy: 98.13% ±0.0080 <-- combined
Accuracy: 92.32% ±0.0157 <-- generic
Accuracy: 98.11% ±0.0079 <-- provenance

Result: The outputs above is the accuracy of the classifier in identifying the owner of a provenance document from ProvStore using all provenance network metrics (i.e. combined), only generic network metrics, and only provenance-specific network metrics.

The individual accuracy scores are stored in results and the importance of every feature in each test in imps (both are pandas Dataframe objects).

Saving experiments' results (optional)¶

Optionally, we can save the test results to save time the next time we want to re-explore them:

In [8]:
results.to_pickle("provstore/results.pkl")
importances.to_pickle("provstore/importances.pkl")

Next time, we can reload the results as follows:

In [9]:
import pandas as pd
results = pd.read_pickle("provstore/results.pkl")
importances = pd.read_pickle("provstore/importances.pkl")
results.shape, importances.shape  # showing the shape of the data (for checking)
Out[9]:
((3000, 2), (1000, 22))

The importance of features¶

We plot the importance of each feature (from the combined test, which used all 22 features).

In [10]:
%matplotlib inline
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("talk")
In [11]:
# Rename the columns with Math notation for consistency with the metrics symbols in the paper
feature_name_maths_mapping = {
"entities": "$n_e$", "agents": "$n_{ag}$", "activities": "$n_a$", "nodes": "$n$", "edges": "$e$",
"diameter": "$d$", "assortativity": "$r$", "acc": "$\\mathsf{ACC}$",
"acc_e": "$\\mathsf{ACC}_e$",  "acc_a": "$\\mathsf{ACC}_a$",  "acc_ag": "$\\mathsf{ACC}_{ag}$",
"mfd_e_e": "$\\mathrm{mfd}_{e \\rightarrow e}$", "mfd_e_a": "$\\mathrm{mfd}_{e \\rightarrow a}$",
"mfd_e_ag": "$\\mathrm{mfd}_{e \\rightarrow ag}$", "mfd_a_e": "$\\mathrm{mfd}_{a \\rightarrow e}$",
"mfd_a_a": "$\\mathrm{mfd}_{a \\rightarrow a}$", "mfd_a_ag": "$\\mathrm{mfd}_{a \\rightarrow ag}$",
"mfd_ag_e": "$\\mathrm{mfd}_{ag \\rightarrow e}$", "mfd_ag_a": "$\\mathrm{mfd}_{ag \\rightarrow a}$",
"mfd_ag_ag": "$\\mathrm{mfd}_{ag \\rightarrow ag}$", "mfd_der": "$\\mathrm{mfd}_\\mathit{der}$", "powerlaw_alpha": "$\\alpha$"
}
importances.rename(columns=feature_name_maths_mapping, inplace=True)
In [12]:
plot = sns.barplot(data=importances)
for i in plot.get_xticklabels():
i.set_rotation(90)

From the above chart, the three most important features for this application are: $n_e$, $\mathrm{mfd}_{e \rightarrow ag}$, and $\mathsf{ACC}$.