The goal here is to determine if what Snpeff's label means (generally) the same thing as VEP's label means the same thing as what ANNOVAR's label. Unfortunately each classifier uses slightly differt terms so we need to normalize the terms over all three tools. There is some hope that all tools will adopt the terms as defined by The Sequence Ontology Project, but as of May 2014 we still need create this mapping.
First, let's take a look at the terms each tool uses.
SNPeff classifies variants by region and effect.
VEP uses the sequence ontology terms as the SO project defines them.
ANNOVAR does something similar to SNPeff in that it breaks it's classification into a region and an effect
I can't decide what resolution the buckets should be (ie feature type, or just variant category-LOF, Missense, Syn, etc). Note that SNPeff catagorizes variants that chage the start/stop codon to another start/stop codon as both synonymous and non-synonomous. Also note that VEP doesn't differentiate between SNPs and mod 3 indels. Lastly, the annotation of splice site variants is tricky. ANNOVAR annotates anything within 2bp of a splice site (on either side). This is similar to the splice_donor_variant and splice_acceptor_variant, except what with this classification variants on the exonic side of the splice site are not counted. The splice_region_variant has no analogous match in ANNOVAR and will be ignored in this analysis.
Normalized SO Name | ANNOVAR | VEP | SNPeff |
---|---|---|---|
frameshift_variant | frameshift_deletion, frameshift_insertion, frameshift_block_substitution | frameshift_variant | FRAME_SHIFT |
stop_gained | stopgain | stop_gained | STOP_GAINED |
stop_lost | stoploss | stop_lost | STOP_LOST |
splicing_variant | splicing | splice_donor_variant, splice_acceptor_variant | SPLICE_SITE_DONOR, SPLICE_SITE_ACCEPTOR |
inframe_variant | nonframeshift_deletion, nonframeshift_insertion | inframe_insertion, inframe_deletion | CODON_INSERTION, CODON_CHANGE_PLUS_CODON_INSERTION, CODON_DELETION, CODON_CHANGE_PLUS_CODON_DELETION |
nonsynonymous_variant | nonsynonymous_SNV, nonframeshift_block_substitution | initiator_codon_variant, missense_variant, stop_retained_variant, incomplete_terminal_codon_variant | CODON_CHANGE, NON_SYNONYMOUS_CODING, NON_SYNONYMOUS_START, NON_SYNONYMOUS_STOP, START_LOST |
synonymous_variant | synonymous_SNV | synonymous_variant | SYNONYMOUS_CODING, SYNONYMOUS_START, SYNONYMOUS_STOP |
3_prime_UTR_variant | UTR3 | 3_prime_UTR_variant | UTR_3_PRIME, UTR_3_DELETED |
5_prime_UTR_variant | UTR5 | 5_prime_UTR_variant | UTR_5_PRIME, UTR_5_DELETED, START_GAINED |
upstream_gene_variant | upstream | upstream_gene_variant | UPSTREAM |
downstream_gene_variant | downstream | downstream_gene_variant | DOWNSTREAM |
regulatory_region_variant | N/A | regulatory_region_variant, regulatory_region_ablation, regulatory_region_amplification | REGULATION |
intron_variant | intronic | intron_variant | INTRON, INTRON_CONSERVED |
intergenic_variant | intergenic | intergenic_variant | INTERGENIC, INTERGENIC_CONSERVED |
Ignored | unknown, exonic, ncRNA | transcript_ablation , coding_sequence_variant, splice_region_variant, feature_truncation, feature_elongation, TF_binding_site_variant, TFBS_amplification, TFBS_ablation, NMD_transcript_variant, nc_transcript_variant, non_coding_exon_variant, mature_miRNA_variant | EXON, GENE, EXON_DELETED, CDS, CHROMOSOME, SPLICE_SITE_REGION, SPLICE_SITE_BRANCH, SPLICE_SITE_BRANCH_U12, MICRO_RNA, INTRAGENIC |
from IPython.display import HTML
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn
%pylab inline
Populating the interactive namespace from numpy and matplotlib
snpeff_map = {"SNPeff":"Normalized SO Name",
"FRAME_SHIFT":"frameshift_variant",
"STOP_GAINED":"stop_gained",
"STOP_LOST":"stop_lost",
"SPLICE_SITE_DONOR": "splicing_variant",
"SPLICE_SITE_ACCEPTOR": "splicing_variant",
"CODON_INSERTION": "inframe_variant",
"CODON_CHANGE_PLUS_CODON_INSERTION": "inframe_variant",
"CODON_DELETION": "inframe_variant",
"CODON_CHANGE_PLUS_CODON_DELETION": "inframe_variant",
"NON_SYNONYMOUS_CODING": "nonsynonymous_variant",
"NON_SYNONYMOUS_START": "nonsynonymous_variant",
"NON_SYNONYMOUS_STOP": "nonsynonymous_variant",
"START_LOST": "nonsynonymous_variant",
"CODON_CHANGE": "nonsynonymous_variant",
"SYNONYMOUS_CODING": "synonymous_variant",
"SYNONYMOUS_START": "synonymous_variant",
"SYNONYMOUS_STOP": "synonymous_variant",
"UTR_3_PRIME": "3_prime_UTR_variant",
"UTR_3_DELETED": "3_prime_UTR_variant",
"UTR_5_PRIME": "5_prime_UTR_variant",
"UTR_5_DELETED": "5_prime_UTR_variant",
"START_GAINED": "5_prime_UTR_variant",
"UPSTREAM":"upstream_gene_variant",
"DOWNSTREAM":"downstream_gene_variant",
"REGULATION":"regulatory_region_variant",
"INTRON": "intron_variant",
"INTRON_CONSERVED": "intron_variant",
"INTERGENIC":"intergenic_variant",
"INTERGENIC_CONSERVED":"intergenic_variant",
"EXON": "ignored",
"GENE": "ignored",
"EXON_DELETED": "ignored",
"CDS": "ignored",
"CHROMOSOME ": "ignored",
"INTRAGENIC":"ignored",
"MICRO_RNA":"ignored",
"SPLICE_SITE_REGION": "ignored",
"SPLICE_SITE_BRANCH": "ignored",
"SPLICE_SITE_BRANCH_U12": "ignored"}
#annovar is not consistent with their naming of functional consequences, hence the multiple representations of frameshift_deletion, etc.
annovar_map = {
"frameshift deletion": "frameshift_variant",
"frameshift_deletion": "frameshift_variant",
"frameshift insertion": "frameshift_variant",
"frameshift_insertion": "frameshift_variant",
"frameshift_block_substitution": "frameshift_variant",
"stopgain": "stop_gained",
"stopgain SNV": "stop_gained",
"stoploss": "stop_lost",
"stoploss SNV": "stop_lost",
"splicing": "splicing_variant",
"nonframeshift_deletion": "inframe_variant",
"nonframeshift deletion": "inframe_variant",
"nonframeshift_insertion": "inframe_variant",
"nonframeshift insertion": "inframe_variant",
"nonframeshift_block_substitution": "nonsynonymous_variant",
"nonsynonymous_SNV": "nonsynonymous_variant",
"nonsynonymous SNV": "nonsynonymous_variant",
"synonymous_SNV": "synonymous_variant",
"synonymous SNV": "synonymous_variant",
"UTR3": "3_prime_UTR_variant",
"UTR5": "5_prime_UTR_variant",
"ncRNA": "ignored",
"ncRNA_exonic": "ignored",
"ncRNA_UTR3": "ignored",
"ncRNA_UTR5": "ignored",
"ncRNA_intronic": "ignored",
"upstream": "upstream_gene_variant",
"downstream": "downstream_gene_variant",
"intronic": "intron_variant",
"intergenic": "intergenic_variant",
"unknown": "ignored",
"exonic": "ignored"}
vep_map= {"frameshift_variant": "frameshift_variant",
"stop_gained": "stop_gained",
"stop_lost": "stop_lost",
"splice_donor_variant": "splicing_variant",
"splice_acceptor_variant": "splicing_variant",
"inframe_insertion": "inframe_variant",
"inframe_deletion": "inframe_variant",
"initiator_codon_variant": "nonsynonymous_variant",
"missense_variant": "nonsynonymous_variant",
"stop_retained_variant": "nonsynonymous_variant",
"incomplete_terminal_codon_variant": "nonsynonymous_variant",
"synonymous_variant": "synonymous_variant",
"3_prime_UTR_variant": "3_prime_UTR_variant",
"5_prime_UTR_variant": "5_prime_UTR_variant",
"upstream_gene_variant": "upstream_gene_variant",
"downstream_gene_variant": "downstream_gene_variant",
"regulatory_region_variant": "regulatory_region_variant",
"regulatory_region_ablation": "regulatory_region_variant",
"regulatory_region_amplification": "regulatory_region_variant",
"intron_variant": "intron_variant",
"intergenic_variant": "intergenic_variant",
"transcript_ablation": "ignored",
"coding_sequence_variant": "ignored",
"splice_region_variant": "ignored",
"feature_truncation": "ignored",
"feature_elongation": "ignored",
"TF_binding_site_variant": "ignored",
"TFBS_amplification": "ignored",
"TFBS_ablation": "ignored",
"NMD_transcript_variant": "ignored",
"nc_transcript_variant": "ignored",
"non_coding_exon_variant": "ignored",
"mature_miRNA_variant": "ignored"}
with pd.get_store('classified_variant_store.h5') as store:
print store.keys()
['/cftr_annovar_ensembl', '/cftr_annovar_ensembl_subset', '/cftr_annovar_refseq', '/cftr_annovar_refseq_subset', '/cftr_snpeff_ensembl', '/cftr_snpeff_ensembl_subset', '/cftr_snpeff_refseq', '/cftr_snpeff_refseq_subset', '/cftr_vep_ensembl', '/cftr_vep_ensembl_subset', '/cftr_vep_refseq', '/cftr_vep_refseq_subset', '/vep_subset_ensembl']
for snpeff_store in ["cftr_snpeff_ensembl", "cftr_snpeff_refseq"]:
with pd.get_store('classified_variant_store.h5') as store:
snpeff_df = store.get(snpeff_store)
snpeff_df["normalized_so"] = snpeff_df.apply(lambda row: snpeff_map[row["Effect"]], axis=1)
subset_snpeff = snpeff_df[["POS", "ID", "REF", "ALT", "Effect", "Gene_Name", "Transcript_ID", "Amino_Acid_Change"]]
subset_snpeff.rename(columns={"Amino_Acid_Change": "hgvs_snpeff"}, inplace=True)
subset_snpeff["normalized_so_snpeff"] = subset_snpeff.apply(lambda row: snpeff_map[row["Effect"]], axis=1)
store[snpeff_store + "_subset"] = subset_snpeff
print(str.title(" ".join(snpeff_store.split("_")) + " Metrics"))
print subset_snpeff["Effect"].value_counts()
print("\n")
print subset_snpeff.count()
print("\n\n")
del snpeff_df
del subset_snpeff
Cftr Snpeff Ensembl Metrics INTRON 439230 FRAME_SHIFT 113758 DOWNSTREAM 99185 UPSTREAM 83809 NON_SYNONYMOUS_CODING 30988 EXON 30736 SPLICE_SITE_REGION 23430 CODON_CHANGE_PLUS_CODON_INSERTION 15905 CODON_INSERTION 12429 UTR_5_PRIME 10280 SYNONYMOUS_CODING 9371 STOP_GAINED 8127 CODON_CHANGE_PLUS_CODON_DELETION 7606 UTR_3_PRIME 6664 CODON_DELETION 6403 SPLICE_SITE_ACCEPTOR 2976 SPLICE_SITE_DONOR 2976 INTERGENIC 1403 START_GAINED 286 START_LOST 125 STOP_LOST 115 NON_SYNONYMOUS_START 8 UTR_5_DELETED 6 SYNONYMOUS_STOP 5 INTRAGENIC 3 dtype: int64 POS 905824 ID 905824 REF 905824 ALT 905824 Effect 905824 Gene_Name 905824 Transcript_ID 905824 hgvs_snpeff 905824 normalized_so_snpeff 905824 dtype: int64 Cftr Snpeff Refseq Metrics INTRON 88950 FRAME_SHIFT 35675 INTERGENIC 32500 UPSTREAM 14129 NON_SYNONYMOUS_CODING 9721 DOWNSTREAM 9173 SPLICE_SITE_REGION 6690 CODON_CHANGE_PLUS_CODON_INSERTION 4996 CODON_INSERTION 3890 UTR_3_PRIME 2982 SYNONYMOUS_CODING 2940 STOP_GAINED 2557 CODON_CHANGE_PLUS_CODON_DELETION 2377 CODON_DELETION 2016 UTR_5_PRIME 1851 SPLICE_SITE_ACCEPTOR 806 SPLICE_SITE_DONOR 806 START_GAINED 44 START_LOST 31 STOP_LOST 30 INTRAGENIC 6 NON_SYNONYMOUS_START 2 SYNONYMOUS_STOP 1 dtype: int64 POS 222173 ID 222173 REF 222173 ALT 222173 Effect 222173 Gene_Name 222173 Transcript_ID 222173 hgvs_snpeff 222173 normalized_so_snpeff 222173 dtype: int64
/Users/andrewjesaitis/.virtualenvs/ipython/lib/python2.7/site-packages/pandas/core/frame.py:2175: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame **kwargs)
for vep_store in ["cftr_vep_ensembl", "cftr_vep_refseq"]:
with pd.get_store('classified_variant_store.h5') as store:
vep_df = store.get(vep_store)
vep_df["normalized_so"] = vep_df.apply(lambda row: vep_map[row["Consequence"]], axis=1)
subset_vep = vep_df[["POS", "ID", "REF", "ALT", "Gene", "Consequence", "Feature", "HGVSc", "HGVSp" ]]
subset_vep["hgvs_vep"] = subset_vep["HGVSp"].map(str) + subset_vep["HGVSc"]
del subset_vep["HGVSp"]
del subset_vep["HGVSc"]
subset_vep["normalized_so_vep"] = subset_vep.apply(lambda row: vep_map[row["Consequence"]], axis=1)
store[vep_store + "_subset"] = subset_vep
print(str.title(" ".join(vep_store.split("_")) + " Metrics"))
print subset_vep["Consequence"].value_counts()
print("\n")
print subset_vep.count()
print("\n\n")
del vep_df
del subset_vep
Cftr Vep Ensembl Metrics intron_variant 434622 feature_elongation 362882 feature_truncation 132807 nc_transcript_variant 128229 frameshift_variant 113706 downstream_gene_variant 99177 upstream_gene_variant 83853 missense_variant 30988 non_coding_exon_variant 30608 inframe_insertion 28494 splice_region_variant 24192 NMD_transcript_variant 23212 regulatory_region_variant 16236 inframe_deletion 13995 5_prime_UTR_variant 10257 synonymous_variant 9359 3_prime_UTR_variant 6654 stop_gained 4877 splice_donor_variant 2208 splice_acceptor_variant 2208 coding_sequence_variant 590 stop_lost 127 incomplete_terminal_codon_variant 56 initiator_codon_variant 55 stop_retained_variant 9 dtype: int64 POS 1559401 ID 1559401 REF 1559401 ALT 1559401 Gene 1559401 Consequence 1559401 Feature 1559401 hgvs_vep 1559401 normalized_so_vep 1559401 dtype: int64 Cftr Vep Refseq Metrics intron_variant 283392 feature_elongation 254590 frameshift_variant 105713 downstream_gene_variant 97429 feature_truncation 92432 upstream_gene_variant 87006 missense_variant 28780 inframe_insertion 26482 splice_region_variant 19656 regulatory_region_variant 16236 inframe_deletion 13006 5_prime_UTR_variant 11277 3_prime_UTR_variant 9279 intergenic_variant 9198 synonymous_variant 8687 stop_gained 4582 splice_acceptor_variant 1794 splice_donor_variant 1794 coding_sequence_variant 510 initiator_codon_variant 141 stop_lost 130 stop_retained_variant 6 dtype: int64 POS 1072120 ID 1072120 REF 1072120 ALT 1072120 Gene 1072120 Consequence 1072120 Feature 1072120 hgvs_vep 1072120 normalized_so_vep 1072120 dtype: int64
for annovar_store in ["cftr_annovar_ensembl", "cftr_annovar_refseq"]:
with pd.get_store('classified_variant_store.h5') as store:
annovar_df = store.get(annovar_store)
annovar_df["normalized_so"] = annovar_df.apply(lambda row: annovar_map[row["combined_effect"]], axis=1)
subset_annovar = annovar_df[["POS", "REF", "ALT", "Gene", "combined_effect", "hgvs" ]]
subset_annovar["normalized_so_annovar"] = subset_annovar.apply(lambda row: annovar_map[row["combined_effect"]], axis=1)
store[annovar_store + "_subset"] = subset_annovar
print(str.title(" ".join(annovar_store.split("_")) + " Metrics"))
print subset_annovar["combined_effect"].value_counts()
print("\n")
print subset_annovar.count()
print("\n\n")
del annovar_df
del subset_annovar
Cftr Annovar Ensembl Metrics intronic 86193 frameshift insertion 26252 ncRNA_intronic 15085 nonsynonymous SNV 9730 frameshift deletion 8811 nonframeshift insertion 8111 intergenic 7924 UTR5 5704 nonframeshift deletion 4387 UTR3 3918 synonymous SNV 2941 stopgain SNV 2031 unknown 1875 splicing 1590 ncRNA_exonic 1515 upstream 1408 ncRNA_UTR3 969 stoploss SNV 20 ncRNA_UTR5 8 dtype: int64 POS 188472 REF 188472 ALT 188472 Gene 180548 combined_effect 188472 hgvs 65454 normalized_so_annovar 188472 dtype: int64 Cftr Annovar Refseq Metrics intronic 87338 frameshift insertion 26252 intergenic 13650 nonsynonymous SNV 9730 upstream 9682 downstream 9178 frameshift deletion 8808 nonframeshift insertion 8111 nonframeshift deletion 4390 UTR3 2963 synonymous SNV 2941 stopgain SNV 2031 UTR5 1848 splicing 1456 stoploss SNV 20 dtype: int64 POS 188398 REF 188398 ALT 188398 Gene 174748 combined_effect 188398 hgvs 63531 normalized_so_annovar 188398 dtype: int64
/Users/andrewjesaitis/.virtualenvs/ipython/lib/python2.7/site-packages/pandas/io/pytables.py:2446: PerformanceWarning: your performance may suffer as PyTables will pickle object types that it cannot map directly to c-types [inferred_type->mixed,key->block1_values] [items->['REF', 'ALT', 'Gene', 'combined_effect', 'hgvs', 'normalized_so_annovar']] warnings.warn(ws, PerformanceWarning)
for tx_set in ["ensembl", "refseq"]:
with pd.get_store('classified_variant_store.h5') as store:
snpeff_df = store.get("cftr_snpeff_" + tx_set + "_subset")
vep_df = store.get("cftr_vep_" + tx_set + "_subset")
annovar_df = store.get("cftr_annovar_" + tx_set + "_subset")
vc_snpeff = snpeff_df.groupby(["normalized_so_snpeff"]).size()
vc_snpeff.name = "SNPeff"
vc_vep = vep_df.groupby(["normalized_so_vep"]).size()
vc_vep.name = "VEP"
vc_annovar = annovar_df.groupby(["normalized_so_annovar"]).size()
vc_annovar.name = "Annovar"
vc_df = pd.DataFrame([vc_snpeff, vc_vep, vc_annovar])
vc_df.transpose().plot(kind="barh", fontsize=13, figsize=(16,8))
print vc_df.transpose()
SNPeff VEP Annovar 3_prime_UTR_variant 6664 6654 3918 5_prime_UTR_variant 10572 10257 5704 downstream_gene_variant 99185 99177 NaN frameshift_variant 113758 113706 35063 ignored 54169 702520 19452 inframe_variant 42343 42489 12498 intergenic_variant 1403 NaN 7924 intron_variant 439230 434622 86193 nonsynonymous_variant 31121 31108 9730 regulatory_region_variant NaN 16236 NaN splicing_variant 5952 4416 1590 stop_gained 8127 4877 2031 stop_lost 115 127 20 synonymous_variant 9376 9359 2941 upstream_gene_variant 83809 83853 1408 [15 rows x 3 columns] SNPeff VEP Annovar 3_prime_UTR_variant 2982 9279 2963 5_prime_UTR_variant 1895 11277 1848 downstream_gene_variant 9173 97429 9178 frameshift_variant 35675 105713 35060 ignored 6696 367188 NaN inframe_variant 13279 39488 12501 intergenic_variant 32500 9198 13650 intron_variant 88950 283392 87338 nonsynonymous_variant 9754 28927 9730 regulatory_region_variant NaN 16236 NaN splicing_variant 1612 3588 1456 stop_gained 2557 4582 2031 stop_lost 30 130 20 synonymous_variant 2941 8687 2941 upstream_gene_variant 14129 87006 9682 [15 rows x 3 columns]