Here we analyze the TCGA HNSCC molecular validation cohort. For this data we break our use of the data-versioned Firehose run by using a new MAF file and copy number matrix.
import NotebookImport
from Imports import *
f = '../Data/MAFs/PR_TCGA_HNSC_PAIR_Capture_All_Pairs_QCPASS_v4.aggregated.capture.tcga.uuid.automated.somatic.maf.txt'
mut_new = pd.read_table(f, skiprows=4, low_memory=False)
keep = (mut_new.Variant_Classification.isin(['Silent', 'Intron', "3'UTR", "5'UTR"])==False)
mut_new = mut_new[keep]
mut_new['barcode'] = mut_new.Tumor_Sample_Barcode.map(lambda s: s[:12])
mut_new = mut_new.groupby(['barcode','Hugo_Symbol']).size().unstack().fillna(0).T
mut_old = mut.df.ix[mut_new.index, mut_new.columns].dropna([0,1], how='all')
del_3p = cn.features.ix['Deletion'].ix['3p14.2']
del_3p.name = '3p_deletion'
I downloaded an updated version of the GISTIC gene by patient matrix from the April 16, 2014 Firehose run. This has calls for 511 paitents as opposed to 452 in the January 15th run.
f = '../Extra_Data/FH_HNSC__4_16_all_data_thresholded_by_genes.txt'
gistic = pd.read_table(f, index_col=[2, 1, 0], low_memory=False)
gistic = FH.fix_barcode_columns(gistic, tissue_code='01')
del_3p = gistic.ix['3p14.2'].median(0)
del_3p.name = '3p_deletion'
mut_all = mut.df.combine_first(mut_new)
clinical_cohort = mut.df.columns
molecular_cohort = mut_new.columns.diff(mut.features.columns)
hpv_neg_cohort = mut_all.columns.intersection(true_index(hpv == 0))
molecular_cohort_n = molecular_cohort.intersection(hpv_neg_cohort)
This association is very significnat but weaking in the validation data. This could be because of less acurate mutation calls, or less accurate HPV status assignment as most of these patients' HPV statuses were inferred from the expression data.
cohorts = {'Discovery': clinical_cohort, 'Validation': molecular_cohort, 'All': mut_all.columns}
hpv.name = 'HPV'
ct = pd.concat({c: combine(hpv, mut_all.ix['TP53']>0).ix[s].value_counts()
for c,s in cohorts.iteritems()}, axis=1)
ct.ix[['neither','HPV','TP53','both'],['Discovery','Validation','All']]
Discovery | Validation | All | |
---|---|---|---|
neither | 52 | 32 | 84 |
HPV | 41 | 31 | 72 |
TP53 | 211 | 139 | 350 |
both | 2 | 3 | 5 |
stats = pd.concat({c: fisher_exact_test(hpv.ix[s], mut_all.ix['TP53'].ix[s]>0)
for c,s in cohorts.iteritems()}, axis=1)
stats[['Discovery','Validation','All']]
Discovery | Validation | All | |
---|---|---|---|
odds_ratio | 1.20e-02 | 2.23e-02 | 1.67e-02 |
p | 1.70e-22 | 5.90e-16 | 2.93e-37 |
Note that the HPV- cohort contains a couple of the patients in original TCGA analysis set that were filtered out due to missing data or old age.
cohorts = {'Discovery': keepers_o, 'Validation': molecular_cohort_n, 'HPV-': hpv_neg_cohort}
ct = pd.concat({c: combine(mut_all.ix['TP53'].ix[s].dropna()>0, del_3p<0).value_counts()
for c,s in cohorts.iteritems()}, axis=1)
ct.ix[['neither','3p_deletion','TP53','both'],['Discovery','Validation','HPV-']]
Discovery | Validation | HPV- | |
---|---|---|---|
neither | 22 | 18 | 42 |
3p_deletion | 26 | 7 | 33 |
TP53 | 23 | 20 | 45 |
both | 179 | 81 | 265 |
ct.sum()
Discovery 250 HPV- 385 Validation 126 dtype: int64
stats = pd.concat({c: fisher_exact_test(mut_all.ix['TP53'].ix[s]>0, del_3p<0)
for c,s in cohorts.iteritems()}, axis=1)
stats[['Discovery','Validation','HPV-']]
Discovery | Validation | HPV- | |
---|---|---|---|
odds_ratio | 6.59e+00 | 1.04e+01 | 7.49e+00 |
p | 3.56e-07 | 1.44e-06 | 8.11e-13 |
Here we show that 3p is has the highest association of any chromosomal segment with TP53 mutation in both training and discovery cohorts.
cn.features.index = cn.features.index.droplevel(2)
r1 = screen_feature(mut_all.ix['TP53'].ix[molecular_cohort_n] > 0, fisher_exact_test,
cn.features.ix['Deletion'] < 0)
r2 = screen_feature(mut_all.ix['TP53'].ix[molecular_cohort_n] > 0, fisher_exact_test,
cn.features.ix['Amplification'] > 0)
r3 = screen_feature(mut_all.ix['TP53'].ix[keepers_o] > 0, fisher_exact_test,
cn.features.ix['Deletion'] < 0)
r4 = screen_feature(mut_all.ix['TP53'].ix[keepers_o] > 0, fisher_exact_test,
cn.features.ix['Amplification'] > 0)
v1 = pd.concat([r3, r1], keys=['Discovery','Validation'], axis=1).sort([('Discovery','p')])
v2 = pd.concat([r4, r2], keys=['Discovery','Validation'], axis=1).sort([('Discovery','p')])
v3 = pd.concat([v1.head(6), v2.head(6)], keys=['Deletion','Amplification'])
v3.columns = v3.columns.swaplevel(0,1)
v3 = v3.sort_index(axis=1)
del v3['q']
v3[('q','bonf')] = pd.concat([v3.p.Discovery['Deletion'] * len(r3),
v3.p.Discovery['Amplification'] * len(r4)],
keys=['Deletion','Amplification'])
v3
odds_ratio | p | q | ||||
---|---|---|---|---|---|---|
Discovery | Validation | Discovery | Validation | bonf | ||
Deletion | 3p14.3 | 6.59 | 10.41 | 3.56e-07 | 1.44e-06 | 1.71e-05 |
3p14.2 | 6.59 | 10.41 | 3.56e-07 | 1.44e-06 | 1.71e-05 | |
3p25.3 | 5.49 | 9.23 | 1.81e-06 | 4.17e-06 | 8.68e-05 | |
3p12.2 | 5.07 | 8.72 | 6.28e-06 | 6.87e-06 | 3.02e-04 | |
10p15.3 | 4.00 | 6.10 | 6.74e-04 | 7.33e-03 | 3.24e-02 | |
11q23.1 | 3.23 | 3.04 | 1.12e-03 | 5.79e-02 | 5.36e-02 | |
Amplification | 3q26.33 | 6.35 | 7.82 | 9.00e-08 | 2.43e-05 | 2.34e-06 |
8q24.21 | 4.24 | 2.70 | 3.98e-05 | 7.84e-02 | 1.04e-03 | |
12p13.33 | 3.21 | 1.46 | 2.63e-03 | 6.12e-01 | 6.85e-02 | |
9p24.1 | 0.40 | 0.76 | 8.88e-03 | 6.05e-01 | 2.31e-01 | |
18p11.31 | 2.61 | 4.41 | 1.69e-02 | 3.91e-02 | 4.40e-01 | |
8q11.21 | 2.13 | 2.41 | 2.15e-02 | 5.88e-02 | 5.60e-01 |
combo_all = combine(mut_all.ix['TP53']>0, del_3p<0)
two_hit = combo_all == 'both'
two_hit.name = 'TP53-3p'
ct = pd.concat({c: combine(two_hit, mut_all.ix['CASP8']>0).ix[s].value_counts()
for c,s in cohorts.iteritems()}, axis=1)
ct.ix[['neither','CASP8','TP53-3p','both'],['Discovery','Validation','HPV-']]
Discovery | Validation | HPV- | |
---|---|---|---|
neither | 45 | 26 | 71 |
CASP8 | 11 | 18 | 31 |
TP53-3p | 184 | 116 | 304 |
both | 10 | 9 | 22 |
stats = pd.concat({c: fisher_exact_test(two_hit.ix[s], mut_all.ix['CASP8']>0)
for c,s in cohorts.iteritems()}, axis=1)
stats[['Discovery','Validation','HPV-']]
Discovery | Validation | HPV- | |
---|---|---|---|
odds_ratio | 0.22 | 1.12e-01 | 1.66e-01 |
p | 0.00 | 1.21e-06 | 5.77e-09 |
combo_all = combine(mut_all.ix['TP53']>0, del_3p<0)
two_hit = combo_all == 'both'
two_hit.name = 'TP53-3p'
gs = run.gene_sets['REACTOME_SOS_MEDIATED_SIGNALLING']
sos1_pathway = mut_all.ix[gs].sum()>0
sos1_pathway.name = 'SOS1 Pathway'
ct = pd.concat({c: combine(two_hit, sos1_pathway>0).ix[s].value_counts()
for c,s in cohorts.iteritems()}, axis=1)
ct.ix[['neither','SOS1 Pathway','TP53-3p','both'],['Discovery','Validation','HPV-']]
Discovery | Validation | HPV- | |
---|---|---|---|
neither | 41 | 26 | 67 |
SOS1 Pathway | 15 | 18 | 35 |
TP53-3p | 186 | 117 | 308 |
both | 8 | 8 | 18 |
stats = pd.concat({c: fisher_exact_test(two_hit.ix[s], sos1_pathway)
for c,s in cohorts.iteritems()}, axis=1)
stats[['Discovery','Validation','HPV-']]
Discovery | Validation | HPV- | |
---|---|---|---|
odds_ratio | 1.18e-01 | 9.88e-02 | 1.12e-01 |
p | 4.04e-06 | 4.85e-07 | 2.01e-12 |