HNSCC HPV- Cohort Predicting Clinical Variables

Setup

Initialization
Read in Pre-processed Data

Clinical variables
Support Vector Inference

Smoking
Drinking
Perineural Invasion
Extra-Capsular Spread
Save processed clinical variables

Setup¶

Initialization¶

Style-Sheet

In [1]:

%pylab inline

Populating the interactive namespace from numpy and matplotlib

In [2]:

cd ../src

/cellar/users/agross/TCGA_Code/TCGA/src

In [3]:

from Processing.Imports import *

Read in Pre-processed Data¶

In [4]:

params = pd.read_table('../global_params.txt', header=None, squeeze=True, 
                       index_col=0)

In [5]:

run_path  = '{}/Firehose__{}/'.format(params.ix['OUT_PATH'], params.ix['RUN_DATE'])
run = get_run(run_path, 'Run_' + params.ix['VERSION'])
cancer = run.load_cancer(params.ix['CANCER'])
clinical = cancer.load_clinical()

mut = cancer.load_data('Mutation')
mut.uncompress()
cn = cancer.load_data('CN_broad')
cn.uncompress()

In [6]:

rna = pickle.load(open(cancer.path + '/mRNASeq/store/no_hpv.p', 'rb'))
#meth = pickle.load(open(cancer.path + '/Methylation/store/no_hpv.p', 'rb'))
mirna = pickle.load(open(cancer.path + '/miRNASeq/store/no_hpv.p', 'rb'))

In [7]:

surv = clinical.survival.survival_5y
age = clinical.clinical.age
hpv_inferred = clinical.hpv_inferred

In [8]:

keepers_o = true_index(hpv_inferred==0)
keepers_o = keepers_o.intersection(mut.features.columns)
keepers_o = keepers_o.intersection(cn.features.columns)
keepers_o = keepers_o.intersection(surv.unstack().index)
keepers_o = keepers_o.intersection(rna.features.columns)
keepers_o = keepers_o.intersection(mirna.features.columns)
keepers_o = keepers_o.intersection(true_index(age < 85))

Clinical variables¶

Need to do some scrubbing to make them interpretable
Tumor subdivision are larger sub-groups to make comparative analysis more feasable

In [17]:

stage = clinical.stage.pathologicstage.ix[keepers_o].fillna('nx')
stage = stage.dropna().map(lambda s: s.replace('a','').replace('b',''))
stage = stage.map(lambda s: s.replace('stge','Stage'))

lymph_stage = clinical.stage.pathologicn.ix[keepers_o]
lymph_stage = lymph_stage.dropna().map(lambda s: s[:2])

old_age = (age >= 75).map({True: 'Age > 75', False: 'Age < 75'})
pack_years = py = clinical.clinical.numberpackyearssmoked.dropna().astype(float)

group = [['oral tongue','oral cavity','floor of mouth','buccal mucosa','alveolar ridge','hard palate','lip'],
         ['oropharynx','tonsil','base of tongue'],
         #['hypopharynx'],
         ['larynx']]
groups = ['oral cavity','oropharynx','larynx']
tumor_subdivision = pd.Series({idx: groups[i] for i,g in enumerate(group) for idx,j in 
                               clinical.clinical.anatomicneoplasmsubdivision.iteritems() 
                               if j.lower() in g})

invasion = clinical.clinical.perineuralinvasionpresent.replace('nan', nan)
invasion = invasion.str.lower()

spread = clinical.clinical.presenceofpathologicalnodalextracapsularspread
spread = spread.map(str.lower, na_action='ignore')
spread = spread.map({'no extranodal extension': 'no', 'microscopic extension':'yes', 
                     'gross extension':'yes'}).dropna()

year = clinical.clinical.yearofinitialpathologicdiagnosis
year = year.replace('[Discrepancy]', nan).astype(float)

lymph = lymph_stage.ix[keepers_o] != 'n0'
lymph_status = combine(lymph, spread.ix[keepers_o]=='yes')
lymph_status = lymph_status.map({'neither': 'n0', lymph.name: 'lymph_node', 'both': 'extra_capsular_spread'})
#lymph_status = (lymph_status == 'extra_capsular_spread').astype(float)

Support Vector Inference¶

Using SVC function from Scikit Learn Package
Features are the binarized differential expression vectors
- Have high change in expression from tumor to normal
- Thresholded at 1 standard deviation over the mean to reduce overfitting
Parameters are fit using cross validation, optimizing for AUC score
- I try linear, RBF, and polynomial kernels under a variety of parameters
- The best model in cross validation is fit on the entire dataset
Missing values are filled in based on the model prediction

In [18]:

from Stats.Classification import SVC_fill

Smoking¶

I'm fitting the model on non-smokers verses current smokers and filling in the rest
I use the background genes here because this should not be a cancer specific event
Much of the variation in the background is tissue specific so I drop the genes that expression changes across tissues
I use a small test set becuase we are fitting extreme cases and want to overfit a little

In [19]:

smoker = clinical.clinical.tobaccosmokinghistory.str.lower()
smoker_binary = smoker[smoker.isin(['current smoker','lifelong non-smoker'])] == 'current smoker'

In [20]:

smoker.value_counts()

Out[20]:

current smoker                                 128
current reformed smoker for < or = 15 years    101
lifelong non-smoker                             80
current reformed smoker for > 15 years          59
dtype: int64

In [21]:

ret = SVC_fill(smoker_binary, rna.features.ix['real'])
ret['auc']

Out[21]:

0.90839694656488545

In [22]:

figsize(6,4)
fun = ret['decision_function']
o = ['current smoker','current reformed smoker for < or = 15 years',
     'current reformed smoker for > 15 years', 'lifelong non-smoker']
violin_plot_pandas(smoker, fun, order = o)
ax = plt.gca()
t = ax.set_xticklabels(o, rotation=20)
prettify_ax(ax)

In [23]:

get_surv_fit_lr(surv, smoker_binary.ix[keepers_o].fillna('Missing'))

Out[23]:

	Stats		Median Survival			5y Survival			Log-Rank
	# Patients	# Events	Median	Lower	Upper	Surv	Lower	Upper	chi2	p
									7.66	0.0217
Missing	123	50	4	2.3	NaN	0.487	0.39	0.609
True	84	38	1.6	1.35	NaN	0.284	0.146	0.553
False	44	14	4.71	2.96	NaN	0.393	0.19	0.813

4 rows × 10 columns

In [24]:

smoker_inferred = 1.*smoker_binary.combine_first(ret['filled_feature'])
smoker_inferred.name = 'smoker_inferred'

In [25]:

pd.crosstab(smoker_inferred, smoker.ix[smoker_inferred.index].fillna('M')).T

Out[25]:

smoker_inferred	0.0	1.0
tobaccosmokinghistory
M	4	6
current reformed smoker for < or = 15 years	22	52
current reformed smoker for > 15 years	27	17
current smoker	0	128
lifelong non-smoker	80	0

5 rows × 2 columns

In [26]:

get_surv_fit_lr(clinical.survival.survival, smoker_inferred)

Out[26]:

	Stats		Median Survival			5y Survival			Log-Rank
	# Patients	# Events	Median	Lower	Upper	Surv	Lower	Upper	chi2	p
									12.8	0.000342
1	203	91	2.21	1.6	3.29	0.323	0.237	0.439
0	132	41	5.48	4.71	NaN	0.553	0.428	0.713

3 rows × 10 columns

In [27]:

survival_and_stats(smoker_binary.ix[smoker_inferred.index], surv)

In [28]:

si = smoker_inferred.map({1:'smoker_inf', 0:'non-smoker_inf'})
s = smoker_binary.map({True:'smoker', False:'non-smoker'})
survival_and_stats(s.combine_first(si), surv)

Drinking¶

This is a tough one to split but for now I'm fitting on patients that:
- Had less than 7 drinks per week
- Had more than 14 drinks per week
I use a small test set becuase we are fitting extreme cases and want to overfit a little

In [29]:

figsize(6,4)
clinical.clinical.amountofalcoholconsumptionperday.astype(float).hist()

Out[29]:

<matplotlib.axes.AxesSubplot at 0xb8ff990>

In [30]:

clinical.clinical.alcoholhistorydocumented.value_counts()

Out[30]:

yes    254
no     118
dtype: int64

In [31]:

freq = clinical.clinical.frequencyofalcoholconsumption.astype(float)
count = clinical.clinical.amountofalcoholconsumptionperday.astype(float)
drinker = (freq * count).dropna()
#drinker = drinker[(drinker < 8) + (drinker > 14)]
drinker = drinker.ix[keepers_o].dropna() > 10

In [32]:

drinker.value_counts()

Out[32]:

True     57
False    38
dtype: int64

In [34]:

ret = SVC_fill(drinker, rna.features.ix['real'])
ret['auc']

Out[34]:

0.88421052631578945

In [35]:

fun = ret['decision_function']
violin_plot_pandas(drinker, fun)

In [36]:

series_scatter((freq * count).dropna(), fun)
xlim(-1,20)

Out[36]:

(-1, 20)

In [37]:

get_surv_fit_lr(surv, drinker*1.)

Out[37]:

	Stats		Median Survival			5y Survival			Log-Rank
	# Patients	# Events	Median	Lower	Upper	Surv	Lower	Upper	chi2	p
									3.1	0.0783
1	57	24	2.5	1.79	NaN	0.319	0.176	0.577
0	38	8	NaN	NaN	NaN	0.732	0.583	0.919

3 rows × 10 columns

In [38]:

drinker_inferred = 1.*drinker.combine_first(ret['filled_feature'])

In [39]:

get_surv_fit_lr(surv, drinker_inferred)

Out[39]:

	Stats		Median Survival			5y Survival			Log-Rank
	# Patients	# Events	Median	Lower	Upper	Surv	Lower	Upper	chi2	p
									6.03	0.0141
1	179	84	2.2	1.6	3.53	0.351	0.264	0.467
0	80	25	NaN	4.49	NaN	0.562	0.431	0.734

3 rows × 10 columns

In [40]:

si = drinker_inferred.map({1:'drinker_inf', 0:'non-drinker_inf'})
s = drinker.map({True:'drinker', False:'non-drinker'})
survival_and_stats(s.combine_first(si), surv)

In [41]:

survival_and_stats(drinker_inferred, surv)

Perineural Invasion¶

This is part of the pathological evaluation, we are missing these for about 100 patients
We expect this to effect tumor specific signatures, so we use differentially expressed genes

In [42]:

invasion.value_counts()

Out[42]:

no     137
yes    126
dtype: int64

In [43]:

ret = SVC_fill(invasion[invasion.isin(['yes','no'])]=='yes', 
               rna.features.ix['real'])
ret['auc']

Out[43]:

0.84375

In [44]:

fun = ret['decision_function']
violin_plot_pandas(invasion, fun)

In [45]:

invasion_inferred = 1.*(invasion.dropna()=='yes').combine_first(ret['filled_feature'])

In [46]:

get_surv_fit_lr(surv, invasion)

Out[46]:

	Stats		Median Survival			5y Survival			Log-Rank
	# Patients	# Events	Median	Lower	Upper	Surv	Lower	Upper	chi2	p
									6.88	0.032
no	137	34	NaN	4	NaN	0.506	0.356	0.72
yes	126	54	2.58	1.49	NaN	0.388	0.285	0.529

3 rows × 10 columns

In [47]:

get_surv_fit_lr(surv, invasion_inferred)

Out[47]:

	Stats		Median Survival			5y Survival			Log-Rank
	# Patients	# Events	Median	Lower	Upper	Surv	Lower	Upper	chi2	p
									4.87	0.0273
0	179	55	4.71	2.84	NaN	0.439	0.325	0.594
1	151	68	2.54	1.54	4.49	0.369	0.277	0.493

3 rows × 10 columns

In [48]:

survival_and_stats(invasion_inferred, surv)

In [49]:

survival_and_stats(invasion.ix[invasion_inferred.index].combine_first(invasion_inferred), surv)

Extra-Capsular Spread¶

Similar to perineural invasion

In [50]:

ret = SVC_fill(spread=='yes', rna.features.ix['real'])
ret['auc']

Out[50]:

0.88636363636363635

In [51]:

fun = ret['decision_function']
violin_plot_pandas(spread, fun)

In [52]:

spread_inferred = 1.*(spread.dropna()=='yes').combine_first(ret['filled_feature'])

In [53]:

get_surv_fit_lr(surv, spread)

Out[53]:

	Stats		Median Survival			5y Survival			Log-Rank
	# Patients	# Events	Median	Lower	Upper	Surv	Lower	Upper	chi2	p
									12.5	0.000413
no	168	50	NaN	2.99	NaN	0.507	0.4	0.641
yes	71	34	1.42	1.25	2.08	0.282	0.17	0.469

3 rows × 10 columns

In [54]:

get_surv_fit_lr(surv, spread_inferred)

Out[54]:

	Stats		Median Survival			5y Survival			Log-Rank
	# Patients	# Events	Median	Lower	Upper	Surv	Lower	Upper	chi2	p
									19.9	7.97e-06
0	227	70	4.71	3.29	NaN	0.493	0.399	0.608
1	95	49	1.42	1.25	1.71	0.246	0.153	0.397

3 rows × 10 columns

In [55]:

survival_and_stats(spread, surv)

In [56]:

survival_and_stats(spread_inferred, surv)

Save processed clinical variables¶

In [57]:

clinical_processed = pd.concat({'spread': spread, 
                                'spread_inferred': spread_inferred,
                                'invasion': invasion, 
                                'invasion_inferred': invasion_inferred,
                                'hpv': clinical.hpv, 
                                'hpv_inferred': hpv_inferred,
                                'smoker': smoker, 
                                'smoker_inferred': smoker_inferred,
                                'drinker': drinker, 
                                'drinker_inferred': drinker_inferred,
                                'stage': stage,
                                'lymph_stage': lymph_stage,
                                'age': age, 
                                'old_age': old_age,
                                'pack_years': pack_years,
                                'year': (year < 2000).map({True: 'pre_2000', False: 'post_2000'}),
                                'lymph_status': lymph_status,
                                'tumor_subdivision': tumor_subdivision}, axis=1)

In [58]:

clinical.processed = clinical_processed
clinical.save()