Here we are reading in variant calls for our molecular validation cohort. This script is still a little dirty... aka hard paths and the like. The rest of the analysis can be picked up from the MAF file that we generate here, but if you are actually trying to run this part of the analysis please contact me.
cd ../src/
/cellar/users/agross/TCGA_Code/TCGA/src
%pylab inline
Populating the interactive namespace from numpy and matplotlib
from Processing.Imports import *
PATH = '/cellar/data/TCGA/protected/exome_andy/hnsc_targeted/'
Read in all of the VCF files and pull out the SNVs.
pts_snv = [f[:12] for f in os.listdir(PATH) if 'call_stats' in f and '~' not in f]
vcf = pd.concat({p: pd.read_table(PATH + '{}_call_stats.txt'.format(p), skiprows=[0])
for p in pts_snv})
vcf = vcf.reset_index()
vcf = vcf.rename(columns={'level_0':'barcode'})
vcf['barcode'] = vcf['barcode'].map(lambda s: s.replace('_','-'))
pts_snv = map(lambda p: p.replace('_','-'), pts_snv)
MuTect sometimes likes to fail ungracefully. We can tell this when we see no variants in the last few genes. Here I check for this and filter out those patients.
tab = vcf.groupby(['tumor_name','contig']).size().unstack()
drop = tab[[19,20,22]].isnull().sum(1).order() == 3
drop = [p[:12] for p,v in drop.iteritems() if v == True]
pts_snv = [p for p in pts_snv if p not in drop]
vcf = vcf[vcf.judgement == 'KEEP']
Read in all of the indel calls. I have this funky try catch because some of the files have the VCF header but no actual calls, Pandas does not like this. For this reason I keep track of the patients with the pts_indels dict and the indel calls with the indels dict, and then put it all together.
indels = {}
pts_indels = {}
for f in os.listdir(PATH):
if not f.endswith('indels.txt'):
continue
pts_indels[f] = '-'.join(f.split('_')[:3])
try:
indels[pts_indels[f]] = pd.read_table(PATH + f, skiprows=102, header=None)
except:
pass
pts_indels = pd.Series(pts_indels)
indels = pd.concat(indels)
indels.index.names = ['barcode','num']
indels.columns = ['chromosome','pos','id','ref','alt','qual','filter','info',
'format','tumor','normal']
indels = indels.reset_index(0)
indels = indels[indels['info'] == 'SOMATIC']
Not all variant calling runs were sucessfull. Here we are only using patients with a sucessfull SomaticIndelDetector run and a sucessfull MuTect run.
pts = list(set(pts_indels).intersection(set(pts_snv)))
len(pts), len(pts_indels), len(pts_snv)
(462, 463, 462)
def format_indels_for_oncotator(s):
r = {}
r['chr'] = s['chromosome']
r['start'] = s['pos']
r['end'] = int(s['pos']) + len(s['ref']) - 1
r['reference_allele'] = s['ref']
r['observed_allele'] = s['alt']
return pd.Series(r)
def format_snvs_for_oncotator(s):
r = {}
r['chr'] = s['contig']
r['start'] = s['position']
r['end'] = s['position']
r['reference_allele'] = s['ref_allele']
r['observed_allele'] = s['alt_allele']
r = pd.Series(r)
return r
onc_indel = indels.set_index('barcode').apply(format_indels_for_oncotator,1)
onc_vcf = vcf.set_index('barcode').apply(format_snvs_for_oncotator, 1)
onc = pd.concat([onc_indel, onc_vcf])
onc = onc.reset_index('barcode')
onc = onc[['chr', 'start', 'end', 'reference_allele', 'observed_allele']]
onc.to_csv('tmp.txt', sep=' ', index=False, header=False)
Yes this is a hard path. If you are trying to replicate this, replace this with the downloaded oncotator file.
ONCOTATOR_FILE = '/cellar/users/agross/Downloads/oncotator_output (5).txt'
Oncotator only outputs annotations for unique variants. So I have to create a lookup for each mutation and then map it across the whole dataset.
tab = pd.read_table(ONCOTATOR_FILE, skiprows=[0])
tt = tab[['Chromosome','Start_position','End_position','Reference_Allele', 'Tumor_Seq_Allele1']]
tt = pd.Series({i: tuple(v) for i,v in tt.iterrows()})
onc_indel = indels.set_index('barcode').apply(format_indels_for_oncotator,1)
onc_vcf = vcf.set_index('barcode').apply(format_snvs_for_oncotator, 1)
onc = pd.concat([onc_indel, onc_vcf])
oo = onc[['chr','start','end','reference_allele','observed_allele']]
oo = [tuple(v) for i,v in oo.iterrows()]
maf = pd.DataFrame(tab.ix[tt[tt == s].index[0]] for s in oo)
maf.index = onc.index
maf = maf.set_index('Hugo_Symbol', append=True)
Here I am turning the MAF file into a gene by patient mutation matrix.
keep = (maf.Variant_Classification.isin(['Silent', 'Intron', "3'UTR", "5'UTR"])==False)
mat = keep.groupby(level=[0,1]).sum()
mat = mat.unstack().ix[pts].fillna(0).T
(mat > 0).sum(1).order()
Hugo_Symbol GRB2 0 MAPK3 0 SNAPC5 0 KRAS 1 NRAS 1 YWHAB 2 RAF1 3 IRS1 4 MAP2K1 6 MAP2K2 6 MAPK1 7 SOS1 8 IRS2 9 EGFR 14 HRAS 28 ASPM 29 CASP8 47 MUC5B 48 TP53 307 dtype: int64
Save the MAF file and the gene by patient for later analysis. This is not a proper MAF file as I append the patients with variant calls, but wildtype TP53 to the end of the file.
wildtype = set(pts) - set(maf.index.get_level_values('barcode'))
wildtype = pd.Series({(p, 'wildtype'):nan for p in wildtype})
wildtype.index = pd.MultiIndex.from_tuples(wildtype.index)
wildtype = pd.DataFrame({'Entrez_Gene_Id': wildtype})
maf_wt = maf.append(wildtype)[maf.columns]
maf_wt = maf_wt.dropna(axis=1, how='all')
maf_wt.to_csv('../Extra_Data/HNSCC_molecular_validation.maf', sep='\t')
mat.to_csv('../Extra_Data/HNSCC_molecular_validation_matrix.csv')