Process HNSCC Targeted Mutation Calls¶

Here we are reading in variant calls for our molecular validation cohort. This script is still a little dirty... aka hard paths and the like. The rest of the analysis can be picked up from the MAF file that we generate here, but if you are actually trying to run this part of the analysis please contact me.

In [1]:

cd ../src/

/cellar/users/agross/TCGA_Code/TCGA/src

In [2]:

%pylab inline

Populating the interactive namespace from numpy and matplotlib

In [3]:

from Processing.Imports import *

In [4]:

PATH = '/cellar/data/TCGA/protected/exome_andy/hnsc_targeted/'

Read in Variant Calls¶

Read in all of the VCF files and pull out the SNVs.

In [5]:

pts_snv = [f[:12] for f in os.listdir(PATH) if 'call_stats' in f and '~' not in f]
vcf = pd.concat({p: pd.read_table(PATH + '{}_call_stats.txt'.format(p), skiprows=[0]) 
                 for p in pts_snv})
vcf = vcf.reset_index()
vcf = vcf.rename(columns={'level_0':'barcode'})
vcf['barcode'] = vcf['barcode'].map(lambda s: s.replace('_','-'))
pts_snv = map(lambda p: p.replace('_','-'), pts_snv)

MuTect sometimes likes to fail ungracefully. We can tell this when we see no variants in the last few genes. Here I check for this and filter out those patients.

In [6]:

tab = vcf.groupby(['tumor_name','contig']).size().unstack()
drop = tab[[19,20,22]].isnull().sum(1).order() == 3
drop = [p[:12] for p,v in drop.iteritems() if v == True]
pts_snv = [p for p in pts_snv if p not in drop]
vcf = vcf[vcf.judgement == 'KEEP']

Read in all of the indel calls. I have this funky try catch because some of the files have the VCF header but no actual calls, Pandas does not like this. For this reason I keep track of the patients with the pts_indels dict and the indel calls with the indels dict, and then put it all together.

In [7]:

indels = {}
pts_indels = {}
for f in os.listdir(PATH):
    if not f.endswith('indels.txt'):
        continue
    pts_indels[f] = '-'.join(f.split('_')[:3])
    try:
        indels[pts_indels[f]] = pd.read_table(PATH + f, skiprows=102, header=None)
    except:
        pass
pts_indels = pd.Series(pts_indels)
indels = pd.concat(indels)
indels.index.names = ['barcode','num']
indels.columns = ['chromosome','pos','id','ref','alt','qual','filter','info',
                  'format','tumor','normal']
indels = indels.reset_index(0)
indels = indels[indels['info'] == 'SOMATIC']

Not all variant calling runs were sucessfull. Here we are only using patients with a sucessfull SomaticIndelDetector run and a sucessfull MuTect run.

In [8]:

pts = list(set(pts_indels).intersection(set(pts_snv)))
len(pts), len(pts_indels), len(pts_snv)

Out[8]:

(462, 463, 462)

Annotate the Variants¶

I have not found a good automated way to do this, and the best solution is to upload the variants to the Oncotator web site
They do have a REST API but it annotates about 2 variants per second, which takes a while with this many samples
The solution that I have emerged on it to save the variants to a temporar file in the format that Oncotator likes, upload that file to the website, and read in the oncotator output back into the program.
Yes, this is far from elegant, but works for now.

In [9]:

def format_indels_for_oncotator(s):
    r = {}
    r['chr'] = s['chromosome']
    r['start'] = s['pos']
    r['end'] = int(s['pos']) + len(s['ref']) - 1
    r['reference_allele'] = s['ref']
    r['observed_allele'] = s['alt']
    return pd.Series(r)

def format_snvs_for_oncotator(s):
    r = {}
    r['chr'] = s['contig']
    r['start'] = s['position']
    r['end'] = s['position']
    r['reference_allele'] = s['ref_allele']
    r['observed_allele'] = s['alt_allele']
    r = pd.Series(r)
    return r

In [10]:

onc_indel = indels.set_index('barcode').apply(format_indels_for_oncotator,1)
onc_vcf = vcf.set_index('barcode').apply(format_snvs_for_oncotator, 1)
onc = pd.concat([onc_indel, onc_vcf])
onc = onc.reset_index('barcode')
onc = onc[['chr', 'start', 'end', 'reference_allele', 'observed_allele']]
onc.to_csv('tmp.txt', sep=' ', index=False, header=False)

Yes this is a hard path. If you are trying to replicate this, replace this with the downloaded oncotator file.

In [11]:

ONCOTATOR_FILE = '/cellar/users/agross/Downloads/oncotator_output (5).txt'

Oncotator only outputs annotations for unique variants. So I have to create a lookup for each mutation and then map it across the whole dataset.

In [12]:

tab = pd.read_table(ONCOTATOR_FILE, skiprows=[0])
tt = tab[['Chromosome','Start_position','End_position','Reference_Allele', 'Tumor_Seq_Allele1']]
tt = pd.Series({i: tuple(v) for i,v in tt.iterrows()})

In [13]:

onc_indel = indels.set_index('barcode').apply(format_indels_for_oncotator,1)
onc_vcf = vcf.set_index('barcode').apply(format_snvs_for_oncotator, 1)
onc = pd.concat([onc_indel, onc_vcf])
oo = onc[['chr','start','end','reference_allele','observed_allele']]
oo = [tuple(v) for i,v in oo.iterrows()]
maf = pd.DataFrame(tab.ix[tt[tt == s].index[0]] for s in oo)
maf.index = onc.index
maf = maf.set_index('Hugo_Symbol', append=True)

Here I am turning the MAF file into a gene by patient mutation matrix.

In [14]:

keep = (maf.Variant_Classification.isin(['Silent', 'Intron', "3'UTR", "5'UTR"])==False)
mat = keep.groupby(level=[0,1]).sum()
mat = mat.unstack().ix[pts].fillna(0).T
(mat > 0).sum(1).order()

Out[14]:

Hugo_Symbol
GRB2             0
MAPK3            0
SNAPC5           0
KRAS             1
NRAS             1
YWHAB            2
RAF1             3
IRS1             4
MAP2K1           6
MAP2K2           6
MAPK1            7
SOS1             8
IRS2             9
EGFR            14
HRAS            28
ASPM            29
CASP8           47
MUC5B           48
TP53           307
dtype: int64

Save the MAF file and the gene by patient for later analysis. This is not a proper MAF file as I append the patients with variant calls, but wildtype TP53 to the end of the file.

In [15]:

wildtype = set(pts) - set(maf.index.get_level_values('barcode'))
wildtype = pd.Series({(p, 'wildtype'):nan for p in wildtype})
wildtype.index = pd.MultiIndex.from_tuples(wildtype.index)
wildtype = pd.DataFrame({'Entrez_Gene_Id': wildtype})
maf_wt = maf.append(wildtype)[maf.columns]
maf_wt = maf_wt.dropna(axis=1, how='all')

In [16]:

maf_wt.to_csv('../Extra_Data/HNSCC_molecular_validation.maf', sep='\t')
mat.to_csv('../Extra_Data/HNSCC_molecular_validation_matrix.csv')