Author: Florian Wagner
Email: florian.wagner@duke.edu
This notebook demonstrates the use of the scripts extact_protein_coding_genes.py
and extract_entrez2gene.py
from the GenomeTools package.
# get genometools version
from pkg_resources import require
print 'Package versions'
print '----------------'
print require('genometools')[0]
Package versions ---------------- genometools 1.1.0
extract_protein_coding_genes.py
¶gene_annotation_file = 'Homo_sapiens.GRCh38.79.gtf.gz'
protein_coding_gene_file = 'protein_coding_genes_human.tsv.gz'
!curl -o "$gene_annotation_file" \
"ftp://ftp.ensembl.org/pub/release-79/gtf/homo_sapiens/Homo_sapiens.GRCh38.79.gtf.gz"
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 42.3M 100 42.3M 0 0 4800k 0 0:00:09 0:00:09 --:--:-- 9342k
# reading annotation file from stdin
!gunzip -c "$gene_annotation_file" | extract_protein_coding_genes.py -q -a - -o - | gzip > "$protein_coding_gene_file"
!gunzip -c "$protein_coding_gene_file" | head -n 10
A1BG 19 ENSG00000121410 A1CF 10 ENSG00000148584 A2M 12 ENSG00000175899 A2ML1 12 ENSG00000166535 A3GALT2 1 ENSG00000184389 A4GALT 22 ENSG00000128274 A4GNT 3 ENSG00000118017 AAAS 12 ENSG00000094914 AACS 12 ENSG00000081760 AADAC 3 ENSG00000114771 gzip: stdout: Broken pipe
# alternatively: reading the annotation file directly
!extract_protein_coding_genes.py -a "$gene_annotation_file" -o - | gzip > "$protein_coding_gene_file"
[2015-11-20 16:58:41] INFO: Regular expression used for filtering chromosome names: "(?:\d\d?|MT|X|Y)$" [2015-11-20 16:58:41] INFO: Parsing data... [2015-11-20 16:59:05] INFO: done (parsed 2720535 lines). [2015-11-20 16:59:05] INFO: [2015-11-20 16:59:05] INFO: Gene chromosomes (25): [2015-11-20 16:59:05] INFO: 1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 3, 4, 5, 6, 7, 8, 9, MT, X, Y [2015-11-20 16:59:05] INFO: [2015-11-20 16:59:05] INFO: Excluded chromosomes (225): [2015-11-20 16:59:05] INFO: CHR_HG126_PATCH, CHR_HG1362_PATCH, CHR_HG142_HG150_NOVEL_TEST, CHR_HG151_NOVEL_TEST, CHR_HG1832_PATCH, CHR_HG2021_PATCH, CHR_HG2030_PATCH, CHR_HG2058_PATCH, CHR_HG2066_PATCH, CHR_HG2095_PATCH, CHR_HG2104_PATCH, CHR_HG2191_PATCH, CHR_HG2217_PATCH, CHR_HG2232_PATCH, CHR_HG2233_PATCH, CHR_HG2247_PATCH, CHR_HG2288_HG2289_PATCH, CHR_HG2291_PATCH, CHR_HG986_PATCH, CHR_HSCHR10_1_CTG1, CHR_HSCHR10_1_CTG2, CHR_HSCHR10_1_CTG4, CHR_HSCHR11_1_CTG1_2, CHR_HSCHR11_1_CTG5, CHR_HSCHR11_1_CTG6, CHR_HSCHR11_1_CTG7, CHR_HSCHR11_1_CTG8, CHR_HSCHR11_2_CTG1, CHR_HSCHR11_2_CTG1_1, CHR_HSCHR11_3_CTG1, CHR_HSCHR12_1_CTG1, CHR_HSCHR12_1_CTG2_1, CHR_HSCHR12_2_CTG2, CHR_HSCHR12_3_CTG2, CHR_HSCHR12_3_CTG2_1, CHR_HSCHR12_4_CTG2, CHR_HSCHR12_5_CTG2, CHR_HSCHR12_5_CTG2_1, CHR_HSCHR12_6_CTG2_1, CHR_HSCHR13_1_CTG1, CHR_HSCHR13_1_CTG3, CHR_HSCHR14_1_CTG1, CHR_HSCHR14_2_CTG1, CHR_HSCHR14_3_CTG1, CHR_HSCHR14_7_CTG1, CHR_HSCHR15_1_CTG1, CHR_HSCHR15_1_CTG3, CHR_HSCHR15_1_CTG8, CHR_HSCHR15_2_CTG3, CHR_HSCHR15_2_CTG8, CHR_HSCHR15_3_CTG3, CHR_HSCHR15_3_CTG8, CHR_HSCHR15_4_CTG8, CHR_HSCHR15_5_CTG8, CHR_HSCHR16_1_CTG1, CHR_HSCHR16_1_CTG3_1, CHR_HSCHR16_2_CTG3_1, CHR_HSCHR16_3_CTG1, CHR_HSCHR16_4_CTG1, CHR_HSCHR16_CTG2, CHR_HSCHR17_10_CTG4, CHR_HSCHR17_1_CTG1, CHR_HSCHR17_1_CTG2, CHR_HSCHR17_1_CTG4, CHR_HSCHR17_1_CTG5, CHR_HSCHR17_1_CTG9, CHR_HSCHR17_2_CTG1, CHR_HSCHR17_2_CTG2, CHR_HSCHR17_2_CTG5, CHR_HSCHR17_3_CTG2, CHR_HSCHR17_4_CTG4, CHR_HSCHR17_5_CTG4, CHR_HSCHR17_6_CTG4, CHR_HSCHR17_7_CTG4, CHR_HSCHR17_8_CTG4, CHR_HSCHR18_1_CTG1_1, CHR_HSCHR18_2_CTG2, CHR_HSCHR18_2_CTG2_1, CHR_HSCHR18_ALT2_CTG2_1, CHR_HSCHR19KIR_ABC08_A1_HAP_CTG3_1, CHR_HSCHR19KIR_ABC08_AB_HAP_C_P_CTG3_1, CHR_HSCHR19KIR_ABC08_AB_HAP_T_P_CTG3_1, CHR_HSCHR19KIR_FH05_A_HAP_CTG3_1, CHR_HSCHR19KIR_FH05_B_HAP_CTG3_1, CHR_HSCHR19KIR_FH06_A_HAP_CTG3_1, CHR_HSCHR19KIR_FH06_BA1_HAP_CTG3_1, CHR_HSCHR19KIR_FH08_A_HAP_CTG3_1, CHR_HSCHR19KIR_FH08_BAX_HAP_CTG3_1, CHR_HSCHR19KIR_FH13_A_HAP_CTG3_1, CHR_HSCHR19KIR_FH13_BA2_HAP_CTG3_1, CHR_HSCHR19KIR_FH15_A_HAP_CTG3_1, CHR_HSCHR19KIR_FH15_B_HAP_CTG3_1, CHR_HSCHR19KIR_G085_A_HAP_CTG3_1, CHR_HSCHR19KIR_G085_BA1_HAP_CTG3_1, CHR_HSCHR19KIR_G248_A_HAP_CTG3_1, CHR_HSCHR19KIR_G248_BA2_HAP_CTG3_1, CHR_HSCHR19KIR_GRC212_AB_HAP_CTG3_1, CHR_HSCHR19KIR_GRC212_BA1_HAP_CTG3_1, CHR_HSCHR19KIR_LUCE_A_HAP_CTG3_1, CHR_HSCHR19KIR_LUCE_BDEL_HAP_CTG3_1, CHR_HSCHR19KIR_RP5_B_HAP_CTG3_1, CHR_HSCHR19KIR_RSH_A_HAP_CTG3_1, CHR_HSCHR19KIR_RSH_BA2_HAP_CTG3_1, CHR_HSCHR19KIR_T7526_A_HAP_CTG3_1, CHR_HSCHR19KIR_T7526_BDEL_HAP_CTG3_1, CHR_HSCHR19LRC_COX1_CTG3_1, CHR_HSCHR19LRC_COX2_CTG3_1, CHR_HSCHR19LRC_LRC_I_CTG3_1, CHR_HSCHR19LRC_LRC_J_CTG3_1, CHR_HSCHR19LRC_LRC_S_CTG3_1, CHR_HSCHR19LRC_LRC_T_CTG3_1, CHR_HSCHR19LRC_PGF1_CTG3_1, CHR_HSCHR19LRC_PGF2_CTG3_1, CHR_HSCHR19_1_CTG2, CHR_HSCHR19_1_CTG3_1, CHR_HSCHR19_2_CTG2, CHR_HSCHR19_3_CTG2, CHR_HSCHR19_3_CTG3_1, CHR_HSCHR19_4_CTG2, CHR_HSCHR19_4_CTG3_1, CHR_HSCHR19_5_CTG2, CHR_HSCHR1_1_CTG3, CHR_HSCHR1_1_CTG31, CHR_HSCHR1_1_CTG32_1, CHR_HSCHR1_2_CTG3, CHR_HSCHR1_2_CTG31, CHR_HSCHR1_2_CTG32_1, CHR_HSCHR1_3_CTG31, CHR_HSCHR1_3_CTG32_1, CHR_HSCHR1_4_CTG31, CHR_HSCHR1_ALT2_1_CTG32_1, CHR_HSCHR20_1_CTG2, CHR_HSCHR20_1_CTG3, CHR_HSCHR20_1_CTG4, CHR_HSCHR21_3_CTG1_1, CHR_HSCHR21_4_CTG1_1, CHR_HSCHR21_5_CTG2, CHR_HSCHR21_6_CTG1_1, CHR_HSCHR22_1_CTG1, CHR_HSCHR22_1_CTG2, CHR_HSCHR22_1_CTG3, CHR_HSCHR22_1_CTG4, CHR_HSCHR22_1_CTG5, CHR_HSCHR22_1_CTG7, CHR_HSCHR22_2_CTG1, CHR_HSCHR22_3_CTG1, CHR_HSCHR22_4_CTG1, CHR_HSCHR22_5_CTG1, CHR_HSCHR2_1_CTG1, CHR_HSCHR2_1_CTG5, CHR_HSCHR2_1_CTG7_2, CHR_HSCHR2_2_CTG7_2, CHR_HSCHR2_3_CTG1, CHR_HSCHR2_3_CTG15, CHR_HSCHR2_4_CTG1, CHR_HSCHR3_1_CTG1, CHR_HSCHR3_1_CTG2_1, CHR_HSCHR3_1_CTG3, CHR_HSCHR3_2_CTG2_1, CHR_HSCHR3_2_CTG3, CHR_HSCHR3_3_CTG3, CHR_HSCHR3_4_CTG3, CHR_HSCHR3_5_CTG3, CHR_HSCHR3_6_CTG3, CHR_HSCHR3_7_CTG3, CHR_HSCHR3_8_CTG3, CHR_HSCHR4_1_CTG12, CHR_HSCHR4_1_CTG4, CHR_HSCHR4_1_CTG9, CHR_HSCHR4_6_CTG12, CHR_HSCHR5_1_CTG1_1, CHR_HSCHR5_2_CTG1_1, CHR_HSCHR5_2_CTG5, CHR_HSCHR5_3_CTG1, CHR_HSCHR5_3_CTG5, CHR_HSCHR5_4_CTG1, CHR_HSCHR5_5_CTG1, CHR_HSCHR5_6_CTG1, CHR_HSCHR6_1_CTG4, CHR_HSCHR6_1_CTG5, CHR_HSCHR6_1_CTG8, CHR_HSCHR6_8_CTG1, CHR_HSCHR6_MHC_APD_CTG1, CHR_HSCHR6_MHC_COX_CTG1, CHR_HSCHR6_MHC_DBB_CTG1, CHR_HSCHR6_MHC_MANN_CTG1, CHR_HSCHR6_MHC_MCF_CTG1, CHR_HSCHR6_MHC_QBL_CTG1, CHR_HSCHR6_MHC_SSTO_CTG1, CHR_HSCHR7_1_CTG1, CHR_HSCHR7_1_CTG4_4, CHR_HSCHR7_1_CTG6, CHR_HSCHR7_2_CTG4_4, CHR_HSCHR7_2_CTG6, CHR_HSCHR7_3_CTG6, CHR_HSCHR8_2_CTG7, CHR_HSCHR8_3_CTG1, CHR_HSCHR8_3_CTG7, CHR_HSCHR8_4_CTG7, CHR_HSCHR8_5_CTG1, CHR_HSCHR8_5_CTG7, CHR_HSCHR8_7_CTG1, CHR_HSCHR8_8_CTG1, CHR_HSCHR8_9_CTG1, CHR_HSCHR9_1_CTG2, CHR_HSCHR9_1_CTG3, CHR_HSCHR9_1_CTG5, CHR_HSCHRX_1_CTG3, CHR_HSCHRX_2_CTG12, CHR_HSCHRX_2_CTG3, GL000009.2, GL000194.1, GL000195.1, GL000205.2, GL000213.1, GL000218.1, GL000219.1, KI270711.1, KI270713.1, KI270721.1, KI270726.1, KI270727.1, KI270728.1, KI270731.1, KI270734.1 [2015-11-20 16:59:05] INFO: [2015-11-20 16:59:05] INFO: Gene sources: [2015-11-20 16:59:05] INFO: ensembl_havana: 18873 [2015-11-20 16:59:05] INFO: havana: 733 [2015-11-20 16:59:05] INFO: ensembl: 236 [2015-11-20 16:59:05] INFO: insdc: 13 [2015-11-20 16:59:05] INFO: [2015-11-20 16:59:05] INFO: Gene types: [2015-11-20 16:59:05] INFO: protein_coding: 19796 [2015-11-20 16:59:05] INFO: polymorphic_pseudogene: 59 [2015-11-20 16:59:05] INFO: [2015-11-20 16:59:05] INFO: # Genes with redundant annotations: 112 [2015-11-20 16:59:05] INFO: [2015-11-20 16:59:05] INFO: Polymorphic pseudogenes (59): AKR7L (1), C6orf183 (1), CASP12 (1), CYP2D7 (1), FBXL21 (1), FCGR2C (1), GBA3 (1), GSTT2 (1), IFNL4 (1), KIR2DS4 (1), MROH5 (1), NAT8B (1), OR10A6 (1), OR10AC1 (1), OR10C1 (1), OR10J4 (1), OR10X1 (1), OR11H7 (1), OR12D1 (1), OR12D2 (1), OR13C7 (1), OR1B1 (1), OR1S1 (1), OR2AG1 (1), OR2F1 (1), OR2J1 (1), OR2L8 (1), OR2S2 (1), OR2T11 (1), OR4A8 (1), OR4C16 (1), OR4Q2 (1), OR4X1 (1), OR4X2 (1), OR51B2 (1), OR51F1 (1), OR51G1 (1), OR52B4 (1), OR52E1 (1), OR52R1 (1), OR52Z1 (1), OR5AC1 (1), OR5AL1 (1), OR5AR1 (1), OR5D13 (1), OR5G3 (1), OR5H6 (1), OR5H8 (1), OR5L1 (1), OR5R1 (1), OR6Q1 (1), OR8B4 (1), OR8D2 (1), OR8J2 (1), OR8K3 (1), PKD1L2 (1), PNLIPRP2 (1), SERPINA2 (1), TUBB8P7 (1) [2015-11-20 16:59:05] INFO: [2015-11-20 16:59:05] INFO: Total protein-coding genes: 19742
extract_gene2entrez.py
¶gene2accession_file = 'gene2accession_2015-05-26_human.tsv.gz'
entrez2gene_file = 'entrez2gene_human.tsv'
!curl -L -o "$gene2accession_file" \
"https://www.dropbox.com/s/ggjrvnigtrfue3x/gene2accession_human_2015-05-26.tsv.gz?dl=1"
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 506 0 506 0 0 610 0 --:--:-- --:--:-- --:--:-- 622 100 15.6M 100 15.6M 0 0 3339k 0 0:00:04 0:00:04 --:--:-- 5525k
!extract_entrez2gene.py -f "$gene2accession_file" -o "$entrez2gene_file"
!head -n 10 "$entrez2gene_file"
[2015-11-20 16:59:19] INFO: Found 53831 Entrez Gene IDs associated with 53814 gene symbols. [2015-11-20 16:59:19] INFO: Output written to file "entrez2gene_human.tsv". 1 A1BG 2 A2M 3 A2MP1 9 NAT1 10 NAT2 11 NATP 12 SERPINA3 13 AADAC 14 AAMP 15 AANAT