# Change what's inside the quotes into your MBL STAMPS server username, or something else unique
# reminder! hit shift-ENTER to execute cell & move onto next cell
username="CHANGEME"
cd /mnt
/mnt
mkdir -p $username
mkdir: cannot create directory `CHANGEME': File exists
cd $username
/mnt/CHANGEME
# get a small E. coli data set
!curl -O http://public.ged.msu.edu.s3.amazonaws.com/ecoli-500k.fq.gz
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 40.8M 100 40.8M 0 0 37.7M 0 0:00:01 0:00:01 --:--:-- 37.9M
!velveth k.31 31 -short -fastq.gz ecoli-500k.fq.gz
[0.000000] Reading FastQ file ecoli-500k.fq.gz; [4.069341] 500000 sequences found [4.069379] Done [4.268264] Reading read set file k.31/Sequences; [4.483652] 500000 sequences found [5.476888] Done [5.476934] 500000 sequences in total. [5.483106] Writing into roadmap file k.31/Roadmaps... [6.097592] Inputting sequences... [6.112510] Inputting sequence 0 / 500000 [55.859540] === Sequences loaded in 49.761966 s [55.859733] Done inputting sequences [55.859753] Destroying splay table [56.057223] Splay table destroyed
!velvetg k.31 -exp_cov auto
[0.000001] Reading roadmap file k.31/Roadmaps [1.616374] 500000 roadmaps read [1.619179] Creating insertion markers [1.915542] Ordering insertion markers [2.197360] Counting preNodes [2.356534] 963682 preNodes counted, creating them now [5.541818] Adjusting marker info... [5.698421] Connecting preNodes [6.496145] Cleaning up memory [6.503573] Done creating preGraph [6.503613] Concatenation... [7.154593] Renumbering preNodes [7.154662] Initial preNode count 963682 [7.187705] Destroyed 847030 preNodes [7.187769] Concatenation over! [7.187784] Clipping short tips off preGraph [7.284335] Concatenation... [7.424245] Renumbering preNodes [7.424489] Initial preNode count 116652 [7.500170] Destroyed 40613 preNodes [7.500420] Concatenation over! [7.500438] 20554 tips cut off [7.500451] 76039 nodes left [7.500533] Writing into pregraph file k.31/PreGraph... [8.419802] Reading read set file k.31/Sequences; [8.631457] 500000 sequences found [9.613768] Done [10.220623] Reading pre-graph file k.31/PreGraph [10.220870] Graph has 76039 nodes and 500000 sequences [10.683820] Scanning pre-graph file k.31/PreGraph for k-mers [10.925013] 5289776 kmers found [11.765411] Sorting kmer occurence table ... [15.942495] Sorting done. [15.942547] Computing acceleration table... [16.272995] Computing offsets... [16.317274] Ghost Threading through reads 0 / 500000 [26.043235] === Ghost-Threaded in 9.726299 s [26.043331] Threading through reads 0 / 500000 [36.589290] === Threaded in 10.545980 s [36.758596] Correcting graph with cutoff 0.200000 [36.764111] Determining eligible starting points [36.951321] Done listing starting nodes [36.951381] Initializing todo lists [36.980292] Done with initilization [36.980555] Activating arc lookup table [37.105083] Done activating arc lookup table [37.314470] 10000 / 76039 nodes visited [37.580561] 20000 / 76039 nodes visited [37.930400] 30000 / 76039 nodes visited [38.177996] 40000 / 76039 nodes visited [38.416655] 50000 / 76039 nodes visited [38.705646] 60000 / 76039 nodes visited [38.846620] 70000 / 76039 nodes visited [38.948188] Concatenation... [38.952225] Renumbering nodes [38.952287] Initial node count 76039 [38.954147] Removed 60123 null nodes [38.954215] Concatenation over! [38.954229] Clipping short tips off graph, drastic [38.961144] Concatenation... [39.004904] Renumbering nodes [39.004981] Initial node count 15916 [39.006151] Removed 4252 null nodes [39.006187] Concatenation over! [39.006216] 11664 nodes left [39.006314] Writing into graph file k.31/Graph2... [40.398103] Measuring median coverage depth... [40.404214] Median coverage depth = 6.684325 [40.404405] Removing contigs with coverage < 3.342162... [40.409395] Concatenation... [40.503582] Renumbering nodes [40.503727] Initial node count 11664 [40.504116] Removed 8967 null nodes [40.504141] Concatenation over! [40.504270] Concatenation... [40.504477] Renumbering nodes [40.504496] Initial node count 2697 [40.504514] Removed 0 null nodes [40.504527] Concatenation over! [40.504540] Clipping short tips off graph, drastic [40.505090] Concatenation... [40.505564] Renumbering nodes [40.505591] Initial node count 2697 [40.505717] Removed 119 null nodes [40.505736] Concatenation over! [40.505749] 2578 nodes left [40.505763] Read coherency... [40.505863] Identifying unique nodes [40.505982] Done, 1829 unique nodes counted [40.506005] Trimming read tips [40.506168] Renumbering nodes [40.506187] Initial node count 2578 [40.506205] Removed 0 null nodes [40.506219] Confronted to 0 multiple hits and 0 null over 0 [40.506232] Read coherency over! [40.508914] Starting pebble resolution... [40.513490] Computing read to node mapping array sizes [40.540875] Computing read to node mappings [40.678131] Estimating library insert lengths... [40.682042] Done [40.682100] Computing direct node to node mappings [40.703919] Scaffolding node 0 [40.704882] === Nodes Scaffolded in 0.022765 s [40.712154] Preparing to correct graph with cutoff 0.200000 [40.736251] Cleaning memory [40.736308] Deactivating local correction settings [40.736367] Pebble done. [40.736386] Concatenation... [40.739241] Renumbering nodes [40.739270] Initial node count 2578 [40.739402] Removed 251 null nodes [40.739450] Concatenation over! [40.739468] Removing reference contigs with coverage < 3.342162... [40.739618] Concatenation... [40.739737] Renumbering nodes [40.739755] Initial node count 2327 [40.739773] Removed 0 null nodes [40.739813] Concatenation over! [40.744302] Writing contigs into k.31/contigs.fa... [41.606408] Writing into stats file k.31/stats.txt... [41.683751] Writing into graph file k.31/LastGraph... [42.941745] Estimated Coverage = 6.684325 [42.941901] Estimated Coverage cutoff = 3.342162 Final graph has 2327 nodes and n50 of 4865, max 24714, total 4451317, using 490991/500000 reads
!head k.31/contigs.fa
>NODE_8_length_8205_cov_6.415600 GTCGCCTTCTACGGTGATCAGCGACTGTTCGGCTTCAACTTTGTCGCCCACTTTGACCAG GATCTCGGTGATTTCAACTTCATCAGCCCCGATGTCCGGTACTTTGATTTCGATAGCCAT TATTCTTTTACCTCTTACGCCAGACGCGGGTTAACTTTATCTGCATCGATGTTGAATTTG GCGATTGCGTCAGCAACCACTTTCTTATCGATTTCGCCACGTTTAGCCAGTTCGCCCAGC GCCGCAACCACGACATAAGAAGCATCAACTTCGAAGTGGTGACGCAGGTTCTCACGGCTG TCGGGACGGCCGAAGCCATCAGTACCCAGTACGCGGTAGTCGTCAGCCGGTACGTAAGTA CGGACCTGCTCAGCGAACAGTTTCATATAGTCGGTAGATGCCACTGCCGGAGCGTCGTTC ATCACCTGAGCGATATACGGAACGCGCGGAGTTTCCAGCGGGTGCAGCATGTTCCAGCGT TCACAATCCTGACCATCACGCGCCAGCTCGGTGAAGGAGGTCACGCTATAAACGTCAGAA
import screed
contig_lengths = [ len(record.sequence) for record in screed.open('k.31/contigs.fa') ]
hist(contig_lengths, bins=20);
How does bigger (or smaller) k affect things? Note, k must be odd.