Notebook

Basic Tutorial: paired-end ddRAD analysis¶

Topics:¶

Setup of params file for paired-end ddRAD
Visualize ideal results on simulated data
To see more options for filtering empirical paired ddRAD data see the Advanced tutorials

This tutorial assumes you have already read the general use pyRAD tutorial.

What is ddRAD?¶

The double digest library preparation method was developed and described by Peterson et al. 2012. Here I will be talking about __ paired-end ddRAD __ data and describe my recommendations for how to setup a params file to analyze them in pyRAD.

A ddRAD library is prepared by cutting genomic DNA with two different restriction enzymes and selecting the intervening fragments that are within a certain size window. These will contain overhangs from the respective cut sites on either side. One side will have a barcode+illumina adapter ligated, and the other end will have only the reverse Illumina adapter ligated. The first reads may come in one or multiple files with "_R1_" in the name, and the second reads are in a separate file/s with "_R2_". Second read files will contain the corresponding pair from the first read files in the exact same order.

In [1]:

from IPython.display import SVG, Image
SVG("https://dl.dropboxusercontent.com/u/2538935/PYRAD_TUTORIALS/figures/diag_ddradpair.svg")

Out[1]:

Your data will likely come to you non-demultiplexed (meaning not sorted by which individual the reads belong to). You can demultiplex the reads according to their barcodes using pyRAD or separate software. For simplicity, I suggest using pyRAD, but for data with combinatorial barcodes you will need to use a more advanced de-multiplexing software.

A number of programs are available to check for overlap of paired end reads, and you can run your data through these programs before being input to pyRAD. However, as of pyRAD v.2.0 I now include a robust built-in overlap detection method utilizing USEARCH. This can be used to separate merged from non-merged reads to be analyzed separately in pyRAD. This method is illustrated in the Advanced tutorial.

Simulation setup¶

In [2]:

SVG("https://dl.dropboxusercontent.com/u/2538935/PYRAD_TUTORIALS/figures/fig_tree4sims.svg")

Out[2]:

For this tutorial I simulated paired-end ddRAD loci on a 12 taxon species tree, shown above. You can download the data using the script below and assemble these data by following along with all of the instructions below:

In [ ]:

%%bash
wget http://goo.gl/cGUuWd 2&> /dev/null
gunzip cGUuWd
rm cGUuWd

You can simply download the data, but it is also worth describing how the data were generated. I simulated ddRAD-like data using the egglib coalescent simulator with the following options.

In [3]:

indels  = 0.005   ## low rate of indels (prob. mutation is indel)
dropout = 0       ## if 1, mutations to restriction site can cause locus dropout.
nloci   = 1000    ## Total Nloci simulated (less if locus dropout or merged reads) 
ninds   = 1       ## sampled individuals per tip taxon
shortfrag = 300   ## smallest size of digested fragments
longfrag  = 600   ## largest size of digested fragments

Because I simulate sequences at 20X coverage, and the reads are read from both ends (paired-end), the total number of reads is:

In [4]:

reads = 12*ninds*nloci*20*2
print "%s sequenced reads" % reads

480000 sequenced reads

Here I execute the simulation script (available here) to simulate the data, and write it to a file called simpairddRADs.fastq.gz. If you simply downloaded the data you do not need to do this step, I show it only for those interested.

In [5]:

## you will need the Python package egglib installed to run the simRRLs script.
! python ../simscripts/simRRLs.py 0.005 0 1000 1 300,600 pairddrad simpairddRADs

	simulating pairddrad data
	simulating 1000 loci at 20X coverage across 12 tip taxa with 1 samples per taxon
	indels arise at frequency of 0.005000 per mutation
	mutations in restriction site = False
	sequencing error rate = 0.0005
	theta=4Nu= 0.0014
	creating new barcode map
.

We simulated 1000 loci per sample. Because we simulated no sequence overlap (min frag size = 300), and no mutations to restriction sites we should recover all 1000 loci shared across our entire data set.

PyRAD Analysis - the params file¶

We first create an empty params.txt input file for the pyRAD analysis.
The following command will create a template which we will fill in with all relevant parameter settings for the analysis.

In [6]:

! ~/pyrad_v.2.0/pyRAD -n


     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------

	new params.txt file created

Take a look at the default options. Each line designates a parameter, and contains a "##" symbol after which comments can be added, and which includes a description of the parameter. The affected step for each parameter is shown in parentheses. The first 14 parameters are required. Numbers 15-35 are optional.

In [7]:

! cat params.txt

==== parameter inputs for pyRAD version 2.00  ============================  affected step ==
./                          ## 1. Working directory                                 (all)
./*.fastq.gz                ## 2. Loc. of non-demultiplexed files (if not line 18)  (s1)
./*.barcodes                ## 3. Loc. of barcode file (if not line 18)             (s1)
usearch7.0.1090_i86linux32  ## 4. command (or path) to call usearch v.7             (s3,s6)
muscle                      ## 5. command (or path) to call muscle                  (s3,s7)
TGCAG                       ## 6. restriction overhang (e.g., C|TGCAG -> TGCAG)     (s1,s2)
2                           ## 7. N processors to use in parallel                   (all)
6                           ## 8. Mindepth: minimum dpth of coverage for a cluster  (s4,s5)
4                           ## 9. NQual: max # sites with qual < minQ (line 20)     (s2)
.90                         ## 10. Wclust: clustering threshold as a decimal        (s3,s6)
rad                         ## 11. Datatype: rad,gbs,ddrad,pairgbs,pairddrad,merge  (all)
4                           ## 12. MinCov: min samples in a final locus             (s7)
3                           ## 13. MaxSH: max inds with shared hetero site          (s7)
c90d6m4p3                   ## 14. prefix name for final output (no spaces)         (s7)
==== optional params below this line ===================================  affected step ==
                       ## 15.opt.: select subset (list or use prefix*)              (s2-s7)
                       ## 16.opt.: add-on (outgroup) taxa (list or prefix*)         (s6,s7)
                       ## 17.opt.: exclude taxa (list or prefix*)                   (s7)
                       ## 18.opt.: Loc. of de-multiplexed data                      (s2)
                       ## 19.opt.: maxM: N mismatches in barcodes (def. 1)          (s1)
                       ## 20.opt.: MinQ: min Quality score after offset (def. 20)   (s2)
                       ## 21.opt.: Filter: 0=NQual 1=NQual+adapters. 2=1+strict     (s2)
                       ## 22.opt.: a priori E,H (def. 0.001,0.01, if not estimated) (s4)
                       ## 23.opt.: maxN: Ns in a consensus seq (def. 4)             (s5)
                       ## 24.opt.: maxH: hetero. sites in consensus seq (def. 4)    (s5)
                       ## 25.opt.: diploid: limit 2 alleles in a consensus (def. 1) (s5)
                       ## 26.opt.: maxSNPs: step 7. (def=100). Paired (def=100,100) (s7)
                       ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7)
                       ## 28.opt.: random number seed (def. 112233)                 (s3,s6)
                       ## 29.opt.: trim overhang left,right on final loci, def(0,0) (s7)
                       ## 30.opt.: add output formats: a,n,s,u (see documentation)  (s7)
                       ## 31.opt.: call maj. consens if dpth < stat. limit (def. 0) (s5)
                       ## 32.opt.: merge/remove paired overlap (def 1), 0=nocheck   (s2)
                       ## 33.opt.: keep trimmed reads (def=0). Enter min length.    (s2)
                       ## 34.opt.: max stack size (int), def= max(500,mean+2*SD)    (s3)
                       ## 35.opt.: minDerep: exclude dereps with <= N copies, def=0 (s3)
==== list hierachical cluster groups below this line =====================================

Edit the params file¶

I will use the script below to substitute new values, but you should simply use any text editor to make changes. For this analysis I made the following changes from the defaults:

  5. set the two restriction enzymes used to generate the ddRAD data
 10. lowered clustering threshold to .85
 11. set datatype to pairddrad
 14. changed the output name prefix
 19. mismatches for demulitiplexing set to 0, exact match.
 24. Raised maxH. Lower is better for filtering paralogs.
 30. added additional output formats (e.g., nexus and SNPs)

In [8]:

%%bash
sed -i '/## 6. /c\TGCAG,AATT                 ## 6. cutsites... ' ./params.txt
sed -i '/## 10. /c\.85                        ## 10. lowered clust thresh' ./params.txt
sed -i '/## 11. /c\pairddrad                  ## 11. datatype... ' ./params.txt
sed -i '/## 14. /c\c85d6m4p3                  ## 13. prefix name ' ./params.txt
sed -i '/## 19./c\0                     ## 19. errbarcode... ' ./params.txt
sed -i '/## 24./c\10                    ## 24. maxH... ' ./params.txt
sed -i '/## 30./c\a,s,u,n               ## 30. outformats... ' ./params.txt

Let's take a look at the edited params.txt file

In [9]:

! cat params.txt

==== parameter inputs for pyRAD version 2.00  ============================  affected step ==
./                          ## 1. Working directory                                 (all)
./*.fastq.gz                ## 2. Loc. of non-demultiplexed files (if not line 18)  (s1)
./*.barcodes                ## 3. Loc. of barcode file (if not line 18)             (s1)
usearch7.0.1090_i86linux32  ## 4. command (or path) to call usearch v.7             (s3,s6)
muscle                      ## 5. command (or path) to call muscle                  (s3,s7)
TGCAG,AATT                 ## 6. cutsites... 
2                           ## 7. N processors to use in parallel                   (all)
6                           ## 8. Mindepth: minimum dpth of coverage for a cluster  (s4,s5)
4                           ## 9. NQual: max # sites with qual < minQ (line 20)     (s2)
.85                        ## 10. lowered clust thresh
pairddrad                  ## 11. datatype... 
4                           ## 12. MinCov: min samples in a final locus             (s7)
3                           ## 13. MaxSH: max inds with shared hetero site          (s7)
c85d6m5p3                  ## 13. prefix name 
==== optional params below this line ===================================  affected step ==
                       ## 15.opt.: select subset (list or use prefix*)              (s2-s7)
                       ## 16.opt.: add-on (outgroup) taxa (list or prefix*)         (s6,s7)
                       ## 17.opt.: exclude taxa (list or prefix*)                   (s7)
                       ## 18.opt.: Loc. of de-multiplexed data                      (s2)
0                     ## 19. errbarcode... 
                       ## 20.opt.: MinQ: min Quality score after offset (def. 20)   (s2)
                       ## 21.opt.: Filter: 0=NQual 1=NQual+adapters. 2=1+strict     (s2)
                       ## 22.opt.: a priori E,H (def. 0.001,0.01, if not estimated) (s4)
                       ## 23.opt.: maxN: Ns in a consensus seq (def. 4)             (s5)
10                    ## 24. maxH... 
                       ## 25.opt.: diploid: limit 2 alleles in a consensus (def. 1) (s5)
                       ## 26.opt.: maxSNPs: step 7. (def=100). Paired (def=100,100) (s7)
                       ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7)
                       ## 28.opt.: random number seed (def. 112233)                 (s3,s6)
                       ## 29.opt.: trim overhang left,right on final loci, def(0,0) (s7)
a,s,u,n               ## 30. outformats... 
                       ## 31.opt.: call maj. consens if dpth < stat. limit (def. 0) (s5)
                       ## 32.opt.: merge/remove paired overlap (def 1), 0=nocheck   (s2)
                       ## 33.opt.: keep trimmed reads (def=0). Enter min length.    (s2)
                       ## 34.opt.: max stack size (int), def= max(500,mean+2*SD)    (s3)
                       ## 35.opt.: minDerep: exclude dereps with <= N copies, def=0 (s3)
==== list hierachical cluster groups below this line =====================================

PyRAD Analysis¶

Step 1: De-multiplexing the data¶

Four examples of acceptable input file name formats for paired-end data:

   1. xxxx_R1_001.fastq      xxxx_R2_001.fastq
   2. xxxx_R1_001.fastq.gz   xxxx_R2_001.fastq.gz
   3. xxxx_R1_100.fq.gz      xxxx_R2_100.fq.gz
   4. xxxx_R1_.fq            xxxx_R2_.fq

The file ending can be .fastq, .fq, or .gz.
There should be a unique name or number shared by each pair and the characters _R1_ and _R2_.
For every file name with _R1_ there should be a corresponding _R2_ file.

If your data are already demultiplexed skip step 1 and see step 2 below.

In [10]:

! ~/pyrad_v.2.0/pyRAD -p params.txt -s 1


     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	step 1: sorting reads by barcode
 .

Now we can look at the stats output for this below:

In [11]:

! cat stats/s1.sorting.txt

file    	Nreads	cut_found	bar_matched
simpairddRADs_001.fastq.gz	240000	240000	240000


sample	true_bar	obs_bars	N_obs
1C0    	AGATAA    	AGATAA	20000   
2F0    	AGTGAT    	AGTGAT	20000   
1B0    	ATGTGA    	ATGTGA	20000   
3I0    	ATGTTG    	ATGTTG	20000   
3K0    	ATTAGG    	ATTAGG	20000   
2E0    	GAGTTA    	GAGTTA	20000   
1D0    	GGATAT    	GGATAT	20000   
3J0    	TAGGAG    	TAGGAG	20000   
2H0    	TATAAT    	TATAAT	20000   
3L0    	TGAAGG    	TGAAGG	20000   
1A0    	TGATGG    	TGATGG	20000   
2G0    	TTTTTA    	TTTTTA	20000   

nomatch  	_       	0

The de-multiplexed reads are written to a new file for each individual in a new directory created within your working directory called fastq/

The reads were sorted into a separate file for the first (R1) and second (R2) reads for each individual.

If your data were previously de-multiplexed you need the following things before step 2:

your sorted file names should be formatted similar to above, but with sample names substituted for 1A0, 1A1, etc.
the files can be zipped (.gz) or not (.fq or .fastq).
the barcode should be removed (not on left side of reads)
the restriction site should not be removed, but if it is, enter a '@' symbol before the location of your sorted data.
Enter on line 18 of the params file the location of your sorted data.

Step 2: Quality filtering¶

Next we apply the quality filtering.

We left the filter setting (line 21) at its default of 0, meaning it will apply only a filter based on quality scores. Low quality sites are converted to Ns, and any locus with more than X number of Ns is discarded, where X is the number set on line 9 of the params file.
We could also check for overlap of paired-end reads and filter for the presence of Illumina adapters in this step, however we assume here that the data were properly size selected and do not contain any of these problems. To see those filters applied see the Advanced tutorials.

In [12]:

! ~/pyrad_v.2.0/pyRAD -p params.txt -s 2


     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	step 2: quality filtering 
	............

Statistics for the number of reads that passed filtering can be found in the stats/ directory.

In [13]:

! cat stats/s2.rawedit.txt

sample	Nreads	exclude	trimmed	passed
2H0	20000	0	0	20000
3I0	20000	0	0	20000
3L0	20000	0	0	20000
3J0	20000	0	0	20000
1C0	20000	0	0	20000
3K0	20000	0	0	20000
1D0	20000	0	0	20000
1B0	20000	0	0	20000
2G0	20000	0	0	20000
1A0	20000	0	0	20000
2F0	20000	0	0	20000
2E0	20000	0	0	20000

    Nreads = total number of reads for a sample
    exclude = reads that were excluded
    trimmed = reads that had adapter trimmed but were kept
    passed = total kept reads

    note: you can set minimum length of kept trimmed reads on line 35 the params file 
    kept trimmed (fragment) reads written to /home/deren/Dropbox/Public/PYRAD_TUTORIALS/tutorial_pairddRAD/mergedreads/

The filtered data files are converted to fasta format and written to a directory called edits/. I show this below:

In [14]:

! ls -l edits/

total 50064
-rw-rw-r-- 1 deren deren 4268890 Mar  9 18:57 1A0.edit
-rw-rw-r-- 1 deren deren 4268890 Mar  9 18:57 1B0.edit
-rw-rw-r-- 1 deren deren 4268890 Mar  9 18:57 1C0.edit
-rw-rw-r-- 1 deren deren 4268890 Mar  9 18:57 1D0.edit
-rw-rw-r-- 1 deren deren 4268890 Mar  9 18:57 2E0.edit
-rw-rw-r-- 1 deren deren 4268890 Mar  9 18:57 2F0.edit
-rw-rw-r-- 1 deren deren 4268890 Mar  9 18:57 2G0.edit
-rw-rw-r-- 1 deren deren 4268890 Mar  9 18:57 2H0.edit
-rw-rw-r-- 1 deren deren 4268890 Mar  9 18:57 3I0.edit
-rw-rw-r-- 1 deren deren 4268890 Mar  9 18:57 3J0.edit
-rw-rw-r-- 1 deren deren 4268890 Mar  9 18:57 3K0.edit
-rw-rw-r-- 1 deren deren 4268890 Mar  9 18:57 3L0.edit

Here you can see that the first and second reads have been concatenated into a single read separated by 'xxxx' in these files, and the read quality scores have now been discarded.

In [15]:

! head -n 8 edits/1A0.edit

>1A0_0_pair
TGCAGATCTAAGTCTTTTAGGTTACACGAACGAATGAGCGCCCCCCTGTGTGTCGTTCGTTCACTGTCCGTCTTGTACCCTACCTCGAACCTAxxxxGAACCCCATTACCACCTGCGGGAAATGGCCGGAGGTTACATTTCGAAAGTGATTCTTCTTGGCACATGAATGACAAAGGTTCTCCTTCGCTGCTGCAATT
>1A0_1_pair
TGCAGATCTAAGTCTTTTAGGTTACACGAACGAATGAGCGCCCCCCTGTGTGTCGTTCGTTCACTGTCCCTCTTGTACCCTACCTCGAACCTAxxxxGAACCCCATTACCACCTGCGGGAAATGGCCGGAGGTTACATTTCGAAAGTGATTCTTCTTGGCACATGAATGACAAAGGTTCTCCTTCGCTGCTGCAATT
>1A0_2_pair
TGCAGATCTAAGTCTTTTAGGTTACACGAACGAATGAGCGCCCCCCTGTGTGTCGTTCGTTCACTGTCCGTCTTGTACCCTACCTCGAACCTAxxxxGAACCCCATTACCACCTGCGGGAAATGGCCGGAGGTTACATTTCGAAAGTGATTCTTCTTGGCACATGAATGACAAAGGTTCTCCTTCGCTGCTGCAATT
>1A0_3_pair
TGCAGATCTAAGTCTTTTAGGTTACACGAACGAATGAGCGCCCCCCTGTGTGTCGTTCGTTCACTGTCCGTCTTGTACCCTACCTCGAACCTAxxxxGAACCCCATTACCACCTGCGGGAAATGGCCGGAGGTTACATTTCGAAAGTGATTCTTCTTGGCACATGAATGACAAAGGTTCTCCTTCGCTGCTGCAATT

Step 3: Within-sample clustering¶

From here you have two options for how to proceed. The first is to treat these concatenated reads like single end ddRAD sequences and cluster them as concatenated reads. To do this change the datatype option in the params folder from "pairddrad" to "ddrad", and proceed with steps 3-7. It will detect and remove the 'xxxx' seperator and cluster the long reads.

The more complex option, and the one I prefer, is to cluster the reads separately, which I call the "split clustering" method. This will do the following:

de-replicate concatenated reads
split them and cluster first reads only
match second reads back to firsts
align second read clusters separately from first-read clusters
discard loci for which second reads align poorly (incomplete digestion or paralogs)

This is illustrated graphically below:

In [16]:

SVG("https://dl.dropboxusercontent.com/u/2538935/PYRAD_TUTORIALS/figures/pairclustermethod.svg")

Out[16]:

The screen captures below show split clustering performed on an empirical paired ddRAD data set:¶

The good aligned clusters are written to a .clustS.gz file¶

In [17]:

Image("https://dl.dropboxusercontent.com/u/2538935/PYRAD_TUTORIALS/figures/goodpairs.png")

Out[17]:

And the poor aligned pairs which are excluded from further analysis are written to a separate file ending with .badpairs.gz.¶

In [18]:

Image("https://dl.dropboxusercontent.com/u/2538935/PYRAD_TUTORIALS/figures/badpairs.png")

Out[18]:

The benefit of this method over clustering concatenated first+second reads is that second reads can be retained that are either more divergent than the clustering threshold, or which contain many low quality base calls (Ns), thus potentially yielding more SNPs. Also, for very large data sets it is much faster to cluster shorter length reads. The drawbacks are that it discards sequences for which the first and second reads do not appear to be from the same DNA fragment, whereas the concatenated clustering method might cluster them separately.

The split clustering method filters the alignment of second reads based on the number of indels in the alignment. This number can be set on line 27 of the params file. The default values are 3,6,99,99. Meaning for within-sample clusters we allow 3 indels in first read clusters and 6 in second read clusters. For across-sample clusters we allow 99 in first read clusters and 99 in second read clusters. If only two numbers are set (e.g., 3,10) this is equivalent to 3,3,10,10.

In [19]:

! ~/pyrad_v.2.0/pyRAD -p params.txt -s 3


     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	de-replicating files for clustering...

	step 3: within-sample clustering of 12 samples at 
	        '.85' similarity using up to 2 processors
	1A0.edit finished, 1000loci
	1D0.edit finished, 1000loci
	2E0.edit finished, 1000loci
	2G0.edit finished, 1000loci
	2H0.edit finished, 1000loci
	3I0.edit finished, 1000loci
	2F0.edit finished, 1000loci
	3L0.edit finished, 1000loci
	1B0.edit finished, 1000loci
	3K0.edit finished, 1000loci
	3J0.edit finished, 1000loci
	1C0.edit finished, 1000loci

In [20]:

! cat stats/s3.clusters.txt

taxa	total	dpt.me	dpt.sd	d>5.tot	d>5.me	d>5.sd	badpairs
1A0	1000	20.0	0.0	1000	20.0	0.0	0
1D0	1000	20.0	0.0	1000	20.0	0.0	0
2E0	1000	20.0	0.0	1000	20.0	0.0	0
2G0	1000	20.0	0.0	1000	20.0	0.0	0
2H0	1000	20.0	0.0	1000	20.0	0.0	0
3I0	1000	20.0	0.0	1000	20.0	0.0	0
2F0	1000	20.0	0.0	1000	20.0	0.0	0
3L0	1000	20.0	0.0	1000	20.0	0.0	0
1B0	1000	20.0	0.0	1000	20.0	0.0	0
3K0	1000	20.0	0.0	1000	20.0	0.0	0
3J0	1000	20.0	0.0	1000	20.0	0.0	0
1C0	1000	20.0	0.0	1000	20.0	0.0	0

    ## total = total number of clusters, including singletons
    ## dpt.me = mean depth of clusters
    ## dpt.sd = standard deviation of cluster depth
    ## >N.tot = number of clusters with depth greater than N
    ## >N.me = mean depth of clusters with depth greater than N
    ## >N.sd = standard deviation of cluster depth for clusters with depth greater than N
    ## badpairs = mismatched 1st & 2nd reads (only for paired ddRAD data)

Steps 4 & 5: Consensus base calling¶

We next make consensus base calls for each cluster within each individual. First we estimate the error rate and heterozygosity within each sample:

In [21]:

! ~/pyrad_v.2.0/pyRAD -p params.txt -s 4


     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	step 4: estimating error rate and heterozygosity
	............

Calling consensus sequences applies a number of filters, as listed in the params file, to remove potential paralogs or highly repetitive markers from the data set. For paired-end data the maxN filter only applies to first reads, since we don't mind allowing low quality base calls in second reads. The maxH filter applies to both reads together. The diploid filter (only allow 2 haplotypes in a consensus sequence) also applies to the two reads together.

In [22]:

! ~/pyrad_v.2.0/pyRAD -p params.txt -s 5


     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	step 5: created consensus seqs for 12 samples, using E=0.00051 H=0.00137
	............

In [23]:

! cat stats/s5.consens.txt

taxon               	nloci	f1loci	f2loci	nsites	npoly	poly
3I0.clustS.gz       	1000	1000	1000	183005	246	0.0013442
2H0.clustS.gz       	1000	1000	1000	183002	260	0.0014207
1C0.clustS.gz       	1000	1000	1000	183010	265	0.001448
1D0.clustS.gz       	1000	1000	1000	183000	256	0.0013989
3L0.clustS.gz       	1000	1000	1000	183007	261	0.0014262
2G0.clustS.gz       	1000	1000	1000	183002	260	0.0014207
1B0.clustS.gz       	1000	1000	1000	183004	265	0.0014481
3J0.clustS.gz       	1000	1000	1000	183005	227	0.0012404
1A0.clustS.gz       	1000	1000	1000	183010	242	0.0013223
3K0.clustS.gz       	1000	1000	1000	183003	243	0.0013278
2E0.clustS.gz       	1000	1000	1000	183005	248	0.0013552
2F0.clustS.gz       	1000	1000	1000	183002	259	0.0014153

    ## nloci = number of loci
    ## f1loci = number of loci with >N depth coverage
    ## f2loci = number of loci with >N depth and passed paralog filter
    ## nsites = number of sites across f loci
    ## npoly = number of polymorphic sites in nsites
    ## poly = frequency of polymorphic sites

Step 6: Across-sample clustering¶

This step clusters consensus sequences across samples. It can take a very long time for very large data sets. If run normally it will print its progress to the screen so you can judge how long it might take. Try this first. If it seems too slow (many days or weeks) you can cancel the job and try running the two-step clustering method instead (see the Advanced tutorials). Here I show the results for a data set that runs very quickly.

In [24]:

! ~/pyrad_v.2.0/pyRAD -p params.txt -s 6


     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	step 6: clustering across 12 samples at '.85' similarity 

usearch v7.0.1090_i86linux32, 4.0Gb RAM (7.9Gb total), 4 cores
(C) Copyright 2013 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

Licensed to: daeaton.chicago@gmail.com


                                                        
      Seqs  12000 (12.0k)
  Clusters  1000
  Max size  12
  Avg size  12.0
  Min size  12
Singletons  0, 0.0% of seqs, 0.0% of clusters
   Max mem  28Mb
      Time  3.00s
Throughput  4000.0 seqs/sec.

Step 7: Statistics and output files¶

Alignment of loci, statistics, and output of various formatted data files.

In [26]:

! ~/pyrad_v.2.0/pyRAD -p params.txt -s 7


     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------

	ingroup 1A0,1B0,1C0,1D0,2E0,2F0,2G0,2H0,3I0,3J0,3K0,3L0
	addon 
	exclude 
	...
	final stats written to:
	 /home/deren/Dropbox/Public/PYRAD_TUTORIALS/tutorial_pairddRAD/stats/c85d6m5p3.stats
	output files written to:
	 /home/deren/Dropbox/Public/PYRAD_TUTORIALS/tutorial_pairddRAD/outfiles/ directory

Stats for this run. In this case 1000 loci were shared across all 12 samples.

In [29]:

! cat stats/c85d6m4p3.stats


1000        ## loci with > minsp containing data
1000        ## loci with > minsp containing data & paralogs removed
1000        ## loci with > minsp containing data & paralogs removed & final filtering

## number of loci recovered in final data set for each taxon.
taxon	nloci
1A0	1000
1B0	1000
1C0	1000
1D0	1000
2E0	1000
2F0	1000
2G0	1000
2H0	1000
3I0	1000
3J0	1000
3K0	1000
3L0	1000


## nloci = number of loci with data for exactly ntaxa
## ntotal = number of loci for which at least ntaxa have data
ntaxa	nloci	saved	ntotal
1	-
2	-		-
3	-		-
4	0	*	1000
5	0	*	1000
6	0	*	1000
7	0	*	1000
8	0	*	1000
9	0	*	1000
10	0	*	1000
11	0	*	1000
12	1000	*	1000


## var = number of loci containing n variable sites.
## pis = number of loci containing n parsimony informative sites.
n	var	PIS
0	1	97
1	237	210
2	325	270
3	303	197
4	269	122
5	235	63
6	197	30
7	131	8
8	87	2
9	54	1
10	30	0
11	17	0
12	13	0
13	3	0
14	0	0
15	0	0
16	1	0
total var= 8026
total pis= 2405

Examples of the available output formats can be seen in the general use tutorial.

Notes on paired ddRAD¶

...