#!/usr/bin/env python
# coding: utf-8

# # Example _de novo_ GBS analysis using _pyRAD_

# ----------  
# 
# Please direct questions about _pyRAD_ usage to the google group thread ([link](https://groups.google.com/forum/#!forum/pyrad-users))  
# 
# --------------  
# 
# 

# This tutorial focuses on the analysis of single-end GBS reads as well as methods for salvaging short fragments containing adapter sequence that often result from poor size selection on this type of library preparation.  If you have not yet read the [full tutorial](http://nbviewer.ipython.org/gist/dereneaton/ba1e04f0a757ee9f859d), you should start there for a broader description of how _pyRAD_ works.  
# 
# This tutorial is written as an IPython notebook. Each cell that begins with the header `%%bash` or with `!` should be executed in a command line shell, for example by copying and pasting the text (but excluding the header). If you have IPython notebook installed on your machine you can download this notebook and edit/execute the cells yourself.  
# 
# -------------  
# 

# ###A short description of the genotyping-by-sequencing (GBS) library preparation.  
# 
# Genomic DNAs are digested with a single restriction enzyme such that the resulting DNA fragments have the same cutsite overhang on both ends. A key difference between GBS and other library types is that the adapter+barcode can ligate to either ends of these fragments, meaning that the sequence read may have been sequenced from _either_ end.  
# 
# GBS, like ddRAD, is cheaper to prepare than RAD, and easier to prepare in your own lab. Some GBS preps do not include a size selection step, in which case the data generally include _many_ short fragments that when read from either end in different reads will overlap considerably. For this reason pyrad uses reverse complement clustering to identify that these reads come from the same fragment. Some people add a size selection step to GBS preps, in which case, depending on the size range selected, overlap between sequenced reads from either ends of fragments can be reduced.  
# 
# In this tutorial I show how to test whether overlap occurs in your GBS data, and how to setup the params file to perform reverse complement clustering in the case that it does occur; or to not perform this type of clustering if your data do not overlap (reverse complement clustering is a bit slower.)

# --------------  
# 
# The bash script in the cell below can be used to download an example GBS data set and unarchive it into your current directory. This data set was simulated with the following parameters
# 
# + 12 taxa, 1 sampled individual per taxon  
# + 2000 GBS loci (1000 DNA fragments sequenced from each end as first reads)  
# + 20X sequencing depth
# + theta = 0.014  
# + sequencing error rate = 0.005  
# + no mutations to restriction sites
# + no indels   
# + fragment size window (50-300)    
# 
# This size selection window was chosen to demonstrate the analysis of short fragments. 
# 

# ## Example data set

# In[1]:


get_ipython().run_cell_magic('bash', '', 'wget -q http://dereneaton.com/downloads/simGBS.zip  \nunzip simGBS.zip\n')


# #### The two necessary files below should now be located in your current directory/

# + simGBS_R1.fastq.gz : Illumina fastQ formatted reads (gzip compressed)
# + simGBS.barcodes : barcode map file

# ## _pyRAD_ Analysis
# 
# We begin by creating the params.txt file, which sets all of the parameters for the pyrad analysis. A default template should always be created first using the `-n` option in pyRAD. Then you can use any text editor to fill in the relevant lines with parameter valuels.

# In[3]:


get_ipython().run_cell_magic('bash', '', 'pyRAD -n\n')


# #### Let's take a look at the default settings.

# In[4]:


get_ipython().run_cell_magic('bash', '', 'cat params.txt\n')


# --------------   
# 
# #### We want to change a few of these. You should do this by editing params.txt with any text editor. I use the script below to automate the changes here. The most important changes below are the following effects:  
# 
# + line 11 sets the datatype to 'gbs', meaning we will test for reverse-complement clusters in step 3
# + line 21 sets filter to 1, meaning we will trim reads with an Illumina adapter detected in step 2.
# + line 32 sets the minimum length of kept trimmed reads to 50.

# In[5]:


get_ipython().run_cell_magic('bash', '', "sed -i '/## 10. /c\\.85                 ## 10. lowered clust thresh... ' params.txt\nsed -i '/## 11. /c\\gbs                 ## 11. changed datatype to gbs ' params.txt\nsed -i '/## 14. /c\\c85m4p3             ## 14. outprefix... ' params.txt\nsed -i '/## 21./c\\1                    ## 21. set filter to 1 ' params.txt\nsed -i '/## 24./c\\8                    ## 24. increased maxH ' params.txt\nsed -i '/## 30./c\\*                    ## 30. all output formats ' params.txt\nsed -i '/## 32./c\\50                   ## 32. keep fragments longer than 50 ' params.txt\n")


# In[6]:


get_ipython().run_cell_magic('bash', '', 'cat params.txt\n')


# #### Let's take a look at what the raw data look like.

# Your input data will be in fastQ format, usually ending in .fq or .fastq. Your data could be split among multiple files, or all within a single file. The file/s may be compressed with gzip so that they have a .gz, but do not need to be compressed. The location of these files should be entered on line 2 of the params file unless your data are already demultiplexed in which case you can skip this step and instead list the location of your demultiplexed files on line 18. Below are the first three reads in the example data set.

# In[7]:


get_ipython().run_cell_magic('bash', '', 'less simGBS_R1.fastq.gz | head -n 12 | cut -c 1-80\n')


# Each read takes four lines. The first is the name of the read (its location on the plate). The second line contains the sequence data. The third line is a spacer. And the fourth line the quality scores for the base calls. In this case arbitrarily high since the data were simulated. 
# 
# These are 100 bp single-end reads prepared as GBS. The first six bases form the barcode and the next five bases (TGCAG) the restriction site overhang. All following bases make up the sequence data. 

# ### Step 1: de-multiplexing

# This step uses information in the barcodes file to sort data into a separate file for each sample, which is then saved in a new directory called fastq/

# In[8]:


get_ipython().run_cell_magic('bash', '', 'pyRAD -p params.txt -s 1\n')


# In[9]:


get_ipython().run_cell_magic('bash', '', 'ls fastq/\n')


# #### The statistics for step 1  
# For each raw data file the number of reads is shown, the number for which the cut site was detected, and the number for which the detected barcode matched in the barcodes file.  Then for each true barcode it lists the observed barcodes that matched within the allowed number of mismatches, and their number of occurrences. Finally at the bottom the non-matching barcodes are shown, and their occurrences as well.

# In[10]:


get_ipython().run_cell_magic('bash', '', 'cat stats/s1.sorting.txt\n')


# ### Step 2: quality filtering

# Step 2 filters reads based on quality scores, and can be used to detect and trim off Illumina adapters if present, which we expect to have in these simulated data. Filter option 1 looks for a fuzzy match to the reverse complement cutsite and the Illumina adapter. If present the adapter is chopped off and the read is either discarded or kept if the remaining length is longer than the set minimum length (here 50) it is kept. If you wanted, you could instead use external software to filter adapter sequences from your data. 

# In[11]:


get_ipython().run_cell_magic('bash', '', 'pyRAD -p params.txt -s 2\n')


# In[13]:


get_ipython().run_cell_magic('bash', '', 'ls edits/\n')


# The filtered data are written in fasta format (quality scores removed) into a new directory called edits/ 

# In[14]:


get_ipython().run_cell_magic('bash', '', 'head -n 10 edits/1A0.edit | cut -c 1-80\n')


# ----------------   
# 
# You can see here that of the 40K reads for each sample about 39,900 pass filtering. Of these, about 33,000 passed with no adapter sequence detected, and about 6,000 had the adapter trimmed off but still passed because their remaining length was longer than 50 bp. Only about 100 reads were excluded from each sample. 

# In[16]:


get_ipython().run_cell_magic('bash', '', 'cat stats/s2.rawedit.txt\n')


# ### Step 3: clustering within-samples

# Step 3 de-replicates and then clusters reads within samples by the set clustering threshold and writes the clusters to new files in a directory called clust.xx.  
# 
# Below I show the results of this analysis which uses reverse complement clustering to detect overlap among sequences originating from either end of short GBS fragments. I then show how to check for overlap in your data, and how to turn off reverse complement clustering if overlap is not detected, because it will allow your analysis to run much faster.  

# In[17]:


get_ipython().run_cell_magic('bash', '', 'pyRAD -p params.txt -s 3\n')


# #### overlapping clusters  
# As you can see below, within the first few clusters we find two reverse complement matching clusters. The first cluster has reads from either end that overlap almost completely, meaning this fragment was only a little more than 100 bp long, and we sequenced 100 bp data. The third cluster is offset a little more. I found that reads which ovelap by less than about 30% are difficult to detect, and the minimum amount of overlap allowed is not a parameter you can change in the params file, but it can be edited in the pyrad code.  

# In[18]:


get_ipython().run_cell_magic('bash', '', 'less clust.85/1A0.clustS.gz | head -n 26 | cut -c 1-80\n')


# ---------------   
# 
# The stats output tells you how many clusters were found, and their mean depth of coverage. It also tells you how many pass  your minimum depth setting. You can use this information to decide if you wish to increase or decrease the mindepth. It also shows histograms of coverage depth for each sample. Below I show sample 1A0 which clearly shows the number of samples in this data set that matched normally plus those which matched in the reverse complement direction, which doubled the coverage. 

# In[19]:


get_ipython().run_cell_magic('bash', '', 'head -n 53 stats/s3.clusters.txt\n')


# #### Toggling reverse-complement clustering   
# 
# For any single end GBS analysis I suggest that you first begin step 3 with the datatype set to gbs, which tells _pyRAD_ to perform reverse complement clustering. Let this run for about 15 minutes, and then look at the resulting data file in the clust directory that has a ".u" ending, using the unix command "less". I use the command "head" below to print only the first 20 lines. 

# In[20]:


get_ipython().run_cell_magic('bash', '', 'head -n 20 clust.85/1A0.u\n')


# The ".u" file records the clustering process, showing which hits match to seeds, their measured similarity, the direction in which they match, and the number of indels. If your data do not contain any reverse complement clusters, meaning you see only "+" in column 5, and no "-", then you do not need to use reverse complement clustering. In that case you could change the datatype option on from 'gbs' to 'rad' and you would get a speed increase. If speed is not a concern, or if you see many "-" in the .u file then you should perform the rest of your analysis using the 'gbs' option.  

# ### Steps 4 & 5: Call consensus sequences

# Step 4 jointly infers the error-rate and heterozygosity across samples.

# In[21]:


get_ipython().run_cell_magic('bash', '', 'pyRAD -p params.txt -s 4\n')


# In[17]:


get_ipython().run_cell_magic('bash', '', 'cat stats/Pi_E_estimate.txt\n')


# Step 5 calls consensus sequences using the parameters inferred above, and filters for paralogs.

# In[22]:


get_ipython().run_cell_magic('bash', '', 'pyRAD -p params.txt -s 5\n')


# In[23]:


get_ipython().run_cell_magic('bash', '', 'cat stats/s5.consens.txt\n')


# ### Step 6: Cluster across samples

# Step 6 clusters consensus sequences across samples. Again this uses reverse complement clustering if the datatype is set to 'gbs'. This step can take a long time for very large data sets (>100 individuals). I suggest trying it first. If it takes too long you can implement the hierarchical clustering method (described in detail in a separate tutorial).

# In[24]:


get_ipython().run_cell_magic('bash', '', 'pyRAD -p params.txt -s 6 \n')


# In[25]:


get_ipython().run_cell_magic('bash', '', 'pyRAD -p params.txt -s 7\n')


# ### Final stats output

# As you can see we did not recover all 2000 loci in the data set. This is because some reads were filtered for containing Illumina adapters, and others were merged when they overlapped.  Below I print the first several loci in the ".loci" output to show the variable length loci that are recovered from this GBS analysis that included fragmented and overlapping reads.  

# In[26]:


get_ipython().run_cell_magic('bash', '', 'cat stats/c85m4p3.stats\n')


# In[29]:


get_ipython().run_cell_magic('bash', '', 'head -n 200 outfiles/c85m4p3.loci | cut -c 1-85\n')