- The plan
- Constructing the Reference Tree
- Aligning the illumina OTU sequences
- Constructing the ML bootstrapped tree
- Inserting the Illumina Sequences
- Creating the dataset showing where illumina sequences placed
- Using iToL
- Identify the backbone sequences needed to generate the phylogenetic tree.
- use BLAST or insert in ARB SILVA database
- Use RaXML to generate the bootstrapped ML tree from the reference sequences.
- Use papara or SILVA to align the Illumina sequences to the reference tree.
- Insert the aligned OTU into the reference tree using RaXML-EPA.
- Convert the .jplace into a datasheet format that iToL can read.
- Enveomics script by L.M.M. Rodriguez-R
- Custom script
Constructing the Reference Tree
The set-up. I’m working with a dataset that includes ~800,000 16S amplicon illumina sequences and ~400 mid-full length 16S clone sequences. The environment these come from is poorly characterized and most of the sequences don’t have close isolates. The goal is to describe some of the dominant populations and it was decided that longer sequences were needed to improve the confidence in phylogenetic placements.
The first iteration Note: I didn’t save all the commands for this… The first reference tree was made by BLASTing the clone sequences against refseq_rna and grabbing the closest named isolate (if possible) or closest uncharacterized high quality sequence.
This collection of sequences was aligned with clustal-omega. Representative OTU illumina sequences (~40,000) were aligned using papara (I don’t think this was ideal), inserted using RaXML-EPA, and the resultant .jplace file was modified using the JPlace.to_iToL.rb script in enveomics.
The second iteration This will be documented This tree was built by aliging the clone sequences using SILVA and they were inserted into a pre-made ARB tree (SILVA v.128 Ref_NR_99). Close relatives were manually chosen and I included a broader range of outgroups.
Aligned sequences were exported in FASTA format. There might be an issue with names that I need to fix manually (yes there was)…
Aligning the illumina OTU sequences
I’m using the mothur recreated SEED database to align the illumina OTU sequences.
Constructing the ML bootstrapped tree
I’m not sure what the most appropriate way to make the reference tree is. It is temping to use the ARB NJ tree since the topology is so well grounded by the 600,000 curated sequences.
However, I am also making a RaXML tree and I’ll double check the topologies match.
*Note: The first time I used raxml’s default masking and the tree topology was not good. I’ve since gone back to apply the lane-mask to try and improve the raxml tree. If this doesn’t work I’ll export the ARB NJ tree and use it!
*Note2: Tree is still not correct, and I ended up using the exported ARB tree.
Inserting the Illumina Sequences
Luckly I have everything I need to do this: (1) A tree (2) full set of sequences to insert including ref seqs already in the tree
Creating the dataset showing where illumina sequences placed
I’m currently using two scripts to convert the jplace file to a format that iToL can recognize.
The first script is written by Luis M. Rodriguez-R in the enveomics collection. I use this script solely to grab the tree that Miguel reformats from the jplace file to play nicely with iToL. If you did your RaXML-EPA mapping in a different way (each sequence library separately) you would be able to only use this script.
I, however, mapped OTU representative sequences onto the tree and inorder to correctly displace the true abundances I needed to inflate these back into the reads assigned to that OTU.
I wrote a quick python script that follows the logic of Miguel’s ruby script (namely I grab the most likely node placement for each OTU), and generate a list of all the OTU that were placed at the same node. I can then use the OTU table (tab-delimited format) to sum all sequences from each library placed at each node in the tree.
I’ve currently set the default radius size to be the sum of the total reads assigned, but that can be changed easily after the script is run (its the 3rd column). The default piechart location is in the middle of the branch (2nd column). If you change the value from 0.5 to -1, the external piecharts will be plotted at the leaves. Values from 0 (start of node branch) to 1 (end of node branch) can also be used.
From here it is pretty straightforward to upload your tree and your datafile into a project and then play with all the options.