Stephanie Chen, Maurizio Rossetto, Marlien van der Merwe, Hervé Sauquet, Patricia Lu-Irving, Jia-Yee Yap, William Studley, Greg Bourke, Jason Bragg, Richard J. Edwards
Whole-genome shotgun sequencing is becoming increasingly common in phylogenetic research due to the falling cost of whole genome sequencing compared to traditional methods which target subsets of genomes. However, there are few existing packages for assembling putatively orthologous loci from evolutionarily diverged samples and making alignments for phylogenetic analysis from these data. Additionally, short-read Illumina sequencing data are highly accurate but at low coverages, it can be difficult to draw out meaningful phylogenomic inferences, especially for non-model organisms for which there is no reference genome available.
We have developed a scalable method of rapidly generating species trees from short-read data without the need for a reference genome. The workflow involves (1) de novo genome assembly with ABySS at a range of k values (2) extracting the most complete BUSCO (Benchmarking Universal Single-Copy Orthologs) genes from each set of assemblies with the BUSCO Compiler and Comparison tool (BUSCOMP) (3) generating gene trees, and (4) constructing a species tree.
The workflow has been applied to a whole genome shotgun sequencing waratah (Telopea spp.) dataset of five species, comprising of two samples from each of the seven lineages; there are three lineages of T. speciosissima (New South Wales waratah) – coastal, upland, and southern. We have also generated a reference genome for T. speciosissima, and examine the robustness of the workflow by comparison to a reference-based approach. It is anticipated that the workflow will maximise the recovery of informative data from genomic datasets for reproducible phylogenomic studies and be especially useful for non-model organisms.