Thursday, 3 June 2021

Chromosome-level de novo genome assembly of Telopea speciosissima (New South Wales waratah) using long-reads, linked-reads and Hi-C

The latest genomics paper from the lab is now out on bioRvix. This is the first paper from Stephanie Chen’s PhD project in collaboration with the Royal Botanic Gardens and Domain Trust (RBGDT), Sydney. In this paper, Stephanie reports on the chromosome-level assembly of the New South Wales Waratah, the floral emblem of NSW. This is the first of the pilot reference genomes to be released from the Genomics for Australian Plants initiative.

In addition to the genome itself, this paper describes a couple of genomics tools from the lab. DepthSizer (https://github.com/slimsuite/depthsizer) uses BUSCO predictions to establish the single-copy read depth of sequencing data, from which the genome size can be estimated in a way that is hopefully quite robust to assembly quality. Diploidocus (https://github.com/slimsuite/diploidocus) has been used for our previous Dog genome assemblies to help eliminate “haplotigs” (heterozygous regions of the genome that appear in the assembly twice), and low-quality sequences, in addition to flagging possible collapsed repeats or contaminants for further investigation. Here, the Diploidocus “tidy” pipeline is considerably extended for a much more nuanced classification and filtering of scaffolds, using a combination of read depths, homology, kmer analysis and BUSCO predictions.


Chen SH, Rossetto M, van der Merwe M, Lu-Irving P, Yap JS, Sauquet H, Bourke G, Bragg JG & Edwards RJ (preprint): Chromosome-level de novo genome assembly of Telopea speciosissima (New South Wales waratah) using long-reads, linked-reads and Hi-C. bioRxiv 2021.06.02.444084; doi: 10.1101/2021.06.02.444084.
[bioRxiv]

Abstract

Background: Telopea speciosissima, the New South Wales waratah, is Australian endemic woody shrub in the family Proteaceae. Waratahs have great potential as a model clade to better understand processes of speciation, introgression and adaptation, and are significant from a horticultural perspective. Findings: Here, we report the first chromosome-level reference genome for T. speciosissima. Combining Oxford Nanopore long-reads, 10x Genomics Chromium linked-reads and Hi-C data, the assembly spans 823 Mb (scaffold N50 of 69.0 Mb) with 91.2 % of Embryophyta BUSCOs complete. We introduce a new method in Diploidocus (https://github.com/slimsuite/diploidocus) for classifying, curating and QC-filtering assembly scaffolds. We also present a new tool, DepthSizer (https://github.com/slimsuite/depthsizer), for genome size estimation from the read depth of single copy orthologues and find that the assembly is 93.9 % of the estimated genome size. The largest 11 scaffolds contained 94.1 % of the assembly, conforming to the expected number of chromosomes (2n = 22). Genome annotation predicted 40,158 protein-coding genes, 351 rRNAs and 728 tRNAs. Our results indicate that the waratah genome is highly repetitive, with a repeat content of 62.3 %. Conclusions: The T. speciosissima genome (Tspe_v1) will accelerate waratah evolutionary genomics and facilitate marker assisted approaches for breeding. Broadly, it represents an important new genomic resource of Proteaceae to support the conservation of flora in Australia and further afield.