Thursday, 18 March 2021

Chromosome-length genome assembly and structural variations of the primal Basenji dog (Canis lupus familiaris) genome

Our latest genome paper is now out at BMC Genomics. This was a second collaboration in the team behind the German Shepherd Dog genome last year, led by Bill Ballard. This time, we used a combination of BGI short reads, ONT long reads, and Hi-C scaffolding to generate a chromosome-length assembly. This is one of the most intact and complete dog genomes generated to date, and joins only a handful of published breed-specific chromosome-length assemblies.

The Basenji is particularly interesting as it sits at the base of the dog breed family tree, making it a good unbiased reference for future comparisons between breeds.

The paper also has a few nice nuggets for those interesting in genome assembly. Of particular interest, our initial assembly had an artefact where the entire mitochondrial genome got assembled (in two copies) into the middle of one of the nuclear chromosomes. It is not entirely clear why this happened, but it was inserted into a NUMT (nuclear mitochondrial DNA insertion) fragment at that location. To make finding such things easier, we’ve released a new NUMT finding tool, NUMTFinder.

As with our previous dog genome, the German Shepherd Dog, we also observe that the tandem repeat of Amy2B (Amylase Alpha 2B) genes, was assembled intact but with fewer copies than are present in the actual genome. (This gene is of interest for dog domestication and adaptations to a starch-rich diet.) Crucially, without looking at the raw sequencing data, it would not have been clear that the assembly under-represents Amy2B copy number. This kind of analysis can be repeated using the regcheck or regcnv run modes of Diploidocus, which estimates the copy number of a region based on its read depth versus the single-copy read depth determined from BUSCO single-copy complete genes.

Overall, this presents a nice case study of the need for a bit of TLC and manual curation, even when you have some very impressive completeness and contiguity statistics.

Edwards RJ, Field MA, Ferguson JM, Dudchenko O, Keilwagen K, Rosen BD, Johnson GS, Rice ES, Hillier L, Hammond JM, Towarnicki SG, Omer A, Khan R, Skvortsova K, Bogdanovic O, Zammit RA, Lieberman Aiden E, Warren WC & Ballard JWO (2021): Chromosome-length genome assembly and structural variations of the primal Basenji dog (Canis lupus familiaris) genome. BMC Genomics 22:188


Background: Basenjis are considered an ancient dog breed of central African origins that still live and hunt with tribesmen in the African Congo. Nicknamed the barkless dog, Basenjis possess unique phylogeny, geographical origins and traits, making their genome structure of great interest. The increasing number of available canid reference genomes allows us to examine the impact the choice of reference genome makes with regard to reference genome quality and breed relatedness.

Results: Here, we report two high quality de novo Basenji genome assemblies: a female, China (CanFam_Bas), and a male, Wags. We conduct pairwise comparisons and report structural variations between assembled genomes of three dog breeds: Basenji (CanFam_Bas), Boxer (CanFam3.1) and German Shepherd Dog (GSD) (CanFam_GSD). CanFam_Bas is superior to CanFam3.1 in terms of genome contiguity and comparable overall to the high quality CanFam_GSD assembly. By aligning short read data from 58 representative dog breeds to three reference genomes, we demonstrate how the choice of reference genome significantly impacts both read mapping and variant detection.

Conclusions: The growing number of high-quality canid reference genomes means the choice of reference genome is an increasingly critical decision in subsequent canid variant analyses. The basal position of the Basenji makes it suitable for variant analysis for targeted applications of specific dog breeds. However, we believe more comprehensive analyses across the entire family of canids is more suited to a pangenome approach. Collectively this work highlights the importance the choice of reference genome makes in all variation studies.

Friday, 4 December 2020

EdwardsLab at #AusEvo2020

If you missed his talk at ABACBS2020, Jack will be presenting today at the Australasian Evolution Society 2020 Conference about The role of gene duplication in the evolution of snake venoms. Two conference presentations in two weeks - not a bad way to prepare for your Honours viva post-submission. Well done, Jack!

Also, Kat Stuart will be presenting her work on invasive starlings in Zoom 2 at 13:00 AEDT. Kat’s talks are always great to listen to:

  • Katarina Stuart: What drives invasion success? Using historical museum samples to examine evolution in an invasive passerine.

Wednesday, 25 November 2020

#ABACBS2020: Unsupervised orthologous gene tree enrichment for cost-effective phylogenomic analysis and a test case on waratahs (Telopea spp.)

Stephanie Chen, Maurizio Rossetto, Marlien van der Merwe, Hervé Sauquet, Patricia Lu-Irving, Jia-Yee Yap, William Studley, Greg Bourke, Jason Bragg, Richard J. Edwards


Whole-genome shotgun sequencing is becoming increasingly common in phylogenetic research due to the falling cost of whole genome sequencing compared to traditional methods which target subsets of genomes. However, there are few existing packages for assembling putatively orthologous loci from evolutionarily diverged samples and making alignments for phylogenetic analysis from these data. Additionally, short-read Illumina sequencing data are highly accurate but at low coverages, it can be difficult to draw out meaningful phylogenomic inferences, especially for non-model organisms for which there is no reference genome available.

We have developed a scalable method of rapidly generating species trees from short-read data without the need for a reference genome. The workflow involves (1) de novo genome assembly with ABySS at a range of k values (2) extracting the most complete BUSCO (Benchmarking Universal Single-Copy Orthologs) genes from each set of assemblies with the BUSCO Compiler and Comparison tool (BUSCOMP) (3) generating gene trees, and (4) constructing a species tree.

The workflow has been applied to a whole genome shotgun sequencing waratah (Telopea spp.) dataset of five species, comprising of two samples from each of the seven lineages; there are three lineages of T. speciosissima (New South Wales waratah) – coastal, upland, and southern. We have also generated a reference genome for T. speciosissima, and examine the robustness of the workflow by comparison to a reference-based approach. It is anticipated that the workflow will maximise the recovery of informative data from genomic datasets for reproducible phylogenomic studies and be especially useful for non-model organisms.

#ABACBS2020: Whole transcripts in genome assembly, annotation, and assessment: the draft genome assembly of the globally invasive common starling, Sturnus vulgaris

Katarina Stuart, Yuanyuan Cheng, Lee Rollins & Richard J. Edwards


Native to the Palearctic, the common starling (Sturnus vulgaris) is a near-globally invasive passerine that has now colonised every continent barring Antarctica. Ecological interest in the species is two-fold – they are considered a conservation risk and crop pest within the invasive ranges, while recent decades have brought with them a worrying decline in starling numbers within historical native ranges. Despite the global interest in this species, there are still fundamental knowledge gaps in our understanding of the genetics and population differences of this species across their native and invasive range. We present the Australian S. vulgaris draft genome and transcriptome to be used as a reference for further investigation into evolutionary characterisation of this ecologically significant species. An initial 10x Genomics linked-read assembly was scaffolded and gap-filled with low coverage nanopore sequencing, complemented by PacBio Isoseq full-length transcript data. Isoseq data was incorporated into assembly scaffolding, annotation, and assembly assessment to inform workflow decisions. We produced a draft assembly with a scaffold N50 size of 72.5 Mb, and assess this alongside a North American S. vulgaris draft genome, previously assembled from Illumina data. Lastly, we use these different reference genomes, alongside a non-scaffolded version of the Australian S. vulgaris genome to assess how choice of reference genome affects common population genetic downstream analysis using a global whole genome resequencing data set.

Tuesday, 24 November 2020

#ABACBS2020: The role of gene duplication in the evolution of snake venoms

Jack Clarke, Vicki Thomson & Richard Edwards


Snakes are one of the most venomous animals on the planet, using their venom for defence and the capturing of prey. Snake venoms have evolved independently of other venoms in other vertebrates, and there is considerable variation between species in their proteomic composition. One of the primary mechanisms through which snake venoms are thought to evolve is the duplication, recruitment and specialisation of proteins from other tissues. In some cases, this evolution is known to involve the tandem duplication of genes resulting in chromosomal clusters of venom genes in some gene families. We have recently sequenced and assembled the genomes of two highly venomous Australian snakes: Notechis scutatus (mainland tiger snake) and Pseudonaja textilis (eastern brown snake). In conjunction with publicly available proteomes from 10 other venomous snakes and 2 non-venomous snakes, these genomes provide an excellent opportunity to examine the role that duplication and neofunctionalisation has played in snake venom evolution.

We have analysed 43 protein families known to play a role in snake venom and examined their pattern of duplication in snakes, compared to high quality reference genomes of other reptiles and non-venomous vertebrates. We find evidence for extensive duplications across some of these families, but no clear enrichment for duplication in the evolution of venom specifically. Instead, we identify a trend where numerous duplications specific to venomous snakes occur in proteins that seem predisposed to evolve by duplication and specialisation, even in non-venomous vertebrates. A subset of high-quality snake genomes was then used to further explore the nature of duplications. While tandem gene duplication is evident in some larger families, it remains absent in many.

The snake venom metalloproteinase (SVMP) family provides an excellent case study, with multiple duplication events throughout its evolutionary history in vertebrates. Part of the broader ADAM (“a disintegrin and metalloproteinase”) family of single-pass transmembrane and secreted zinc proteases, SVMP appears to have expanded by independent tandem duplications in different snake lineages. We also identify a second ADAM subfamily, ADAM20, with an abundance of venomous snake-specific duplications. Ongoing work in exploring the possible role of ADAM20 proteins in snake venoms and the role that genome assembly quality has played in our ability to robustly detect the presence or absence of gene duplication events.

Monday, 2 November 2020

Antarctic desert soil bacteria exhibit high novel natural product potential, evaluated through long-read genome sequencing and comparative genomics

The third of our collaborative “controlled bacterial metagenome” de novo whole genome assembly projects was published in Environmental Microbiology in November. This was a fun collaboration with the Ferrari lab at UNSW trying to maximise bang for buck to sequence some complete bacterial genomes using PacBio sequencing to identify biosynthetic gene clusters. As with a previous paper, we used pooled genomic DNA sequencing and were able to assemble complete genomes (and plasmids) of the 13/17 species that had sufficient depth of coverage. Coolest of all (if you excuse the pun), these were bugs from an Antarctic expedition! Head over to the Ferrari lab website to find out more about their research.

Benaud N, Edwards RJ, Amos TG, D’Agostino PM, Gutiérrez-Cháveza C, Montgomery K, Nicetic I & Ferrari BC (2020). Antarctic desert soil bacteria exhibit high novel natural product potential, evaluated through long-read genome sequencing and comparative genomics. Environmental Microbiology.


Actinobacteria and Proteobacteria are important producers of bioactive natural products (NP), and these phyla dominate in the arid soils of Antarctica, where metabolic adaptations influence survival under harsh conditions. Biosynthetic gene clusters (BGCs) which encode NPs, are typically long and repetitious high G + C regions difficult to sequence with short‐read technologies. We sequenced 17 Antarctic soil bacteria from multi‐genome libraries, employing the long‐read PacBio platform, to optimize capture of BGCs and to facilitate a comprehensive analysis of their NP capacity. We report 13 complete bacterial genomes of high quality and contiguity, representing 10 different cold‐adapted genera including novel species. Antarctic BGCs exhibited low similarity to known compound BGCs (av. 31%), with an abundance of terpene, non‐ribosomal peptide and polyketide‐encoding clusters. Comparative genome analysis was used to map BGC variation between closely related strains from geographically distant environments. Results showed the greatest biosynthetic differences to be in a psychrotolerant Streptomyces strain, as well as a rare Actinobacteria genus, Kribbella, while two other Streptomyces spp. were surprisingly similar to known genomes. Streptomyces and Kribbella BGCs were predicted to encode antitumour, antifungal, antibacterial and biosurfactant‐like compounds, and the synthesis of NPs with antibacterial, antifungal and surfactant properties was confirmed through bioactivity assays.