Thursday 18 March 2021

Chromosome-length genome assembly and structural variations of the primal Basenji dog (Canis lupus familiaris) genome

Our latest genome paper is now out at BMC Genomics. This was a second collaboration in the team behind the German Shepherd Dog genome last year, led by Bill Ballard. This time, we used a combination of BGI short reads, ONT long reads, and Hi-C scaffolding to generate a chromosome-length assembly. This is one of the most intact and complete dog genomes generated to date, and joins only a handful of published breed-specific chromosome-length assemblies.

The Basenji is particularly interesting as it sits at the base of the dog breed family tree, making it a good unbiased reference for future comparisons between breeds.

The paper also has a few nice nuggets for those interesting in genome assembly. Of particular interest, our initial assembly had an artefact where the entire mitochondrial genome got assembled (in two copies) into the middle of one of the nuclear chromosomes. It is not entirely clear why this happened, but it was inserted into a NUMT (nuclear mitochondrial DNA insertion) fragment at that location. To make finding such things easier, we’ve released a new NUMT finding tool, NUMTFinder.

As with our previous dog genome, the German Shepherd Dog, we also observe that the tandem repeat of Amy2B (Amylase Alpha 2B) genes, was assembled intact but with fewer copies than are present in the actual genome. (This gene is of interest for dog domestication and adaptations to a starch-rich diet.) Crucially, without looking at the raw sequencing data, it would not have been clear that the assembly under-represents Amy2B copy number. This kind of analysis can be repeated using the regcheck or regcnv run modes of Diploidocus, which estimates the copy number of a region based on its read depth versus the single-copy read depth determined from BUSCO single-copy complete genes.

Overall, this presents a nice case study of the need for a bit of TLC and manual curation, even when you have some very impressive completeness and contiguity statistics.

Edwards RJ, Field MA, Ferguson JM, Dudchenko O, Keilwagen K, Rosen BD, Johnson GS, Rice ES, Hillier L, Hammond JM, Towarnicki SG, Omer A, Khan R, Skvortsova K, Bogdanovic O, Zammit RA, Lieberman Aiden E, Warren WC & Ballard JWO (2021): Chromosome-length genome assembly and structural variations of the primal Basenji dog (Canis lupus familiaris) genome. BMC Genomics 22:188


Background: Basenjis are considered an ancient dog breed of central African origins that still live and hunt with tribesmen in the African Congo. Nicknamed the barkless dog, Basenjis possess unique phylogeny, geographical origins and traits, making their genome structure of great interest. The increasing number of available canid reference genomes allows us to examine the impact the choice of reference genome makes with regard to reference genome quality and breed relatedness.

Results: Here, we report two high quality de novo Basenji genome assemblies: a female, China (CanFam_Bas), and a male, Wags. We conduct pairwise comparisons and report structural variations between assembled genomes of three dog breeds: Basenji (CanFam_Bas), Boxer (CanFam3.1) and German Shepherd Dog (GSD) (CanFam_GSD). CanFam_Bas is superior to CanFam3.1 in terms of genome contiguity and comparable overall to the high quality CanFam_GSD assembly. By aligning short read data from 58 representative dog breeds to three reference genomes, we demonstrate how the choice of reference genome significantly impacts both read mapping and variant detection.

Conclusions: The growing number of high-quality canid reference genomes means the choice of reference genome is an increasingly critical decision in subsequent canid variant analyses. The basal position of the Basenji makes it suitable for variant analysis for targeted applications of specific dog breeds. However, we believe more comprehensive analyses across the entire family of canids is more suited to a pangenome approach. Collectively this work highlights the importance the choice of reference genome makes in all variation studies.