Wednesday, 20 November 2024

Adventures in genome assembly curation and QC

It was great to have the opportunity to present some recent - and not so recent - work today to the Australian BioCommons Genomics Community meeting. You can find a link to the recording here, and the slides are up here if you missed it and are curious. I’ll add some additional links to the tools discussed with time.

One of the highlights of my time in Australia has been the awesome bioinformatics and genomics (and other) communities. I’ve been lucky to meet and work with so many awesome scientists and fantastic human beings.

Saturday, 16 November 2024

Repeat-Rich Regions Cause False-Positive Detection of NUMTs: A Case Study in Amphibians Using an Improved Cane Toad Reference Genome

Version 3* of the cane toad reference genome, aRhiMar1.3 is now officially out and published in Genome Biology and Evolution. [*It’s only the second published genome but for internal reasons the original draft genome was version 2!].

The focus of the paper itself is confirming the lack of Nuclear Mitochondrial fragments (a.k.a. NUMTs) in the cane toad genome, which could impact whole-mitogenome analysis of genetic diversity in cane toads. We were pretty surprised when we first looked for NUMTs in the cane toad genome and could not find any! The draft genome is pretty drafty, especially in terms of missing repetitive regions, so an updated long-read assembly was important to rule out a false negative result.

The new genome is in much better shape, with an extra 922 Mbp (>95% repeats) and a 15x increase in scaffold N50 (2.5 Mbp). (My biggest regret with the original paper was not sticking to my guns and including an early DepthSizer genome size estimate of 3.5 Mbp, which has subsequently turned out to be correct.) The cane toad genome remains a tough nut to crack, and we didn’t quite reach the magic 1Mb contig N50 (860 kb), but the functional completeness was markedly improved and we are pretty confident that the continued absence of NUMT detection is a real phenomenom and does not simply reflect technical limitations.

Watch this space for a chromosome-level cane toad genome, which is still in the works.

Cheung K, Rollins LA, Hammond JM, Barton K, Ferguson JM, Eyck HJF, Shine R & Edwards RJ (2024): Repeat-rich regions cause false positive detection of NUMTs - a case study in amphibians using an improved cane toad reference genome. Genome Biology and Evolution evae246. [Gen Biol Evol] [bioRxiv] [PubMed]

Mitochondrial DNA (mtDNA) has been widely used in genetics research for decades. Contamination from nuclear DNA of mitochondrial origin (NUMTs) can confound studies of phylogenetic relationships and mtDNA heteroplasmy. Homology searches with mtDNA are widely used to detect NUMTs in the nuclear genome. Nevertheless, false-positive detection of NUMTs is common when handling repeat-rich sequences, while fragmented genomes might result in missing true NUMTs. In this study, we investigated different NUMT detection methods and how the quality of the genome assembly affects them. We presented an improved nuclear genome assembly (aRhiMar1.3) of the invasive cane toad (Rhinella marina) with additional long-read Nanopore and 10× linked-read sequencing. The final assembly was 3.47 Gb in length with 91.3% of tetrapod universal single-copy orthologs (n = 5,310), indicating the gene-containing regions were well assembled. We used 3 complementary methods (NUMTFinder, dinumt, and PALMER) to study the NUMT landscape of the cane toad genome. All 3 methods yielded consistent results, showing very few NUMTs in the cane toad genome. Furthermore, we expanded NUMT detection analyses to other amphibians and confirmed a weak relationship between genome size and the number of NUMTs present in the nuclear genome. Amphibians are repeat-rich, and we show that the number of NUMTs found in highly repetitive genomes is prone to inflation when using homology-based detection without filters. Together, this study provides an exemplar of how to robustly identify NUMTs in complex genomes when confounding effects on mtDNA analyses are a concern.

Tuesday, 5 November 2024

#ABACBS2024 Poster 102: Improving phased Hifiasm assemblies with 20 kb ONT reads

After a great presentation this morning by Emma de Jong on our High-Quality Genomes for Australian Lutjanidae Species (abstract below), if you’re at ABACBS2024 then please drop by Poster #102 to find out about some of the work we’re doing with ONT data.

Abstracts

Improving phased Hifiasm assemblies with 20 kb ONT reads

Richard J Edwards, Adrianne Doran, Emma de Jong, Lara Parata, Shannon Corrigan

The quality and quantity of genome assembly has improved dramatically over recent years. Many large-scale genome projects combine assembly of HiFi and HiC reads using Hifiasm to produce contiguous phased assemblies, scaffolded to chromosome-level. Nevertheless, HiFi reads are typically under 25 kb and can still struggle to assemble long, low diversity repeat regions. Obtaining ultra-long (100 kb or longer) ONT reads to solve this problem remains a significant challenge due to technical constraints and DNA sample requirements. Here, we explore the utility of using standard ONT long reads (20 kb or more) as “ultra-long” input to improve phased Hifiasm assemblies for 22 species of bony fish (Genome Size, 627 Mb 1.54 Gb). We also explore whether the new --telo-m mode in Hifiasm v0.9.0 improves telomere prediction. Incorporating 20kb+ ONT reads (7.8X 93.5X) significantly increased assembly contiguity. BUSCO Completeness was not significantly altered, although there was some re-partitioning of BUSCO genes between phased haplotypes for some species. Improvement did not strongly correlate with read depth (either HiFi or ONT), suggesting that the underlying read length distributions and/or specific genome features are more important for determining the outcome. Hifiasm --telo-m mode significantly increased telomere recovery, assembling over six times the number of gapless telomere-to-telomere chromosomes when combined with 20kb+ ONT reads. Verification of how these results translate to the ease of curation and/or quality of final HiC-scaffolded chromosome-level assemblies is ongoing, with a goal to determine whether the additional sample preparation and sequencing in the lab is cost-effective.

High-Quality Genomes for Australian Lutjanidae Species

Emma de Jong, Lara Parata, Philipp E Bayer, Shannon Corrigan, Richard J Edwards

Lutjanidae (snappers) are highly valued in commercial and recreational fisheries worldwide and serve as indicator species of the health of marine environments and fishery bioregions in Western Australia. Comprehensive genomic mapping of immune gene families of Lutjanidae species are lacking, but this information is critical for understanding disease vulnerability, the impact of environmental stress, improving aquaculture efforts and to provide insights into the health of wild populations. Despite their importance, only 3 out of 113 Lutjanid species currently have available reference genomes, two of which are highly fragmented (>11,000 and >200,000 contigs), impacting studies on gene families relevant to aquaculture. In this study, we present high-quality chromosome-level reference genomes for 14 Australian Lutjanidae species across seven genera, generated using HiFi and HiC data. We present initial comparative genomic analyses, including immune gene content and chromosomal synteny analyses across species. These analyses provide insights into the genomic architecture and evolutionary relationships within Lutjanidae. Ongoing work aims to comprehensively map and compare the immune gene family repertoire across Lutjanidae genera, as well as Lethrinidae species as an outgroup, to determine genus-specific changes in genes (e.g. loss, selection, duplication) important for pathogen detection, antigen presentation, inflammation, and immune memory. These genome assemblies will serve as a foundational resource to the wider scientific community interested in Lutjanidae