Tuesday, 17 December 2024

#AusEvol2024 - Depth-based correction of gene duplications and losses in genome assemblies

The Australasian Evolution Society conference has always been one of my favourites, due to its laid back culture of inclusivity and kindness. (And low cost!) It therefore feels quite fitting that my last conference as an Aussie academic was AES2024.

This talk was a bit of an update from my AES2021 presentation. This showcased some of the latest additions to DepthKopy, including depth-based copy number correction of genome features, such as rDNA genes, repeat families, or multicopy genes. This includes a feature that classifies multicopy “Duplicated” genes identified by BUSCO as true (biological) or false (artefactual) duplicates. TL/DR version: analysis of draft genome assemblies for 45 species of fish across five different depths/qualities indicates that DepthKopy can correct the copy number and total length of multicopy features to within 10% of the true number. (The lower-quality raw assemblies ranged from a 30% under-estimate to a 60% over-estimate.)

This will be of most importance when low quality draft genomes are included in a comparative genomics analysis. However, even the best genome assemblies appear to have some “collapsed” or duplicated loci where the copy number in the assembly does not accurately reflect the copy number in the genome. DepthKopy is useful for exploring the magnitude of such disparities, and can help to identify and correct specific disrepancies in genes or features of interest.

Tuesday, 3 December 2024

So long, Ocean Genomes... and thanks for all the fish!

After a successful couple of years, today was my last day at UWA. I am proud of the team that I helped to build at the Minderoo Oceanomics Centre at UWA, and the things we have accomplished together. The UWA Oceans Institute has been a fantastic place to work, and I look forward to completing some exciting ongoing collaborations in my capacity as adjunct. It's been exciting to see Ocean Genomes grow from a concept with a largely empty lab to a fully-fledged genome factory capable of generating multiple high-quality genomes a week. The associated publications should hopefully be following soon, and I look forward to continued collaboration with the team as an Oceans Institute adjunct.

Developments in DNA sequencing technology over the past few years have been immense, but the most impressive part for me has been witnessing the laboratory technical team optimising the sample preparations for sequencing. Everything gets so much harder when you move from human samples (the focus of most methods development and testing) into non-model organisms, and I am convinced that the quality of genomes we’ve been producing is in large part due to the quality of the DNA going into the sequencers.

I am now looking for my next challenge and am officially Open For Work. We’ll be moving back to Dublin at the end of January. If you are based in Ireland and need an experienced interdisciplinary problem solver with broad expertise across bioinformatics and biomolecular science, please get in touch! Academic and non-academic opportunities are welcome.

Wednesday, 20 November 2024

Adventures in genome assembly curation and QC

It was great to have the opportunity to present some recent - and not so recent - work today to the Australian BioCommons Genomics Community meeting. You can find a link to the recording here, and the slides are up here if you missed it and are curious. I’ll add some additional links to the tools discussed with time.

One of the highlights of my time in Australia has been the awesome bioinformatics and genomics (and other) communities. I’ve been lucky to meet and work with so many awesome scientists and fantastic human beings.

Saturday, 16 November 2024

Repeat-Rich Regions Cause False-Positive Detection of NUMTs: A Case Study in Amphibians Using an Improved Cane Toad Reference Genome

Version 3* of the cane toad reference genome, aRhiMar1.3 is now officially out and published in Genome Biology and Evolution. [*It’s only the second published genome but for internal reasons the original draft genome was version 2!].

The focus of the paper itself is confirming the lack of Nuclear Mitochondrial fragments (a.k.a. NUMTs) in the cane toad genome, which could impact whole-mitogenome analysis of genetic diversity in cane toads. We were pretty surprised when we first looked for NUMTs in the cane toad genome and could not find any! The draft genome is pretty drafty, especially in terms of missing repetitive regions, so an updated long-read assembly was important to rule out a false negative result.

The new genome is in much better shape, with an extra 922 Mbp (>95% repeats) and a 15x increase in scaffold N50 (2.5 Mbp). (My biggest regret with the original paper was not sticking to my guns and including an early DepthSizer genome size estimate of 3.5 Mbp, which has subsequently turned out to be correct.) The cane toad genome remains a tough nut to crack, and we didn’t quite reach the magic 1Mb contig N50 (860 kb), but the functional completeness was markedly improved and we are pretty confident that the continued absence of NUMT detection is a real phenomenom and does not simply reflect technical limitations.

Watch this space for a chromosome-level cane toad genome, which is still in the works.

Cheung K, Rollins LA, Hammond JM, Barton K, Ferguson JM, Eyck HJF, Shine R & Edwards RJ (2024): Repeat-rich regions cause false positive detection of NUMTs - a case study in amphibians using an improved cane toad reference genome. Genome Biology and Evolution evae246. [Gen Biol Evol] [bioRxiv] [PubMed]

Mitochondrial DNA (mtDNA) has been widely used in genetics research for decades. Contamination from nuclear DNA of mitochondrial origin (NUMTs) can confound studies of phylogenetic relationships and mtDNA heteroplasmy. Homology searches with mtDNA are widely used to detect NUMTs in the nuclear genome. Nevertheless, false-positive detection of NUMTs is common when handling repeat-rich sequences, while fragmented genomes might result in missing true NUMTs. In this study, we investigated different NUMT detection methods and how the quality of the genome assembly affects them. We presented an improved nuclear genome assembly (aRhiMar1.3) of the invasive cane toad (Rhinella marina) with additional long-read Nanopore and 10× linked-read sequencing. The final assembly was 3.47 Gb in length with 91.3% of tetrapod universal single-copy orthologs (n = 5,310), indicating the gene-containing regions were well assembled. We used 3 complementary methods (NUMTFinder, dinumt, and PALMER) to study the NUMT landscape of the cane toad genome. All 3 methods yielded consistent results, showing very few NUMTs in the cane toad genome. Furthermore, we expanded NUMT detection analyses to other amphibians and confirmed a weak relationship between genome size and the number of NUMTs present in the nuclear genome. Amphibians are repeat-rich, and we show that the number of NUMTs found in highly repetitive genomes is prone to inflation when using homology-based detection without filters. Together, this study provides an exemplar of how to robustly identify NUMTs in complex genomes when confounding effects on mtDNA analyses are a concern.

Tuesday, 5 November 2024

#ABACBS2024 Poster 102: Improving phased Hifiasm assemblies with 20 kb ONT reads

After a great presentation this morning by Emma de Jong on our High-Quality Genomes for Australian Lutjanidae Species (abstract below), if you’re at ABACBS2024 then please drop by Poster #102 to find out about some of the work we’re doing with ONT data.

Abstracts

Improving phased Hifiasm assemblies with 20 kb ONT reads

Richard J Edwards, Adrianne Doran, Emma de Jong, Lara Parata, Shannon Corrigan

The quality and quantity of genome assembly has improved dramatically over recent years. Many large-scale genome projects combine assembly of HiFi and HiC reads using Hifiasm to produce contiguous phased assemblies, scaffolded to chromosome-level. Nevertheless, HiFi reads are typically under 25 kb and can still struggle to assemble long, low diversity repeat regions. Obtaining ultra-long (100 kb or longer) ONT reads to solve this problem remains a significant challenge due to technical constraints and DNA sample requirements. Here, we explore the utility of using standard ONT long reads (20 kb or more) as “ultra-long” input to improve phased Hifiasm assemblies for 22 species of bony fish (Genome Size, 627 Mb 1.54 Gb). We also explore whether the new --telo-m mode in Hifiasm v0.9.0 improves telomere prediction. Incorporating 20kb+ ONT reads (7.8X 93.5X) significantly increased assembly contiguity. BUSCO Completeness was not significantly altered, although there was some re-partitioning of BUSCO genes between phased haplotypes for some species. Improvement did not strongly correlate with read depth (either HiFi or ONT), suggesting that the underlying read length distributions and/or specific genome features are more important for determining the outcome. Hifiasm --telo-m mode significantly increased telomere recovery, assembling over six times the number of gapless telomere-to-telomere chromosomes when combined with 20kb+ ONT reads. Verification of how these results translate to the ease of curation and/or quality of final HiC-scaffolded chromosome-level assemblies is ongoing, with a goal to determine whether the additional sample preparation and sequencing in the lab is cost-effective.

High-Quality Genomes for Australian Lutjanidae Species

Emma de Jong, Lara Parata, Philipp E Bayer, Shannon Corrigan, Richard J Edwards

Lutjanidae (snappers) are highly valued in commercial and recreational fisheries worldwide and serve as indicator species of the health of marine environments and fishery bioregions in Western Australia. Comprehensive genomic mapping of immune gene families of Lutjanidae species are lacking, but this information is critical for understanding disease vulnerability, the impact of environmental stress, improving aquaculture efforts and to provide insights into the health of wild populations. Despite their importance, only 3 out of 113 Lutjanid species currently have available reference genomes, two of which are highly fragmented (>11,000 and >200,000 contigs), impacting studies on gene families relevant to aquaculture. In this study, we present high-quality chromosome-level reference genomes for 14 Australian Lutjanidae species across seven genera, generated using HiFi and HiC data. We present initial comparative genomic analyses, including immune gene content and chromosomal synteny analyses across species. These analyses provide insights into the genomic architecture and evolutionary relationships within Lutjanidae. Ongoing work aims to comprehensively map and compare the immune gene family repertoire across Lutjanidae genera, as well as Lethrinidae species as an outgroup, to determine genus-specific changes in genes (e.g. loss, selection, duplication) important for pathogen detection, antigen presentation, inflammation, and immune memory. These genome assemblies will serve as a foundational resource to the wider scientific community interested in Lutjanidae

Thursday, 19 September 2024

#PAGAustralia Poster 14: Synteny-Guided Semi-Automated Curation of Chromosome-Level Genome Assemblies

If you are attending PAG Australia 2024, come and have a chat at Poster 14 about easing the burden of chromosome-level assembly curation.

Abstract Text

Reference genomes are fundamental resources that underpin research across most aspects of modern biology. Technological improvements in the length and accuracy of long-read sequencing platforms, combined with Hi-C proximity ligation sequencing, has enabled the routine generation of highly contiguous phased assemblies, scaffolded to chromosome-level. Nevertheless, automated generation of perfect gapless “telomere-to-telomere” assemblies remains out of reach for most eukaryotic organisms. Scaffolding errors and false duplications can still occur, and assembly curation is now the main bottleneck for large-scale assembly projects. Here, I present a streamlined data workflow and scaffolding assessment for high-throughput manual curation of chromosome-level genome assemblies. Synteny between haplotypes, or closely related species, is combined with read mapping to orient, pair and visualise assembled chromosomes. Assembly gaps are classified according to scaffolding confidence, highlighting candidates for simple scaffolding corrections, such as inversions. Synteny visualisation, gap classification, and HiC contact maps are then combined to identify and document scaffolding edits with increased speed, precision and confidence. This accelerates the production of curated chromosome-level assemblies, and enables the identification of regions of the assembly that may require further attention. Individual tools used in the workflow (ChromSyn, Telociraptor, SynBad, PAFScaff, DepthKopy and DepthCharge) are available at https://github.com/slimsuite/.

Thursday, 5 September 2024

Origin and maintenance of large ribosomal RNA gene repeat size in mammals

Our latest paper is out as a Featured Article in the journal Genetics, featuring ONT from both the cane toad and BABS Genome snake genomes. This paper looks at how ribosomal RNA gene repeats (a.k.a. rDNA repeats) have evolved in vertebrates to expand in size in mammals. For something so fundamental to the function of an organism - literally every process of every cell ultimately relies on rRNA - there is surprising diversity. These regions are traditionally hard to assemble with short reads, and still provide challenges for long-read assemblies, so the new era of high-quality long-read assemblies is likely to reveal a lot about their evolution.

  • Macdonald E, Whibley A, Waters PD, Patel H, Edwards RJ & Ganley ARD (2024): Origin and maintenance of large ribosomal RNA gene repeat size in mammals. Genetics 228(1): iyae121 [Genetics] [PubMed]

Abstract

The genes encoding ribosomal RNA are highly conserved across life and in almost all eukaryotes are present in large tandem repeat arrays called the rDNA. rDNA repeat unit size is conserved across most eukaryotes but has expanded dramatically in mammals, principally through the expansion of the intergenic spacer region that separates adjacent rRNA coding regions. Here, we used long-read sequence data from representatives of the major amniote lineages to determine where in amniote evolution rDNA unit size increased. We find that amniote rDNA unit sizes fall into two narrow size classes: “normal” (∼11–20 kb) in all amniotes except monotreme, marsupial, and eutherian mammals, which have “large” (∼35–45 kb) sizes. We confirm that increases in intergenic spacer length explain much of this mammalian size increase. However, in stark contrast to the uniformity of mammalian rDNA unit size, mammalian intergenic spacers differ greatly in sequence. These results suggest a large increase in intergenic spacer size occurred in a mammalian ancestor and has been maintained despite substantial sequence changes over the course of mammalian evolution. This points to a previously unrecognized constraint on the length of the intergenic spacer, a region that was thought to be largely neutral. We finish by speculating on possible causes of this constraint.

Tuesday, 9 July 2024

New pre-print: The Genomics for Australian Plants (GAP) framework initiative – developing genomic resources for understanding the evolution and conservation of the Australian flora

The Bioplatforms Australia Genomics for Australian Plants (GAP) initiative aims to sequence and assemble representative genomes of Australia’s unique flora, which boasts over 24,000 native vascular plant species evolved over millions of years. The program brings together academic groups, herbaria and botanic gardens from across the country to build genomic capacity and create valuable resources for the classification, conservation and utilisation of Australian plants. We were lucky enough to sequence one of the first GAP species, the NSW Waratah. Now, the capstone paper outlining the project and its key findings from multiple species is out as a pre-print at EcoEvoRxiv:

Simpson L, Cantrill DJ, Byrne M, Allnutt TR, King GJ, Lum M, Al Bkhetan Z, Andrew R, Baker WJ, Barrett MD, Batley J, Berry O, Binks RM, Bragg JG, Broadhurst L, Brown G, Bruhl J, Edwards RJ, Ferguson S, Forest F, Gustafsson J, Hammer TA, Holmes GD, Jackson CJ, James EA, Jones A, Kersey PJ, Leitch IJ, Maurin O, McLay TGB, Murphy DJ, Nargar K, Nauheimer L, Sauquet H, Schmidt-Lebuhn AN, Shepherd KA, Syme AE, Waycott M, Wilson TC, Crayn DM (preprint): The Genomics for Australian Plants (GAP) framework initiative – developing genomic resources for understanding the evolution and conservation of the Australian flora. EcoEvoRxiv DOI: https://doi.org/10.32942/X2RP70

The generation and analysis of genome-scale data—genomics—is driving a rapid increase in plant biodiversity knowledge. However, the speed and complexity of technological advance in genomics presents challenges for its widescale use in evolutionary and conservation biology. Here, we introduce and describe a national-scale collaboration conceived to build genomic resources and capability for understanding the Australian flora: the Genomics for Australian Plants (GAP) Framework Initiative. We outline (a) the history of the project including the collaborative framework, partners, and funding; (b) GAP principles such as rigour in design, sample verification and documentation, data management, and data accessibility; and (c) the structure of the consortium and its four activity streams (reference genomes, phylogenomics, conservation genomics, and training), with the rationale and aims for each of them. We show, through discussion of its successes and challenges, the value of this multi-institutional consortium approach and the enablers, such as well-curated collections and national collaborative research infrastructure, all of which have led to a substantial increase in capacity and delivery of biodiversity knowledge outcomes.

The initiative is about more than just reference genomes, with core activity in phylogenomics, conservation genomics and training too. For more information on the project and the resources generated (with more to come), read the paper and/or visit the GAP website.

Friday, 15 March 2024

Towards telomere-to-telomere fish genomes with Oxford Nanopore Technologies gap-filling

It was a pleasure to be invited to the “What You’re Missing Matters” tour at Perth, and present some ongoing work investigating the best way to incorporate Oxford Nanopore Technologies data into our high-quality HiFi+HiC fish genomes. The results are too preliminary to share here (and will soon be superseded) but do get in touch if it sounds interesting to you.

Thursday, 7 March 2024

Is developmental plasticity triggered by DNA methylation changes in the invasive cane toad (Rhinella marina)?

Second cane toad paper of the week! This time, we're revisiting invasive epigenomics.

Yagound B, Sarma RR, Edwards RJ, Richardson MF, Rodriguez Lopez CM, Crossland MR, Brown GP, DeVore JL, Shine R & Rollins LA (2024): Is developmental plasticity triggered by DNA methylation changes in the invasive cane toad (Rhinella marina)? Ecology and Evolution 14:e11127. [Ecol Evol] [PubMed] [bioRxiv]

Many organisms can adjust their development according to environmental conditions, including the presence of conspecifics. Although this developmental plasticity is common in amphibians, its underlying molecular mechanisms remain largely unknown. Exposure during development to either ‘cannibal cues’ from older conspecifics, or ‘alarm cues’ from injured conspecifics, causes reduced growth and survival in cane toad (Rhinella marina) tadpoles. Epigenetic modifications, such as changes in DNA methylation patterns, are a plausible mechanism underlying these developmental plastic responses. Here we tested this hypothesis, and asked whether cannibal cues and alarm cues trigger the same DNA methylation changes in developing cane toads. We found that exposure to both cannibal cues and alarm cues was associated with local changes in DNA methylation patterns. These DNA methylation changes affected genes putatively involved in developmental processes, but in different genomic regions for different conspecific-derived cues. Genetic background explains most of the epigenetic variation among individuals. Overall, the molecular mechanisms triggered by exposure to cannibal cues seem to differ from those triggered by alarm cues. Studies linking epigenetic modifications to transcriptional activity are needed to clarify the proximate mechanisms that regulate developmental plasticity in cane toads.

Monday, 4 March 2024

Whole-mitogenome analysis unveils previously undescribed genetic diversity in cane toads across their invasion trajectory

Congratulations to Kelton Cheung for getting her first PhD paper out. This one has been a long time brewing and involved quite a lot of data wrangling, but we got there in the end. Invasive cane toads might be a little more complex than we thought.

Cheung K, Amos TG, Shine R, DeVore JL, S Ducatez S, Edwards RJ & Rollins LA (2024): Whole-mitogenome analysis unveils previously undescribed genetic diversity in cane toads across their invasion trajectory. Ecology and Evolution 14:e11115. [Ecol Evol] [PubMed] [bioRxiv]

Invasive species offer insights into rapid adaptation to novel environments. The iconic cane toad (Rhinella marina) is an excellent model for studying rapid adaptation during invasion. Previous research using the mitochondrial NADH dehydrogenase 3 (ND3) gene in Hawai’ian and Australian invasive populations found a single haplotype, indicating an extreme genetic bottleneck following introduction. Nuclear genetic diversity also exhibited reductions across the genome in these two populations. Here, we investigated the mitochondrial genomics of cane toads across this invasion trajectory. We created the first reference mitochondrial genome for this species using long-read sequence data. We combined whole-genome resequencing data of 15 toads with published transcriptomic data of 125 individuals to construct nearly complete mitochondrial genomes from the native (French Guiana) and introduced (Hawai’i and Australia) ranges for population genomic analyses. In agreement with previous investigations of these populations, we identified genetic bottlenecks in both Hawai’ian and Australian introduced populations, alongside evidence of population expansion in the invasive ranges. Although mitochondrial genetic diversity in introduced populations was reduced, our results revealed that it had been underestimated: we identified 45 mitochondrial haplotypes in Hawai’ian and Australian samples, none of which were found in the native range. Additionally, we identified two distinct groups of haplotypes from the native range, separated by a minimum of 110 base pairs (0.6%). These findings enhance our understanding of how invasion has shaped the genetic landscape of this species.

Sunday, 28 January 2024

Toward genome assemblies for all marine vertebrates: current landscape and challenges

The first Ocean Genomes paper is now out! This one is a small commentary piece, but some high-quality genomes are on their way - watch this space. Well, actually, watch this space at Genomes on a Tree!

de Jong E, Parata L, Bayer PE, Corrigan S & Edwards RJ (2024): Toward genome assemblies for all marine vertebrates: current landscape and challenges. Gigascience 13:giad119. [Gigascience] [PubMed]

Marine vertebrate biodiversity is fundamental to ocean ecosystem health but is threatened by climate change, overharvesting, and habitat degradation. High-quality reference genomes are valuable foundational scientific resources that can inform conservation efforts. Consequently, global consortia are striving to produce reference genomes for representatives of all life. Here, we summarize the current landscape of available marine vertebrate reference genomes, including their phylogenetic diversity and geographic hotspots of production. We discuss key logistical and technical challenges that remain to be overcome if we are to realize the vision of a comprehensive reference genome library of all marine vertebrates.

Thursday, 4 January 2024

Happy New Year! Updates coming soon...

It's been a really busy year or so, getting the Minderoo OceanOmics Centre at UWA fully staffed and operational, and blog content has suffered as a result. Stay tuned for a backlog of publications (see Publications tab) and other news - including some of the first outputs to come out of Ocean Genomes, and exciting updates to the Oceans Insitute strategy.