We have three Pawsey student internships available this summer with the Ocean Genomes Laboratory in the Minderoo OceanOmics Centre at UWA. Closing date: 07 August, 2023 at 17:00 AWST (Perth time). This is a 10-week, paid program open to exceptional undergrad (2nd/3rd year), Honours, Master’s and PhD students. Apply at the CSIRO Application page. Please get in touch if you want to know more and/or are interested in a student research project in the lab.
The biodiversity of marine vertebrates is critical for the health of our ocean’s ecosystem, but is under immediate threat from climate change, pollution, overfishing and habitat destruction. To advance our understanding of how best to protect and sustain our ocean life, global efforts are underway (such as the Vertebrate Genome Project; VGP) to establish a complete library of high-quality reference genomes for all ~22,000 marine vertebrates.
Reference genomes are pivotal not only for answering fundamental questions in marine biology and evolution, but also for guiding the conservation of species most at risk within our changing oceans, and for accurately monitoring biodiversity.
This project utilizes data generated in-house, either by Illumina short-read or PacBio high-fidelity long-read sequencing of Australian marine vertebrate species. The primary objective is to optimize analysis workflows on Pawsey, encompassing the entire life cycle of the data from its raw format to the ultimate outcome of a high-quality assembled genome. We have data across a diverse range of species covering small to large genome sizes.
A containerised Pawsey workflow for Diploidocus (Project #10)
Bioinformatics in general, and genomics specifically, is replete with complex workflows that do not translate easily to HPC. Frequently, genomics pipelines will incorporate many different tools and/or in-built functions with very different computational requirements in terms of multithreading, memory requirements and IO pressures. The Diploidocus genome curation pipeline exemplifies this problem with some lengthy single-processor steps building on data produced by highly parallelised tools, such as minimap2. As well as adapting a specific mission-critical tool, this project will help identify and establish some general principles for optimising genomics code/workflows for running on Setonix.
Diploidocus is a published genome curation and clean-up tool that utilises several different underlying bioinformatics tools and in-built algorithms. Different steps (and tools) in the pipeline have markedly different CPU, IO and memory requirements, including some lengthy non-parallelised portions. This makes it hard to run efficiently on HPC without wasting resource allocation and/or failing to take advantage of parallelisation when available.
The expected outcome of this project is a Nextflow workflow for the deployment of the Diploidocus pipeline on HPC. This will (a) increase in-house efficiency of HPC usage, and (b) make Diploidocus more attractive as a tool to other research groups.
A containerised Pawsey workflow high throughput phylogenomics (Project #14)
This project aims to produce a robust and efficient phylogenomics workflow for whole genome sequencing data.
One important application of genome assemblies is to test and improve the taxonomic classification of species using large-scale genome-wide phylogenetics, known as phylogenomics. There is a previously developed Snakemake workflow for the rapid generation of phylogenomic trees from low- to mid-coverage whole genome shotgun sequencing data. This pipeline (1) creates multiple rapid draft assemblies; (2) identifies an optimal set of orthologous genes per species using BUSCO and BUSCOMP; (3) generates a multiple sequence alignment per gene; (4) generates a phylogenetic tree per gene; and (5) generates a consensus tree from all the individual gene trees.
There is now a requirement to (1) update the pipeline to be optimised for the high-coverage draft and reference genomes created by the Ocean Genomes Project, and (2) convert this pipeline from PBS/Snakemake to SLURM/Nextflow in-line with other genomics workflows being developed at the Minderoo OceanOmics Centre at UWA.
This project will adapt the wgs2tree workflow to optionally start from a set of existing genome assemblies and BUSCO orthologue annotations and implement a Nextflow/SLURM workflow optimised to run efficiently on Pawsey.