Monday 17 December 2018

What are we sequencing next? The waratah!

Thanks to seed funding from the UNSW, we were able to sequence two rainforest tree species earlier this year in collaboration with the Royal Botanic Gardens and Domain Trust (RBGDT), Sydney. I am pleased to announce that, together with RBGDT and the Blue Mountains Botanic Garden, Mt Tomah, we won a bid to sequence one of the first genomes as part of the new Genomics for Australian Plants Framework Initiative by Bioplatforms Australia: the NSW state flower, the Waratah (Telopea speciosissima).

As announced recently, this is one of three species selected for the initial pilot study. Details will be sorted out in the new year, but we will be looking to use a combination of 10x Genomics linked reads and long-read sequencing (PacBio and/or Nanopore).

Collaborators on the project: M Rossetto1, M van der Merwe1, H Sauquet1, P Lu-Irving1, J Bragg1, G Bourke2, RJ Edwards3

  1. Royal Botanic Gardens and Domain Trust, Sydney
  2. Blue Mountains Botanic Garden, Mt Tomah
  3. The University of New South Wales, Sydney

Image: Telopea speciosissima, Suellen’s Garden, Falls Ck NSW: Photo, Suellen Harris

Wednesday 28 November 2018

EdwardsLab at #ABACBS2018

For those who missed it, there’s a (slightly old) poster version of my ABACBS 2018 talk - Sequencing snakes: Pseudodiploid pseudo-long-read whole genome sequencing and assembly of Pseudonaja textilis (eastern brown snake) and Notechis scutatus (mainland tiger snake). If anything in the talk (except the repeat stuff) looks useful to you, this is a citeable poster:

Edwards RJ et al. Pseudodiploid pseudo-long-read whole genome sequencing and assembly of Pseudonaja textilis (eastern brown snake) and Notechis scutatus (mainland tiger snake) [version 1; not peer reviewed]. F1000Research 2018, 7:753 (poster) (doi: 10.7490/f1000research.1115550.1)

We’re still developing the genome size prediction and BUSCO comparison/compilation tools, so get in touch if either of these look useful to you.

ABACBS2018 Posters

We have three lab posters in Poster session 2 this morning:

  • Poster #16. Åsa Pérez-Bercoff, Using structural variant detection to resolve difficult regions of a genome assembly.

  • Poster #21. Kirsti Paulsen, Optimising intrinsic protein disorder prediction for short linear motif discovery.

  • Poster #26. Katarina Stuart, Evolution in invasive populations: using genomics to reveal drivers of invasion success in the Australian European starling (Sturnus vulgaris) introduction across Australia.

Also check out the posters of our UNSW neighbours from the Wilkins lab:

  • Poster #29. Chi Nam Ignatius (Igy) Pang, Benchmarking Protein Correlation Profiling datasets against reference protein complexes: case studies in S. cerevisiae.

  • Poster #44. Susan Corley, QuantSeq 3’ sequencing paired with Salmon quantification provides a fast reliable approach for high throughput transcriptomic analysis.

  • Poster #49. Xabier Vázquez-Campos, OTUreporter: an automated pipeline for the analysis and report of amplicon sequencing data.

Wednesday 31 October 2018

SLiMEnrich: computational assessment of protein–protein interaction data as a source of domain-motif interactions

Idrees S, Pérez-Bercoff Å & Edwards RJ. (2018) SLiMEnrich: computational assessment of protein–protein interaction data as a source of domain-motif interactions. PeerJ 6:e5858


Many important cellular processes involve protein–protein interactions (PPIs) mediated by a Short Linear Motif (SLiM) in one protein interacting with a globular domain in another. Despite their significance, these domain-motif interactions (DMIs) are typically low affinity, which makes them challenging to identify by classical experimental approaches, such as affinity pulldown mass spectrometry (AP-MS) and yeast two-hybrid (Y2H). DMIs are generally underrepresented in PPI networks as a result. A number of computational methods now exist to predict SLiMs and/or DMIs from experimental interaction data but it is yet to be established how effective different PPI detection methods are for capturing these low affinity SLiM-mediated interactions. Here, we introduce a new computational pipeline (SLiMEnrich) to assess how well a given source of PPI data captures DMIs and thus, by inference, how useful that data should be for SLiM discovery. SLiMEnrich interrogates a PPI network for pairs of interacting proteins in which the first protein is known or predicted to interact with the second protein via a DMI. Permutation tests compare the number of known/predicted DMIs to the expected distribution if the two sets of proteins are randomly associated. This provides an estimate of DMI enrichment within the data and the false positive rate for individual DMIs. As a case study, we detect significant DMI enrichment in a high-throughput Y2H human PPI study. SLiMEnrich analysis supports Y2H data as a source of DMIs and highlights the high false positive rates associated with naïve DMI prediction. SLiMEnrich is available as an R Shiny app. The code is open source and available via a GNU GPL v3 license at: A web server is available at:

Ziying Zhang (Visiting Masters student)

Ziying Zhang is a visiting Masters student from Wageningen University in the Netherlands. She got a bachelor degree with the background of bioscience at Hainan University, China in 2016. She started her MSc in bioinformatics at Wageningen University in February 2017. The project for Ziying’s Masters thesis focused on developing a downstream analysis toolbox for microbial diversity analysis, executing SPARQL to query RDF datasets of 16s rRNA before launching an analysis to simplify the pre-processing procedures.

Ziying started an internship at the Edwards Lab at the end of October 2018, working on genome size prediction from kmer profiles. Her interests include genomics, genetics, bioinformatics, programming and data management.


Wednesday 3 October 2018

Research snapshot - October 2018

One of the most important, interesting and challenging questions in biology is how new traits evolve at the molecular level. My lab employs sequence analysis techniques to interrogate protein and DNA sequences for the signals left behind by evolution. We are a bioinformatics lab but like to incorporate bench data through collaboration wherever possible.

Main Research

The core research in the lab is broadly divided into three main themes:

1. Short Linear Motifs (SLiMs)

Many protein-protein interactions are mediated by Short Linear Motifs (SLiMs): short stretches of proteins (5-15 amino acids long), of which only a few positions are critical to function. These motifs are vital for biological processes of fundamental importance, acting as ligands for molecular signalling, post-translational modifications and subcellular targeting. SLiMs have extremely compact protein interaction interfaces, generally encoded by less than 4 major affinity-/specificity-determining residues. Their small size enables high functional density and evolutionary plasticity, making them frequent products of convergent "ex nihilo" evolution. It also makes them challenging to identify, both experimentally and computationally.

A major focus of the lab is the computational prediction of SLiMs from protein sequences. This research originated with Rich’s postdoctoral research, during which he developed a sequence analysis methods for the rational design of biologically active short peptides. He subsequently developed SLiMDisc, one of the first algorithms for successfully predicting novel SLiMs from sequence data - and coined the term “SLiM” into the bargain. This subsequently lead to the development of SLiMFinder, the first SLiM prediction algorithm able to estimate the statistical significance of motif predictions. SLiMFinder greatly increased the reliability of predictions. SLiMFinder has since spawned a number of motif discovery tools and webservers and is still arguably the most successful SLiM prediction tool on benchmarking data.

Current research is looking to develop these SLiM prediction tools further and apply them to important biological questions. Of particular interest is the molecular mimicry employed by viruses to interact with host proteins and the role of SLiMs in other diseases, such as cancer. Other work is concerned with the evolutionary dynamics of SLiMs within protein interaction networks.

2. The evolution of novel functions.

Previous work in the lab has focused on the evolution of functional specificity following gene duplication. Since moving to UNSW, activities have shifted more towards the use of PacBio long read sequencing and other cutting-edge sequencing technologies, working closely with the Ramaciotti Centre for Genomics. We are collaborating with industrial and academic partners to de novo sequence, assemble, annotate and interrogate the genomes of a selection of microbes with interesting metabolic abilities. Most notably, we have an ARC Linkage Grant with Microbiogen Pty Ltd. to understand how a strain of Saccharomyces cerevisiae has evolved to efficiently use xylose as a sole carbon source: something vital for second generation biofuel production that wild yeast cannot do. We are combining comparative genomics, evolutionary genetics, RNA-Seq transcriptomics, and competition assays to understand how the novel metabolism evolved. Through deep Illumina resequencing of evolving populations, and assembling reliable complete genomes of the founding ancestors, the ultimate goal is to trace how mutations have interacted with existing genetic variation during adaptive evolution.

3. Whole genome sequencing and assembly.

Following our experiences with de novo whole genome assembly in yeast, the lab is getting involved in an increasing number of genome sequencing projects. The biggest of these is leading the bioinformatics and assembly effort in a consortium to sequence the cane toad genome. The lab is also leading the BABS Genome project two iconic Australian snakes for use in teaching and public engagement. We are a member of the Oz Mammals Genomics initiative, assisting with the sequencing and assembly of Australia's unique marsupial fauna. We also have an number of bacterial long-read whole genome sequencing collaborations.

Other Research Projects

In addition to the main research in the lab, the lab has a number of interdisciplinary collaborative projects applying bioinformatics tools and molecular evolution theory to experimental biology, often using large genomic, transcriptomic and/or proteomic datasets. These projects often involve the development of bespoke bioinformatics pipelines and a number of open source bioinformatics tools have been generated as a result. We frequently have small collaborations and/or undergraduate student research projects. Many of these are “on hold” waiting for the right person, or sometimes data, to come along. If you think that you have what it needs, get in touch!

Previous Research

The lab has been involved in a number of interdisciplinary collaborative projects applying bioinformatics tools and molecular evolution theory to experimental biology, often using large genomic, transcriptomic and/or proteomic datasets. These projects often involved the development of bespoke bioinformatics pipelines and a number of open source bioinformatics tools have been generated as a result. Please see the Publications and Lab software pages for more detail, or get in touch if something catches your eye and you want to find out more.

Tuesday 7 August 2018

Draft genome assembly of the invasive cane toad, Rhinella marina

Richard J Edwards, Daniel Enosi Tuipulotu, Timothy G Amos, Denis O’Meally, Mark F Richardson, Tonia L Russell, Marcelo Vallinoto, Miguel Carneiro, Nuno Ferrand, Marc R Wilkins, Fernando Sequeira, Lee A Rollins, Edward C Holmes, Richard Shine & Peter A White (2018): Draft genome assembly of the invasive cane toad, Rhinella marina. GigaScience 7(9):giy095. [GigaScience] [PubMed] [PDF]


Background. The cane toad (Rhinella marina formerly Bufo marinus) is a species native to Central and South America that has spread across many regions of the globe. Cane toads are known for their rapid adaptation and deleterious impacts on native fauna in invaded regions. However, despite an iconic status, there are major gaps in our understanding of cane toad genetics. The availability of a genome would help to close these gaps and accelerate cane toad research.

Findings. We report a draft genome assembly for R. marina, the first of its kind for the Bufonidae family. We used a combination of long read PacBio RS II and short read Illumina HiSeq X sequencing to generate a total of 359.5 Gb of raw sequence data. The final hybrid assembly of 31,392 scaffolds was 2.55 Gb in length with a scaffold N50 of 168 kb. BUSCO analysis revealed that the assembly included full length or partial fragments of 90.6% of tetrapod universal single-copy orthologs (n = 3950), illustrating that the gene-containing regions have been well-assembled. Annotation predicted 25,846 protein coding genes with similarity to known proteins in SwissProt. Repeat sequences were estimated to account for 63.9% of the assembly.

Conclusion. The R. marina draft genome assembly will be an invaluable resource that can be used to further probe the biology of this invasive species. Future analysis of the genome will provide insights into cane toad evolution and enrich our understanding of their interplay with the ecosystem at large.

(More details to follow in future posts.)

Sunday 1 July 2018

Edwards Lab at Genetics Society of AustralAsia 2018

Åsa Pérez-Bercoff is attending GSA 2018 in Canberra, this week. Today at 16:30, she will be presenting an updated and extended version of her SBRS18 talk, Investigating the evolution of complex, novel traits using whole genome sequencing and molecular palaeontology as part of the Evolutionary Genetics session.

Find out what this plot means…

…and more!

Friday 15 June 2018

Investigating the evolution of complex, novel traits using whole genome sequencing and molecular palaeontology

Åsa Pérez-Bercoff, Psyche Arcenal, Anna Sophia Grobler, Philip J. L. Bell, Paul V. Attfield and Richard J. Edwards.

Åsa presented the latest updates from our ARC Linkage Project with Microbiogen Pty Ltd at the Sydney Bioinformatics Research Symposium 2018. She will be giving an expanded version of the talk at the Genetics Society of AustralAsia 2018 conference in Canberra (1-4 July), if you missed it.


Understanding how new biochemical pathways evolve in a sexually reproducing population is a complex and largely unanswered question. We are using PacBio whole-genome sequencing and deep population resequencing to explore the evolution of a novel biochemical pathway in yeast over several thousand generations. Growth of wild Saccharomyces cerevisiae strains on the pentose sugar xylose is barely perceptible. A mass-mated starting population was evolved under selection on Xylose Minimal Media with forced sexual mating every two months for four years, producing a population that can grow on and utilise xylose as its sole carbon source.

We are now using a novel “molecular palaeontology” approach to trace the evolutionary process and identify functionally significant loci under selection. Populations at seven key time points during their evolution have been sequenced using Illumina short-read sequencing. In addition, all the parental strains from the founding population have been subject to PacBio de novo whole-genome sequencing and assembly. By constructing reliable whole genomes of the ancestors of our populations, we can the trace evolution of these populations over time. We can therefore track the trajectory of allele frequencies through time, identifying the contributions of different founding strains and novel mutations. We are using these data to estimate the proportions and regions of the genome that have evolved neutrally, under purifying selection, or adaptively in response to xylose selection. Our unique array of both extant and past, but not extinct, populations allow us to test popular models of molecular evolution.

Sequencing snakes: Pseudodiploid pseudo-long-read whole genome sequencing and assembly of Pseudonaja textilis (eastern brown snake) and Notechis scutatus (mainland tiger snake)

Richard J Edwards, Timothy G Amos, Joshua Tang, Beni Cawood, Sabrina Rispin, Daniel Enosi Tuipulotu & Paul Waters.

This work was presented at the Sydney Bioinformatics Research Symposium 2018. (Abstract below.) Click on thumbnail for full resolution PDF. Citation:

Edwards RJ et al. Pseudodiploid pseudo-long-read whole genome sequencing and assembly of Pseudonaja textilis (eastern brown snake) and Notechis scutatus (mainland tiger snake) [version 1; not peer reviewed]. F1000Research 2018, 7:753 (poster) (doi: 10.7490/f1000research.1115550.1)


The precipitous drop in sequencing costs over recent years has seen the bottleneck in vertebrate whole genome sequencing (WGS) shift from data generation (sequencing) to data processing (assembly and annotation). Draft genomes generated from cheap shotgun Illumina sequencing tend to be highly fragmented with many tens of thousands of short contigs or scaffolds. This can be improved by preparing multiple paired end and “mate pair” libraries with different insert sizes, but this increases the cost of both sequencing and data storage/analysis. PacBio or Oxford Nanopore long read sequencing enables massive improvements in assembly quality but tends to be prohibitively expensive for organisms with large genome sizes, such as vertebrates. 10x Genomics Chromium “linked read” sequencing offers a solution to this problem. High molecular weight molecules of DNA are barcoded prior to standard shotgun Illumina sequencing. These barcodes can then be used for pseudo-long-read assembly, with improved handling of repetitive regions. Where heterozygous variants are dense enough, haplotypes can be phased to generate a “pseudodiploid” assembly with some regions represented as two alleles. This is all for the cost of an additional library prep with no extra sequencing. But does it work?

We have sequenced two of the deadliest venomous snakes in Australia using 10x Chromium linked reads: the mainland tiger snake (Notechis scutatus) and the eastern brown snake (Pseudonaja textilis). Supernova v2 assemblies of the data generated exceptionally high quality genomes for the price, with maximum scaffolds over 50 Mb and N50 values of 5.99 Mb for the tiger snake and 14.7 Mb for the brown snake. This was reflected in BUSCO (v2.0.1 short) completeness estimates of 87.3% (tiger snake) and 90.5% (brown snake). These data will be compared to tiger snake WGS using standard paired end Illumina NovaSeq shotgun sequencing, and discussed with respect to some of the downstream opportunities and challenges provided by pseudodiploid genome assemblies. In particular, BUSCO analysis of haploid, pseudodiploid, and non-redundant genome assemblies revealed some interesting and unexpected behaviour of this widely-used tool. We also present results from GenomeR, a Shiny app (in development) for batch kmer genome size estimation (

Snake genomes and ongoing annotation are being made available through the lab Web Apollo browser and search tool ( We welcome contact from anyone interested in getting involved with the annotation and analysis of these genomes.

Optimising intrinsic protein disorder prediction for short linear motif discovery

Kirsti M G Paulsen, Norman E Davey, Sobia Idrees, Åsa Pérez-Bercoff & Richard J Edwards.

This work was presented at the Sydney Bioinformatics Research Symposium 2018. (Abstract below.) Click on thumbnail for full resolution PDF.


Short linear motifs (SLiMs) are short stretches of proteins that are directly involved in protein-protein interactions (PPI). Identifying SLiMs is important for understanding fundamental processes involved in normal cellular function. SLiMs are commonly only 3 - 10 amino acids in length and form low affinity interactions. This makes them ideal for fast cellular processes, such as cell signalling or response to stimuli, but also difficult to predict experimentally. As a result, many computational SLiM prediction methods have been developed. In order to increase the signal to noise ratio of SLiM predictions, different sequence masking techniques have been developed. These attempt to screen out areas that are unlikely to contain SLiMs and thereby preferentially eliminate the random nonfunctional sequences. One widely implemented masking strategy is to remove protein regions that form stable three-dimensional structures; SLiMs are typically found in regions of intrinsic disorder that are natively unstructured in their unbound form. To date, there has been no systematic study of how best to predict intrinsic disordered protein regions for SLiM discovery. Poor quality predictions will not have the desired noise-removal, while over-stringent masking will remove too many true positives. The aim of this study is to compare how ten different disorder prediction methods affect SLiM occurrence prediction and to identify the best method and settings for this purpose. The disorder prediction scores for each residue in the human proteome was obtained from the MobiDB database. Further, this study aims to investigate whether the optimal disorder masking settings for occurrence SLiM prediction are the same for de novo SLiM prediction and for identification of SLiM mediated PPIs.

Evaluation of protein-protein interaction detection methods as a source of capturing domain-motif interactions

Sobia Idrees, Richard J Edwards

This work was presented at the Sydney Bioinformatics Research Symposium 2018. (Abstract below.) Click on thumbnail for full resolution PDF.


One of the main pursuits in proteomics is to understand the complex network of protein-protein Interactions (PPI) that underpin biological processes. Two major classes of PPI are domain-domain interactions (DDI) between globular proteins, and domain-motif interactions (DMI) between a globular domain and a short linear motif (SLiM) in its partner. Advances in high-throughput experimental techniques have been applied at large-scale in an attempt to characterise the interactome of various organisms. However, PPI networks being identified by these high-throughput experiments have low resolution as compared to low-throughput technologies, such as protein co-crystallization. Furthermore, large-scale approaches may be poor at capturing low affinity or transient interactions, which includes the majority of known DMI. To date, several studies have been conducted to identify how well these PPI data can capture protein complexes, but the ability of high-throughput PPI-detection methods to capture DMI remains a largely unanswered question.

To help system biologists choose appropriate methods for predicting different types of interactions, we conducted a comprehensive comparison study on existing high-throughput PPI datasets. We have integrated PPI data, SLiM predictions, domain compositions and known SLiM-domain binding partnerships to identify possible DMI and DDI within interactomes. We identify PPI data that are enriched for DMI or DDI versus a background expectation generated by randomising the PPI within the network. Despite returning relatively few experimentally validated DMI when compared to interaction databases, we present evidence that high-throughput PPI data is enriched for DMI and thus potentially useful for the prediction of novel SLiMs. We discuss the relative merits of co-fractionation followed by mass spectrometry (CoFrac-MS), affinity purification coupled mass spectrometry (AP-MS), and yeast two hybrid (Y2H) for capturing DMI and DDI, as well as potential quality versus quantity trade-offs in DMI prediction.

Thursday 14 June 2018

Edwards lab at #SBRS2018

Come visit our posters at the Sydney Bioinformatics Research Symposium 2018! Details and high res versions to follow...

Monday 5 February 2018

Katarina Stuart (PhD Student)

Katarina Stuart completed her undergrad at the University of Sydney, completing an honours thesis on the evolution and trait plasticity of the invasive cane toad.

She commenced her PhD at UNSW in February 2018 under the primary supervision of Lee Ann Rollins. Her thesis project aims to use genomics to investigate evolution in the European Starlings (Sturnus vulgaris). The global nature of starling invasions provides an important opportunity to examine evolutionary questions about invasion success, and how this is linked to similarities or emerging differences in their genome post colonisation and during range expansion. Major components of her thesis involve updating the starling genome, and exploring the population genetics of Australia’s starling invasion in relation to their native range counterparts, both modern and historic.

Katarina’s general interest are in invasive species, and their dynamic evolutionary change over geographical and temporal ranges.

[LinkedIn | Twitter]