Friday 15 June 2018

Investigating the evolution of complex, novel traits using whole genome sequencing and molecular palaeontology

Åsa Pérez-Bercoff, Psyche Arcenal, Anna Sophia Grobler, Philip J. L. Bell, Paul V. Attfield and Richard J. Edwards.

Åsa presented the latest updates from our ARC Linkage Project with Microbiogen Pty Ltd at the Sydney Bioinformatics Research Symposium 2018. She will be giving an expanded version of the talk at the Genetics Society of AustralAsia 2018 conference in Canberra (1-4 July), if you missed it.


Understanding how new biochemical pathways evolve in a sexually reproducing population is a complex and largely unanswered question. We are using PacBio whole-genome sequencing and deep population resequencing to explore the evolution of a novel biochemical pathway in yeast over several thousand generations. Growth of wild Saccharomyces cerevisiae strains on the pentose sugar xylose is barely perceptible. A mass-mated starting population was evolved under selection on Xylose Minimal Media with forced sexual mating every two months for four years, producing a population that can grow on and utilise xylose as its sole carbon source.

We are now using a novel “molecular palaeontology” approach to trace the evolutionary process and identify functionally significant loci under selection. Populations at seven key time points during their evolution have been sequenced using Illumina short-read sequencing. In addition, all the parental strains from the founding population have been subject to PacBio de novo whole-genome sequencing and assembly. By constructing reliable whole genomes of the ancestors of our populations, we can the trace evolution of these populations over time. We can therefore track the trajectory of allele frequencies through time, identifying the contributions of different founding strains and novel mutations. We are using these data to estimate the proportions and regions of the genome that have evolved neutrally, under purifying selection, or adaptively in response to xylose selection. Our unique array of both extant and past, but not extinct, populations allow us to test popular models of molecular evolution.

Sequencing snakes: Pseudodiploid pseudo-long-read whole genome sequencing and assembly of Pseudonaja textilis (eastern brown snake) and Notechis scutatus (mainland tiger snake)

Richard J Edwards, Timothy G Amos, Joshua Tang, Beni Cawood, Sabrina Rispin, Daniel Enosi Tuipulotu & Paul Waters.

This work was presented at the Sydney Bioinformatics Research Symposium 2018. (Abstract below.) Click on thumbnail for full resolution PDF. Citation:

Edwards RJ et al. Pseudodiploid pseudo-long-read whole genome sequencing and assembly of Pseudonaja textilis (eastern brown snake) and Notechis scutatus (mainland tiger snake) [version 1; not peer reviewed]. F1000Research 2018, 7:753 (poster) (doi: 10.7490/f1000research.1115550.1)


The precipitous drop in sequencing costs over recent years has seen the bottleneck in vertebrate whole genome sequencing (WGS) shift from data generation (sequencing) to data processing (assembly and annotation). Draft genomes generated from cheap shotgun Illumina sequencing tend to be highly fragmented with many tens of thousands of short contigs or scaffolds. This can be improved by preparing multiple paired end and “mate pair” libraries with different insert sizes, but this increases the cost of both sequencing and data storage/analysis. PacBio or Oxford Nanopore long read sequencing enables massive improvements in assembly quality but tends to be prohibitively expensive for organisms with large genome sizes, such as vertebrates. 10x Genomics Chromium “linked read” sequencing offers a solution to this problem. High molecular weight molecules of DNA are barcoded prior to standard shotgun Illumina sequencing. These barcodes can then be used for pseudo-long-read assembly, with improved handling of repetitive regions. Where heterozygous variants are dense enough, haplotypes can be phased to generate a “pseudodiploid” assembly with some regions represented as two alleles. This is all for the cost of an additional library prep with no extra sequencing. But does it work?

We have sequenced two of the deadliest venomous snakes in Australia using 10x Chromium linked reads: the mainland tiger snake (Notechis scutatus) and the eastern brown snake (Pseudonaja textilis). Supernova v2 assemblies of the data generated exceptionally high quality genomes for the price, with maximum scaffolds over 50 Mb and N50 values of 5.99 Mb for the tiger snake and 14.7 Mb for the brown snake. This was reflected in BUSCO (v2.0.1 short) completeness estimates of 87.3% (tiger snake) and 90.5% (brown snake). These data will be compared to tiger snake WGS using standard paired end Illumina NovaSeq shotgun sequencing, and discussed with respect to some of the downstream opportunities and challenges provided by pseudodiploid genome assemblies. In particular, BUSCO analysis of haploid, pseudodiploid, and non-redundant genome assemblies revealed some interesting and unexpected behaviour of this widely-used tool. We also present results from GenomeR, a Shiny app (in development) for batch kmer genome size estimation (

Snake genomes and ongoing annotation are being made available through the lab Web Apollo browser and search tool ( We welcome contact from anyone interested in getting involved with the annotation and analysis of these genomes.

Optimising intrinsic protein disorder prediction for short linear motif discovery

Kirsti M G Paulsen, Norman E Davey, Sobia Idrees, Åsa Pérez-Bercoff & Richard J Edwards.

This work was presented at the Sydney Bioinformatics Research Symposium 2018. (Abstract below.) Click on thumbnail for full resolution PDF.


Short linear motifs (SLiMs) are short stretches of proteins that are directly involved in protein-protein interactions (PPI). Identifying SLiMs is important for understanding fundamental processes involved in normal cellular function. SLiMs are commonly only 3 - 10 amino acids in length and form low affinity interactions. This makes them ideal for fast cellular processes, such as cell signalling or response to stimuli, but also difficult to predict experimentally. As a result, many computational SLiM prediction methods have been developed. In order to increase the signal to noise ratio of SLiM predictions, different sequence masking techniques have been developed. These attempt to screen out areas that are unlikely to contain SLiMs and thereby preferentially eliminate the random nonfunctional sequences. One widely implemented masking strategy is to remove protein regions that form stable three-dimensional structures; SLiMs are typically found in regions of intrinsic disorder that are natively unstructured in their unbound form. To date, there has been no systematic study of how best to predict intrinsic disordered protein regions for SLiM discovery. Poor quality predictions will not have the desired noise-removal, while over-stringent masking will remove too many true positives. The aim of this study is to compare how ten different disorder prediction methods affect SLiM occurrence prediction and to identify the best method and settings for this purpose. The disorder prediction scores for each residue in the human proteome was obtained from the MobiDB database. Further, this study aims to investigate whether the optimal disorder masking settings for occurrence SLiM prediction are the same for de novo SLiM prediction and for identification of SLiM mediated PPIs.

Evaluation of protein-protein interaction detection methods as a source of capturing domain-motif interactions

Sobia Idrees, Richard J Edwards

This work was presented at the Sydney Bioinformatics Research Symposium 2018. (Abstract below.) Click on thumbnail for full resolution PDF.


One of the main pursuits in proteomics is to understand the complex network of protein-protein Interactions (PPI) that underpin biological processes. Two major classes of PPI are domain-domain interactions (DDI) between globular proteins, and domain-motif interactions (DMI) between a globular domain and a short linear motif (SLiM) in its partner. Advances in high-throughput experimental techniques have been applied at large-scale in an attempt to characterise the interactome of various organisms. However, PPI networks being identified by these high-throughput experiments have low resolution as compared to low-throughput technologies, such as protein co-crystallization. Furthermore, large-scale approaches may be poor at capturing low affinity or transient interactions, which includes the majority of known DMI. To date, several studies have been conducted to identify how well these PPI data can capture protein complexes, but the ability of high-throughput PPI-detection methods to capture DMI remains a largely unanswered question.

To help system biologists choose appropriate methods for predicting different types of interactions, we conducted a comprehensive comparison study on existing high-throughput PPI datasets. We have integrated PPI data, SLiM predictions, domain compositions and known SLiM-domain binding partnerships to identify possible DMI and DDI within interactomes. We identify PPI data that are enriched for DMI or DDI versus a background expectation generated by randomising the PPI within the network. Despite returning relatively few experimentally validated DMI when compared to interaction databases, we present evidence that high-throughput PPI data is enriched for DMI and thus potentially useful for the prediction of novel SLiMs. We discuss the relative merits of co-fractionation followed by mass spectrometry (CoFrac-MS), affinity purification coupled mass spectrometry (AP-MS), and yeast two hybrid (Y2H) for capturing DMI and DDI, as well as potential quality versus quantity trade-offs in DMI prediction.

Thursday 14 June 2018

Edwards lab at #SBRS2018

Come visit our posters at the Sydney Bioinformatics Research Symposium 2018! Details and high res versions to follow...