Wednesday, 31 October 2018

SLiMEnrich: computational assessment of protein–protein interaction data as a source of domain-motif interactions

Idrees S, Pérez-Bercoff Å & Edwards RJ. (2018) SLiMEnrich: computational assessment of protein–protein interaction data as a source of domain-motif interactions. PeerJ 6:e5858


Many important cellular processes involve protein–protein interactions (PPIs) mediated by a Short Linear Motif (SLiM) in one protein interacting with a globular domain in another. Despite their significance, these domain-motif interactions (DMIs) are typically low affinity, which makes them challenging to identify by classical experimental approaches, such as affinity pulldown mass spectrometry (AP-MS) and yeast two-hybrid (Y2H). DMIs are generally underrepresented in PPI networks as a result. A number of computational methods now exist to predict SLiMs and/or DMIs from experimental interaction data but it is yet to be established how effective different PPI detection methods are for capturing these low affinity SLiM-mediated interactions. Here, we introduce a new computational pipeline (SLiMEnrich) to assess how well a given source of PPI data captures DMIs and thus, by inference, how useful that data should be for SLiM discovery. SLiMEnrich interrogates a PPI network for pairs of interacting proteins in which the first protein is known or predicted to interact with the second protein via a DMI. Permutation tests compare the number of known/predicted DMIs to the expected distribution if the two sets of proteins are randomly associated. This provides an estimate of DMI enrichment within the data and the false positive rate for individual DMIs. As a case study, we detect significant DMI enrichment in a high-throughput Y2H human PPI study. SLiMEnrich analysis supports Y2H data as a source of DMIs and highlights the high false positive rates associated with naïve DMI prediction. SLiMEnrich is available as an R Shiny app. The code is open source and available via a GNU GPL v3 license at: A web server is available at:

Tuesday, 7 August 2018

Draft genome assembly of the invasive cane toad, Rhinella marina

Richard J Edwards, Daniel Enosi Tuipulotu, Timothy G Amos, Denis O’Meally, Mark F Richardson, Tonia L Russell, Marcelo Vallinoto, Miguel Carneiro, Nuno Ferrand, Marc R Wilkins, Fernando Sequeira, Lee A Rollins, Edward C Holmes, Richard Shine & Peter A White (2018): Draft genome assembly of the invasive cane toad, Rhinella marina. GigaScience giy095 (Adv. access, 07 August 2018)


Background. The cane toad (Rhinella marina formerly Bufo marinus) is a species native to Central and South America that has spread across many regions of the globe. Cane toads are known for their rapid adaptation and deleterious impacts on native fauna in invaded regions. However, despite an iconic status, there are major gaps in our understanding of cane toad genetics. The availability of a genome would help to close these gaps and accelerate cane toad research.

Findings. We report a draft genome assembly for R. marina, the first of its kind for the Bufonidae family. We used a combination of long read PacBio RS II and short read Illumina HiSeq X sequencing to generate a total of 359.5 Gb of raw sequence data. The final hybrid assembly of 31,392 scaffolds was 2.55 Gb in length with a scaffold N50 of 168 kb. BUSCO analysis revealed that the assembly included full length or partial fragments of 90.6% of tetrapod universal single-copy orthologs (n = 3950), illustrating that the gene-containing regions have been well-assembled. Annotation predicted 25,846 protein coding genes with similarity to known proteins in SwissProt. Repeat sequences were estimated to account for 63.9% of the assembly.

Conclusion. The R. marina draft genome assembly will be an invaluable resource that can be used to further probe the biology of this invasive species. Future analysis of the genome will provide insights into cane toad evolution and enrich our understanding of their interplay with the ecosystem at large.

(More details to follow in future posts.)

Sunday, 1 July 2018

Edwards Lab at Genetics Society of AustralAsia 2018

Åsa Pérez-Bercoff is attending GSA 2018 in Canberra, this week. Today at 16:30, she will be presenting an updated and extended version of her SBRS18 talk, Investigating the evolution of complex, novel traits using whole genome sequencing and molecular palaeontology as part of the Evolutionary Genetics session.

Find out what this plot means…

…and more!

Friday, 15 June 2018

Investigating the evolution of complex, novel traits using whole genome sequencing and molecular palaeontology

Åsa Pérez-Bercoff, Psyche Arcenal, Anna Sophia Grobler, Philip J. L. Bell, Paul V. Attfield and Richard J. Edwards.

Åsa presented the latest updates from our ARC Linkage Project with Microbiogen Pty Ltd at the Sydney Bioinformatics Research Symposium 2018. She will be giving an expanded version of the talk at the Genetics Society of AustralAsia 2018 conference in Canberra (1-4 July), if you missed it.


Understanding how new biochemical pathways evolve in a sexually reproducing population is a complex and largely unanswered question. We are using PacBio whole-genome sequencing and deep population resequencing to explore the evolution of a novel biochemical pathway in yeast over several thousand generations. Growth of wild Saccharomyces cerevisiae strains on the pentose sugar xylose is barely perceptible. A mass-mated starting population was evolved under selection on Xylose Minimal Media with forced sexual mating every two months for four years, producing a population that can grow on and utilise xylose as its sole carbon source.

We are now using a novel “molecular palaeontology” approach to trace the evolutionary process and identify functionally significant loci under selection. Populations at seven key time points during their evolution have been sequenced using Illumina short-read sequencing. In addition, all the parental strains from the founding population have been subject to PacBio de novo whole-genome sequencing and assembly. By constructing reliable whole genomes of the ancestors of our populations, we can the trace evolution of these populations over time. We can therefore track the trajectory of allele frequencies through time, identifying the contributions of different founding strains and novel mutations. We are using these data to estimate the proportions and regions of the genome that have evolved neutrally, under purifying selection, or adaptively in response to xylose selection. Our unique array of both extant and past, but not extinct, populations allow us to test popular models of molecular evolution.

Sequencing snakes: Pseudodiploid pseudo-long-read whole genome sequencing and assembly of Pseudonaja textilis (eastern brown snake) and Notechis scutatus (mainland tiger snake)

Richard J Edwards, Timothy G Amos, Joshua Tang, Beni Cawood, Sabrina Rispin, Daniel Enosi Tuipulotu & Paul Waters.

This work was presented at the Sydney Bioinformatics Research Symposium 2018. (Abstract below.) Click on thumbnail for full resolution PDF. Citation:

Edwards RJ et al. Pseudodiploid pseudo-long-read whole genome sequencing and assembly of Pseudonaja textilis (eastern brown snake) and Notechis scutatus (mainland tiger snake) [version 1; not peer reviewed]. F1000Research 2018, 7:753 (poster) (doi: 10.7490/f1000research.1115550.1)


The precipitous drop in sequencing costs over recent years has seen the bottleneck in vertebrate whole genome sequencing (WGS) shift from data generation (sequencing) to data processing (assembly and annotation). Draft genomes generated from cheap shotgun Illumina sequencing tend to be highly fragmented with many tens of thousands of short contigs or scaffolds. This can be improved by preparing multiple paired end and “mate pair” libraries with different insert sizes, but this increases the cost of both sequencing and data storage/analysis. PacBio or Oxford Nanopore long read sequencing enables massive improvements in assembly quality but tends to be prohibitively expensive for organisms with large genome sizes, such as vertebrates. 10x Genomics Chromium “linked read” sequencing offers a solution to this problem. High molecular weight molecules of DNA are barcoded prior to standard shotgun Illumina sequencing. These barcodes can then be used for pseudo-long-read assembly, with improved handling of repetitive regions. Where heterozygous variants are dense enough, haplotypes can be phased to generate a “pseudodiploid” assembly with some regions represented as two alleles. This is all for the cost of an additional library prep with no extra sequencing. But does it work?

We have sequenced two of the deadliest venomous snakes in Australia using 10x Chromium linked reads: the mainland tiger snake (Notechis scutatus) and the eastern brown snake (Pseudonaja textilis). Supernova v2 assemblies of the data generated exceptionally high quality genomes for the price, with maximum scaffolds over 50 Mb and N50 values of 5.99 Mb for the tiger snake and 14.7 Mb for the brown snake. This was reflected in BUSCO (v2.0.1 short) completeness estimates of 87.3% (tiger snake) and 90.5% (brown snake). These data will be compared to tiger snake WGS using standard paired end Illumina NovaSeq shotgun sequencing, and discussed with respect to some of the downstream opportunities and challenges provided by pseudodiploid genome assemblies. In particular, BUSCO analysis of haploid, pseudodiploid, and non-redundant genome assemblies revealed some interesting and unexpected behaviour of this widely-used tool. We also present results from GenomeR, a Shiny app (in development) for batch kmer genome size estimation (

Snake genomes and ongoing annotation are being made available through the lab Web Apollo browser and search tool ( We welcome contact from anyone interested in getting involved with the annotation and analysis of these genomes.

Optimising intrinsic protein disorder prediction for short linear motif discovery

Kirsti M G Paulsen, Norman E Davey, Sobia Idrees, Åsa Pérez-Bercoff & Richard J Edwards.

This work was presented at the Sydney Bioinformatics Research Symposium 2018. (Abstract below.) Click on thumbnail for full resolution PDF.


Short linear motifs (SLiMs) are short stretches of proteins that are directly involved in protein-protein interactions (PPI). Identifying SLiMs is important for understanding fundamental processes involved in normal cellular function. SLiMs are commonly only 3 - 10 amino acids in length and form low affinity interactions. This makes them ideal for fast cellular processes, such as cell signalling or response to stimuli, but also difficult to predict experimentally. As a result, many computational SLiM prediction methods have been developed. In order to increase the signal to noise ratio of SLiM predictions, different sequence masking techniques have been developed. These attempt to screen out areas that are unlikely to contain SLiMs and thereby preferentially eliminate the random nonfunctional sequences. One widely implemented masking strategy is to remove protein regions that form stable three-dimensional structures; SLiMs are typically found in regions of intrinsic disorder that are natively unstructured in their unbound form. To date, there has been no systematic study of how best to predict intrinsic disordered protein regions for SLiM discovery. Poor quality predictions will not have the desired noise-removal, while over-stringent masking will remove too many true positives. The aim of this study is to compare how ten different disorder prediction methods affect SLiM occurrence prediction and to identify the best method and settings for this purpose. The disorder prediction scores for each residue in the human proteome was obtained from the MobiDB database. Further, this study aims to investigate whether the optimal disorder masking settings for occurrence SLiM prediction are the same for de novo SLiM prediction and for identification of SLiM mediated PPIs.

Evaluation of protein-protein interaction detection methods as a source of capturing domain-motif interactions

Sobia Idrees, Richard J Edwards

This work was presented at the Sydney Bioinformatics Research Symposium 2018. (Abstract below.) Click on thumbnail for full resolution PDF.


One of the main pursuits in proteomics is to understand the complex network of protein-protein Interactions (PPI) that underpin biological processes. Two major classes of PPI are domain-domain interactions (DDI) between globular proteins, and domain-motif interactions (DMI) between a globular domain and a short linear motif (SLiM) in its partner. Advances in high-throughput experimental techniques have been applied at large-scale in an attempt to characterise the interactome of various organisms. However, PPI networks being identified by these high-throughput experiments have low resolution as compared to low-throughput technologies, such as protein co-crystallization. Furthermore, large-scale approaches may be poor at capturing low affinity or transient interactions, which includes the majority of known DMI. To date, several studies have been conducted to identify how well these PPI data can capture protein complexes, but the ability of high-throughput PPI-detection methods to capture DMI remains a largely unanswered question.

To help system biologists choose appropriate methods for predicting different types of interactions, we conducted a comprehensive comparison study on existing high-throughput PPI datasets. We have integrated PPI data, SLiM predictions, domain compositions and known SLiM-domain binding partnerships to identify possible DMI and DDI within interactomes. We identify PPI data that are enriched for DMI or DDI versus a background expectation generated by randomising the PPI within the network. Despite returning relatively few experimentally validated DMI when compared to interaction databases, we present evidence that high-throughput PPI data is enriched for DMI and thus potentially useful for the prediction of novel SLiMs. We discuss the relative merits of co-fractionation followed by mass spectrometry (CoFrac-MS), affinity purification coupled mass spectrometry (AP-MS), and yeast two hybrid (Y2H) for capturing DMI and DDI, as well as potential quality versus quantity trade-offs in DMI prediction.