Tuesday, 2 July 2019

#GSA2019 - BUSCOMP: BUSCO Compilation and Comparison for Assessing Completeness in Multiple Genome Assemblies

Richard J. Edwards

If you are at the Genetics Society of Australasia Conference 2019, then come and hear me talk at 11:30 in Symposium 4B – Genomics & Bioinformatics (1). If you cannot make it, or loved the talk so much you want to look at it again, the slides are available on F1000Research:

  • Edwards RJ (2019): BUSCOMP: BUSCO compilation and comparison – Assessing completeness in multiple genome assemblies [version 1; not peer reviewed]. F1000Research 8:995 (slides)
    (doi: 10.7490/f1000research.1116972.1)

Abstract

Advances in DNA sequencing technology and free availability of bioinformatics tools have placed de novo genome assembly of complex organisms firmly in the domain of individual labs and small consortia. Nevertheless, the assemblies produced are often fragmented and incomplete. Optimal assembly depends on the size, repeat landscape, ploidy and heterozygosity of the genome, which are often unknown. It is therefore common practice to try multiple strategies, and there is a bottleneck in assessing and comparing assemblies.

BUSCO [1] is a powerful and popular tool that estimates genome completeness using gene prediction and curated models of single-copy protein orthologues. BUSCO assessments combine genome completeness, contiguity, and accuracy to rate genes as “Complete (Single Copy)”, “Duplicated”, “Fragmented” or “Missing”. However, results can be counterintuitive and lack robustness when comparing multiple assemblies of the same genome. Adding/removing scaffolds can alter the BUSCO genes returned by the rest of the assembly [2], while low sequence quality may reduce “completeness” scores and miss genes that are present in the assembly [3].

BUSCOMP (BUSCO Compilation and Comparison) is designed to complement BUSCO and identify/overcome these issues. BUSCOMP first compiles a non-redundant maximal set of the highest-scoring single-copy complete sequences for as many BUSCO genes as possible. These are then searched against assemblies using Minimap2 [4], converted into global alignment statistics, and used to robustly re-rate genes as Complete (Single/Duplicated), Fragmented/Partial or Missing. On test data from three organisms (yeast, cane toad and mainland tiger snake), BUSCOMP (1) gives consistent results when re-running the same assembly, (2) is not affected by adding or removing non-BUSCO-containing scaffolds, and (3) is minimally affected by assembly quality. This makes BUSCOMP ideal to run alongside BUSCO when trying to compare and rank genome assemblies, even in the absence of error-correction.

BUSCOMP is freely available at https://github.com/slimsuite/buscomp under a GNU GPL v3 license.

  1. Simão FA et al. (2015) Bioinformatics 31:3210–3212
  2. Edwards RJ et al. (2018) F1000Research 7:753
  3. Edwards RJ et al. (2018) GigaScience 7:giy095
  4. Li H (2018) Bioinformatics 34:3094-3100

No comments:

Post a Comment