Validation and development of sequence-based tools to analyse the human gut virome

Thumbnail Image
TS_Thesis_Corrections.Final.pdf(6.89 MB)
Full Text E-thesis
Sutton, Thomas D. S.
Journal Title
Journal ISSN
Volume Title
University College Cork
Published Version
Research Projects
Organizational Units
Journal Issue
The gut microbiome is a complex community of microorganisms that interacts closely with the human host and is believed to play an important role in the maintenance of human health. The viral component of this community is referred to as the human gut virome and is dominated by bacteriophage. Bacteriophage are central to microbial ecosystems by facilitating nutrient turnover, horizontal gene transfer and driving bacterial diversity. In this way the gut virome is believed to closely interact with the human host by shaping the composition and function of the gut microbiome. However, the gut virome also represents one of the biggest gaps in our understanding of the microbiome as it is dominated by unknown bacteriophage targeting unknown bacterial hosts and with uncharacterised downstream functions. These challenges mean that virome research relies heavily on sequence-based approaches and metagenomics to identify compositional patterns and targets for future characterisation. A typical virome study involves physical and chemical separation of individual virions from the cellular components of the microbiome and the contents of the faecal, luminal or mucosal sample from which it came. A viral metagenome is then generated by extracting virome DNA and/or RNA for sequencing on a given platform. These sequencing reads are then quality filtered and assembled to reconstruct the viral genomes in the original sample. The abundance of these assemblies is then estimated by aligning the sequencing reads and performing statistical analysis. However, each step in a virome analysis pipeline has the potential to distort the final viral community and given the unknown nature of the virome, this distortion is difficult to identify and characterise. As a result, conclusions are often drawn from virome studies without fully appreciating the impact of the analysis methods on the findings. This thesis examines the major steps in sequence-based virome analysis pipelines, highlighting how choices made at each step of an analysis protocol can impact the final conclusions drawn from a study. In doing so, we have changed our perspective of the human gut virome and challenged previous assumptions. Chapter One discusses the current understanding of the virome field, giving particular attention to how the analysis methods and challenges affect our view of the virome. In Chapter Two, we focus on the assembly step of virome analysis pipelines. This step is of particular importance to virome studies, as an assembler’s ability to recover viral sequences can ultimately determine the amount of sequence information used in a that study. We compared all short-read assembly programs used in virome studies to date, across mock communities, simulated and real datasets. We found that not all assemblers are equal, and choice of assembler can drastically affect the conclusions that can be drawn from a virome study. These findings call the comparability of different virome studies into question and would suggest that previous virome studies would benefit from reanalysis using improved assembly methods and re-examination of the conclusions drawn. As discussed, the human gut virome is dominated by “viral dark matter”; those sequences which do not share homology to reference databases. However, the majority of what is currently known about the virome in human health and disease is based on the minor fraction of viral sequences collated in these databases. This presents a serious gap in our understanding and was the primary focus of Chapter Three. We reanalysed a keystone inflammatory bowel disease (IBD) dataset, which had formed the foundation of much of what we knew about the virome in IBD. We developed a new approach to analysing the virome beyond the identifiable minority and by doing so, changed our understanding of the virome in IBD significantly. In the final chapter, we directed our attention to possibly the most important aspect of a sequence-based study, the sequencing approach itself. This step bridges the gap between the biological information in a virome and the digital information that is analysed. As with all steps in a virome analysis pipeline, this has serious implications for the final conclusions of the study. We described the use of long-read sequencing in the human gut virome and the benefits and challenges which are associated with this technology. We also found the ability of amplified short-read sequencing libraries to represent the gut virome was limited, but that alternative library preparation methods and long-read sequencing platforms may be able to address these limitations. These findings imply that much of what we know about that human gut virome may be linked to sequencing performance, rather than the biology of the community itself. These three major aspects of virome analysis pipelines highlight the importance of considering the impact of the analysis approach when interpreting the results of virome data and complex biological systems in general.
Phage , Virome , Bioinformatics , Next-generation sequencing , Long-read sequencing , Metagenomics , Assembly , Microbiome , Inflammatory bowel disease
Sutton, T. D. S. 2019. Validation and development of sequence-based tools to analyse the human gut virome. PhD Thesis, University College Cork.