Restriction lift date: 2028-12-31
Optimisation of combinatory machine learning techniques for advancing human microbiome research
Loading...
Files
Date
2025
Authors
O'Sullivan, Jill
Journal Title
Journal ISSN
Volume Title
Publisher
University College Cork
Published Version
Abstract
The human gut microbiota, which comprises trillions of microorganisms, has been extensively studied over the last two decades and is now widely believed to play a role in many human disorders, including Inflammatory Bowel Disease (IBD). However, IBD is considered a multifactorial disease with genetic susceptibility, environmental factors and changes in the immune response also contributing to its progression. Multi-omics provides novel opportunities to integrate multiple forms of molecular information from both the human host and the gut microbiome, allowing researchers to expand their understanding of complex diseases such as IBD. Multi-omics datasets are, however, very complex, high dimensional, noisy and contain large amounts of highly correlated features which can make them quite difficult to work with. To date, several methods have been developed to integrate and analyse these multi-omics datasets with advanced techniques such as machine learning (ML) becoming popular. Unfortunately, there is currently no widely accepted best approach to such analysis and only a few researchers have attempted to train ML models on integrated host-microbe datasets. This thesis explores the potential of ML applied to host-microbe multi-omics data for IBD related tasks, including predicting disease relapses and classifying disease types.
In Chapter II, a large multi-omics dataset from a cohort of adult patients with IBD was examined and using more traditional analyses, we found that most single-omics datasets could distinguish IBD subtypes from non-IBD controls, and associated the microbiome, host transcriptome and host methylome with disease type and inflammation status. These methods were, however, not able to distinguish relapse and remission groups in patients with Crohn’s disease (CD) and ulcerative colitis (UC). Instead, by training an ensemble of XGBoost models on an integrated host-microbe dataset of multiple omics types it was possible to achieve better results when classifying these two groups.
Recognizing that these models may be influenced by confounders of long-term illness, we assessed whether such a multi-omics ML approach could be effective in a treatment-naïve cohort of patients with IBD. In Chapter III, we explored various omics combinations to evaluate their ability to predict relapse within six months of diagnosis in a cohort of patients with paediatric UC. Similar to Chapter II, promising results were observed from models trained on multiple omics types with combinations of host epigenome and microbial features often showing the best performance.
The findings in these chapters highlight the potential of ML analysis in IBD research. In Chapter IV, I conducted a comparative analysis of multiple ML pipelines, highlighting several approaches that show promising results for a multi-class disease classification task. In the initial analysis of a dataset comprising microbiome, host transcriptome and host genotype features, models such as PLS-DA, Support Vector Machines and penalised multinomial regression, demonstrated the best performance. When repeated on a dataset consisting of microbiome and metabolome features, Random Forest and XGBoost models showed the best accuracy. This suggests that the best algorithm may be problem- and/or dataset-specific and should be explored prior to choosing a final model. While many of the top performing models were trained on multiple omics types, similar performance was observed by a single-omics dataset. This raises the question of whether multi-omics integration is always necessary and should be assessed prior to data integration.
Overall, the work undertaken in this thesis highlights the potential of host-microbe integration when predicting IBD-related outcomes, particularly future disease relapses. We are aware, however, that our models were trained on quite limited sample sizes and likely would have benefited from more training samples. Furthermore, we found that care should be taken when integrating these datasets as the addition of features may not always be beneficial to the task at hand. While we understand identifying suitable datasets to validate our findings will be difficult, external validation is essential to ensure the generalizability of our models. However, until such datasets become available, I hope the work described herein will help guide future studies by highlighting potentially informative data types.
Description
Keywords
Gut microbiome , Inflammatory Bowel Disease , Machine learning , Host-microbe multi-omics
Citation
O'Sullivan, J. 2025. Optimisation of combinatory machine learning techniques for advancing human microbiome research. PhD Thesis, University College Cork.
