Integration of multimodal data from disparate sources to identify disease subtypes

This article was originally published here

Biology (Basel). 2023 Feb 24;11(3):360. doi:10.3390/biology11030360.


Studies conducted over the past decade have generated a wealth of molecular data that can be leveraged to better understand cancer risk, progression, and outcome. However, it is impossible to understand the risk of progression and to differentiate between long- and short-term survivors by analyzing data from a single modality due to the heterogeneity of the disease. Using a scientifically developed and tested deep learning approach that leverages aggregated information collected from multiple repositories with multiple modalities (e.g., mRNA, DNA methylation, miRNA) could lead to more prediction. accurate and more robust of disease progression. Here, we propose an autoencoder-based multimodal data fusion system, in which a fusion encoder flexibly integrates collective information available through multiple studies with partially coupled data. Our results on a fully controlled simulation-based study showed that inferring missing data through the proposed data fusion pipeline enables a superior predictor to other baseline predictors with missing categories. The results further showed that short-term and long-term survivors of glioblastoma multiforme, acute myeloid leukemia and pancreatic adenocarcinoma can be successfully differentiated with an AUC of 0.94, 0.75 and 0.96 , respectively.

PMID:35336734 | DO I:10.3390/biology11030360