Skip to content

Project 3

Matthieu Foll edited this page Nov 27, 2024 · 25 revisions

Project 3: Multimodal analysis of molecular data and encoded medical image data

Background

This project aims to create a Multi-Omics Factor Analysis (MOFA) by combining molecular data with the results of a self-supervised deep learning process applied to whole slide images (WSIs). The objective is to uncover potential associations between the molecular and morphological profiles of lung neuroendocrine neoplasms (LNENs).


Molecular MOFA

The script Scripts/Script_ExpMethAltCNV_lungNENomicsCombined.R (up to line 111) provides an example of how data must be formatted to create input matrices for MOFA. This script generates the R object Data/MOFA_molecular/MOFAobjectB.RData, which contains the formatted matrices from multiple omics data, including RNA sequencing, copy number variants, methylation, and mutation data. If you want to explore more precisely how MOFAobjectB was constructed, all raw molecular data are located in the directory Data/Molecular.

Use MOFAobjectB in Data/MOFA_molecular/MOFAobjectB.RData as input for MOFA analysis using the script Scripts/MOFA_script.R. The slurm_script/demo_MOFA.sh provides an example of how MOFA_script.R can be executed using a SLURM cluster. The MOFA_script.R script generates the following outputs:

  • Data/MOFA_molecular/MOFAobject_trained.hdf5
  • Data/MOFA_molecular/MOFAobject.trained.RData

For exploratory analysis of these outputs, refer to Scripts/Script_ExpMethAltCNV_lungNENomicsCombined.R (after line 111). Additional tutorials for MOFA analysis can be found here:


Whole Slide Images (WSIs) Data - Morphological features

WSIs are large microscopic images of tumours. To process them using deep learning and GPUs, WSIs are divided into smaller patches, referred to as tiles. These tiles are processed using a self-supervised deep learning method called Barlow Twins. This process generates encoded vectors for each tile, which are then clustered to derive morphological partitions.

The file /Data/WSI_DL_outputs/LeidenComK_75_res3_r1_repartition_by_samples_filtered_projectMG.csv contains the proportion of tiles in each morphological partition for each sample. These distributions are interpreted as the morphological profiles of the patients, summarising information extracted from approximately 10,000 tiles per WSI.


Tasks

Task 1: Generate a New MOFA Input Object

Create a MOFA input object that combines the molecular and morphological features. To achieve this:

  1. Match the identifiers of each WSI (column WSI_id in /Data/WSI_DL_outputs/LeidenComK_75_res3_r1_repartition_by_samples_filtered_projectMG.csv) with the sample_id of the molecular MOFA. Use the table Data/TechnicalData/key_clinical_data_patients_with_WSI.csv.
  2. Adapt Scripts/Script_ExpMethAltCNV_lungNENomicsCombined.R to create a new MOFA object.

Task 2: Run MOFA with Combined Features

Adapt and reuse the following scripts to run MOFA on the combined molecular and morphological features:

  • Scripts/MOFA_script.R
  • slurm_script/demo_MOFA.sh

Task 3: Explore the Latent Space

Use the output objects from the MOFA analysis to explore the resulting latent space:

  1. Check associations between latent factors and data types (see examples in Scripts/Script_ExpMethAltCNV_lungNENomicsCombined.R, after line 111).
  2. Explore associations between the latent space and key clinical variables. Clinical data can be found in:
    • Data/TechnicalData/key_clinical_data_patients_with_WSI.csv (restricted to the 192 patients with both molecular and morphological features).
    • Data/TechnicalData/key_clinical_data_patients_all_patients.csv (all patients).

Here is a list of the most important variables:

  • archtype_label_combined in Data/TechnicalData/key_clinical_data_patients_with_WSI.csv, corresponding to the molecular groups. In Data/TechnicalData/key_clinical_data_patients_all_patients.csv, this is named archetype_k4_LF3_label.
  • consensus_pathology in Data/TechnicalData/key_clinical_data_patients_with_WSI.csv, representing the histological type. In Data/TechnicalData/key_clinical_data_patients_all_patients.csv, it is named type.
  • age_cat in Data/TechnicalData/key_clinical_data_patients_all_patients.csv (age category) or age_corrected in Data/TechnicalData/key_clinical_data_patients_with_WSI.csv.
  • sex (male or female).
  • localisation_corrected in Data/TechnicalData/key_clinical_data_patients_with_WSI.csv (referred to as location in Data/TechnicalData/key_clinical_data_patients_all_patients.csv), indicating the tumour's position relative to the trachea.

Feel free to explore additional variables based on your analytical goals.


Key Numbers and Information

  • 319 patients have at least one type of omics data. The molecular MOFA includes these 319 patients. Their sample IDs can be found in:

    • Data/MOFA_molecular/MOFAobjectB.RData (MOFAobjectB@samples_metadata)
    • Data/MOFA_molecular/MOFAobject.trained.RData (MOFAobject.trained@samples_metadata)
  • For technical information about these samples, use the file Data/TechnicalData/combined_public_lungNENomics_technical_data.RData (column sample_id).

  • 192 patients are associated with a WSI. Each WSI has a unique identifier in the column WSI_id of /Data/WSI_DL_outputs/LeidenComK_75_res3_r1_repartition_by_samples_filtered_projectMG.csv.

  • The correspondence between sample_id (molecular data) and WSI_id is in Data/TechnicalData/key_clinical_data_patients_with_WSI.csv.

  • Clinical data are available in:

    • Data/TechnicalData/key_clinical_data_patients_with_WSI.csv
    • Data/TechnicalData/key_clinical_data_patients_all_patients.csv

References

  • MOFA: Argelaguet, Ricard, Velten, Britta, Arnol, Damien, et al. Multi-Omics Factor Analysis—a framework for unsupervised integration of multi-omics data sets. Molecular Systems Biology, 2018, vol. 14, no 6, p. e8124.

  • Deep Learning Pipeline: https://www.nature.com/articles/s41467-024-48666-7


Good Luck with Your Exploration!

If you encounter any issues, please contact: