Omics-and-integration-principles

To provide the users with all the tools to understand each omics data and how to analyse them we wrote this detailed guide to address the users to the sources where understand the principles of each omics.

Transcriptomics:

Bulk RNA sequencing (RNA-seq) is a widely used technique to analyze the transcriptome, which represents the complete set of RNA molecules in a population of cells. It provides insights into gene expression, splicing events, and transcript abundance across different conditions or treatments. Common steps include:
  1. RNA Isolation: RNA is extracted from the sample, typically using methods that preserve the integrity of the RNA. The RNA Integrity Number (RIN) is the value to estimate the RNA integrity. A score above 6-7, on a RIN scale from 0 to 10 is required to asses a good RNA quality.
  2. cDNA Library Preparation and sequencing: The RNA is converted into complementary DNA (cDNA) using reverse transcription. The cDNA is then fragmented, and adapters are added to facilitate sequencing. the library is sequenced on platforms such as Illumina, generating millions of short nucleotide sequences called reads. These reads represent fragments of the original RNA molecules. This video illustrates the method: Link
  3. Mapping and Matrix Data Generation: The sequencing reads are mapped to a reference genome or transcriptome using specialized software like STAR, HISAT2, or Bowtie2. Mapping aligns the reads to specific regions, helping identify which genes or transcripts are expressed. Once mapped, the number of reads aligning to each gene is counted, generating a count matrix. Each row represents a gene, and each column corresponds to a sample. The count matrix is the primary data format used for downstream analysis.


  4. Statistical Properties of RNA-seq Data: RNA-seq count data is not normally distributed but follows discrete distributions, often resembling a negative binomial distribution due to overdispersion (variability greater than expected under a Poisson model). The data may have large dynamic ranges (from very low to very high expression, defined as “homoscedasticity” ), and sequencing depth variability ( “Size factor” ) that must be accounted for during analysis.
  5. -DESeq2: This R packages model RNA-seq count data using a negative binomial distribution. It performs Size factor normalization and statistical testing to account for variability in sequencing depth and dispersion. By default, in BiomiX the counts are normalized by variance stabilizing transformation (vst) to correct the homoscedasticity. -Limma: This package is suitable for normalized data that follow an approximately normal (Gaussian) distribution. Examples (RPKM, FPKM and TPM, vst normalized).

Metabolomics:

Metabolomics is the comprehensive study of metabolites—small molecules involved in metabolism—within a biological system. These metabolites reflect the biochemical activities of cells, tissues, or organisms and provide insights into their physiological and pathological states. Here is an explanation of the principles: Link

Metabolomics typically involves the extraction of metabolites from biological samples (e.g., tissues, blood, or urine), followed by analysis using high-throughput techniques such as mass spectrometry (MS) or nuclear magnetic resonance (NMR) spectroscopy. These methods measure the abundance of metabolites, enabling their identification and quantification. Common steps in this analysis include:

Data generation and Peak Detection:: Identifying metabolite signals in raw data.
  • Feature Alignment: Aligning signals across samples to ensure consistency.
  • Normalization: Correcting for variations in sample processing or instrument performance.
  • These steps are not included in BiomiX, but a guideline of how perform these steps is reported here: doi: 10.3390/metabo10010028 Literature reference Metaboanalyst Raw Spectra Processing Metaboanalyst Raw Spectra Processing pdf
  • Annotation and Identification: Assigning chemical identities to detected features using databases like HMDB (Human Metabolome Database) or KEGG.
  • Data Distribution and Statistical Analysis: Metabolomic data often exhibit non-normal distributions and require appropriate statistical tools
  • Pathway Analysis: Tools like MetaboAnalyst link metabolites to biological pathways for functional insights
  • Targeted vs. Untargeted Metabolomics

    Targeted Metabolomics

    In targeted metabolomics, researchers focus on a predefined set of metabolites, often selected based on prior knowledge or specific hypotheses.

    Untargeted Metabolomics

    Untargeted metabolomics aims to detect as many metabolites as possible in a sample, providing an unbiased snapshot of the metabolic state.

    Source image: https://www.drugtargetreview.com/article/33033/the-next-revolution-in-stratified-medicine-molecular-phenotyping-with-metabolomics/

    Methylomics

    Methylomics involves studying DNA methylation, an essential epigenetic modification where methyl groups are added to cytosines, typically in CpG dinucleotides. This modification regulates gene expression and plays a crucial role in development, cellular differentiation, and disease processes such as cancer. The Illumina Infinium HumanMethylation450 BeadChip (450K) and MethylationEPIC BeadChip (EPIC) arrays are widely used platforms for profiling genome-wide DNA methylation.

    Principles of the techniques: Link

    Common Steps:

    450K Array

    EPIC Array

    Output Data

    The arrays generate beta values (ranging from 0 to 1), representing the fraction of methylation at each CpG site. M-values can also be provided. Before using them in BiomiX, we suggest converting them in R:

            
            → Open R studio
            if (!require("BiocManager", quietly = TRUE))
                install.packages("BiocManager")
            BiocManager::install("lumi")
    
            library(lumi)
            m <- read.table("path/to/the/matrix")
            m <- m2beta(m)
            write.table(m, file="Matrix_converted_to_beta_value", sep = "\t", quote = FALSE, row.names = FALSE)
            
        

    Detection p-values: These ensure the reliability of each site, and poor-quality data can be filtered out.



    Main interface guide

    The main interface structure:

  • Condition. Group to analyse. Group to analyse, it is directly displayed from the Metadata column “CONDITION”. Avoid using spaces in the conditions, i.e (No treated → No_treated)
  • Control. Reference group, compared with the condition one. Avoid use spaces in the conditions, i.e (No treated → No_treated) .
  • Output. Directory where save the output results of the analysis.
  • Omics input grid:

  • Input. Input slot number.
  • Preview-QC. Opens a shiny app that visualizes, transforms the data and removes variables or sample outliers.
  • Single omics Analysis. Check the box to analyse that omic by our single omics pipelines.
  • Data type. Select the type of omics uploaded.
  • Integration. Check the box to include those omics in the MOFA integration.
  • Label. Omics name, used to name the output folder and filter the sample based on the sample names.
  • Selection. If selected filters the samples based on the sample name using a regex derived by the label name. It requires the "SAMPLE_TYPE" column (optional). It supposes the input matrix contains samples from different tissues or cells (convoluted matrix), and this source is also defined in the sample ID (e.g., ID = 25658_BLymphocyte, 25658_GlialCells). In that case, this column allows analysis to be performed only on data from the selected tissues or cells.
  • Data upload. Click on the button to upload the matrix. Here, it is possible to modify the BiomiX format (.tsv,xls,xlsx) matrix.
  • Integration grid:

  • Integration. Do you want to perform the integration with the uploaded omics data?.
  • Method. Which integration method? (Only MOFA is available for now).
  • N° Factors. How many MOFA factors would you like to calculate for the integration? We recommend running the tuned mode (automatic) by selecting 0. Once the top three MOFA models are identified, you can re-run the analysis focusing on the model of interest by selecting the corresponding number of MOFA factors. This step is particularly important for visualizing heatmaps and correlations of the significant factors within the chosen model (select the relevant discriminant factor in within the Factor to explore input).
  • Factor to explore. Which factor do you want to explore graphically? (Contributors, heatmap clustering etc...). It is suggested to set it to 1 for the first exploratory analysis. When the model and the total number of factors are defined at this point, it is suggested to set this parameter to visualize the factor of interest graphically.
  • Omics overlay.What is the minimum number of omics a sample must have to be included in the integration? For MOFA, it is crucial to maximize the number of shared samples across the datasets. An omics dataset with significant differences in sample coverage can disproportionately influence the model, causing the predominant feature to be overweighted. To ensure balanced integration, at least 50% of the samples should be shared across all omics.
  • Open advance option. Button to open the advance option interface
  • Open BiomiX chatbot. Button to open the BiomiX chatbot
  • Start Analysis. Button to start the analysis.




  • Guide to the advanced option interface

    Advanced options interface (General section):

  • Log2FC threshold. Log2FC threshold value for significant results.
  • P.adj threshold. Adjusted p.value threshold value for significant results.
  • Gene Panel. Button to upload the gene panel for the subgrouping analysis. Specifically, the standard deviation score (Z activity score) uses the counts normalized by the number of reads to compare the expression of each gene (g) in each disease or treated sample (s) with the mean expression of the CTRLs divided by the standard deviation of the CTRLs as in the following equation:



    The higher the standard deviation shift is in a gene, the higher the gene expression is in the condition compared to the CTRLs.
  • Array type. The type of chip from which the data comes (450K/EPIC).
  • N° of genes within the panel with score > one or two. Parameters to define the positive/negative subgrouping by the gene panel. The user can modify the number of genes that must have a Z-score > one or two of the control standard deviation. By default, according to the Kirou score on interferon signal, the samples with three genes with a score > 2 or 10 genes with a score > 1 are labeled positive. We strongly suggest testing different combinations of parameters for the analysis focusing on the subgrouping, as each gene panel may show differences in standard deviation changes. To check if the subgrouping is consistent, it is possible to compare it with the heatmap clustering (Gene_panel_subgroups...pdf).
  • Removal of positive control samples. If checked, the controls positive for the gene panel can be excluded from the downstream analysis. This is relevant to exclude controls not respecting certain features (e.i healthy patients with occurring infections spotted as high inflammation)
  • N° top DE genes in heatmap Number of top genes(or metabolites) to visualize in the heatmaps. We suggest a value around 25, too many features visualized could disrupt the plot.
  • Clustering distance. Clustering distance used in the heatmaps, both in subgrouping and single omics analysis. Here a guideline for the distance choice:
  • Clustering Distance Recommendation

    Details

    1. Euclidean Distance

    Description: The straight-line distance between two points in a multidimensional space. It is the most commonly used distance metric.

    Formula: \(d = \sqrt{\sum_{i=1}^n (x_i - y_i)^2}\)

    2. Maximum Distance (Chebyshev Distance)

    Description: Measures the maximum absolute difference between dimensions.

    Formula: \(d = \max(|x_i - y_i|)\)

    3. Manhattan Distance (Taxicab or City Block Distance)

    Description: The sum of the absolute differences across dimensions.

    Formula: \(d = \sum_{i=1}^n |x_i - y_i|\)

    4. Canberra Distance

    Description: A weighted metric where differences are normalized by the sum of absolute values of the variables.

    Formula: \(d = \sum_{i=1}^n \frac{|x_i - y_i|}{|x_i| + |y_i|}\)

    5. Binary Distance

    Description: Used for binary data, measuring dissimilarity based on matches (1) and mismatches (0).

    6. Minkowski Distance

    Description: A generalized form of Euclidean and Manhattan distances. It uses a parameter p, where p=2 gives Euclidean distance and p=1 gives Manhattan distance.

    Formula: \(d = \left(\sum_{i=1}^n |x_i - y_i|^p\right)^{1/p}\)

    7. Pearson Correlation Distance

    Description: Measures how well two variables are linearly related. The distance is \(1 - r\), where \(r\) is the Pearson correlation coefficient.

    8. Spearman Correlation Distance

    Description: Measures the monotonic relationship between variables. Similar to Pearson but uses ranks instead of raw values.

    9. Kendall Correlation Distance

    Description: Measures ordinal association between two variables by comparing concordant and discordant pairs.



  • Clustering Methods. Clustering method used in the heatmaps, both in subgrouping and single omics analysis.
  • Clustering Methods Recommendation

    Clustering methods used in the heatmaps, both in subgrouping and single-omics analysis. Here is a guideline about the method choice.

    Method Pros Cons Best for
    Ward.D/D2 Compact clusters, minimizes variance Sensitive to scale/outliers Continuous, normalized data
    Single Handles irregular shapes Prone to chaining Non-elliptical shapes
    Complete Compact clusters, avoids chaining Sensitive to outliers Compact, well-separated clusters
    Average Balances compactness and shape sensitivity May lose precision with irregular clusters General-purpose clustering
    McQuitty Equal cluster importance, simple Assumes equal weight for all clusters Equal-weight clustering scenarios
    Median Robust to outliers, flexible Uneven cluster sizes, harder to interpret Median-based distance datasets
    Centroid Geometrically intuitive clusters Non-hierarchical results possible Centroid-representative data

    1. Ward’s Method (Ward.D and Ward.D2)

    Ward.D: Minimizes the total within-cluster variance by merging clusters that result in the smallest increase in total variance.

    Ward.D2: A variation of Ward.D that uses squared Euclidean distances for calculating dissimilarity.

    2. Single Linkage

    Description: Also known as the "nearest neighbor" method, it merges clusters based on the smallest pairwise distance between their points.

    Formula: \(d(A, B) = \min(d(i, j))\) for all \(i \in A, j \in B\).

    3. Complete Linkage

    Description: Also known as the "farthest neighbor" method, it merges clusters based on the largest pairwise distance between their points.

    Formula: \(d(A, B) = \max(d(i, j))\) for all \(i \in A, j \in B\).

    4. Average Linkage (UPGMA)

    Description: Merges clusters based on the average pairwise distance between their points.

    Formula: \(d(A, B) = \frac{1}{|A||B|} \sum_{i \in A, j \in B} d(i, j)\)

    5. McQuitty’s Method (WPGMA)

    Description: A weighted version of average linkage, where the contribution of each cluster to the distance is equal regardless of size.

    Formula: Similar to UPGMA, but weights are evenly distributed regardless of cluster sizes.

    6. Median Method (WPGMC)

    Description: A weighted method that computes the median distance between clusters to decide merges.

    Formula: Based on cluster medians rather than means or minima/maxima.

    7. Centroid Method (UPGMC)

    Description: Similar to Ward’s method, but uses the centroid of each cluster as the basis for merging.

    Formula: \(d(A, B) = \|\text{Centroid}_A - \text{Centroid}_B\|\)

    Summary

    Each clustering method has its strengths and weaknesses. The best choice depends on the dataset's properties and the goals of the analysis. For example:

  • CPU threads. Number of CPU threads used in your PC in parallel.
  • N° MOFA input features. Number of top features used in the MOFA analysis. For metabolomics, the number of features typically ranges between 100 and 1000. While there is no upper limit for the number of features selected in MOFA analysis, it is recommended to balance the number of features across the different omics layers to prevent one omic from dominating the analysis. This recommendation also applies to undefined databases, which may include only a few features. Using a similar number of features across omics is advised for more balanced integration. Additionally, it is possible to increase the number of input features to capture more variance from each omic, but this will result in slower computation times.


  • Advanced options interface (Metabolomics section):

    Targeted-metabolomics

    This option supports metabolomics data obtained from any analytical platform such as LC-MS, Gas chromatography coupled to mass spectrometry (GC-MS), Capillary electrophoresis coupled to mass spectrometry (CE-MS) or Nuclear magnetic resonance (NMR), provided the metabolites' biological identities are available.

  • Metabolite annotation. Which annotation (HMDB, KEGG, compound name) are used in the uploaded annotated metabolomics matrices.
  • MS1:

    This option supports MS1 annotation, where users can upload MS1 files, containing mass-to-charge ratio (m/z) and retention type. The retention time, despite is not used directly in the ms1 annotation but is only copy pasted in the results. This analysis allows the annotation based only on the m/z of the metabolites (called annotation of Level 3).

    Source image: Nichole A. Reisdorph et all, A Perspective and Framework for Developing Sample Type Specific Databases for LC/MS-Based Clinical Metabolomics

  • Ion mode. Type of ionization used in the metabolomics data acquisition.
  • -Positive ionization: Compounds gain a proton (+H+) or other cations (e.g., Na+, K+) during ionization.

    -Negative ionization: Compounds lose a proton (−H+) or gain an electron during ionization

    -Neutral: Neutral species are not directly ionized but can be inferred after processing data from positive and/or negative ion modes (e.g., reconstructing mass of neutral species).

  • M/Z Tolerance ppm. Ppm tolerance used in the MS1 annotation. 15 ppm by default, its reduction reduce the missannotation but also the ratio of annotated metabolite. The ppm error (accuracy) of the MS method should be taken into account for the choice of this parameter.
  • Examples:

    -Quadrupole Mass Spectrometer > 100 ppm

    -Time-of-Flight (TOF) 1-10 ppm

    -Orbitrap Mass Spectrometer <1 ppm

    -Fourier-Transform Ion Cyclotron Resonance (FT-ICR) Sub-ppm (as low as 0.1 ppm in optimal conditions)

    It means that m/z values coming from Orbitrap Mass Spectrometer could be match with a ppm = 1.

  • Adduct positive mode. Type of positive adduct generated during the metabolomics data acquisition. This should take into account the solvent used. Available adducts: M+H, M+2H, M+Na, M+NH4, M+H-H2O.
  • Adduct negative mode. Type of negative adduct generated during the metabolomics data acquisition. This should take into account the solvent used. Available adducts: M-H, M-Cl, M-FA-H, M+NH4, M-H-H2O.
  • Adduct neutral mode. If checked will consider these data as neutral.
  • MS1 files upload. Button were upload the MS1 annotation with the corresponding input number. It is required a column called name including the peak (matched with the one in the metabolomics matrix), mass/charge ration m/z and retention time in minutes (RT_min) and or seconds (RT_sec).




  • Databases MS1. Databases consulted by the MS1 annotation (By Ceu Mass Mediator). Available databases: HMDB, LipidMaps, Metlin, Kegg.
  • MS2:

    This option supports MS1 and MS2 annotation, where users can upload MS1 files containing mass-to-charge ratio (m/z) and retention type. The retention time is not used directly in the ms1 annotation but only copy pasted in the results. This analysis allows annotations based only on the m/z of the metabolites. Then, the MS2 (containing the fragmentation spectra of the metabolites) is matched with the databases to retrieve a more confident annotation.(called annotation of Level 2).

  • mz match MS1/2. Mass charge ratio tolerance for the match between MS1 annotation and the fragmentation spectra MS2 provided within the .mzml or .mgf files.
  • RT match MS1/2. Retention time tolerance for the match between MS1 annotation and the fragmentation spectra MS2 provided within the .mzml or .mgf files.
  • Column. Type of column used in the liquid chromatography.
  • Databases MS2. Databases consulted by the MS2 annotation via Tidymass package. Availables: HMDB, MONA, MASSBANK.It is possible here to set up a priority, where the database having the higher priority will annotate first the unannotated peaks, if another annotation is identified in another database will be discarded and not replaced with the one already available.
  • MS2 directory Directory containing the fragmentation spectra (.mzml or .mgf)
  • The other option not mentioned here are reported in the MS1 section, as they are involved in the MS1 annotation before the MS2 one.



    Advanced options interface (Metadata section):

  • Column name Column used for the sample filtering
  • Column name Type of data in the defined column
  • Threshold/Factor selected Threshold to filter the samples (format: ">= 90" / "== 0.56") or specific condition (format: "== male" / "== treated")


  • Advanced options interface (MOFA section):

    MOFA parameters

  • Max iteration. Maximum number of MOFA iterations in the model training.
  • Convergence mode. Speed to train the MOFA model. The longer the more accurate.
  • Threshold contribution weight. Threshold to isolate the top MOFA factor contributors. Among the 5% top percent contributors (prefiltered by BiomiX). The contributors are normalized based on the highest contributor, resulting in scores ranging from 0 to 1. The user can define a contribution threshold to reduce the number of contributors to examine; this threshold is set to 0.5 by default.
  • Discriminant MOFA factors interpretation

  • Type of research (Bibliography). The type of document where the contributors will be researched in Pubmed. Abstract/Title or Text word.
  • N° articles (Bibliography). Number of articles consulted by BiomiX to select articles related to the significant factors. Attention, Pubmed server has a max number of request and the research of too many contributors in too many articles can drive the request interruption with the server. In this case reduce the number of articles consulted or number of contributors researched.There are 300 articles retrived by default.
  • N° top contributors (Bibliography). Top contributors of each omics researched in the PubMed abstracts. Ten contributors are selected by default
  • N° keywords extracted (Bibliography). Number of keywords extracted in each article related to significant MOFA factors. These are the keywords actively extracted selecting the most cited words of the abstract by natural language processing.
  • P.adj threshold (Pathway mining). Adjusted p.value threshold to consider significant a biological pathway.
  • Pathways shown (Pathway mining). Number of biological pathways visualized as output in the pdf reports,, 10 pathways are shown by default.
  • Numerical (Clinical). Check the box to correlate the clinical data (Numeric) with the significant MOFA factor and then upload the clinical data. The numerical clinical data must be numeric and not categorical.
  • Binary (Clinical). Check the box to correlate the clinical data (Binary) with the significant MOFA factor. The binary data should be composed by 0 and 1, otherwise using the following binomial label: Yes/No, Male/Female, M/F, m/f, Present/No.
  • Save advanced options (Save parameters). Click on the blue button to save the advanced parameters.