Omics-and-integration-principles
To provide the users with all the tools to understand each omics data and how to analyse them we wrote this detailed guide to address the users to the sources where understand the principles of each omics.
Transcriptomics:
Bulk RNA sequencing (RNA-seq) is a widely used technique to analyze the transcriptome, which represents the complete set of RNA molecules in a population of cells. It provides insights into gene expression, splicing events, and transcript abundance across different conditions or treatments. Common steps include:
- RNA Isolation: RNA is extracted from the sample, typically using methods that preserve the integrity of the RNA. The RNA Integrity Number (RIN) is the value to estimate the RNA integrity. A score above 6-7, on a RIN scale from 0 to 10 is required to asses a good RNA quality.
- cDNA Library Preparation and sequencing: The RNA is converted into complementary DNA (cDNA) using reverse transcription. The cDNA is then fragmented, and adapters are added to facilitate sequencing. the library is sequenced on platforms such as Illumina, generating millions of short nucleotide sequences called reads. These reads represent fragments of the original RNA molecules. This video illustrates the method: Link
- Mapping and Matrix Data Generation: The sequencing reads are mapped to a reference genome or transcriptome using specialized software like STAR, HISAT2, or Bowtie2. Mapping aligns the reads to specific regions, helping identify which genes or transcripts are expressed. Once mapped, the number of reads aligning to each gene is counted, generating a count matrix. Each row represents a gene, and each column corresponds to a sample. The count matrix is the primary data format used for downstream analysis.
- Statistical Properties of RNA-seq Data: RNA-seq count data is not normally distributed but follows discrete distributions, often resembling a negative binomial distribution due to overdispersion (variability greater than expected under a Poisson model). The data may have large dynamic ranges (from very low to very high expression, defined as “homoscedasticity” ), and sequencing depth variability ( “Size factor” ) that must be accounted for during analysis.
-DESeq2: This R packages model RNA-seq count data using a negative binomial distribution. It performs Size factor normalization and statistical testing to account for variability in sequencing depth and dispersion. By default, in BiomiX the counts are normalized by variance stabilizing transformation (vst) to correct the homoscedasticity.
-Limma: This package is suitable for normalized data that follow an approximately normal (Gaussian) distribution. Examples (RPKM, FPKM and TPM, vst normalized).
Metabolomics is the comprehensive study of metabolites—small molecules involved in metabolism—within a biological system. These metabolites reflect the biochemical activities of cells, tissues, or organisms and provide insights into their physiological and pathological states. Here is an explanation of the principles: Link
Metabolomics typically involves the extraction of metabolites from biological samples (e.g., tissues, blood, or urine), followed by analysis using high-throughput techniques such as mass spectrometry (MS) or nuclear magnetic resonance (NMR) spectroscopy. These methods measure the abundance of metabolites, enabling their identification and quantification. Common steps in this analysis include:
Data generation and Peak Detection:: Identifying metabolite signals in raw data.
Feature Alignment: Aligning signals across samples to ensure consistency.
Normalization: Correcting for variations in sample processing or instrument performance.
These steps are not included in BiomiX, but a guideline of how perform these steps is reported here:
doi: 10.3390/metabo10010028
Literature reference
Metaboanalyst Raw Spectra Processing
Metaboanalyst Raw Spectra Processing pdf
Annotation and Identification: Assigning chemical identities to detected features using databases like HMDB (Human Metabolome Database) or KEGG.
Data Distribution and Statistical Analysis: Metabolomic data often exhibit non-normal distributions and require appropriate statistical tools
Pathway Analysis: Tools like MetaboAnalyst link metabolites to biological pathways for functional insights
Targeted vs. Untargeted Metabolomics
Targeted Metabolomics
In targeted metabolomics, researchers focus on a predefined set of metabolites, often selected based on prior knowledge or specific hypotheses.
- Advantages: High sensitivity, precision, and quantification accuracy.
- Applications: Quantitative analysis of specific pathways or biomarkers, such as glucose levels in diabetes research.
- Limitations: Restricted to a limited number of metabolites, potentially missing novel or unexpected findings.
Untargeted Metabolomics
Untargeted metabolomics aims to detect as many metabolites as possible in a sample, providing an unbiased snapshot of the metabolic state.
- Advantages: Comprehensive analysis, enabling the discovery of unknown metabolites and broad profiling of pathways.
- Applications: Biomarker discovery, hypothesis generation, and exploration of novel biochemical processes.
- Limitations: Lower precision for quantification, challenges in identifying unknown metabolites, and greater complexity in data analysis.
Source image: https://www.drugtargetreview.com/article/33033/the-next-revolution-in-stratified-medicine-molecular-phenotyping-with-metabolomics/
Methylomics
Methylomics involves studying DNA methylation, an essential epigenetic modification where methyl groups are added to cytosines, typically in CpG dinucleotides. This modification regulates gene expression and plays a crucial role in development, cellular differentiation, and disease processes such as cancer. The Illumina Infinium HumanMethylation450 BeadChip (450K) and MethylationEPIC BeadChip (EPIC) arrays are widely used platforms for profiling genome-wide DNA methylation.
Principles of the techniques: Link
Common Steps:
- Data generation: Both arrays require bisulfite-converted DNA (~500 ng per sample), where unmethylated cytosines are converted to uracil while methylated cytosines remain unchanged.
450K Array
- Coverage: The 450K array targets ~485,000 CpG sites, including promoters, gene bodies, and intergenic regions.
- Target Regions: Cover CpG islands, shores, shelves, and open seas.
EPIC Array
- Coverage: The EPIC array extends this coverage to >850,000 CpG sites, adding enhancers and regulatory elements identified by projects like ENCODE and FANTOM.
- Target Regions: Both arrays cover CpG islands, shores, shelves, and open seas and include additional enhancer regions and non-CpG methylation sites, improving its utility for diverse studies.
Output Data
The arrays generate beta values (ranging from 0 to 1), representing the fraction of methylation at each CpG site. M-values can also be provided. Before using them in BiomiX, we suggest converting them in R:
→ Open R studio
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("lumi")
library(lumi)
m <- read.table("path/to/the/matrix")
m <- m2beta(m)
write.table(m, file="Matrix_converted_to_beta_value", sep = "\t", quote = FALSE, row.names = FALSE)
Detection p-values: These ensure the reliability of each site, and poor-quality data can be filtered out.
Main interface guide
The main interface structure:
Condition. Group to analyse. Group to analyse, it is directly displayed from the Metadata column “CONDITION”. Avoid using spaces in the conditions, i.e (No treated → No_treated)
Control. Reference group, compared with the condition one. Avoid use spaces in the conditions, i.e (No treated → No_treated) .
Output. Directory where save the output results of the analysis.
Omics input grid:
Input. Input slot number.
Preview-QC. Opens a shiny app that visualizes, transforms the data and removes variables or sample outliers.
Single omics Analysis. Check the box to analyse that omic by our single omics pipelines.
Data type. Select the type of omics uploaded.
Integration. Check the box to include those omics in the MOFA integration.
Label. Omics name, used to name the output folder and filter the sample based on the sample names.
Selection. If selected filters the samples based on the sample name using a regex derived by the label name. It requires the "SAMPLE_TYPE" column (optional). It supposes the input matrix contains samples from different tissues or cells (convoluted matrix), and this source is also defined in the sample ID (e.g., ID = 25658_BLymphocyte, 25658_GlialCells). In that case, this column allows analysis to be performed only on data from the selected tissues or cells.
Data upload. Click on the button to upload the matrix. Here, it is possible to modify the BiomiX format (.tsv,xls,xlsx) matrix.
Integration grid:
Integration. Do you want to perform the integration with the uploaded omics data?.
Method. Which integration method? (Only MOFA is available for now).
N° Factors. How many MOFA factors would you like to calculate for the integration? We recommend running the tuned mode (automatic) by selecting 0. Once the top three MOFA models are identified, you can re-run the analysis focusing on the model of interest by selecting the corresponding number of MOFA factors. This step is particularly important for visualizing heatmaps and correlations of the significant factors within the chosen model (select the relevant discriminant factor in within the Factor to explore input).
Factor to explore. Which factor do you want to explore graphically? (Contributors, heatmap clustering etc...). It is suggested to set it to 1 for the first exploratory analysis. When the model and the total number of factors are defined at this point, it is suggested to set this parameter to visualize the factor of interest graphically.
Omics overlay.What is the minimum number of omics a sample must have to be included in the integration? For MOFA, it is crucial to maximize the number of shared samples across the datasets. An omics dataset with significant differences in sample coverage can disproportionately influence the model, causing the predominant feature to be overweighted. To ensure balanced integration, at least 50% of the samples should be shared across all omics.
Open advance option. Button to open the advance option interface
Open BiomiX chatbot. Button to open the BiomiX chatbot
Start Analysis. Button to start the analysis.
Guide to the advanced option interface
Advanced options interface (General section):
Log2FC threshold. Log2FC threshold value for significant results.
P.adj threshold. Adjusted p.value threshold value for significant results.
Gene Panel. Button to upload the gene panel for the subgrouping analysis. Specifically, the standard deviation score (Z activity score) uses the counts normalized by the number of reads to compare the expression of each gene (g) in each disease or treated sample (s) with the mean expression of the CTRLs divided by the standard deviation of the CTRLs as in the following equation:
The higher the standard deviation shift is in a gene, the higher the gene expression is in the condition compared to the CTRLs.
Array type. The type of chip from which the data comes (450K/EPIC).
N° of genes within the panel with score > one or two. Parameters to define the positive/negative subgrouping by the gene panel. The user can modify the number of genes that must have a Z-score > one or two of the control standard deviation. By default, according to the Kirou score on interferon signal, the samples with three genes with a score > 2 or 10 genes with a score > 1 are labeled positive. We strongly suggest testing different combinations of parameters for the analysis focusing on the subgrouping, as each gene panel may show differences in standard deviation changes. To check if the subgrouping is consistent, it is possible to compare it with the heatmap clustering (Gene_panel_subgroups...pdf).
Removal of positive control samples. If checked, the controls positive for the gene panel can be excluded from the downstream analysis. This is relevant to exclude controls not respecting certain features (e.i healthy patients with occurring infections spotted as high inflammation)
N° top DE genes in heatmap Number of top genes(or metabolites) to visualize in the heatmaps. We suggest a value around 25, too many features visualized could disrupt the plot.
Clustering distance. Clustering distance used in the heatmaps, both in subgrouping and single omics analysis. Here a guideline for the distance choice:
Clustering Distance Recommendation
- Euclidean: Use for normalized continuous data.
- Maximum: Use when the largest difference in any dimension is critical.
- Manhattan: Good for high-dimensional data or when axis-aligned differences matter.
- Canberra: Ideal for data with varying scales or sparse datasets.
- Binary: Use exclusively for binary data.
- Minkowski: Provides flexibility; choose based on p (e.g., Euclidean or Manhattan).
- Pearson: Focus on linear relationships in normalized data.
- Spearman: Use for monotonic relationships or rank-based data.
- Kendall: Best for ordinal data or when robustness to noise is required.
Details
1. Euclidean Distance
Description: The straight-line distance between two points in a multidimensional space. It is the most commonly used distance metric.
Formula: \(d = \sqrt{\sum_{i=1}^n (x_i - y_i)^2}\)
- Pros:
- Intuitive and easy to compute.
- Works well with continuous and normalized data.
- Cons:
- Sensitive to scale and outliers.
- Assumes all dimensions are equally important.
- Use Case: Suitable when data is continuous and has been normalized or standardized. Avoid for high-dimensional data due to the "curse of dimensionality."
2. Maximum Distance (Chebyshev Distance)
Description: Measures the maximum absolute difference between dimensions.
Formula: \(d = \max(|x_i - y_i|)\)
- Pros:
- Captures the largest difference, making it sensitive to outliers in one dimension.
- Simple to compute.
- Cons:
- Ignores variation in other dimensions.
- Not ideal for general clustering tasks.
- Use Case: Useful when extreme differences in any one dimension are critical, such as quality control or anomaly detection.
3. Manhattan Distance (Taxicab or City Block Distance)
Description: The sum of the absolute differences across dimensions.
Formula: \(d = \sum_{i=1}^n |x_i - y_i|\)
- Pros:
- Less sensitive to outliers than Euclidean distance.
- Works well with high-dimensional data.
- Cons:
- Sensitive to the scale of data.
- Assumes all dimensions are equally weighted.
- Use Case: Ideal for high-dimensional data, particularly when differences along each axis are meaningful.
4. Canberra Distance
Description: A weighted metric where differences are normalized by the sum of absolute values of the variables.
Formula: \(d = \sum_{i=1}^n \frac{|x_i - y_i|}{|x_i| + |y_i|}\)
- Pros:
- Gives higher weight to smaller differences.
- Effective when variables have different scales.
- Cons:
- Undefined if any variable has a value of 0.
- More computationally expensive.
- Use Case: Suitable for sparse data or data with different scales.
5. Binary Distance
Description: Used for binary data, measuring dissimilarity based on matches (1) and mismatches (0).
- Pros:
- Simple and interpretable for binary datasets.
- Directly applicable to categorical or binary data.
- Cons:
- Limited to binary datasets.
- Ignores magnitude differences in non-binary data.
- Use Case: Ideal for datasets with binary attributes, such as presence/absence matrices.
6. Minkowski Distance
Description: A generalized form of Euclidean and Manhattan distances. It uses a parameter p, where p=2 gives Euclidean distance and p=1 gives Manhattan distance.
Formula: \(d = \left(\sum_{i=1}^n |x_i - y_i|^p\right)^{1/p}\)
- Pros:
- Flexible; allows customization through p.
- Includes Euclidean and Manhattan distances as special cases.
- Cons:
- Choosing the appropriate p can be non-intuitive.
- Sensitive to the scale of data.
- Use Case: Use when you want flexibility in defining the distance metric based on the problem.
7. Pearson Correlation Distance
Description: Measures how well two variables are linearly related. The distance is \(1 - r\), where \(r\) is the Pearson correlation coefficient.
- Pros:
- Captures linear relationships between variables.
- Scale-independent.
- Cons:
- Assumes a linear relationship.
- Sensitive to outliers.
- Use Case: Ideal for clustering gene expression data or other datasets where linear relationships are important.
8. Spearman Correlation Distance
Description: Measures the monotonic relationship between variables. Similar to Pearson but uses ranks instead of raw values.
- Pros:
- Robust to non-linear relationships.
- Less sensitive to outliers than Pearson.
- Cons:
- Ignores the magnitude of differences.
- Slightly more complex computation.
- Use Case: Use when relationships between variables are monotonic rather than linear.
9. Kendall Correlation Distance
Description: Measures ordinal association between two variables by comparing concordant and discordant pairs.
- Pros:
- Robust to noise and outliers.
- Effective for ordinal or rank-based data.
- Cons:
- Computationally intensive for large datasets.
- Ignores magnitude differences.
- Use Case: Suitable for datasets with ordinal or rank-based data.
Clustering Methods. Clustering method used in the heatmaps, both in subgrouping and single omics analysis.
Clustering Methods Recommendation
Clustering methods used in the heatmaps, both in subgrouping and single-omics analysis. Here is a guideline about the method choice.
Method |
Pros |
Cons |
Best for |
Ward.D/D2 |
Compact clusters, minimizes variance |
Sensitive to scale/outliers |
Continuous, normalized data |
Single |
Handles irregular shapes |
Prone to chaining |
Non-elliptical shapes |
Complete |
Compact clusters, avoids chaining |
Sensitive to outliers |
Compact, well-separated clusters |
Average |
Balances compactness and shape sensitivity |
May lose precision with irregular clusters |
General-purpose clustering |
McQuitty |
Equal cluster importance, simple |
Assumes equal weight for all clusters |
Equal-weight clustering scenarios |
Median |
Robust to outliers, flexible |
Uneven cluster sizes, harder to interpret |
Median-based distance datasets |
Centroid |
Geometrically intuitive clusters |
Non-hierarchical results possible |
Centroid-representative data |
1. Ward’s Method (Ward.D and Ward.D2)
Ward.D: Minimizes the total within-cluster variance by merging clusters that result in the smallest increase in total variance.
Ward.D2: A variation of Ward.D that uses squared Euclidean distances for calculating dissimilarity.
- Pros:
- Produces compact, spherical clusters.
- Works well for continuous and normalized data.
- Reduces noise by emphasizing variance minimization.
- Cons:
- Sensitive to outliers and scaling of variables.
- Assumes that clusters have similar sizes and densities.
- Use Case:
- Use Ward.D for general hierarchical clustering with continuous data.
- Use Ward.D2 for clustering tasks where squared distances are more appropriate.
2. Single Linkage
Description: Also known as the "nearest neighbor" method, it merges clusters based on the smallest pairwise distance between their points.
Formula: \(d(A, B) = \min(d(i, j))\) for all \(i \in A, j \in B\).
- Pros:
- Can handle non-elliptical shapes (e.g., chaining clusters).
- Computationally simple.
- Cons:
- Prone to "chaining," where distant points may be linked via a series of close intermediate points.
- Tends to create long, elongated clusters.
- Use Case: Suitable for data with irregular or elongated shapes. Avoid when compact clusters are desired.
3. Complete Linkage
Description: Also known as the "farthest neighbor" method, it merges clusters based on the largest pairwise distance between their points.
Formula: \(d(A, B) = \max(d(i, j))\) for all \(i \in A, j \in B\).
- Pros:
- Produces compact and tightly bound clusters.
- Less prone to chaining compared to single linkage.
- Cons:
- Sensitive to outliers.
- May break apart elongated clusters.
- Use Case: Best when compact, well-separated clusters are desired. Avoid for datasets with irregular shapes.
4. Average Linkage (UPGMA)
Description: Merges clusters based on the average pairwise distance between their points.
Formula: \(d(A, B) = \frac{1}{|A||B|} \sum_{i \in A, j \in B} d(i, j)\)
- Pros:
- Balances compactness and sensitivity to cluster shapes.
- Works well for most datasets, including those with moderate noise.
- Cons:
- Assumes equal weights for all clusters, which may not be realistic.
- Can lose precision with highly irregular clusters.
- Use Case: A general-purpose method suitable for many types of data.
5. McQuitty’s Method (WPGMA)
Description: A weighted version of average linkage, where the contribution of each cluster to the distance is equal regardless of size.
Formula: Similar to UPGMA, but weights are evenly distributed regardless of cluster sizes.
- Pros:
- Simpler and faster than UPGMA.
- Treats all clusters equally, regardless of size.
- Cons:
- Assumes equal cluster importance, which may not always be appropriate.
- Less robust than UPGMA.
- Use Case: Useful when equal importance for all clusters is desired.
6. Median Method (WPGMC)
Description: A weighted method that computes the median distance between clusters to decide merges.
Formula: Based on cluster medians rather than means or minima/maxima.
- Pros:
- Balances robustness and flexibility.
- Less sensitive to outliers than some other methods.
- Cons:
- May create clusters with uneven sizes.
- Less commonly used and harder to interpret.
- Use Case: Suitable for datasets where median-based distances provide meaningful clustering.
7. Centroid Method (UPGMC)
Description: Similar to Ward’s method, but uses the centroid of each cluster as the basis for merging.
Formula: \(d(A, B) = \|\text{Centroid}_A - \text{Centroid}_B\|\)
- Pros:
- Produces geometrically interpretable clusters.
- Suitable for continuous and normalized data.
- Cons:
- May create non-hierarchical results if centroids from smaller clusters differ significantly.
- Sensitive to outliers.
- Use Case: Use for datasets where the centroid offers a meaningful representation of the cluster.
Summary
Each clustering method has its strengths and weaknesses. The best choice depends on the dataset's properties and the goals of the analysis. For example:
- Ward.D/D2: Best for compact, spherical clusters and normalized data.
- Single Linkage: Effective for irregular or elongated shapes.
- Complete Linkage: Suitable for compact and well-separated clusters.
- Average Linkage: General-purpose and balances shape sensitivity.
- McQuitty: Ideal when equal cluster importance is required.
- Median: Robust for uneven sizes and flexible in median-based datasets.
- Centroid: Best for datasets where centroid interpretation is meaningful.
CPU threads. Number of CPU threads used in your PC in parallel.
N° MOFA input features. Number of top features used in the MOFA analysis. For metabolomics, the number of features typically ranges between 100 and 1000. While there is no upper limit for the number of features selected in MOFA analysis, it is recommended to balance the number of features across the different omics layers to prevent one omic from dominating the analysis. This recommendation also applies to undefined databases, which may include only a few features. Using a similar number of features across omics is advised for more balanced integration. Additionally, it is possible to increase the number of input features to capture more variance from each omic, but this will result in slower computation times.
This option supports metabolomics data obtained from any analytical platform such as LC-MS, Gas chromatography coupled to mass spectrometry (GC-MS), Capillary electrophoresis coupled to mass spectrometry (CE-MS) or Nuclear magnetic resonance (NMR), provided the metabolites' biological identities are available.
Metabolite annotation. Which annotation (HMDB, KEGG, compound name) are used in the uploaded annotated metabolomics matrices.
MS1:
This option supports MS1 annotation, where users can upload MS1 files, containing mass-to-charge ratio (m/z) and retention type. The retention time, despite is not used directly in the ms1 annotation but is only copy pasted in the results. This analysis allows the annotation based only on the m/z of the metabolites (called annotation of Level 3).
Source image: Nichole A. Reisdorph et all, A Perspective and Framework for Developing Sample Type Specific Databases for LC/MS-Based Clinical Metabolomics
Ion mode. Type of ionization used in the metabolomics data acquisition.
-Positive ionization: Compounds gain a proton (+H+) or other cations (e.g., Na+, K+) during ionization.
-Negative ionization: Compounds lose a proton (−H+) or gain an electron during ionization
-Neutral: Neutral species are not directly ionized but can be inferred after processing data from positive and/or negative ion modes (e.g., reconstructing mass of neutral species).
M/Z Tolerance ppm. Ppm tolerance used in the MS1 annotation. 15 ppm by default, its reduction reduce the missannotation but also the ratio of annotated metabolite. The ppm error (accuracy) of the MS method should be taken into account for the choice of this parameter.
Examples:
-Quadrupole Mass Spectrometer > 100 ppm
-Time-of-Flight (TOF) 1-10 ppm
-Orbitrap Mass Spectrometer <1 ppm
-Fourier-Transform Ion Cyclotron Resonance (FT-ICR) Sub-ppm (as low as 0.1 ppm in optimal conditions)
It means that m/z values coming from Orbitrap Mass Spectrometer could be match with a ppm = 1.
Adduct positive mode. Type of positive adduct generated during the metabolomics data acquisition. This should take into account the solvent used. Available adducts: M+H, M+2H, M+Na, M+NH4, M+H-H2O.
Adduct negative mode. Type of negative adduct generated during the metabolomics data acquisition. This should take into account the solvent used. Available adducts: M-H, M-Cl, M-FA-H, M+NH4, M-H-H2O.
Adduct neutral mode. If checked will consider these data as neutral.
MS1 files upload. Button were upload the MS1 annotation with the corresponding input number. It is required a column called name including the peak (matched with the one in the metabolomics matrix), mass/charge ration m/z and retention time in minutes (RT_min) and or seconds (RT_sec).
Databases MS1. Databases consulted by the MS1 annotation (By Ceu Mass Mediator). Available databases: HMDB, LipidMaps, Metlin, Kegg.
MS2:
This option supports MS1 and MS2 annotation, where users can upload MS1 files containing mass-to-charge ratio (m/z) and retention type. The retention time is not used directly in the ms1 annotation but only copy pasted in the results. This analysis allows annotations based only on the m/z of the metabolites. Then, the MS2 (containing the fragmentation spectra of the metabolites) is matched with the databases to retrieve a more confident annotation.(called annotation of Level 2).
mz match MS1/2. Mass charge ratio tolerance for the match between MS1 annotation and the fragmentation spectra MS2 provided within the .mzml or .mgf files.
RT match MS1/2. Retention time tolerance for the match between MS1 annotation and the fragmentation spectra MS2 provided within the .mzml or .mgf files.
Column. Type of column used in the liquid chromatography.
Databases MS2. Databases consulted by the MS2 annotation via Tidymass package. Availables: HMDB, MONA, MASSBANK.It is possible here to set up a priority, where the database having the higher priority will annotate first the unannotated peaks, if another annotation is identified in another database will be discarded and not replaced with the one already available.
MS2 directory Directory containing the fragmentation spectra (.mzml or .mgf)
The other option not mentioned here are reported in the MS1 section, as they are involved in the MS1 annotation before the MS2 one.
Column name Column used for the sample filtering
Column name Type of data in the defined column
Threshold/Factor selected Threshold to filter the samples (format: ">= 90" / "== 0.56") or specific condition (format: "== male" / "== treated")
Advanced options interface (MOFA section):
MOFA parameters
Max iteration. Maximum number of MOFA iterations in the model training.
Convergence mode. Speed to train the MOFA model. The longer the more accurate.
Threshold contribution weight. Threshold to isolate the top MOFA factor contributors. Among the 5% top percent contributors (prefiltered by BiomiX). The contributors are normalized based on the highest contributor, resulting in scores ranging from 0 to 1. The user can define a contribution threshold to reduce the number of contributors to examine; this threshold is set to 0.5 by default.
Discriminant MOFA factors interpretation
Type of research (Bibliography). The type of document where the contributors will be researched in Pubmed. Abstract/Title or Text word.
N° articles (Bibliography). Number of articles consulted by BiomiX to select articles related to the significant factors. Attention, Pubmed server has a max number of request and the research of too many contributors in too many articles can drive the request interruption with the server. In this case reduce the number of articles consulted or number of contributors researched.There are 300 articles retrived by default.
N° top contributors (Bibliography). Top contributors of each omics researched in the PubMed abstracts. Ten contributors are selected by default
N° keywords extracted (Bibliography). Number of keywords extracted in each article related to significant MOFA factors. These are the keywords actively extracted selecting the most cited words of the abstract by natural language processing.
P.adj threshold (Pathway mining). Adjusted p.value threshold to consider significant a biological pathway.
Pathways shown (Pathway mining). Number of biological pathways visualized as output in the pdf reports,, 10 pathways are shown by default.
Numerical (Clinical). Check the box to correlate the clinical data (Numeric) with the significant MOFA factor and then upload the clinical data. The numerical clinical data must be numeric and not categorical.
Binary (Clinical). Check the box to correlate the clinical data (Binary) with the significant MOFA factor. The binary data should be composed by 0 and 1, otherwise using the following binomial label: Yes/No, Male/Female, M/F, m/f, Present/No.
Save advanced options (Save parameters). Click on the blue button to save the advanced parameters.