| CJUH-JLU-China - iGEM 2023

Bioinformatics Analysis

Overview

Highlights:

  1. Feature importance analysis, including Random Forest-based analysis and cooperative game theory-based SHAP value that takes into account the interdependencies among features, has been applied in iGEM's bioinformatic analysis.
  2. A comprehensive analysis including a variety of bioinformatic methods has been applied to achieve accurate identification of target miRNA, which can be served as a pipeline for similar iGEM projects.

Introduction: Altered expression of miRNA is associated with breast cancer. However, identifying the optimal miRNA among numerous candidates as breast cancer-specific biomarkers has proven to be akin to finding a needle in a haystack. To address this challenge, we employed various approaches including differential expression analysis, feature importance analysis based on random forests, further analysis based on SHAP value, survival analysis, and literature search to select the optimal miRNA candidates.

Methods and Results: We commenced our research by undertaking a differential expression analysis, initially utilizing data from the Cancer Genome Atlas (TCGA) database and the microarray dataset GSE59247 [11]. Through the process of intersecting significantly upregulated miRNAs, we identified nine candidate miRNAs that exhibited significant upregulation consistently across both datasets. We constructed ROC curves for these 9 candidate miRNAs based on TCGA-BRCA dataset, and refined our selection by focusing on miRNAs with AUC values approaching unity (>0.9). Concurrently, we harnessed the power of a random forest classifier to classify the TCGA-BRCA dataset, utilizing miRNAs in the TCGA-BRCA dataset as distinctive features and calculating their respective feature importance scores. The selection of candidate miRNAs was rigorously guided by the dual criteria of significant upregulation and elevated feature importance. We conducted a comprehensive analysis employing SHAP (SHapley Additive exPlanations) values to delve deeper into feature importance. Finally, we did survival analysis with Kaplan-Meier curves. After a comprehensive investigation involving multiple analytical methodologies and a thorough review of relevant literatures, we validated miR-21-5p as a robust biomarker for breast cancer screening.

Figure 1: Procedure of bioinformatic analysis

Keywords: miRNA; Differential expression analysis; Feature importance; Random forest; SHAP value; Survival analysis

Differential Expression Analysis

During the initiation and progression of breast cancer, expression levels of certain miRNAs undergo changes. Some miRNAs may be upregulated, while others may exhibit decreased expression. Since our study utilizes lateral flow strips to detect miRNA with increased expression, we decide to select biomarkers from upregulated miRNAs in breast cancer for BreFast detection.

Preliminary selection of miRNAs

We performed differential expression analysis on both the GSE59247 dataset and the TCGA-BRCA dataset. By setting the filtering criteria to |logFC| > 1 and PValue < 0.05, the number of miRNAs with changed expression was shown in Table 1.

Dataset Upregulation Downregulation
GSE59247 118 98
TCGA-BRCA 226 107
Table 1 : Numbers of upregulated and downregulated miRNAs from differential expression analysis

The intersection of upregulated miRNAs of both GSE59247 and TCGA-BRCA datasets was obtained, and plotted with a Venn diagram in figure 2:

Figure 2: Venn diagram showing the numbers of upregulated miRNAs of GSE59247 and TCGA-BRCA

The intersection included 9 miRNAs, and the corresponding logFC and PValue values of these miRNA were shown in the following table:

miRNA logFC PValue
hsa-miR-21-5p 1.46 4.64E-05
hsa-miR-183-5p 3.96 5.29E-04
hsa-miR-96-5p 4.14 1.53E-03
hsa-miR-592 1.81 3.55E-05
hsa-miR-3923 1.98 4.10E-05
hsa-miR-885-5p 1.77 2.29E-04
hsa-miR-184 2.83 3.16E-02
hsa-miR-339-3p 1.31 1.69E-02
hsa-miR-142-5p 2.57 6.72E-03
Table 2: LogFC and PValue values of the 9 miRNA from GSE59247 dataset
miRNA logFC PValue
hsa-miR-21-5p 2.20 1.36E-03
hsa-miR-183-5p 2.98 5.65E-17
hsa-miR-96-5p 3.30 2.16E-99
hsa-miR-592 1.67 1.82E-23
hsa-miR-3923 1.66 9.73E-23
hsa-miR-885-5p 1.04 7.31E-08
hsa-miR-184 1.10 1.59E-10
hsa-miR-339-3p 1.26 8.15E-02
hsa-miR-142-5p 2.58 4.31E-77
Table 3: LogFC and PValue values of the 9 miRNA from TCGA-BRCA dataset

Volcano plots were generated from GSE59247 and TCGA-BRCA datasets respectively, as shown in figure 3 and figure 4:

Figure 3: Volcano plot for GSE59247
Figure 4: Volcano plot for TCGA-BRCA

Further selection of miRNAs

To further refine the selection of optimal miRNAs, we plotted ROC curves for these 9 miRNAs. Since the sample size of GSE59247 dataset was relative small, we used the TCGA-BRCA dataset to generate ROC curves, as shown in figure 5, which demonstrated that miR-21-5p, miR-183-5p, and miR-96-5p have good diagnostic value with AUC > 0.9:

Figure 5: ROC curves of the 9 miRNAs generated from TCGA-BRCA dataset

Feature Importance Analysis

From previous section, we have identified 3 upregulated miRNAs through differential expression analysis and ROC curve analysis. However, which miRNA is the beast biomarker for our Brefast detection remains a question. To further narrow down our selection from another perspective, we treat each miRNA as a feature, thus transforming this problem into a machine learning feature importance problem.

After comparing various mainstream feature importance analysis methods, such as linear models and tree models, we decided to choose Random Forest to conduct feature importance analysis for our data, since random forest achieved higher accuracy on the test set compared to other methods.

While tree models, based on the Gini index, can reflect the classification value of a feature (each miRNA) concerning sample (each patient) classification, their interpretability still has certain limitations, especially when it comes to not considering the interactions between different features. Considering the possibility of strong interactions among different miRNAs, our team has employed SHAP (SHapley Additive exPlanations) analysis, a methodology based on cooperative game theory, to conduct further analysis on the feature importance derived from the Random Forest model. This approach enhanced our understanding of the significance of each miRNA in the context of interactions among different miRNAs.

Feature Importance based on Random Forest

Random forest is an ensemble learning model that selects miRNAs by constructing multiple decision trees and combining their predictive results. Based on the feature importance output of the random forest model, miRNAs with key predictive abilities are selected as detection targets, thereby improving the accuracy and reliability of the screening process.

Each miRNA was treated as a feature, and the labels were divided into positive and negative groups based on whether the samples were cancer patients. The dataset was divided into training data and testing data in an 8:2 ratio. Grid search and five-fold cross-validation were used to find the optimal parameters (achieving an accuracy of 0.98924 in five-fold cross-validation). The binary classification task was performed using a random forest model, and the accuracy reached 1 on the testing dataset. Feature importance was calculated based on the Gini index, and the top ten ranked features were shown in the following figure.

Figure 6: The feature importance of the top ten miRNAs analyzed by random forest, where the upregulated and downregulated miRNAs have been labeled based on the results of differential expression analysis of TCGA data.

Notably, miR-21-5p ranked first in terms of feature importance among the upregulated miRNAs.

SHAP Values Analysis

Machine learning models are becoming increasingly complex, powerful, and able to make accurate predictions. However, as these models become more and more like "black boxes," it's even harder to understand how they make predictions now, which leads to a growing need to improve the interpretability and explainability of machine learning models. One of the most promising tools to solve this problem is SHAP (SHapley Additive exPlanations) value, which measures how much each feature (such as income, age, credit score, etc.) contributes to the model's prediction. Therefore, SHAP value can not only identify which features are most important for the model, but also help to understand how they affect the outcome.

Actually, SHAP value can be used to explain the output of any machine learning models. It uses a game theory approach that measures each player's contribution to the final outcome. In machine learning, each feature is assigned an importance value representing its contribution to the model's output, and SHAP values show how each feature affects the final prediction, the significance of each feature compared to others, and the model's reliance on the interaction between features.[1]

We compute the average SHAP values for each feature (miRNA) and select the top ten ranked miRNAs to create the bar chart in Figure 7.

Figure 7: SHAP value analysis of the feature importance derived from Random Forest, where the upregulated and downregulated miRNAs have been labeled based on the results of differential expression analysis of TCGA data.

Notably, miR-21-5p ranked first in the SHAP value analysis among of miRNA, indicating that it is the best choice of breast cancer-associated miRNA for BreFast detection.

Survival Analysis

Although we have identified candidate miRNAs through differential expression analysis, the significance of differential expression alone cannot be directly used as the basis for selecting cancer biomarkers. Generally, miRNAs that are significantly upregulated in differential expression analysis are expected to promote cancer progression. However, not all upregulated miRNAs in breast cancer exhibit significant effects on the disease progression of cancer patients. This indicates that these miRNAs may not have a direct regulatory role in cancer but are instead regulated by other factors in cancer. Therefore, these miRNAs might not be suitable biomarkers for cancer screening.

We would like to use a miRNA that have impact on the progression of breast cancer as the biomarker for Brefast screening. In this context, survival analysis can compare the effects of candidate miRNAs on cancer progression by grouping patients with either high or low expression of these miRNAs. It validates whether the candidate miRNAs genuinely affect cancer progression and provides parameter values for comparing the effects of different miRNAs.

The horizontal axis of the survival analysis curve represents survival time after diagnosis of patients, and the differences in the survival curve can be better demonstrated when the survival time is long enough. According to available data, the median survival time for TCGA-BRCA is 27.7 months, and the average survival time for Luminal cases in the Metabric dataset is 130.5 months, with a maximum survival time of 337 months. Based on this information, we have chosen 180 months as the horizontal axis for the survival analysis curve. Noted that the mainstream breast cancer follow-up time is 180 months, this is chosen as the maximum unit of the horizontal axis of our survival analysis curve. We have plotted the survival curves for the selected miR-21-5p using data from the Metabric and TCGA datasets as shown in figure 8:

Figure 8: The survival curves for miR-21-5p, where Figure (a) represents data from the Metabric database, and Figure (b) represents data from the TCGA dataset

Literature Research

Apart from bioinformatics analysis, we also read literatures to confirm that miR-21-5p is the best miRNA candidate to be used as the breast-cancer-specific biomarkers for screening by BreFast.

A comprehensive review on International Journal of Molecular Sciences summarized 75 independent clinical studies. 20 miRNAs were suggested to be breast-tumor-specific biomarkers, which were supported by at least two independent studies in the same biological specimen (serum, plasma, or whole blood). 8 out 20 of them demonstrated similar tendency of change in expression: miR-21 (12 up versus 1 down), miR-155 (14 up versus 1 down), miR-10b (5 up, only serum), miR-373 (3 up, only serum), miR-652 (3 down, only serum), miR-425 (2 down, only serum), miR-29a (2 up, only serum), and miR-148b (2 down, only serum). Others miRNA, such as miR145 displayed different tendency of change in expression.[2]

Figure 9: Pyramidal graph of the direction of miRNA expression (microRNA concentration in breast cancer cases versus controls) by type of specimens (only microRNAs that were analyzed in two or more independent studies). (A) whole blood, (B) plasma; (C) all specimens; (D) serum serum

As a result, miR-21, miR-155, miR-10b were supported by multiple studies as highly associated with breast cancer, while the predominant form of miR-21 was miR-21-5p.[3,4]

Conclusion

Through an intensive investigation that uses a variety of bioinformatics analysis, including differential expression analysis, ROC curves, random forest classification, SHAP values analysis, survival analysis and literature research, it has been demonstrated that miR-21-5p is the optimal breast cancer-associated biomarker for BreFast screening.

References

[1] https://www.datacamp.com/tutorial/introduction-to-shap-values-machine-learning-interpretability

[2] Padroni L, De Marco L, Dansero L, et al. An Epidemiological Systematic Review with Meta-Analysis on Biomarker Role of Circulating MicroRNAs in Breast Cancer Incidence. Int J Mol Sci. 2023;24(4):3910. Published 2023 Feb 15. doi:10.3390/ijms24043910

[3] Weng S, Lin D, Lai S, et al. Highly sensitive and reliable detection of microRNA for clinically disease surveillance using SERS biosensor integrated with catalytic hairpin assembly amplification technology. Biosens Bioelectron. 2022;208:114236. doi:10.1016/j.bios.2022.114236

[4] Fan T, Mao Y, Sun Q, et al. Branched rolling circle amplification method for measuring serum circulating microRNA levels for early breast cancer detection [published correction appears in Cancer Sci. 2020 Jan;111(1):297-298]. Cancer Sci. 2018;109(9):2897-2906. doi:10.1111/cas.13725

[5] Lanczky A, Gyorffy B: Web-Based Survival Analysis Tool Tailored for Medical Research (KMplot): Development and Implementation, J Med Internet Res, 2021 Jul 26;23(7):e27633. doi: 10.2196/27633.

[6] Lánczky A, Nagy Á, Bottai G, Munkácsy G, Szabó A, Santarpia L, Győrffy B. miRpower: a web-tool to validate survival-associated miRNAs utilizing expression data from 2178 breast cancer patients. Breast Cancer Res Treat. 2016 Dec;160(3):439-446. doi: 10.1007/s10549-016-4013-7.

[7] Lesurf R, Aure MR, Mørk HH, Vitelli V; Oslo Breast Cancer Research Consortium (OSBREAC); Lundgren S, Børresen-Dale AL, Kristensen V, Wärnberg F, Hallett M, Sørlie T. Molecular Features of Subtype-Specific Progression from Ductal Carcinoma In Situ to Invasive Breast Cancer. Cell Rep. 2016 Jul 26;16(4):1166-1179. doi: 10.1016/j.celrep.2016.06.051. Epub 2016 Jul 7. (data accessible at NCBI GEO database (Edgar et al., 2002), accession GSE59247).

[8] An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics doi: 10.1016/j.cell.2018.02.052

[9] Survival outcomes are associated with genomic instability in luminal breast cancers. doi: 10.1371/journal.pone.0245042

[10] Zhao L, He L, Jiang L, et al. Differential expression analysis of miRNAs in gastric adenocarcinoma tissues based on bioinformatics and its clinical significance. Shandong Medical Journal, 2018, 58(02), 21-24.

[11]https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE59247