Bioinformatics Analysis
Overview
Highlights:
-
Feature importance analysis, including Random Forest-based analysis and cooperative game
theory-based SHAP value that takes into account the interdependencies among features, has been
applied in iGEM's bioinformatic analysis.
- A comprehensive analysis including a variety of bioinformatic methods has been applied to achieve
accurate identification of target miRNA, which can be served as a pipeline for similar iGEM
projects.
Introduction:
Altered expression of miRNA is associated with breast cancer. However, identifying the optimal miRNA
among numerous candidates as breast cancer-specific biomarkers has proven to be akin to finding a
needle in a haystack. To address this challenge, we employed various approaches including
differential expression analysis, feature importance analysis based on random forests, further
analysis based on SHAP value, survival analysis, and literature search to select the optimal miRNA
candidates.
Methods and Results:
We commenced our research by undertaking a differential expression analysis, initially utilizing
data from the Cancer Genome Atlas (TCGA) database and the microarray dataset GSE59247 [11]. Through
the process of intersecting significantly upregulated miRNAs, we identified nine candidate miRNAs
that exhibited significant upregulation consistently across both datasets. We constructed ROC curves
for
these 9 candidate miRNAs based on TCGA-BRCA dataset, and refined our selection by focusing on miRNAs
with AUC values approaching unity (>0.9). Concurrently, we harnessed the power of a
random forest classifier to classify the TCGA-BRCA dataset, utilizing miRNAs in the TCGA-BRCA
dataset as distinctive features and calculating their respective feature importance scores. The
selection of candidate miRNAs was rigorously guided by the dual criteria of significant upregulation
and elevated feature importance. We conducted a comprehensive analysis employing SHAP (SHapley
Additive exPlanations) values to delve deeper into feature importance. Finally, we did survival
analysis with Kaplan-Meier curves. After a comprehensive investigation involving multiple analytical
methodologies and a thorough review of relevant literatures, we validated miR-21-5p as a robust
biomarker for breast cancer screening.
Figure 1: Procedure of bioinformatic analysis
Keywords: miRNA; Differential expression
analysis; Feature
importance;
Random forest; SHAP value;
Survival analysis
Differential Expression Analysis
During the initiation and progression of breast cancer, expression levels of certain miRNAs undergo
changes. Some miRNAs may be upregulated, while others may exhibit decreased expression. Since our study
utilizes lateral flow strips to detect miRNA with increased expression, we decide to select biomarkers
from upregulated miRNAs in breast cancer for BreFast detection.
Preliminary selection of miRNAs
We performed differential expression analysis on both the GSE59247 dataset and the TCGA-BRCA
dataset. By
setting the filtering criteria to |logFC| > 1 and PValue < 0.05, the number of miRNAs with
changed expression was shown in Table 1.
Dataset |
Upregulation |
Downregulation |
GSE59247 |
118 |
98 |
TCGA-BRCA |
226 |
107 |
Table 1 : Numbers of upregulated and downregulated miRNAs from differential
expression analysis
The intersection of upregulated miRNAs of both GSE59247 and TCGA-BRCA datasets was obtained, and
plotted with a Venn diagram in figure 2:
Figure 2: Venn diagram showing the numbers of upregulated miRNAs of GSE59247
and TCGA-BRCA
The intersection included 9 miRNAs, and the corresponding logFC and PValue values of these miRNA were
shown in the following table:
miRNA |
logFC |
PValue |
hsa-miR-21-5p |
1.46 |
4.64E-05 |
hsa-miR-183-5p |
3.96 |
5.29E-04 |
hsa-miR-96-5p |
4.14 |
1.53E-03 |
hsa-miR-592 |
1.81 |
3.55E-05 |
hsa-miR-3923 |
1.98 |
4.10E-05 |
hsa-miR-885-5p |
1.77 |
2.29E-04 |
hsa-miR-184 |
2.83 |
3.16E-02 |
hsa-miR-339-3p |
1.31 |
1.69E-02 |
hsa-miR-142-5p |
2.57 |
6.72E-03 |
Table 2: LogFC and PValue values of the 9 miRNA from GSE59247 dataset
miRNA |
logFC |
PValue |
hsa-miR-21-5p |
2.20 |
1.36E-03 |
hsa-miR-183-5p |
2.98 |
5.65E-17 |
hsa-miR-96-5p |
3.30 |
2.16E-99 |
hsa-miR-592 |
1.67 |
1.82E-23 |
hsa-miR-3923 |
1.66 |
9.73E-23 |
hsa-miR-885-5p |
1.04 |
7.31E-08 |
hsa-miR-184 |
1.10 |
1.59E-10 |
hsa-miR-339-3p |
1.26 |
8.15E-02 |
hsa-miR-142-5p |
2.58 |
4.31E-77 |
Table 3: LogFC and PValue values of the 9 miRNA from TCGA-BRCA dataset
Volcano plots were generated from GSE59247 and TCGA-BRCA datasets respectively, as shown in figure 3
and figure 4:
Figure 3: Volcano plot for GSE59247
Figure 4: Volcano plot for TCGA-BRCA
Further selection of miRNAs
To further refine the selection of optimal miRNAs, we plotted ROC curves for these 9 miRNAs. Since
the sample size of GSE59247 dataset was relative small, we used the TCGA-BRCA dataset to generate
ROC curves, as shown in figure 5, which demonstrated that miR-21-5p, miR-183-5p, and miR-96-5p have
good diagnostic value with AUC > 0.9:
Figure 5: ROC curves of the 9 miRNAs generated from TCGA-BRCA dataset
Feature Importance Analysis
From previous section, we have identified 3 upregulated miRNAs through differential expression
analysis
and ROC curve analysis. However, which miRNA is the beast biomarker for our Brefast detection
remains a
question. To further narrow down our selection from another perspective, we treat each miRNA as a
feature, thus transforming this problem into a machine learning feature importance problem.
After comparing various mainstream feature importance analysis methods, such as linear models and
tree models, we decided to choose Random Forest to conduct feature importance analysis for our data,
since random forest achieved higher accuracy on the test set compared to other methods.
While tree models, based on the Gini index, can reflect the classification value of a feature (each
miRNA) concerning sample (each patient) classification, their interpretability still has certain
limitations, especially when it comes to not considering the interactions between different
features. Considering the possibility of strong interactions among different miRNAs, our team has
employed SHAP (SHapley Additive exPlanations) analysis, a methodology based on cooperative game
theory, to conduct further analysis on the feature importance derived from the Random Forest model.
This approach enhanced our understanding of the significance of each miRNA in the context of
interactions among different miRNAs.
Feature Importance based on Random Forest
Random forest is an ensemble learning model that selects miRNAs by constructing multiple decision
trees and combining their predictive results. Based on the feature importance output of the random
forest model, miRNAs with key predictive abilities are selected as detection targets, thereby
improving the accuracy and reliability of the screening process.
Each miRNA was treated as a feature, and the labels were divided into positive and negative groups
based on whether the samples were cancer patients. The dataset was divided into training data and
testing data in an 8:2 ratio. Grid search and five-fold cross-validation were used to find the
optimal parameters (achieving an accuracy of 0.98924 in five-fold cross-validation). The binary
classification task was performed using a random forest model, and the accuracy reached 1 on the
testing dataset. Feature importance was calculated based on the Gini index, and the top ten ranked
features were shown in the following figure.
Figure 6: The feature importance of the top ten miRNAs analyzed by random
forest, where the upregulated and downregulated miRNAs have been labeled based on the results of
differential expression analysis of TCGA data.
Notably, miR-21-5p ranked first in terms of feature importance among the upregulated miRNAs.
SHAP Values Analysis
Machine learning models are becoming increasingly complex, powerful, and able to make accurate
predictions. However, as these models become more and more like "black boxes," it's even harder to
understand how they make predictions now, which leads to a growing need to improve the
interpretability and explainability of machine learning models. One of the most promising tools to
solve this problem is SHAP (SHapley Additive exPlanations) value, which measures how much each
feature (such as income, age, credit score, etc.) contributes to the model's prediction. Therefore,
SHAP value can not only identify which features are most important for the model, but also help to
understand how they affect the outcome.
Actually, SHAP value can be used to explain the output of any machine learning models. It uses a
game
theory approach that measures each player's contribution to the final outcome. In machine learning,
each
feature is assigned an importance value representing its contribution to the model's output, and
SHAP
values show how each feature affects the final prediction, the significance of each feature compared
to
others, and the model's reliance on the interaction between features.[1]
We compute the average SHAP values for each feature (miRNA) and select the top ten ranked miRNAs to
create the bar chart in Figure 7.
Figure 7: SHAP value analysis of the feature importance derived from Random
Forest, where the upregulated and downregulated miRNAs have been labeled based on the results of
differential expression analysis of TCGA data.
Notably, miR-21-5p ranked first in the SHAP value analysis among of miRNA, indicating that it is the
best choice of breast cancer-associated miRNA for BreFast detection.
Survival Analysis
Although we have identified candidate miRNAs through differential expression analysis, the
significance of differential expression alone cannot be directly used as the basis for selecting
cancer biomarkers. Generally, miRNAs that are significantly upregulated in differential expression
analysis are expected to promote cancer progression. However, not all upregulated miRNAs in breast
cancer exhibit significant effects on the disease progression of cancer patients. This indicates
that these miRNAs may not have a direct regulatory role in cancer but are instead regulated by other
factors in cancer. Therefore, these miRNAs might not be suitable biomarkers for cancer screening.
We would like to use a miRNA that have impact on the progression of breast cancer as the biomarker for
Brefast screening. In this context, survival analysis can compare the effects of candidate miRNAs on
cancer progression by grouping patients with either high or low expression of these miRNAs. It validates
whether the candidate miRNAs genuinely affect cancer progression and provides parameter values for
comparing the effects of different miRNAs.
The horizontal axis of the survival analysis curve represents survival time after diagnosis of
patients, and the differences in the survival curve can be better demonstrated when the survival
time is long enough. According to available data, the median survival time for TCGA-BRCA is 27.7
months, and the average survival time for Luminal cases in the Metabric dataset is 130.5 months,
with a maximum survival time of 337 months. Based on this information, we have chosen 180 months as
the horizontal axis for the survival analysis curve. Noted that the mainstream breast cancer
follow-up time is 180 months, this is chosen as the maximum unit of the horizontal axis of our
survival analysis curve. We have plotted the survival curves for the selected miR-21-5p using data
from the Metabric and TCGA datasets as shown in figure 8:
Figure 8: The survival curves for miR-21-5p, where Figure (a) represents
data from the Metabric database, and Figure (b) represents data from the TCGA dataset
Literature Research
Apart from bioinformatics analysis, we also read literatures to confirm that miR-21-5p is the best miRNA
candidate to be used as the breast-cancer-specific biomarkers for screening by BreFast.
A comprehensive review on International Journal of Molecular Sciences summarized 75 independent
clinical studies. 20 miRNAs were suggested to be breast-tumor-specific biomarkers, which were
supported by at least two independent studies in the same biological specimen (serum, plasma, or
whole blood). 8 out 20 of them demonstrated similar tendency of change in expression: miR-21 (12 up
versus 1 down), miR-155 (14 up versus 1 down), miR-10b (5 up, only serum), miR-373 (3 up, only
serum), miR-652 (3 down, only serum), miR-425 (2 down, only serum), miR-29a (2 up, only serum), and
miR-148b (2 down, only serum). Others miRNA, such as miR145 displayed different tendency of change
in expression.[2]
Figure 9: Pyramidal graph of the direction of miRNA expression (microRNA
concentration in breast cancer cases versus controls) by type of specimens (only microRNAs that were
analyzed in two or more independent studies). (A) whole blood, (B) plasma; (C) all specimens; (D) serum
serum
As a result, miR-21, miR-155, miR-10b were supported by multiple studies as highly associated with breast
cancer, while the predominant form of miR-21 was miR-21-5p.[3,4]
Conclusion
Through an intensive investigation that uses a variety of bioinformatics analysis, including
differential expression analysis, ROC curves, random forest classification, SHAP values analysis,
survival analysis and literature research, it has been demonstrated that miR-21-5p is the optimal breast
cancer-associated biomarker for BreFast screening.
References
[1] https://www.datacamp.com/tutorial/introduction-to-shap-values-machine-learning-interpretability
[2] Padroni L, De Marco L, Dansero L, et al. An Epidemiological Systematic Review with Meta-Analysis
on
Biomarker Role of Circulating MicroRNAs in Breast Cancer Incidence. Int J Mol Sci. 2023;24(4):3910.
Published 2023 Feb 15. doi:10.3390/ijms24043910
[3] Weng S, Lin D, Lai S, et al. Highly sensitive and reliable detection of microRNA for clinically
disease
surveillance using SERS biosensor integrated with catalytic hairpin assembly amplification
technology.
Biosens Bioelectron. 2022;208:114236. doi:10.1016/j.bios.2022.114236
[4] Fan T, Mao Y, Sun Q, et al. Branched rolling circle amplification method for measuring serum
circulating
microRNA levels for early breast cancer detection [published correction appears in Cancer Sci. 2020
Jan;111(1):297-298]. Cancer Sci. 2018;109(9):2897-2906. doi:10.1111/cas.13725
[5] Lanczky A, Gyorffy B: Web-Based Survival Analysis Tool Tailored for Medical Research (KMplot):
Development and Implementation, J Med Internet Res, 2021 Jul 26;23(7):e27633. doi: 10.2196/27633.
[6] Lánczky A, Nagy Á, Bottai G, Munkácsy G, Szabó A, Santarpia L, Győrffy B. miRpower: a web-tool
to
validate survival-associated miRNAs utilizing expression data from 2178 breast cancer patients.
Breast
Cancer Res Treat. 2016 Dec;160(3):439-446. doi: 10.1007/s10549-016-4013-7.
[7] Lesurf R, Aure MR, Mørk HH, Vitelli V; Oslo Breast Cancer Research Consortium (OSBREAC);
Lundgren S,
Børresen-Dale AL, Kristensen V, Wärnberg F, Hallett M, Sørlie T. Molecular Features of
Subtype-Specific
Progression from Ductal Carcinoma In Situ to Invasive Breast Cancer. Cell Rep. 2016 Jul
26;16(4):1166-1179. doi: 10.1016/j.celrep.2016.06.051. Epub 2016 Jul 7. (data accessible at NCBI GEO
database (Edgar et al., 2002), accession GSE59247).
[8] An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome
Analytics doi: 10.1016/j.cell.2018.02.052
[9] Survival outcomes are associated with genomic instability in luminal breast cancers. doi:
10.1371/journal.pone.0245042
[10] Zhao L, He L, Jiang L, et al. Differential expression analysis of miRNAs in gastric
adenocarcinoma
tissues based on bioinformatics and its clinical significance. Shandong Medical Journal, 2018,
58(02),
21-24.
[11]https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE59247