Overview
In this modeling, we attempted to develop an ideal CRC screening biodevice in terms of structure and expression system from the early stages of modeling using only one biomarker to using two biomarkers. This aims to increase the sensitivity and stability of our biodevice which is the main goal of a screening tool. Structurally, we optimized with the help of software such as NUPACK to design RNA Toehold Switch (RTS) and Loop-Initiated RNA Activator (LIRA) candidates automatically, and analysis of the stability of the ideal secondary structure, we also optimized the design of LIRA by making simulation models of ON OFF Ratio and ON OFF difference in silico by testing several types of metrics to get the metrics that best fit our case, not only stop there, through the LIRA design optimization process we try to explore the data we obtained from the process to gain insight into which parameters contribute most to the optimal results of our LIRA design through multivariate regression models. Meanwhile, in the expression system, we optimized by means of kinetic modeling simulations using Ordinary Differential Equations of both RTS and LIRA expression systems, with parameter improvements and miRNA initial condition ranges that are more in line with reality as well as increasing the number of ODEs in LIRA kinetic modeling simulations to obtain more expected visible reporter duration stability and proof that our OR Gate expression system can work well.
Some of dry lab work using Python and we happily share our documentations with you
here!Early Design RNA Toehold Switch (RTS)
Purpose
Initially, we designed our biodevice to detect one of CRC biomarkers (miR21) (Ajuyah et al. 2019; Yu et al. 2015) using RNA Toehold Switch (RTS). We were doing the design process to get:
An RTS ideal and stable structure. Our meaning of the RTS ideal structure is an RBS and start codon of the RTS structure that will be opened (not complement with other nucleotide) whenever It encounters miR21—so it biodevice is sensitive to biomarker—and will remain closed when there is no miR21—so it biodevice doesn’t generate a false positive result.
Get biodevice which only requires the minimum possible biomarker concentration, but still reaches the minimum reporter concentration needed to be clearly visible with the expected visible duration.
Method and Modelling
Based on the decision in the design phases of cycle 2 on the
Engineering Page. We decided to test two types of RTS, there are RTS first generation and RTS second generation. In order to achieve those two above purposes, we used two methods for each RTS type, including
NUPACK Analysis Job for ideal structure stability analysis of RTS to achieve the first purpose. We analyze RTS first generation and RTS second generation which consists of subparts that are listed in
this page. Several input parameters for secondary structure stability analysis of RTS first generation and RTS second generation, including type of nucleic acid, temperature, number of nucleic acid strands, melting temperature, maximum number of complexes formed, nucleic acid sequence, and other options if necessary, are included in Web-based NUPACK Job Analysis for analysis. The results of the analysis are then interpreted to determine the stability of the secondary structure of the RTS first generation and RTS second generation designs.
Kinetic modeling simulation for expression system analysis of RTS to achieve second purpose The preparation of Ordinary Differential Equation (ODE) produces RTS modeling parameters. The Kinetic Modeling RTS simulation begins by finding the parameter values involved in the RTS system. The parameter values are then input into the kinetic modeling equation which is then simulated using the Python programming language. In this RTS experiment, four kinetic modeling simulations were carried out, namely for RTS first generation with an initial extracellular miR21 concentration of 2 M and 0.2 M and RTS second generation with an initial extracellular miR21 concentration of 2 M and 0.2 M.
Results
Structural Stability Analysis
The stability of the RTS secondary structure was analyzed by observing the MFE value, equilibrium concentration, secondary structure of each species, and the complexes formed. The MFE value and secondary structure are observed to determine whether the desired secondary structure has the most negative MFE value. This is because the more negative the MFE value of a secondary structure, the more stable the structure is and the chemical reaction equilibrium will tend to lead to the formation of that structure. The results of the analysis are presented as follows
Table 1. Structural stability analysis results of the RTS first generation and RTS second generation
Based on Table 1, on section RTS first generation + miR21 or the first-row, It can be seen that equilibrium tends to lead to the formation of the RTS-miR21 complex. This is possibly because the formation of complex structures is more stable or can occur more spontaneously than single structures in each species (Zadeh et al., 2011). The stability of this structure increases as the Minimum Free Energy (MFE) value becomes increasingly negative (Dirks et al., 2007). Then, in the secondary structure and MFE analysis image, it can be seen that the double strand at the bottom of the structure shows a dark red color, indicating that the probability of complementation between the two strands tends to be high because it has a probability value of 1 or close to 1 (Dirks and Pierce, 2003 and Dirks and Pierce, 2004). Meanwhile, the double strand at the top of the structure shows a light green color, which indicates that the probability of complementation between the two strands is lower than the double strand at the bottom of the structure, thus allowing the double strand to separate during translation because this area is the RBS and start codon area.
Meanwhile, in the second line (RTS second generation + antimiR21 + miR21) it can be seen that equilibrium tends to lead to the formation of the RTS-antimiR21-miR21 complex. Then, in the secondary structure and MFE analysis image, It can be seen that the MFE value of the Open Toehold Switch (OTS) on the RTS second generation is more negative than on the RTS first generation. This shows that the structure is more stable (Dirks and Pierce, 2003 and Dirks and Pierce, 2004). However, judging from the constituent bases in the figure below the equilibrium concentration section, it can be seen that the stable structure of RTS second generation is not an open RTS structure (the hairpin containing the RBS is still closed). This will likely prevent ribosomes from carrying out translation which will result in a decrease in reporter expression levels. This is different from what happens in RTS first generation, where the hairpin structure of RTS opens when complemented with miR21 so that RTS changes from a Close Toehold Switch (CTS) state to an OTS state.
Kinetic Modeling Simulation for Expression System Analysis of RTS
The preparation of the ODE begins by creating a chemical equation for each reaction from the working mechanism scheme for the chromoprotein-sensitive miR21 biodevice design in the design phase of
Engineering cycle 2. Next, the chemical equation is translated into ODE.
Equation for RTS
Kinetic Modeling Simulation Result
RTS first generation | RTS second generation |
| |
| |
| |
Table 2. Kinetic Modelling Result
Based on the figure in row 3 and the data result (Table 2), it can be seen that the maximum concentration of aeBlue chromoprotein in RTS first generation was reached at a value of 0.000516 M and at 1252.26 seconds or around 21 minutes after extracellular miR21 with an initial concentration of 2 M first diffused into in EcN with biodevice. At that time the intensity of the blue color is maximum so it will appear the clearest visually. Based on the concentration fluctuation data result, the duration of aeBlue chromoprotein can be clearly seen for approximately 2 hours, starting from 46 seconds after miR21 first enters the EcN with the biodevice.
Meanwhile, the maximum concentration of aeBlue chromoprotein in RTS second generation was reached at a value of 0.041281 M and at 1098.76 seconds or around 18 minutes after extracellular miR21 with an initial concentration of 2 M first diffused into the EcN with the biodevice. Based on the concentration fluctuation data in the result, the duration of aeBlue chromoprotein can be clearly seen for approximately 7.6 hours, starting from 2.8 seconds after miR21 first enters the EcN with the biodevice. This is because based on the literature, aeBlue chromoprotein can be seen clearly at a minimum concentration of 3.86 x 10-5 M, which this concentration can be achieved after 46 seconds for miR21 to enter the EcN with a biodevice on the RTS first generation and after 2.8 seconds for miR21 to enter. into the EcN with a biodevice on the RTS second generation. The relationship between the clarity of the color of aeBlue chromoprotein and its concentration is shown in the following table document.
Calculation of protein MR based on its constituent amino acids (https://rest.uniprot.org/uniprotkb/Q2VFP3.fasta) via the website https://www.cusabio.com/m-299.html to obtain MR aeBlue = 25912.6279 g/mol.
Meanwhile, kinetic modeling simulations on RTS First generation and RTS second generation with an initial concentration of extracellular miR21 of 0.2 M gave results, namely RTS first generation reached a maximum value of aeBlue chromoprotein concentration of 0.000176 M at 332 seconds with a duration of observation of aeBlue chromoprotein of approximately for 0.87 hours starting from 44.5 seconds of entry of miR21. Meanwhile, the RTS second generation reached a maximum value of aeBlue chromoprotein concentration of 0.019965 M at 405.38 seconds with a duration of observation of aeBlue chromoprotein of approximately 6.6 hours starting from 3 seconds of entry of miR21.
Early Design LIRA
Purpose
The General purpose of this process is to improve the previous step. We want to increase the sensitivity and get the more ideal structure of our biodevice by redesigning our RTS to be a Loop-Initiated RNA Activator (LIRA) that can detect at least one of two biomarkers which are upregulated in different CRC phases, those are miR21 (advanced adenoma) and miR92a (early adenoma) (Okugawa et al. 2014). This strategy has been inspired by LIRA OR Gate from research by Ma et al. 2022.
In this early step, our specific purposes are to determine where is important parts of LIRA located, including the Ribosome Binding Site (RBS), Start Codon, and Detector Region; and get LIRA's ideal target structure when there is no miRNA biomarker, there is one of two miRNA biomarkers, also there are two miRNA biomarkers. So we can automate design for generate LIRA in the next step.
Method and Modelling
The steps that we do include
Simulate LIRA OR Gate sequence from Journal Reference using Analysis Job in NUPACK website
Determining parts of LIRA including RBS, Start Codon, and Detector Region.
Substitute detector regions so the LIRA substituted can detect miRNA CRC biomarkers
Resimulate LIRA substituted using Analysis Job on NUPACK website
Modify LIRA nucleotide(s) until we get the expected ideal structure of our LIRA
We define the expected ideal structure of LIRA are:
RBS and start codon of LIRA remain enclosed when there is no miR21, miR92a, also miR21 and miR92a so it can minimize the possibility of false positive result.
RBS and start codon of LIRA will be open when there is miR21, miR92a, or both so it can minimize the possibility of false negative results.
Results
Below images are the results of simulating the LIRA sequence and its inputs (RNA sequence to be detected) from Journal Reference on NUPACK Analysis Job:
Input Sequence(s) | Nucleotide Identity* | Nucleotide Probability |
LIRA OR Gate: GGGATTCATTTCACATCTCCTAATCCAGTCGTGGATGGGCTCTGTTTCCGTATTCTGTGAAGCCCTAGGGTCCGATACAGAAACAGAGCCCATCCACGACTGGAATGGCTCTGTTTCATCTTAAAGTCCTTGTAACAGTCGTCAAGACGAAACTAAGCCATAGAGGAGATGACAAATGAATAACCTGGCGGCAGCGCAAAAG | | |
LIRA OR Gate + Input A. Input A: GGGCCAGTGACTTGTCACTGGGAACGGACCCTAGGGCTTCACAGAATACGGAAACGAC | | |
LIRA OR Gate + Input B. Input B: GGGCTCACCTGCCAAGGTGAGAGCGACGACTGTTACAAGGACTTTAAGATGAAACGAC | | |
LIRA OR Gate + Input A + Input B | | |
*Color code: Detector and Input A, Detector and Input B, RBS, Start Codon
From those results, we substitute detector regions and modify nucleotides so the LIRA can detect miRNA CRC biomarkers and have expected ideal structures. Here are the final modification results:
Input Sequence(s) | Nucleotide Identity* | Nucleotide Probability |
LIRA OR Gate modification: TCATTACACATCTCCTGGGCACAAGCATCAACATCAGTCTGATAAGCTATTGAGTATTGTGCATACAATACACAGGCCGGGACAAGTGCAATACTGTCCGTTGTATAGAGGAGATGACAAATGAAATAACCTGGCGGCAGCGCAAAAG | | |
LIRA OR Gate modification + miR21. miR21: UAGCUUAUCAGACUGAUGUUGA | | |
LIRA OR Gate modification + miR92a. miR92a: UAUUGCACUUGUCCCGGCCUGU | | |
LIRA OR Gate modification + miR21 + miR92a | | |
*Color code: Detector and miR21, Detector and miR92a, RBS, Start Codon
If we look at the results in the two tables above, it appears that LIRA, both LIRA from journal references and modified LIRA, still have complementary nucleotides in the RBS range to the start codon. However, this complement has a probability value that tends to be less than 0.5, where this probability value is even smaller in the modified LIRA (which means it is more in line with expectations). Based on in vitro tests in reference journals, with the probability value of RBS to the start codon in the reference LIRA, this LIRA can already work to detect input A and/or input B, so we predict that our modified LIRA with an even smaller probability value will also be able to work for detecting miR21 and/or miR92a.
Toehold LIRA Design Optimization
Purpose
The purposes of the optimization process are modeling the relationship between the variables that affect the efficiency of biomarker performance as a basis for designing and optimizing a more efficient and effective LiRA toehold switch design against miRNA-92a and miRNA-21
Independent Variable
For each complex variations: a+c+b, a+b+c, a+b, a+c, a+a with a : sequence toehold LiRA; b : miRNA-21; c: miRNA-92a we have several independent variable such as MFE (Minimum Free Energy), Partition Function, Free Energy, Ensemble Size, MFE proxy structure, MFE stack energy, concentration (T1) in test tube, and concentration rank in test tube (T1 rank). For each sequences, we also splitting the sequences into 3 sub-sequences (ABS to miR21 bottom seal, toehold miR21to miR92a top seal, and miR92a bottom seal-2 to end of sequences than get MFE and GC content for each sub-sequences. For each sequences, we also get some parameters like GC content, structure probability, and defect ensemble.
Variation of Dependent Variable
Explore the influence of other parameters on several performance metrics that show the effectiveness of Toehold Switch miRNA performance.
ON OFF Ratio
on/off ratio average=number of complexes∑i=1Non/off ratioi⋅⋅⋅⋅(1)
ON level
on level average=number of complexes∑i=1Non leveli⋅⋅⋅⋅(2)
OFF level
off level average=number of complexes∑i=1Noff leveli⋅⋅⋅⋅(3)
ON OFF ratio a-c-b
on/off ratio a-c-b=off level a−c−bon level a−c−b⋅⋅⋅⋅(4)
ON OFF Ratio weighted
on/off ratio weighted=number of complexes∑i=1nranki+11⋅on/off ratioi⋅⋅⋅⋅(5)
ON level weighted
on level average weighted=number of complexes∑i=1nranki+11⋅on leveli⋅⋅⋅⋅(6)
OFF level weighted
off level average weighted=number of complexes∑i=1nranki+11⋅off leveli⋅⋅⋅⋅(7)
ON minus OFF Average
on minus off=n∑i=1non leveli−n∑i=1noff leveli⋅⋅⋅⋅(8)
ON minus OFF weighted
on minus off weighted=n∑i=1nranki+11⋅on leveli−n∑i=1nranki+11⋅off leveli⋅⋅⋅⋅(9)
Figure 1. Toehold Optimization Design Pipeline
Optimization design steps consists of the following :
Gathering Sequences of LIRA
Objective: to obtain a LiRA sequence design that meets constraints and desires
Process:
Design process done on the web NUPACK (Zadeh et al., 2011) with input consists of sequence, domains, strands, complexes, target structures, constraints, and additional settings. Domain settings consist of LIRA with length 179 sequences, miR21 with length 22 sequences, and miR92a with length 22 sequences. Strands settings consist of LIRA, miRNA-21, and miRNA-92a. Complexes and target structures (Figure 2) settings consist of LIRA, LIRA-miR92a, and LIRA-miR21. Constraints settings consist of GC content limitation (0.2 to 0.6) and banned sequences followed the illegal sites rules on iGEM web. Detail setup on LIRA Design Job using NUPACK can be accessed below
Figure 2. Target Structure for LiRA, LiRA-miR92a, and LiRA-miR21.
Output: LiRA sequence with each domains sequence predictions
Gathering Parameters of Sequences LIRA
Objective: obtain properties and feature engineering parameters from LIRA sequences
Input: 10 LIRA sequences
Output: parameter properties and on off performance metrics
Process:
Analysis Job by NUPACK (Dirks et al., 2007 and Fornace et al., 2020)
MFE of sliced domain parts: ABS-mir21 bottom seal1; toehold mir21-mir92 top seal; mir92 bottom seal 2 - end
Minimum Free Energy of sliced domain parts: ABS-mir21 bottom seal1; toehold mir21-mir92 top seal; mir92 bottom seal 2 - end using ViennaRNA (R. Lorenz et al., 2011 and I. L. Hofacker, 1994).
Choosing the Right Metrics
ON OFF Idea Concept
Figure 3. Scheme of ON OFF Approached Metrics
Performance of screening tools can be measured by its ON OFF performances. ON level is protein reporter (AeBlue-chromoprotein) abundance that expresses switch (LiRA) when there are switch and trigger (miR21 and miR92a). On other way, OFF level is protein reporter (AeBlue-chromoprotein) abundance that expresses switch (LiRA) when there are switch and non-trigger (other than miR21 and miR92a), These definition are wetlab definition, but we want to give data driven design to wetlab, so we did analogical approach to ON-OFF metrics by evaluate its predicted secondary structure probability (Dirks and Pierce, 2003 and Dirks and Pierce, 2004) from NUPACK calculation.
Technically, uncomplement structure from RBS (Ribosom Binding Site) to start codon when LiRA alone will be expressed when there are biomarkers targeted around. There will be 4 conditions to fulfill there are: 1) when there aren’t miR21 nor miR92a around, 2) there only miR21, 3) there only miR92a, and 4) there are miR21 and miR92a around. In 1st conditions, we targeted to get a high OFF level or less to no AeBlue expressed by LiRA. While for other conditions, we targeted to get a high ON level or significant AeBlue expressed by LiRA. More uncomplement structure while bind to targeted biomarkers logically show easier and more switch LIRA expressed to open up (ON) then more AeBlue expressed. On other way, more complement/less uncomplement structure while not bind to targeted biomarkers/alone technically show much stronger expression of OFF logic switch. Mathematically expression of that logic developed as sum of probabilities of complement and uncomplement structure from sub-sequence RBS until start codon.
off-level=LRBS−to−startcodon∑i=1nProbic⋅⋅⋅(10) on-level=LRBS−to−startcodon∑i=1nProbio⋅⋅⋅(11) Infinity Issue: ON Minus OFF
Figure 4. 6th Sequence Probabilities and Identities using NUPACK Utilities Job
From the 10 sequences collected, there is a sequence (the 6th sequence) that has non-complementary secondary structures from the RBS to the start codon. In this way, the ON Level in the sequence structure (in the a-c-b complex or LIRA-miR92a-miR21) is very high and the absence of a complementary structure causes the OFF level to be 0. This brings new problems to the approached metric on off ratio (Equation 1) which begins with the inifinity value in Equation (4) although by our hypothesis, this is what we are aiming for, when the highest possible ON Level and the lowest possible OFF Level happen in the same time. In data processing, this has an effect on processing using libraries such as Scikit-Learn and AutoGluon that do not accept infinite data input so that sequence 6 is excluded from processing on the metrics ON OFF ratio average, ON OFF ratio a-c-b, and ON OFF ratio weighted. Apart from that, this also dissolves other metrics in the average ON OFF ratio so that the ON OFF ratio cannot be seen in other complexes. Therefore, we added the ON minus OFF average metric (Equation 8) obtained from the ON level minus OFF level of each complex then sum and divide it by the number of complexes.
Bias Issue in Complexes Probabilities
Apart from providing output in the form of complex and sequence properties, NUPACK analysis also produces a prediction analysis of the complex concentration in the test tube if it is in the same test tube with miR92a and miR21. These concentration predictions are sorted based on the predicted complex with the highest concentration. In average calculation, the probability between one complex to another is equal. In fact, through the NUPACK Analysis simulation, the probability of each different complex is shown from the predicted concentration of each different complex. Therefore, the rank of probabilities is used as the weight for each complex's metric in equations (5), (6), (7), and (9). Since the higher the ranking order, the less possible it is, the relationship between rank of probabilities and ON OFF level metrics for each complex is inversely proportional.
Considering Lonely LIRA in Test Tube
weighted on/off ratio average+(off levela−on levela)=numberofcomplexes∑i=1nranki+11⋅on/off ratioi+(off levela−on levela)⋅⋅⋅(12) When we choose the best sequence, we want to consider the least possible structure LIRA to be in ON mode when it rather be alone in the test tube. We only add this as best sequences consideration because this consideration comes after seeing 10 sequences we designed have ‘a’ complex at a fairly high ranking (rank 1-7) so to consider these 10 sequences tendency, we mitigate the condition by adding OFF level of ‘a’ complex minus ON level of ‘a’ complex for each sequence. This condition is important because when LIRA alone has an ON structure that is very possible to trigger the AeBlue protein, the switch RNA function fails to develop.
Exploratory Data
Statistical inference analysis of 10 sequences designed using Design Job NUPACK show standard deviation, mean/average, and summation of probabilities in each sequence. Through the mean and cumulative values, we can know probabilities of its secondary structure for each sequence in each complex. Other ways, standard deviation show us variability of secondary structure probabilities in each index of sequence for each complex. By that, we can conclude which sequence is the best sequence so we can recommend it to wetlab. On complexes that we targeted, we aim for higher mean and cumulative values while having very low standard deviation that show high secondary structure probabilities in each index of the sequences.
Standard Deviation
Figure 5. ON and OFF Standard Deviation Distributions for Each Complex
Standard deviation of each sequence in each complex shows the variety of probabilities in each complement and non-complement structure. Standard deviation analysis want to know how consistent/stabil the secondary structure of sequences. It hopes to be as low as possible, especially sequences that show expected characteristics in mean and cumulative values. Some sequences have high standard deviation relative to other sequences such as 5, 8 in on level and 7 in off level. The 5th sequence only has high standard deviation in a and a+a complex, so we can ignore the high variability of it.
Mean
Figure 6. ON and OFF Mean Distribution for Each Complex
In the mean of ON Level, we aim for higher mean on a+c+b, a+b+c, a+b, and a+c while lower on OFF Level that show LIRA binds to miRNA. While aim for lower mean on a and a+a complexes in ON Level but higher on OFF Level since both 2 are non-targeted complexes on test tube with LIRA+miR21+miR92a. Most sequences show similar ON Level mean for each complexes, only 2nd, 3th, 4th, and 5th sequence that show expected characteristics such as lower ON Level mean on a+a and a while higher on other complexes. For OFF Level, most sequences characteristics have expected characteristics such as higher on OFF Level mean on a+a and a, while lower OFF Level mean on other complexes. Only 1st sequence that have opposite characteristics such as lower OFF Level mean on a+a and a, while higher on others.
Cumulative
Figure 7. ON and OFF Sum Distribution for Each Complex
In the cumulative of ON Level, we aim for higher cumulative on a+c+b, a+b+c, a+b, and a+c while lower on OFF Level that show LIRA binds to miRNA. While aim for lower cumulative on a and a+a complexes in ON Level but higher on OFF Level since both 2 are non-targeted complexes on test tube with LIRA+miR21+miR92a. Most sequences show expected ON Level cumulative for each complex, only 4th, 6th, and 8th sequence that show opposite of expected characteristics such as ON Level cumulative on a+a are higher than ON Level cumulatives of other complexes. For OFF Level, all sequences characteristics have expected characteristics such as higher OFF Level cumulative on a+a and a than OFF Level cumulative of other complexes. Something unique happens in the 10th sequence that shows very low of ON cumulative and very high of OFF cumulative on a and a+a complexes relative to other cumulative values in complexes, so the 10th sequence can be considered despite the opposite expected on off mean calculated.
The ‘Lonely LIRA’ Metrics
Figure 8. ‘The Lonely LIRA’ Metrics Calculated for Every Sequence (Sequence 0 refer to Early Design LIRA)
From the exploration data above, sequences that can still be considered to use are 2th, 3th, 5th and 10th sequences. Using the equation of ‘The Lonely LIRA’ (12), those 4 sequences have the highest metrics from another. The 6th sequence also gets the same value as the 5th sequence but the 6th sequence might be not performing well due to NUPACK analysis results that show that it's both good in ON and OFF at any complexes by looking into its mean and cumulative values (include a and a+a).
#1 Most Likely Complex to be Formed
Figure 9. Concentration of Each Complex Formed in Test Tube
Surprisingly, 10 sequences from Design Job NUPACK have similar characteristics to each other. One of them is concentration rank probabilities (Figure 9) in the test tube. The lower the rank means it has a higher possibility to happen in the test tube when there is miR21, miR92a, and LIRA in the same place. Complex of a+c+b (LIRA-miR92a-miR21) being most likely to happen in the test tube for 9 out of 10 sequences.
Multivariate Regression Model
Multivariate regression is used when the regression case has several variables or parameters. From the gathering parameters process, we got 49 parameters as independent variables to processed and 9 metrics as dependent variables. To build multivariate regression pipeline model, data should be pre-processed as numerical format, no null values, standard scaling, and split into test and train in ratio 1:4 before put them into the model. After build the model, we evaluate the model performances. There are some performance metrics of regression such as RMSE (Root Mean Square) and R-squared. This section will explain about the evaluation process of regression model and evaluate parameters effect on approached metrics of ON-OFF performance.
Table 3. Summary of Regression Performance Metrics for Each Dependent Variables
Validation of Regression: Are We on The Right Track?
Clarifying model validity
By the assumption that most parameters correlated into secondary structures, so happen to be correlated to on-off performances, especially approached metrics, regression models start with linear regression for each regression target or dependent variables with all parameters. Table x shows negative R-squared and some high RMSE in several dependent variables for 49 parameters linear regression. Negative R-squared means that the model explains less of the variability of the dependent variable than a horizontal line of mean value for all observations. Negative R-squared might happen due to complexity of the data/not enough data to show the patterns. Low variability of sequences yet a lot of parameters used might be too complex for linear regression.
Throwing unnecessary parameters
To reduce complexity of the data, one of the approaches is feature selection by filtering out parameters that are less correlated into approached metrics. By using Spearman Correlation and threshold 0.7 for all correlation parameters into all dependent variables, parameters are reduced into 19 parameters. This approach affects higher R-squared and lower RMSE on dependent variables/approached metrics that use weighted equations. Despite most of the R-squared still being negative, weighted approached metrics that interpret probabilities of each complex more fit into high value parameters. Another hypothesis for the regression case negative R-squared happens due to limited data used or it's actually not linearly correlated into each dependent variable. Despite bad model interpretability performance, feature selection resulted in positive R-squared on ON Level weighted.
Get better model for limited data
To know if there would be any model that can predict regression of approached ON-OFF metrics and get better R-squared results, we try to use AutoML by AWS: AutoGluon (Erickson et al., 2020). AutoGluon is known as one of the best ways to get easy comparisons of several models and known by its algorithm that also tries to use the stack and ensemble of several machine learning models. So, we run auto gluon regression on kaggle. From each metric, we got at least one model that got positive R-squared in each approached metric. These efforts show that with very small data, there also needs some effort on model building to get expected model metrics. From the result of Table X, higher R-squared are on ON Level weighted and ON Level approached metrics that show 51% of variability in the ON Level is explained by the independent variables in the model.
Clarifying our general approached metrics
Having several approached metrics as dependent variables, we could evaluate each metrics by the performance metrics shown in Table X. From there we can assume, are the approached metrics interpret the LIRA switch performances successfully? By not being really complex so it still can be modeled by the regression algorithm.
Figure 10. Distribution of Approached Metrics (a) ON OFF ratio based; (b) ON minus OFF based
ON OFF Ratio based compared to ON minus OFF based
In section Infinity Issue: ON Minus OFF, there is a new metric called ON minus OFF Average to handle interpretability of approached metrics for 0 OFF Level. Later on, this metric also developed into a weighted version. From Table X, ON minus OFF always has a tendency to have a lower RMSE score than ON OFF ratio. But the R-squared of ON minus OFF is only higher than the ON OFF ratio when it's calculated by the weighted equation or using AutoGluon. Significantly lower RMSE of ON minus OFF based approached metrics show less outlier in ON minus OFF based approached metrics. It can be seen also in Figure 10 (a) and 10 (b) that show different ranges of metrics value. So it can be concluded that ON minus OFF based approached metrics are easier metrics to be modeled due to less outliers shown by lower RMSE and can be more computational efficient since they perform better when using filtered parameters.
Unweighted compared to weighted approached metrics
In section Bias Issue in Complexes Probabilities, there is a new metrics version of every metric before. The unweighted metrics such as Equation (1), (2), (3), (4), and (8) have the weighted version in Equation (5), (6), (7), and (9). Table 3 shows that in linear regression, all weighted versions of approached metrics got lower RMSE than unweighted versions. It can be interpreted that weighted versions can catch lower outlier and error/deviation. The distribution in Fig. X (a) and X (b) also show that the range of weighted versions is lower than unweighted versions across sequences. Weighted version approaches also show positive impact into linear regression models using 19 parameters so it's more computationally efficient. From the AutoGluon result, the weighted version always had an R-squared score lower than the unweighted version while using similar model schemes. From all the above, it can be concluded that using weighted versions might be significantly better on less parameters in less complex models but not significantly better in interpreting variability of the weighted versions itself when using more complex models.
How to Get High Sensitivity?
Table 4. Feature Importance > 0.05 from AutoGluon
ON Level
Figure 11. Spearman Correlation on ON Level weighted & unweighted approached metrics
From the Spearman correlation, both in ON Level approached metrics, the most correlated parameters are Minimum Free Energy on sub-sequence Toehold-miR21 until miR92 top seal. Other variables mostly come from free energy family parameters of complexes a+a and a+c. Some complexes have high negative correlation of its partition function to ON Level or weighted ON Level. Besides, on best model by AutoGluon (using NeuralNetTorch) the most important parameter is gc_miR92bottomseal2-end or GC content in sub-sequence of 2nd bottom seal of miR92 until end for ON Level, while for weighted ON Level it’s has several importance parameters there are partition function of a+b complex, MFE of a+b complex, and concentration on test tube at a+c complex. P-value maybe can’t be interpreted since we only have 10 sequences.
OFF Level
Figure 12. Spearman Correlation on OFF Level weighted and unweighted metrics
From the Spearman correlation, OFF Level approached metrics, the most correlated parameters are partition function of complex a+c. Other variables mostly come from free energy and partition function family parameters of complexes a+a and a+c. It also has negative correlation with MFE of toehold miR21 until miR92a top seal that is consistent with ON Level with positive correlation. Besides, the best model by AutoGluon (using an ensemble model consisting of NeuralNetFastAI, XGBoost, and NeuralNetTorch) the most important parameter is MFE of a+c+b complex. P-value maybe can’t be interpreted since we only have 10 sequences. Despite having highest standard deviation between other parameters, MFE of a+c+b has highest feature importance for targeted label (Weighted OFF Level Average). This might happen possibly by variation of data at a good rate.
ON minus OFF
Figure 13. Spearman Correlation on ON minus OFF Level weighted & unweighted approached metrics
From the Spearman correlation, both in ON minus OFF Level approached metrics, the most correlated parameters are Minimum Free Energy on sub-sequence of Toehold-miR21 until miR92 top seal. Other variables mostly come from free energy family parameters of complex a+c. High negative correlation also gets in from GC content on sub-sequence ABS (Annotated regulatory Binding Site) until 1st bottom seal of miR21. Besides, the best model by AutoGluon (using NeuralNetTorch) the most important parameter is gc_miR92bottomseal2-end or GC content in sub-sequence of 2nd bottom seal of miR92 until end for ON minus OFF Average. P-value maybe can’t be interpreted since we only have 10 sequences. Despite having highest standard deviation between other parameters, MFE of a+c+b has highest feature importance for targeted label (Weighted OFF Level Average). This might happen possibly by variation of data at a good rate.
In context design screening methods for CRC, the performance of design methods prioritize more to get high sensitivity. Getting high sensitivity in context of succesful LIRA switch can be translated to high ON level. By correlation and feature importance analysis before, to aim for better ON Level, the design process itself can be explored in the sub-sequence of Toehold-miR21 until miR92 top seal and 2nd bottom seal of miR92 until end. There are also several parameters based properties of sequences such as partition function, free energy, and concentration prediction on the test tube. While the complexes that should be more explored are a+a, a+b, and a+c complexes.
Therefore, from Toehold LIRA Design Optimization pipeline, we can reproduce this pipeline as design consideration for another RNA switch. From this pipeline, we consider which LIRA sequence we choose to apply in wetlab by doing exploratory data analysis and learn how sequence and its parameters affect to ON Level, OFF Level, and ON minus OFF Level.
Kinetic Modeling for LIRA OR Gate Expression System
Purpose
Our aim in carrying out this simulation is to obtain insight into the sensitivity of LIRA in terms of increasing the concentration of aeBlue (reporter) that can be produced and increasing the duration of visible aeBlue as a consequence of varying the miRNA within a certain miRNA concentration range. Because we have not gotten the level of miRNA concentration in CRC patients at various stages. So we decided to simulate the LIRA expression system with initial miRNA concentrations in the range of important constituents on the human extracellular fluid from the lowest normal value until the highest value of approximate short-term nonlethal limit, that is 1.2x10-3 M - 175x10-3 M (Hall and Hall 2020).
Method and Modeling
The steps we took included:
Understanding the mechanism of the LIRA expression system in EcN in the colonic lumen environment
Translate the expression system into chemical equations
Translating chemical equations into ODEs
Setting parameter values and initial conditions used in the ODEs
Simulate using Python programming language and interpret ODE simulation result
Results
Mechanism of The LIRA Expression System in EcN in The Colonic Lumen Environment
This expression system begins with the diffusion of miRNA from the extracellular fluid of CRC cells. This diffusion is assumed to be possible because:
EcN bacteria can specifically colonize colorectal tumor tissue with the highest specific colonization ability in tumors compared to other bacteria (Alizadeh et al., 2020).
The EcN bacterial parts that help specific colonization are the K5 capsule and F1C fimbria (Chiang and Huang, 2021) which will interact with TLR-4 and TLR-5 in colorectal tumor cells (Hafez et al., 2010).
miR 21 and miR 92a are excreted by CRC cells and can be found and it can be seen that there is upregulation of the expression of these 2 miRNAs in fecal samples from CRC patients (Okugawa et al. 2014).
miRNAs can diffuse into Escherichia coli cells and regulate gene expression in these bacteria (Liu et al. 2016).
The miRNA that has diffused into the cell then activates/turns ON LIRA, where this LIRA mRNA is expressed constitutively (continuously) because we use a constitutive promoter (BBa_K1614000).
Activation of LIRA occurs by complementation (complex formation) between one or both of the miRNA biomarkers with one or both of the miRNA detectors of LIRA so that the RBS and start codon of LIRA are previously closed by a hairpin loop structure or what we call it Close LIRA. The switch (CLS) becomes open in the RBS and starts codon because the hairpin loop is open or we call it the Open LIRA Switch (OLS).
OLS which is complemented with one of the miRNAs can produce aeBlue because it is in accordance with our LIRA design which can work with an OR gate. So in this expression system, there will be 3 types of OLS, namely: OLS1 which is a complex between CLS and miR21, OLS2 which is a complex between CLS and miR92a, and OLSc which is a complex between CLS, miR21, and miR92a. In the process, OLSc is formed from the formation of a complex between OLS1 and miR92a or OLS2 and miR21.
These three types of OLS will then be translated so that they can produce aeBlue chromoprotein which is located upstream of LIRA and is still in the same transcription unit as LIRA.
The concentration of aeBlue produced will increase as the OLS concentration increases and then will decrease as the OLS concentration decreases and the rate of degradation of the aeBlue chromoprotein itself. Likewise, other substances such as OLS, CLS, and miRNA also have their own degradation rates which will influence the fluctuations between these substances frequently over time, where these fluctuation trends will be observable in the kinetic modeling simulation stage later.
Translate Expression System into Chemical Equations, ODEs, setting parameters value
Mechanisms of the LIRA expression system are translated into the following equations with certain parameters values:
Simulate and interpret ODE simulation result
We simulate two conditions for biomarker existence. The first one is when two biomarkers exist and the other one is when only one biomarker exists. We simulate this to ensure that our OR Gate mechanism can work properly. For sub-conditions, we simulate each condition with various concentrations of biomarkers in the range 1.2x10-3 M - 175x10-3 M. Here are the results:
No. | All Substance fluctuations | Zoom in (without CLS fluctuation) |
1 | | |
2 | | |
3 | | |
4 | | |
5 | | |
6 | | |
Based on the results above, in the LIRA simulation with 2 biomarkers at the lowest normal miRNA concentration value (1.2x10-3 M), the results obtained were that the visible concentration of aeBlue lasted for around 12.15 hours. Then, when the concentration was increased 10x from the normal value to 1.2x10-2 M (second row) the result was that the visible concentration of aeBlue increased to last for around 14.53 hours. Finally, when the concentration was changed to the highest value of the concentration range, namely 0.175 M, the result was that the visible duration of aeBlue increased to around 16.85 hours.
Meanwhile, in the simulation with 1 biomarker at the lowest normal miRNA concentration value (1.2x10-3 M), 10x the lowest normal value (1.2x10-2 M), and the highest value of 0.175 M, the visible duration of aeBlue was 13.94 hours, 14.60 hours, respectively. and 17.47 hours. The value obtained is not much different from the simulation using 2 biomarkers, this is proof that our LIRA OR Gate expression system can work.
Then, if we observe the trend of the graph, especially the one that displays all fluctuations in substances (column 2), it can be seen that the CLS value reaches a constant level at a certain time. This is because the gene which is the template for CLS production is assumed to be constant and the value is included in the Ktr constant (Transcription Rate Constant). This assumption is because the number of genes tends to be constant in an organism. Apart from that, the constant amount of CLS is also caused by the use of a constitutive promoter that continuously transcribes LIRA, so that CLS will always be available to detect biomarkers.
If we review the literature, information is obtained that the transit time for food from the beginning of the colon until it comes out as feces is approximately 12-24 hours (Sensoy 2021). There were also those who stated that the transit duration in each section was as long as The ascending (9.5 ± 2.3 hours) and descending colon (5.5 ± 4.1 hours) had shorter transit times than the sigmoid-rectum section (12.7 ± 2.1 hours) (p < 0.001, p < 0.01, respectively), and the transverse colon (4.2 ± 2.1 hours) (Tomita et al. 2011), as well as data that states that the majority of CRC cases occur in the sigmoid-rectum, namely 59% in men and 46% in women (source: https://www.cancerresearchuk.org/health-professional/cancer-statistics/statistics-by-cancer-type/bowel-cancer/incidence#heading-Three)
(Source: https://www.cancerresearchuk.org/health-professional/cancer-statistics/statistics-by-cancer-type/bowel-cancer/incidence#heading-Three )
So our biodevice still has the possibility of being able to detect the presence of CRC because at the input miRNA with the lowest concentration, namely the lowest normal value (1.2x10-3M), the visible color of aeBlue can last for approximately 12.15 hours. Then, when it is increased 10x from the normal concentration, the duration increases to 14.53 hours, based on the data, this duration has entered the range of 12-24 hours (Sensoy 2021) and enters the transit time range in the sigmoid-rectum (12.7 ± 2.1 hours) (Tomita et al. al. 2011) which is the part with the largest distribution percentage of cancer spread bases on the figure above. So it can be predicted that the possibility of the blue color of aeBlue chromoprotein produced by our biodevice can still be seen in the feces of CRC sufferers (miRNA levels that are upregulated above normal). Moreover, the simulation we carried out still does not include the dynamics of the EcN cell population, which if we include this parameter, it is possible that our visible AeBlue biodevice can last longer and can be adjusted based on the dose of EcN probiotic taken.
Therefore, from here we also learned that for future development of the DBTL model and cycle, it is necessary to consider the dynamic parameters of the EcN population and adjust the dose based on each person's circadian defecation rhythm. Apart from that, adjustments can also be made in terms of the circadian rhythm of defecation, for example by making probiotic preparations which can also facilitate the defecation process, or other alternative adjustments as simple as advising EcN biodevice consumers to eat healthy foods rich in fiber which can facilitate digestion, we are sure This improvement can be carried out in stages until a biodevice design that is close to ideal is obtained.
Ajuyah, P., Hill, M., Ahadi, A., Lu, J., Hutvagner, G., and Tran, N. 2019. MicroRNA (miRNA)-to-miRNA Regulation of Programmed Cell Death 4 (PDCD4). Molecular and cellular biology. 39(18): e00086-19.
Alizadeh, S., Barzegari, A., Esmaeili, A., and Omidi, Y. 2020. Designing a light-activated recombinant alpha hemolysin for colorectal cancer targeting. BioImpacts : BI. 10(3): 187–193.
Alon, U. 2020. An introduction to systems biology: design principles of biological circuits. Second Edition. Taylor & Francis Group: Boca Raton (FL).
Baabu, P.R.S., Srinivasan, S., Nagarajan, S., Muthamilselvan, S., Selvi, T., Suresh, R.R., and Palaniappan, A. 2022. End-to-end computational approach to the design of RNA biosensors for detecting miRNA biomarkers of cervical cancer. Synthetic and systems biotechnology. 7(2): 802–814.
Dirks, R.M. and Pierce, N.A. 2003. A partition function algorithm for nucleic acid secondary structure including pseudoknots. Journal of computational chemistry. 24(13): 1664–1677.
Dirks, R.M. and Pierce, N.A. 2004. An algorithm for computing nucleic acid base‐pairing probabilities including pseudoknots.Journal of computational chemistry. 25(10): 1295–1304.
Dirks, R.M., Bois, J.S., Schaeffer, J.M., Winfree, E., and Pierce, N.A. 2007. Thermodynamic analysis of interacting nucleic acid strands. SIAM review. 49(1): 65–88.
Erickson, N, Mueller, J, Shirkov, A, Zhang, H, Larroy, P, Li, M, and Smola, A. 2020. AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data. arXiv preprint arXiv:2003.06505.
Fornace, ME, Porubsky, NJ, and Pierce, NA. 2020. A unified dynamic programming framework for the analysis of interacting nucleic acid strands: enhanced models, scalability, and speed. ACS Synth Biol, 9(10), 2665-2678.
Hafez, M., Hayes, K., Goldrick, M., Grencis, R.K., and Roberts, I. S. 2010. The K5 capsule of Escherichia coli strain Nissle 1917 is important in stimulating expression of Toll-like receptor 5, CD14, MyD88, and TRIF together with the induction of interleukin-8 expression via the mitogen-activated protein kinase pathway in epithelial cells. Infection and immunity. 78(5): 2153–2162.
Hall, J.E. and Hall, M.E., 2020. Guyton and Hall textbook of medical physiology e-Book. Elsevier Health Sciences.
I. L. Hofacker. 1994. Fast folding and comparison of RNA secondary structures. Monatshefte fuer Chemie, 125(2.0), 167-188.
Levine, E., McHale, P., and Levine, H. 2007. Small regulatory RNAs may sharpen spatial expression patterns. PLoS computational biology. 3(11): e233.
Liu, S., da Cunha, A.P., Rezende, R.M., Cialic, R., Wei, Z., Bry, L., Comstock, L.E., Gandhi, R., and Weiner, H.L. 2016. The Host Shapes the Gut Microbiota via Fecal MicroRNA. Cell host & microbe. 19(1): 32–43.
Okugawa, Y., Toiyama, Y. and Goel, A., 2014. An update on microRNAs as colorectal cancer biomarkers: where are we and what’s next?. Expert review of molecular diagnostics, 14(8), pp.999-1021. https://doi.org/10.1586/14737159.2014.946907
R. Lorenz, S. Bernhart, C. Höner zu Siederdissen, H. Tafer, C. Flamm, P. F. Stadler, I. L. Hofacker (2011). ViennaRNA Package 2.0. Algorithms for Molecular Biology, 6(nan), 26.
Sensoy, I., 2021. A review on the food digestion in the digestive tract and the used in vitro models. Current research in food science, 4, pp.308-319. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8134715/
Stögbauer, T., Windhager, L., Zimmer, R., and Rädler, J.O. 2012. Experiment and mathematical modeling of gene expression dynamics in a cell-free system. Integrative Biology. 4(5): 494–501.
Tomita, R., Igarashi, S., Ikeda, T., Sugito, K., Sakurai, K., Fujisaki, S., Koshinaga, T. and Shibata, M., 2011. Study of segmental colonic transit time in healthy men. Hepato-gastroenterology, 58(110-111), pp.1519-1522. https://pubmed.ncbi.nlm.nih.gov/21940311/
Yu, Y., Nangia-Makker, P., Farhana, L., G Rajendra, S., Levi, E., and Majumdar, A.P. 2015. miR-21 and miR-145 cooperation in regulation of colon cancer stem cells. Molecular cancer. 14: 98.
Zadeh, J.N., Steenberg, C.D., Bois, J.S., Wolfe, B.R., Pierce, M.B., Khan, A.R., Dirks, R.M., and Pierce, N.A. 2011. NUPACK: Analysis and design of nucleic acid systems. Journal of computational chemistry. 32(1): 170–173.