Overview

Our modelling objectives were twofold. To characterise the theoretical underpinnings of our diametric selection marker GFP-sacB, we developed an ODE model to simulate the effect of sucrose and IPTG on cell lysis. Our model was able to fit our experimental results, demonstrating its performance.

To advance RNA interaction studies to improve their design process, we implemented two deep learning models. The first is a predictive model that adapts a “Siamese” Large Language architecture, that can be trained to predict an RNA switch’s properties. The model was able to predict binding interactions (R2 = 0.937) between any category of RNA using our large dataset.

The second model applies the Transformer architecture and demonstrates its capability to design RNA switches. The model was shown to generate Toehold switches with at most 2 base pair differences compared to the expected sequence.

ODE model for sacB

🥳 The purpose of the model is to capture the effect of sucrose and IPTG on cell lysis, through obtaining data for optical density (OD)

With the ODE model, we hope to map to changes in optical density (OD600) of the targeted cells to changes in IPTG, so as to characterise our negative selection marker for future iGEMers

lorem ipsum

Mechanism breakdown

  • 2

    Phase II

    Enzymatic reaction between levansucrase and sucrose

  • 1

    Phase I

    From sacB gene to levancurase (transcription and translation)

  • 3

    Phase III

    Levan's effect on cell lysis

  • Model

    Model will originate from these first principles with the aid of experimental data

Preliminary analysis of experimental data

Wetlab data are analysed to identify any trend or special features to be considered for the model. This step is important to determine the areas of focus for the ODE modelling.


General trends

From the wetlab data, we understand that increase in sucrose and IPTG results in reduction in OD as more levan is produced, causing more cell lysis.

Detailed analysis

In reference to Figure 3, when sucrose is between 0% to 0.1% (by interpolation), the effect of different IPTG and OD at 6 hours are similar. However, a distinct trend is evident after 0.2%, where 0 IPTG deviates from the rest of IPTG level, inferring that IPTG level does have an impact on OD and it is less effective to reduce OD by increasing sucrose concentration solely at higher level of IPTG.

Amongst the trends with 25 IPTG, 35 IPTG, and 45 IPTG, the graph shows a point where the levels of OD are nearly coinciding – at around 0.5% sucrose. Before this point, addition of 45 micro molar of IPTG is more effective in reducing OD, however, after this point, the trend with 25 IPTG becomes more effective while the trend with 35 IPTG has remained in its rank (second most effective in reducing OD) throughout the different levels of sucrose.

lorem ipsum

The change in effectiveness corresponds to Figure 2, which shows that as time progresses, level of GFP expression reduces for higher IPTG and higher sucrose concentration. Trends with 0.5% and 1% of sucrose concentration no longer increases at the later section of time stamp in Figure 2c and Figure 2d. In Figure 2d, the level of GFP expression for 1% sucrose even drops to being the second lowest amongst all trends in the figure.

GFP expression can be used as an indicator to infer the expression profiles of levansucrase, as hinted by the mechanism diagram (Figure 1). Hence, with a reduced GFP level at higher IPTG and concentration of sucrose, decrease in levansucrase is implied. As levansucrase decreases, the production of levan also decreases, which leads to being less effective in reducing OD. Additionally, the decrease in GFP could be due to mutation on the gene resulting from increased metabolic burden.

Model Formulation and Parameters

The four equations consist of hill equations and standard modelling expression for synthesis and degradation. The fourth equation, ODt, is of particular interests:

  1. Equation 4 takes into account of sucrose’s inducible and inhibitory effect on cell lysis as well as levan’s inhibitory effect on cell lysis as we understand the cell lysis to be a result of accumulation of compounds in cytoplasm.
  2. OD with 0 IPTG can also be captured when there is no levan production: inhibitory hill equation with sucrose limits the increase in OD due to the inducible hill equation with sucrose
  3. ODmax guides the increase in OD to a ceiling value

The value of parameters are found from fitting the numerical solution of the equations to the actual wetlab data and optimising the fit.

Model Performance

lorem ipsum

Note that the dots denote actual wetlab data and the continuous lines denote model prediction.


The Levan graph is demonstrative of the trends shown in Figure 2, with the exponential increase trends, as well as the relative effect of different concentrations of sucrose on the levan production level. Figure 5a and Figure 5b each shows stabilising behaviour as predicted by theory.

Implications and future improvements

Implications

The model result captures the relations between levan, IPTG and sucrose well. This generic framework of producing ODE equations may be applied for other similar cases.

We would also like to highlight that, optimisically, if a new project is to be started with sacB, the equations and models can be use to predict how much sucrose concentration and IPTG input may cause optimal results for cell death. Then, the model can also drive the formulation of hypothesis regarding this, saving costs and time for the experimenter.

Improvements

We would like to acknowledge that there are some discrepancies between predicted and experimental trends when applying the generic parameters. This may be solved by adding layers of hill equations to the main sets of ODE equations again to increase its precision while increasing its complexity. However, the process seems to be of less value as the current model is able to capture most of the features of the experimental data (the relative effect on OD, for instance).

For future improvements purposes, it would be interesting to investigate the fluctuations between data points to interpret the trend as more than 2 phases (inducing and inhibiting for our case). The diffusion of sucrose and levan inside the cytoplasm may also be modelled to increase the accuracy of the model.



Deep learning

🥳 The goal of modelling is to create and deploy predictive and generative models for RNA switches.

Initially, we explored traditional modeling methods before transitioning to advanced deep learning models.

Phase I: Traditional Modelling

Due to our project’s focus on RNA switches, we aimed to develop predictive models to characterise their behaviour. As a proof of concept, we aimed to develop models for Toehold Switches. Our objective was to predict the ON/OFF ratio for Switches, in line with our overall project goal of creating fundamental tools for RNA interaction studies.


We began with the most fundamental models. Linear modelling takes manually crafted “features” from a dataset and predicts a numerical value. For our dataset, the manual features included sections of the Toehold switch, such as the stem and the minimum free energy of the binding region subsequence (figure 6). The predicted value was the activity of the switch in its ON and OFF states. Moreover, we predicted the ON/OFF ratio of the switch to approximate the fold change activity of the switch.


For this predictive workflow, two approaches were tested; fitted linear regression model and multilayer perceptron. Our objective for this phase was to determine if more advanced models were truly necessary to process RNA sequence data.

Linear Regression Fitting

Linear regression models the dependent and independent variable in a linear fashion by fitting the data points into a linear equation ​(Hayes, 2023)​.


Where X₁ to Xₙ are the features of the data, β₀ to βₙ are the coefficients, and Y is the predicted value. In our case, Y would be the ON value, the OFF value, and the ON/OFF ratio.


The linear regressor aimed to reproduce the findings of Dr Nicolaas ​(Angenent-Mari et al., 2020)​ with a simpler methodology. Simple models like linear regression can provide clear interpretations of the impact of each feature on the output. Furthermore, simpler models with fewer parameters also prevent overfitting. Overfitting occurs where the model does not generalise well to unseen data ​(IBM, n.d.)​.

Methods

We designed our model to have 30 independent variables and 3 dependent variables as in Figure 1. The dataset used for linear regression was from “A deep learning approach to programmable RNA switches” ​(Angenent-Mari et al., 2020)​ where after filtering for null values, there were 135,793 data points.

We evaluated the metrics using two parameters, Pearson’s R and R² (correlation coefficient). Pearson’s R establishes if there is a linear relationship between the predicted values and the actual values. R² measures how accurate the model is. An R² of 0.7 for example means that 70% of the variation in the data can be explained by the model. Henc, Pearson’s R measures if the model picks up on a trend, while R² measures how much the model explains about the deviation from the trend. We did this using r_regression (Sklearn.Feature_selection.R_regression, n.d.)and r2_score (Sklearn.Metrics.R2_score, n.d.) packages from Sklearn to calculate both metrics. Moreover, we calculated the mean squared error (MSE), which determines the accuracy of the model.

Results

To provide a visual representation of these correlations, we plotted a Pearson's correlation heatmap.



From these results, we concluded that the linear regression model lacked the predictive power necessary for RNA switches. Based on the low Pearson R, we hypothesised that there may not be a linear relationship between the switch and its performance. Furthermore, we believe that the specifically engineered features, such as the mean free energy of the switch’s components, adds noise to the data instead of explainability.


Hence, we proceeded to explore the use of neural networks. Neural Networks are able to model non-linear relationships in data. For this aim, we chose to design the simplest neural network possible, a Multilayer Perceptron (MLP).


Multilayer Perceptron

MLPs are the simplest type of neural network. In contrast to the linear regression model, the MLP is capable of directly interpreting the sequence data, after it is transformed into a “one hot encoded” vector. (Giosue et al., 2017)

Each nucleotide becomes a vector of size 4. There will be a single value of 1 to represent the nucleotide at a given position, together with 0s for all other nucleotides not present. For example, the nucleotide A becomes [1,0,0,0] and T becomes [0,1,0,0] and sequence AT is [[1,0,0,0], [0,1,0,0]]


The architecture above was used by Angenent-Mari et al. However, we had to reimplement the model as the python packages used, such as TensorFlow, had been updated with substantial changes compared to the deprecated packages in their implementation.


Furthermore, our own implementation enhanced our understanding on how the input data is processed and fed into networks, which helped aid us in designing more advanced models.

Methods

The model has 3 layers excluding the output layer where the number of nodes progressively decreased from 128 to 64 to 32 and finally 3 nodes.


The model was trained for 200 epochs. The accuracy and loss graphs suggest that the model converges around completely around 100 epochs. After fine-tuning the model, these were the best results we have achieved.

Results



Despite referencing the source work, the results proved somewhat underwhelming. The model had a poor mean absolute error rate of around 0.15. While the R² had improved, it was still lower than the benchmark. Deep Learning models are stochastic in nature, and despite the same architecture, we remained uncertain about how the MLP could be optimised to have the same results.


However, the increase in correlation between linear regression and MLP gave some assurance that deep learning was indeed the right direction to take our project towards. At the same time, our team decided to broaden our project to the study of RNA interaction in line with the wet lab’s team experimental results.

Hence, we made 2 critical decisions:

  1. To enhance our ability to extract information from the RNA inputs, we aimed to use Language Models, which are specially designed to understand sequences.
  2. To model interactions, we aimed to design architectures that utilised data from both the RNA switch and its target sequences.

Phase II: Language modelling

From the results we saw from our previous models, we decided to use language models. Furthermore, we were inspired to work towards a predictive-generative workflow after interviews with Dr Jacqueline Valeri and Mr Aiden Riley.

To do so, the modelling for our project split into two different directions.

One was to take inspiration from Large Language Models (Humza, 2023) to predict RNA interaction behaviours, including switch activity. The other was to apply the transformer model (Vaswani et al, 2017) to generate an RNA switch given its target without explicitly telling the model how to achieve that.

Prediction using Siamese Large Language Models

For the predictor model, we took inspiration from Large Language Models that power models such as GPT-4. These models are designed to interpret languages through a formula called Attention (Vaswani et al, 2017).

This formula is used to weigh how “connected” words in a sentence are to one another.

For example, the model will learn that “take” and “picture” are very related and thus have a high attention score (Voita, 2019). We extended this concept to the language of nucleotides as well.

Each nucleotide is treated as a “word”. The darker the region in the heatmap, the more related the regions are. It is important to note that the attention will lead to the model learning intrasequence interactions, which can be seen as learning the structure of the switch and not just complementarity.

This is achieved through a process called tokenization in our model.

In our case, we tokenized by codon pairs to reduce the length of the sequence. Each “token” has a numerical value associated with it for the model to process.

Bidirectional Encoder Representations from Transformers (BERT)

We chose to use an architecture called Bidirectional Encoder Representations from Transformers (BERT), which has multiple stacked layers to calculate “attention”. In practice, this results in each layer “attending” to the sequence to different degrees. With each layer, the model learns more complex relationships.

However, traditional BERTs are computationally expensive and require a massive dataset to train. Hence, we took inspiration from Siamese models. Siamese Language models create a duplicate of the model for each sequence, the RNA switch and their target in our case. Each model shares information with each other about the weights they learn. This approach was used by a model known as Sentence-BERT (Nils, 2019), which determined if two sentences were similar to one another. However, we adapted this model to predict interactions. Furthermore, we trained it from scratch ourselves. As a result of our Siamese interaction approach, we were able to train the model with half the number of parameters, maintain the effectiveness of the BERT model, and extract data about interactions directly.

Results

The first step was to pretrain the model on a large corpus of RNA interaction data gathered from RNAInter (Kang et al, 2021). The input fed in was a pair of RNA sequences that interact with one another. Each RNA could vary in length between 1 and 1536, with longer sequences being truncated. During the fine tuning stage, a variety of RNAs were used such as mRNAs and miRNAs. The model was trained for 5 epochs, owing to the limited computational resources that our team had. However, we already observed significant results.


Following these results, we returned to the Angenent-Mari et al., dataset. There was a substantial reduction in the loss compared to the MLP.


There was a dramatic reduction in the loss to approximately 0.056 for the validation set (in orange) after 25 epochs. The training set (in blue) initially had a higher loss, likely due to the use of a normalization technique called dropout, where a percentage of nodes in the model are removed during testing at each iteration but only used for validation.

Our MLP model achieved a mean squared error of 0.15 during validation. Hence, the result for this model has nearly 3 times improvement in results. This validates our transfer learning approach, showing that our pretrained model allowed for more data to be extracted from the sequences.

To confirm if our model learned from the data, we measured two independent metrics for correlation, the Pearson R value and Spearman rank correlation values. Both measures show how closely the model result relates to the true value. We achieved a respectable 0.56 and 0.57 for Pearson R and Spearman rank respectively. This indicates a modest correlation between our predicted results and the actual data. However, the R² (correlation coefficient) remains to have room for improvement. It reaches approximately 0.33.

We hypothesise that the difference in results is due to the data generation method. The data from Angenent-Mari et al., was generated by fusing both the toehold switch and the trigger together. However, our model was trained on free RNA-RNA interactions.

Hence, to fully validate the potential for our model, we will require data about free toehold switches to be generated. Our wet lab team data generation protocol will be able to achieve that, allowing for future teams to continue our work.

Improvements

Currently, our model implicitly models interactions by concatenation of the language model results. In practice, this means having the results of both models as one-dimensional sequences.

To capture data about interactions. We can take inspiration from convolutional networks. Instead of creating a one-dimensional sequence, we could create a 2-dimensional matrix with sequences on both sides.

Generation using Transformers

After successfully implementing predictive models using RNA switches and deep learning, the only remaining frontier for us was the development of a generative model.

Our next goal is to develop RNA switches, and after extensive literature review and consideration of human practices, we have chosen to employ the transformer architecture (Vaswani et al, 2017) for this purpose. The transformer architecture stands out for its versatility, capable of handling sequences of varying lengths and managing imbalanced switch-target sequence lengths effectively. The architecture is designed in a way that is easy to comprehend and work with. This allows researchers to grasp the underlying mechanisms swiftly, ultimately accelerating the development process.

It is structured around two major components: encoders and decoders.


Encoders focus on understanding the input sequences, breaking them down into smaller tokens, while decoders generate output sequences based on the previously processed information.


This clear division of tasks simplifies the overall workflow and enhances the interpretability of the model, making it a good starting choice for switch-target generation.

Results

The model architecture is similar to Vaswani et al, where there are 6 encoders and decoders with 8 heads in the multi-head attention layer. The transformer was trained for a total of 75 epochs with the dataset from “A deep learning approach to programmable RNA switches” (Angenent-Mari et al., 2020). In Angenent-Mari et al dataset, switches with an ON/OFF ratio of less than 2 were filtered out and thus we followed their lead (Angenent-Mari et al., 2020). Null datapoints were also filtered out. After the data processing step, there were 176,113 sequences used for the training.

Upon analysing the training loss curve, it is evident that the loss function has reached a stable state after 75 epochs. The loss function serves as a measure of the model's performance, with lower values indicating closer predictions to the actual targets in the training data. We achieved a loss value of 0.70. This result signifies our model's ability to accurately predict 30% of sequences within the dataset.


Next, we investigated how our model performed on the validation set which was withheld during training.

The accompanying figure illustrates a random sample of the validation set generated through predicting the target given the switch using the transformer with frozen weights.

The transformer showcases its proficiency in accurately generating the switch based on the provided target. It is noteworthy that the switch and the generated column are identical, affirming that the transformer has effectively learned the RNA switch template within the dataset.


We conducted a thorough analysis by calculating both character word error rate (CER) and word error rate (WER) between the expected and predicted sequences in the validation set.

CER indicates the percentage of characters that were incorrectly predicted (Char Error Rate, n.d.). WER indicates the percentage of words that were incorrectly predicted (Word Error Rate, n.d.). In our model, word refers to the codons in the sequence while character refers to the individual nucleotides in the sequence.

Throughout most of the sequences, a consistent error rate of 0.06 was observed. However, notable spikes in errors up to 0.012 were identified between batch steps 100 and 150 and from 240. These spikes are attributed to variations in switch designs, as an exact template was not uniformly applied in the Angenent-Mari et al., dataset. Consequently, error rates increased for sequences that deviated from the established design distribution.


Improvements

Firstly, a robust dataset with a bigger design space would improve the transformer’s performance against outliers and improve generalisability. Given our wet lab’s data generation workflow, such future work is possible.

Secondly, the transformer architecture is a good starting point. However, other deep learning architectures would be interesting to look at such Wasserstein Generative Adversarial Network (WGAN), Variational Autoencoders (VAEs) and even a decoder only transformer.


Conclusions and Future work

Both models demonstrate the transformative potential of AI in accelerating Synthetic Biology design. In the future, we seek to link these models onto OTTER, our very own User Interface for advanced deep learning models in Synthetic Biology.

The predictor model demonstrates enhanced predictive capabilities about the properties of RNA switches. It achieves this by capturing data about the RNA language and a novel architecture for interaction. Moreover, we designed the model to be generalisable for all datasets of RNA-RNA interactions.

Concurrently, we developed a generative model for RNA switches, a novel approach utilizing transformers.

Moving forward, our objective is to integrate both models to not only generate switches but also predict their characteristics and optimize their performance. Given the capable results attained by these models, we anticipate their ability to learn from any data generated by the Wet Lab’s Signal Workflow, further advancing our research.

Reference

  1. Protein production in the E. coli cell envelope - semantic scholar. (n.d.). https://www.semanticscholar.org/paper/Protein-production-in-the-E.-coli-cell-envelope-Baumgarten/450e1ff26bf32bc4146daf4c6786c4d002434586
  2. ​​Angenent-Mari, N. M., Garruss, A. S., Soenksen, L. R., Church, G., & Collins, J. J. (2020). A deep learning approach to programmable RNA switches. Nature Communications 2020 11:1, 11(1), 1–12. https://doi.org/10.1038/s41467-020-18677-1
  3. Char Error Rate. (n.d.). TorchMetrics. Retrieved October 11, 2023, from https://torchmetrics.readthedocs.io/en/stable/text/char_error_rate.html
  4. ​Hayes, A. (2023, April 29). Multiple Linear Regression (MLR) Definition, Formula, and Example. Investopedia. https://www.investopedia.com/terms/m/mlr.asp
  5. IBM. (n.d.). What is Overfitting? | IBM. Retrieved October 11, 2023, from https://www.ibm.com/topics/overfitting
  6. sklearn.feature_selection.r_regression. (n.d.). Scikit-Learn. Retrieved October 11, 2023, from https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.r_regression.html
  7. sklearn.metrics.r2_score. (n.d.). Scikit-Learn. Retrieved October 11, 2023, from https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
  8. Vaswani, A., Brain, G., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need.
  9. Word Error Rate. (n.d.). TorchMetrics. Retrieved October 11, 2023, from https://torchmetrics.readthedocs.io/en/stable/text/word_error_rate.html
  10. ​Giosue & Di Gangi (2017) Giosue B, Di Gangi MAN. Deep learning architectures for DNA sequence classification. 2017;10147:249–259. doi: 10.1007/978-3-319-52962-2_14.
  11. Voita, E. (2019). The Story of Heads. The story of heads. https://lena-voita.github.io/posts/acl19_heads.html
  12. Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, & Ajmal Mian. (2023). A Comprehensive Overview of Large Language Models.
  13. Nils Reimers, & Iryna Gurevych. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.
  14. Kang, J., Tang, Q., He, J., Le, L., Yang, N., Yu, S., Wang, M., Zhang, Y., Lin, J. S., Cui, T., Hu, Y., Tan, P., Cheng, J., Zheng, H., Wang, D., Su, X., Chen, W., & Huang, Y. (2021). RNAInter v4.0: RNA interactome repository with redefined confidence scoring system and improved accessibility. Nucleic Acids Research, 50(D1), D326–D332. https://doi.org/10.1093/nar/gkab997