We have developed a user-friendly website featuring a publicly accessible web tool that serves as the interface for
our dry lab model and optimizer. Upon accessing the website, users are presented with a choice between two distinct
tools: the Relative Expression Prediction Tool and the RBS Optimization Tool.
The Relative Expression Prediction Tool accepts the following input parameters: RBS Sequence, Coding Sequence, Temperature,
Gram Stain of Chassis, 16S rRNA Sequence. It subsequently provides the relative expression level in terms of the RBS
rate, represented on a logarithmic scale.
On the other hand, the RBS Optimization Tool also receives the following inputs: RBS Sequence, Coding Sequence,
Target Relative Expression (expressed in terms of the RBS rate on a logarithmic scale), Temperature, Gram Stain of
Chassis, and 16S rRNA Sequence. This tool then furnishes the optimized RBS sequence along with its corresponding
predicted (logarithmically scaled) RBS rate.
https://synthopedia.igemiitm.in/myapp/
We have built a machine learning model that can predict gene expression levels based on the gene sequence, the ribosome binding site (RBS) flanking it, the prokaryotic chassis it has been cloned in, and the temperature at which the chassis is being cultured. Using the predictions of this model, we have also built a genetic algorithm-based optimizer that can modify the RBS sequence so as to achieve a user-defined desired gene expression level. The details of both these aspects of our modelling are provided on this page.
To train our model, we used a dataset of 1014 mRNA sequences curated by
Reis and Salis [1]. We slightly modified this dataset with an additional
column describing whether the organism that the mRNA is being expressed in
is Gram-positive or Gram-negative.
We performed some minimal pre-processing by converting all sequences to
uppercase and replacing thymine (T) with uracil (U). Additionally, we
converted the mean expression values (which are given in arbitrary units)
to a logarithmic scale since we were interested in the “order of
magnitude” of the expression level.
The next step of our modelling was feature engineering, where we used the
existing data to find meaningful information that can aid in predicting
the expression level.
#1 and #2 - 16S rRNA hybridization energy and spacing
The X-crystallographic studies by Yusupov et al. [2] show that the last 8
nucleotides of the 16S ribosomal rRNA bind to the mRNA such that the
hybridization energy is minimized. This binding is important for the 30S
subunit to dock with the mRNA.
The spacing between the Shine-Dalgarno sequence and the start codon also
influences the rate of translation. Any deviation from the optimal spacing
will result in a lower efficiency. Vellanoweth and Rabinowitz [3] found
that the optimal spacing differs in Gram-positive and Gram-negative
bacteria. It is 9 nucleotides in Gram-positives and 7 nucleotides in
Gram-negatives. They also found that Gram-positive bacteria are better at
translation when the spacing is longer, but Gram-negative bacteria can
tolerate shorter spacings better.
To find the location on the mRNA where the rRNA binds, we scanned the mRNA
from the 5’ to the 3’ end of its 5’-UTR sequence and computed the binding
energy in each 8 nucleotide window. The binding energy was computed using
the ViennaRNA library [4]. To account for the effect of spacing, we
penalised non-optimal spacing using a Gaussian-like distribution and
multiplied the penalty with the binding energy. We took into account the
findings of Vellanoweth and Rabinowitz while designing this penalty term.
There is potential for this penalty function to be modified. The rRNA will
bind to that location on the mRNA where the penalised binding energy is
minimum.
We observed that stronger binding leads to higher expression.
The most frequently occurring Shine-Dalgarno sequence appeared to be UAAGGAGG. This was consistent with existing literature [5].
We also saw that spacings close to the optimum give rise to higher expressions, while extreme spacings give rise to lower expressions.
Feature #3 - Interaction with S1 ribosomal protein
Komarova et al. [6] found that the S1 protein present in the ribosome's
30S subunit interacts with A/U-rich sequences upstream of the
Shine-Dalgarno site. This interaction is believed to protect the mRNA from
degradation by RNases, which also use A/U-rich sequences as recognition
sites. The insertion of A/U-rich elements upstream of the Shine-Dalgarno
sequence enhances translation efficiency. X-ray crystallographic studies
by Sengupta et al. [7] found that the S1 protein interacts with 11
nucleotides upstream of the Shine-Dalgarno sequence.
To quantify this interaction, we simply calculated the number of adenine
and uracil nucleotides in the mRNA's putative region of interaction with
the S1 protein. We called this number the A/U score.
We saw that mRNAs with higher A/U scores tend to have higher expression
levels.
Features #4 and #5 - mRNA folding kinetics and accessibility
The region of the mRNA containing the RBS can transiently fold and unfold.
The ribosome can only bind to the RBS if it is accessible. This means that
the region surrounding the RBS should be free of secondary structures, or
if secondary structures are present, they should be easily removable
(i.e., lesser work required to unfold them).
Hüttenhofer and Noller [8] found that the 30S subunit interacts with 54-57
nucleotides of the mRNA. We considered this region, centered around the
start codon, while determining mRNA folding energy and accessibility.
We crudely quantified the accessibility by considering the number of
paired and unpaired nucleotides in the secondary structure of the region
under consideration. We called this the accessibility score.
We saw that having a higher folding energy (i.e., lesser work required for
unfolding) also leads to higher expression levels.
We also saw that a higher accessibility score leads to higher expression levels, on average.
Feature #6 - Standby Sites
If the RBS is transiently folded, the ribosome may bind to standby sites
upstream of the Shine-Dalgarno sequence [9]. This site has to be easily
accessible to the ribosome. Thus, it should have minimal secondary
structure formation. The standby site should also not be too far from the
Shine-Dalgarno sequence.
We defined the accessibility of the standby site by considering the number
of paired and unpaired nucleotides in the standby site. We penalised large
gaps between the standby site and the Shine-Dalgarno sequence by dividing
the accessibility by the length of the gap.
We observed that more accessible standby sites have a higher expression
level than less accessible standby sites.
Feature #7 - Start codon
The start codon in the coding sequence can also influence translation
efficiency. Hecht et al. [10] characterized the strengths of the 64
possible start codons. Using their experimental data, we developed a
scoring system for the different start codons.
We saw that when the codon score was lower, the expression level was
reduced.
Feature #8 - Gram stain
We represented Gram-negative bacteria by ‘0’ and Gram-positive bacteria by ‘1’ to help our model distinguish
between the two.
The next step was to actually train and test our model. For this, we performed grid search cross-validation on a
Random Forest Regressor to find its best hyperparameters. The grid search showed us that the best combination of
hyperparameters for the model were max_depth=80, max_features=3, min_samples_leaf=3, min_samples_split=8,
n_estimators=50 for random_state=23 with both the train-test split and the Random Forest Regressor itself.
After finding the best hyperparameters, we divided the dataset into different combinations of training sets and
testing sets by setting different random seeds during the train-test split. The size of the testing set was set
as 1% of the size of the whole dataset. We trained the fully-tuned model and tested it with the corresponding
testing set for 1000 different random seeds. We compared our performance with the existing calculators (RBS
Calculator v1.0 [12], v1.1 [13], v2.0 [14], v2.1 [1], RBS Designer [15], UTR Designer [16], EMOPEC [17]) for
each of these train-test splits.
On average, the performance of each of these models over the 1000 different testing splits were:
Our model has the second best performance, just slightly below the RBS Calculator v2.1. Something to note is
that our model has a much better performance than EMOPEC, which also uses a Random Forest Regressor for its
predictions.
We then decided to test our model on another dataset of 16779 mRNA sequences whose expression was quantified by
FlowSeq. This dataset was curated by Reis and Salis [1] as well. We compared our performance on this dataset
with the other calculators.
Our model has the best performance among those compared, despite the inherent noise present in a FlowSeq
experiment.
When we checked what features our model deemed were important to make a prediction, we found that binding energy
and folding energy play an important role.
Using the predictions of our model, we sought to build an optimization algorithm that can provide the “best” RBS
sequence for a given gene sequence and desired expression level. We found that a genetic algorithm has the best
performance for such an optimization.
The genetic algorithm, an evolutionary algorithm inspired by natural principles, optimizes sequences through a
combination of mutations (involving random base changes with a certain probability), crossovers (which merges
sequences with rates closest to the desired values), and selections (selecting the best sequence from the
current generation as the parent for the next).
The process begins with random mutations applied to the bases within the RBS sequence, followed by a checking of
the translation initiation rate. Subsequently, four parents are chosen for a single-point crossover operation,
wherein the RBS sequences are divided at a specific point, and all bases beyond that point are exchanged. This
results in four additional RBS sequences, each of which undergoes rate evaluation. Among these sequences, those
exhibiting rates closest to the desired values are identified as the starting RBS sequences for the next
generation. This iterative process continues until the rate closely approximates the desired target.
[1] Reis, A. C., & Salis, H. M. (2020). An automated model test system for
systematic development and improvement of gene expression models. ACS
Synthetic Biology, 9(11), 3145–3156.
https://doi.org/10.1021/acssynbio.0c00394
[2] Yusupov, M., Cate, J., & Noller, H. F. (2001). The Path of Messenger
RNA through the Ribosome. Cell, 106(2), 233–241.
https://doi.org/10.1016/s0092-8674(01)00435-4
[3] Vellanoweth, R. L., & Rabinowitz, J. C. (1992). The influence of
ribosome-binding-site elements on translational efficiency in Bacillus
subtilis and Escherichia coli in vivo. Molecular Microbiology, 6(9),
1105–1114. https://doi.org/10.1111/j.1365-2958.1992.tb01548.x
[4] Lorenz, R., Bernhart, S. H., Siederdissen, C. H. Z., Tafer, H., Flamm,
C., Stadler, P. F., & Hofacker, I. L. (2011). ViennaRNA Package 2.0.
Algorithms for Molecular Biology, 6(1).
https://doi.org/10.1186/1748-7188-6-26
[5] Shine, J., & Dalgarno, L. (1974). The 3′-Terminal Sequence of
Escherichia coli 16S Ribosomal RNA: Complementarity to Nonsense Triplets
and Ribosome Binding Sites. Proceedings of the National Academy of
Sciences of the United States of America, 71(4), 1342–1346.
https://doi.org/10.1073/pnas.71.4.1342
[6] Komarova, A. V., Tchufistova, L. S., Dreyfus, M., & Boni, I. V.
(2005). AU-Rich Sequences within 5′ Untranslated Leaders Enhance
Translation and Stabilize mRNA in Escherichia coli. Journal of
Bacteriology, 187(4), 1344–1349.
https://doi.org/10.1128/jb.187.4.1344-1349.2005
[7] Sengupta, J., Agrawal, R. K., & Frank, J. (2001). Visualization of
protein S1 within the 30S ribosomal subunit and its interaction with
messenger RNA. Proceedings of the National Academy of Sciences of the
United States of America, 98(21), 11991–11996.
https://doi.org/10.1073/pnas.211266898
[8] Hüttenhofer, A., & Noller, H. F. (1994). Footprinting mRNA-ribosome
complexes with chemical probes. The EMBO Journal, 13(16), 3892–3901.
https://doi.org/10.1002/j.1460-2075.1994.tb06700.x
[9] De Smit, M. H., & Van Duin, J. (2003). Translational Standby Sites:
How Ribosomes May Deal with the Rapid Folding Kinetics of mRNA. Journal of
Molecular Biology, 331(4), 737–743.
https://doi.org/10.1016/s0022-2836(03)00809-x
[10] Hecht, A., Glasgow, J. E., Jaschke, P. R., Bawazer, L. A., Munson,
M., Cochran, J. R., Endy, D., & Salit, M. (2017). Measurements of
translation initiation from all 64 codons in E. coli. Nucleic Acids
Research, 45(7), 3615–3626. https://doi.org/10.1093/nar/gkx070
[11] Borujeni, A. E., & Salis, H. M. (2016). Translation Initiation is
Controlled by RNA Folding Kinetics via a Ribosome Drafting Mechanism.
Journal of the American Chemical Society, 138(22), 7016–7023.
https://doi.org/10.1021/jacs.6b01453
[12] Salis, H. M., Mirsky, E., & Voigt, C. A. (2009). Automated design of
synthetic ribosome binding sites to control protein expression. Nature
Biotechnology, 27(10), 946–950. https://doi.org/10.1038/nbt.1568