Software

We have built a machine learning model that can predict gene expression levels based on the gene sequence, the ribosome binding site (RBS) flanking it, the prokaryotic chassis it has been cloned in, and the temperature at which the chassis is being cultured. Using the predictions of this model, we have also built a genetic algorithm-based optimizer that can modify the RBS sequence so as to achieve a user-defined desired gene expression level. The details of both these aspects of our modelling are provided on this page.

To train our model, we used a dataset of 1014 mRNA sequences curated by Reis and Salis [1]. We slightly modified this dataset with an additional column describing whether the organism that the mRNA is being expressed in is Gram-positive or Gram-negative.

We performed some minimal pre-processing by converting all sequences to uppercase and replacing thymine (T) with uracil (U). Additionally, we converted the mean expression values (which are given in arbitrary units) to a logarithmic scale since we were interested in the “order of magnitude” of the expression level.

The next step of our modelling was feature engineering, where we used the existing data to find meaningful information that can aid in predicting the expression level.

Feature Engineering

#1 and #2 - 16S rRNA hybridization energy and spacing

The X-crystallographic studies by Yusupov et al. [2] show that the last 8 nucleotides of the 16S ribosomal rRNA bind to the mRNA such that the hybridization energy is minimized. This binding is important for the 30S subunit to dock with the mRNA.

The spacing between the Shine-Dalgarno sequence and the start codon also influences the rate of translation. Any deviation from the optimal spacing will result in a lower efficiency. Vellanoweth and Rabinowitz [3] found that the optimal spacing differs in Gram-positive and Gram-negative bacteria. It is 9 nucleotides in Gram-positives and 7 nucleotides in Gram-negatives. They also found that Gram-positive bacteria are better at translation when the spacing is longer, but Gram-negative bacteria can tolerate shorter spacings better.

To find the location on the mRNA where the rRNA binds, we scanned the mRNA from the 5’ to the 3’ end of its 5’-UTR sequence and computed the binding energy in each 8 nucleotide window. The binding energy was computed using the ViennaRNA library [4]. To account for the effect of spacing, we penalised non-optimal spacing using a Gaussian-like distribution and multiplied the penalty with the binding energy. We took into account the findings of Vellanoweth and Rabinowitz while designing this penalty term. There is potential for this penalty function to be modified. The rRNA will bind to that location on the mRNA where the penalised binding energy is minimum.

We observed that stronger binding leads to higher expression.

The most frequently occurring Shine-Dalgarno sequence appeared to be UAAGGAGG. This was consistent with existing literature [5].

We also saw that spacings close to the optimum give rise to higher expressions, while extreme spacings give rise to lower expressions.

Feature #3 - Interaction with S1 ribosomal protein

Komarova et al. [6] found that the S1 protein present in the ribosome's 30S subunit interacts with A/U-rich sequences upstream of the Shine-Dalgarno site. This interaction is believed to protect the mRNA from degradation by RNases, which also use A/U-rich sequences as recognition sites. The insertion of A/U-rich elements upstream of the Shine-Dalgarno sequence enhances translation efficiency. X-ray crystallographic studies by Sengupta et al. [7] found that the S1 protein interacts with 11 nucleotides upstream of the Shine-Dalgarno sequence.

To quantify this interaction, we simply calculated the number of adenine and uracil nucleotides in the mRNA's putative region of interaction with the S1 protein. We called this number the A/U score.

We saw that mRNAs with higher A/U scores tend to have higher expression levels.

Features #4 and #5 - mRNA folding kinetics and accessibility

The region of the mRNA containing the RBS can transiently fold and unfold. The ribosome can only bind to the RBS if it is accessible. This means that the region surrounding the RBS should be free of secondary structures, or if secondary structures are present, they should be easily removable (i.e., lesser work required to unfold them).

Hüttenhofer and Noller [8] found that the 30S subunit interacts with 54-57 nucleotides of the mRNA. We considered this region, centered around the start codon, while determining mRNA folding energy and accessibility.

We crudely quantified the accessibility by considering the number of paired and unpaired nucleotides in the secondary structure of the region under consideration. We called this the accessibility score.

We saw that having a higher folding energy (i.e., lesser work required for unfolding) also leads to higher expression levels.

We also saw that a higher accessibility score leads to higher expression levels, on average.

Feature #6 - Standby Sites

If the RBS is transiently folded, the ribosome may bind to standby sites upstream of the Shine-Dalgarno sequence [9]. This site has to be easily accessible to the ribosome. Thus, it should have minimal secondary structure formation. The standby site should also not be too far from the Shine-Dalgarno sequence.

We defined the accessibility of the standby site by considering the number of paired and unpaired nucleotides in the standby site. We penalised large gaps between the standby site and the Shine-Dalgarno sequence by dividing the accessibility by the length of the gap.

We observed that more accessible standby sites have a higher expression level than less accessible standby sites.

Feature #7 - Start codon

The start codon in the coding sequence can also influence translation efficiency. Hecht et al. [10] characterized the strengths of the 64 possible start codons. Using their experimental data, we developed a scoring system for the different start codons.

We saw that when the codon score was lower, the expression level was reduced.

Feature #8 - Gram stain

We represented Gram-negative bacteria by ‘0’ and Gram-positive bacteria by ‘1’ to help our model distinguish between the two.

Model Training and Testing

The next step was to actually train and test our model. For this, we performed grid search cross-validation on a Random Forest Regressor to find its best hyperparameters. The grid search showed us that the best combination of hyperparameters for the model were max_depth=80, max_features=3, min_samples_leaf=3, min_samples_split=8, n_estimators=50 for random_state=23 with both the train-test split and the Random Forest Regressor itself.

After finding the best hyperparameters, we divided the dataset into different combinations of training sets and testing sets by setting different random seeds during the train-test split. The size of the testing set was set as 1% of the size of the whole dataset. We trained the fully-tuned model and tested it with the corresponding testing set for 1000 different random seeds. We compared our performance with the existing calculators (RBS Calculator v1.0 [12], v1.1 [13], v2.0 [14], v2.1 [1], RBS Designer [15], UTR Designer [16], EMOPEC [17]) for each of these train-test splits.

On average, the performance of each of these models over the 1000 different testing splits were:

Our model has the second best performance, just slightly below the RBS Calculator v2.1. Something to note is that our model has a much better performance than EMOPEC, which also uses a Random Forest Regressor for its predictions.

We then decided to test our model on another dataset of 16779 mRNA sequences whose expression was quantified by FlowSeq. This dataset was curated by Reis and Salis [1] as well. We compared our performance on this dataset with the other calculators.

Our model has the best performance among those compared, despite the inherent noise present in a FlowSeq experiment.

When we checked what features our model deemed were important to make a prediction, we found that binding energy and folding energy play an important role.

Optimization Algorithm

Using the predictions of our model, we sought to build an optimization algorithm that can provide the “best” RBS sequence for a given gene sequence and desired expression level. We found that a genetic algorithm has the best performance for such an optimization.

The genetic algorithm, an evolutionary algorithm inspired by natural principles, optimizes sequences through a combination of mutations (involving random base changes with a certain probability), crossovers (which merges sequences with rates closest to the desired values), and selections (selecting the best sequence from the current generation as the parent for the next).

The process begins with random mutations applied to the bases within the RBS sequence, followed by a checking of the translation initiation rate. Subsequently, four parents are chosen for a single-point crossover operation, wherein the RBS sequences are divided at a specific point, and all bases beyond that point are exchanged. This results in four additional RBS sequences, each of which undergoes rate evaluation. Among these sequences, those exhibiting rates closest to the desired values are identified as the starting RBS sequences for the next generation. This iterative process continues until the rate closely approximates the desired target.

References

[1] Reis, A. C., & Salis, H. M. (2020). An automated model test system for systematic development and improvement of gene expression models. ACS Synthetic Biology, 9(11), 3145–3156. https://doi.org/10.1021/acssynbio.0c00394

[2] Yusupov, M., Cate, J., & Noller, H. F. (2001). The Path of Messenger RNA through the Ribosome. Cell, 106(2), 233–241. https://doi.org/10.1016/s0092-8674(01)00435-4

[3] Vellanoweth, R. L., & Rabinowitz, J. C. (1992). The influence of ribosome-binding-site elements on translational efficiency in Bacillus subtilis and Escherichia coli in vivo. Molecular Microbiology, 6(9), 1105–1114. https://doi.org/10.1111/j.1365-2958.1992.tb01548.x

[4] Lorenz, R., Bernhart, S. H., Siederdissen, C. H. Z., Tafer, H., Flamm, C., Stadler, P. F., & Hofacker, I. L. (2011). ViennaRNA Package 2.0. Algorithms for Molecular Biology, 6(1). https://doi.org/10.1186/1748-7188-6-26

[5] Shine, J., & Dalgarno, L. (1974). The 3′-Terminal Sequence of Escherichia coli 16S Ribosomal RNA: Complementarity to Nonsense Triplets and Ribosome Binding Sites. Proceedings of the National Academy of Sciences of the United States of America, 71(4), 1342–1346. https://doi.org/10.1073/pnas.71.4.1342

[6] Komarova, A. V., Tchufistova, L. S., Dreyfus, M., & Boni, I. V. (2005). AU-Rich Sequences within 5′ Untranslated Leaders Enhance Translation and Stabilize mRNA in Escherichia coli. Journal of Bacteriology, 187(4), 1344–1349. https://doi.org/10.1128/jb.187.4.1344-1349.2005

[7] Sengupta, J., Agrawal, R. K., & Frank, J. (2001). Visualization of protein S1 within the 30S ribosomal subunit and its interaction with messenger RNA. Proceedings of the National Academy of Sciences of the United States of America, 98(21), 11991–11996. https://doi.org/10.1073/pnas.211266898

[8] Hüttenhofer, A., & Noller, H. F. (1994). Footprinting mRNA-ribosome complexes with chemical probes. The EMBO Journal, 13(16), 3892–3901. https://doi.org/10.1002/j.1460-2075.1994.tb06700.x

[9] De Smit, M. H., & Van Duin, J. (2003). Translational Standby Sites: How Ribosomes May Deal with the Rapid Folding Kinetics of mRNA. Journal of Molecular Biology, 331(4), 737–743. https://doi.org/10.1016/s0022-2836(03)00809-x

[10] Hecht, A., Glasgow, J. E., Jaschke, P. R., Bawazer, L. A., Munson, M., Cochran, J. R., Endy, D., & Salit, M. (2017). Measurements of translation initiation from all 64 codons in E. coli. Nucleic Acids Research, 45(7), 3615–3626. https://doi.org/10.1093/nar/gkx070

[11] Borujeni, A. E., & Salis, H. M. (2016). Translation Initiation is Controlled by RNA Folding Kinetics via a Ribosome Drafting Mechanism. Journal of the American Chemical Society, 138(22), 7016–7023. https://doi.org/10.1021/jacs.6b01453

[12] Salis, H. M., Mirsky, E., & Voigt, C. A. (2009). Automated design of synthetic ribosome binding sites to control protein expression. Nature Biotechnology, 27(10), 946–950. https://doi.org/10.1038/nbt.1568

Expression Prediction andRBS Optimization

Feature Engineering

Model Training and Testing

Optimization Algorithm

References

Expression Prediction and
RBS Optimization