Abstract

Our innovative model is the first model that predicts plasmid copy number from the origin of replication (Ori) sequence. Based on the model we designed a software that can be used to design Ori with a certain plasmid copy number. For example, this will allow synthetic biologists to tune the expression of genes of interest and thereby optimize protein production with minimal metabolic burden on the cell. Maximizing protein yield can create a more cost-effective process, which can benefit a lot of industries. See the Home page for additional applications of our model.

We chose to use the voting model, averaging two different models: ensemble learning model,Cat-Boost, and Random Forest, after examining a few predictive machine learning models, and it produced the best results (see Model selection section).

To train the models, we extracted thousands of sequence-related features for priming RNA (RNAp) and inhibitory RNA (RNAi) promoters which are the two genes encoded in the origin of replication sequence (see the Project Description page for background about RNAp and RNAi).

The features were based on various biophysical and statistical models. They include among others promoter binding energy, promoter strength, nucleotide counts, one hot encoding of nucleotides, sequence motifs, various predictions of features based on the secondary structure of RNA, de-novo sequence motifs and more (see Model Features section). The model achieved accurate results with a Pearson correlation of up to 0.91 (P-value= $4.06 \times 10^{-50}$) and Spearman correlation of 0.81 (p-value= $1.42 \times 10^{-30}$) between the model predictions and the test set that was not used to infer the model (see Model results section).

We also have created a bioinformatics data analysis pipeline of ColE1-like plasmids to gain new insights about the central modulators of the plasmid copy number - RNAp, RNAi, and their corresponding promoters from an evolutionary perspective, to be able to control the copy number of a wider set of hosts in the future (see Bioinformatics Analysis section). In addition, we used our model as a ground true for our software tool.

In conclusion, our modeling approach includes data searching, data preparation, features design, various types of machine learning, biophysical predictions, statistical analysis and more. The model was validated and is the basis of our software.

Objectives

Our aim includes the development of computational model and software that will enable us to design plasmids with required copy number via the manipulation plasmid’s origin of replication (Ori).

By using our model, we can understand how copy number is encoded in the origin of replication; we can use this knowledge to predict the copy number from Ori sequence and design new Oris. This will allow synthetic biologists to tune the expression of genes of interest for various objectives such as alternative proteins for food tech, biosensors, metabolic engineering, and DNA vaccines, among others via optimizing protein production with minimal metabolic burden on the cell.

A gradient of plasmid copy number can be achieved by controlling the interaction between priming RNA and inhibitory RNA. To maximize copy number, RNAp should have a good affinity to the DNA strand, and RNAi should be repressed. The amount of RNAp and RNAi in the host cell is controlled by their promoter sequences, located in the plasmid’s ORI (see more details in the Project Description page).

Data

RNA polymerase recognizes DNA sequences through the promoter. Promoter identification is mediated by the subunit of RNA polymerase σ factor in bacteria. The housekeeping σ70 factor of E. coli recognizes two DNA sequence elements ‘−10’ hexamer (5′-TATAAT-3′) located at positions -7 to-12 and ‘−35’ hexamer (5′-TTGACA-3′) located at positions -30 to -35 [1].

In 2022, Rouches et al. [2] generated a library of pUC19 plasmids, each containing a different promoter sequence for either priming RNA (RNAp) or inhibitory RNA (RNAi) whereas the variations are only in the promoter regions of either RNAp or RNAi. We use this data in order to model the promotores sequence effects on the copy number.

The data contains 366 inhibitory RNA promoter variants and 833 priming RNA promoter variants with their corresponding copy number.

Below is the copy number distribution of RNAp (figure 1) and RNAi (figure 2):

model — Figure 1: Copy number distribution of RNAp.

The variants were edited at positions (-33,-30), (-11,-8), and +1 for priming RNA and (-33,-30), (-10,-7), and +1 for inhibitory RNA.

The original sequence for priming RNA is ‘TTGAGATCCTTTTTTTCTGCGCGTAATCTGCTGCTT’, and for inhibitory RNA, ‘TTGAAGTGGTGGCCTAACTACGGCTACACTAGAAGA.’

According to the data, the RNAp and RNAi promoters were mutated at 9 positions. We assume these positions were changed because they correspond to -10 and -35 hexamers.

The data contains additional features such as Growth rate and Predicted Promoter Strength (KbT).

Following the model's training, our biology team created extra variations not originally found in the initial database and assessed their Copy Number. These supplementary parts were employed to fine-tune the models' predictions.

Model Features

We extracted thousands of sequence-related features for RNAp, RNAi promoters including promoter binding energy, promoter strength, PSSM, nucleotide counts, one hot encoding, motifs, features based on the secondary structure of RNA (see more details in the RNA secondary structure features section) and more. Below we briefly review some of these features.

Motifs

Motifs files were downloaded from (1) DPINTERACT database [3] which is a comprehensive library of binding sites for E. coli DNA-binding proteins containing a set of 68 motifs, between 10 and 50 in width (average width 24.5), and (2) SwissRegulon E.coli database [4] which is a database containing genome-wide annotations of regulatory motifs, promoters and TF binding sites (TFBSs) in promoter regions across model organisms, contains a set of 97 motifs, between 6 and 30 in width (average width 20.1).

The score and P-value of each motif were calculated with FIMO module, and sequences with no matching motif were replaced with 0 score, and p-value of 1.

PSSM

The promoter sequences of priming RNA and inhibitory RNA corresponding to the top and bottom 20% copy number values were used to compute a position-specific scoring matrix (PSSM), a commonly used representation of motifs (patterns) in biological sequences, for differentiating between high and low copy number plasmids. The PSSM score of each sequence is calculated based on the 20% high copy number matrix.
The PSSM results are illustrated in the figures:

Promoter strength

The promoter strength was calculated using an energy matrix [KBT] taken from an article by Brewster et al. [5]. The matrix covers base pairs [-41:-1] where 1 denotes the transcription start site. Each row corresponds to a given nucleotide and each column corresponds to a given position. The promoter strength was calculated for the entire sequence and for the two edited zones of the priming and inhibitory RNAs.

One hot encoding features

each promoter sequence was split into 144 (=36*4) columns so that each column represents the index and nucleotide combination, and the value is set to 1 if it matches the current sequence and 0 otherwise.

Nucleotide counts features

We used nucleotide counts from an article by Muhammod et al. [6] including:

*Table 2: Nucleotide counts features and their description.*
Feature Name	Number of features	Feature Description
zCurve	3 features	x_axis =(ΣA+ΣG)-(ΣC+ΣT) y_axis =(ΣA+ΣC)-(ΣG+ΣT) z_axis =(ΣA+ΣT)-(ΣG+ΣC)
gcContent	1 feature	(ΣG+ΣC)/(ΣA+ΣC+ΣG+ΣT) * 100%
ATGC ratio	1 feature	(ΣA+ΣT)/(ΣG+ΣC)
Cumulative Skew	2 features	GC Skew=(ΣG-ΣC)/(ΣG+ΣC) AT Skew=(ΣA-ΣT)/(ΣA+ΣT)
Pseudo KNC	when –kGap = 2, 20 features	Features will be numbers of A, C, G, T, AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, and TT of the whole sequence of DNA respectively.
monoMonoKGap	when –kGap = 2, 32 features	Features will be numbers of A_A, A_C, A_G, A_T, C_A, C_C, C_G, C_T, G_A, G_C, G_G, G_T, T_A, T_C, T_G, T_T, A__A, A__C, A__G, A__T, C__A, C__C, C__G, C__T, G__A, G__C, G__G, G__T, T__A, T__C, T__G, T__T of the whole sequence of DNA respectively.
monoDiKGap	when –kGap = 2, 128 features	Feature structure will be X_XX, and X__XX of the whole sequence of DNA respectively.
monoTriKGap	when –kGap = 2, 512 features	Feature structure will be X_XXX, and X__XXX of the whole sequence of DNA respectively.
diMonoKGap	when –kGap = 2, 128 features	Feature structure will be XX_X, and XX__X of the whole sequence of DNA respectively.
diDiKGap	when –kGap = 2, 512 features	Feature structure will be XX_XX, and XX__XX of the whole sequence of DNA respectively.
diTriKGap	when –kGap = 2, 2048 features	Feature structure will be XX_XXX, and XX__XXX of the whole sequence of DNA respectively.
triMonoKGap	when –kGap = 2, 512 features	Feature structure will be XXX_X, and XXX__X of the whole sequence of DNA respectively.
triDiKGap	when –kGap = 2, 2048 features	Feature structure will be XXX_XX, and XXX__XX of the whole sequence of DNA respectively.

dG total and TX rate

ΔGtotal is the difference in free energy between an unbound promoter and a promoter, RNA polymerase and the subunit of RNA polymerase σ70 factor complex with a stable transcriptional bubble, used to predict a promoter’s TX rate compared to a reference promoter sequence with calculated ΔGtotal,ref and measured TXref. ΔGtotal,ref was set to zero arbitrarily so that stronger and more favourable interaction free energies have more negative values compared with the reference promoter, which has a low transcription rate [7].

Entropy

Entropy measures the variability of nucleotides at a given position. It’s a way to assess how conserved or variable a position is among the sequences. In our case the data is promoter sequences and it’s relevant for the positions where researchers inserted mutations.
First based on our sequences (all with the same length) we get the probability of each nucleotide to appear in each index (per each index we calculate the ratio: $\dfrac{\text{count of nucleotide}N}{\text{number of sequences}}$ so that we get per each index 4 ratios, one for each nucleotide). We use these probabilities to calculate the entropy per sequence as follows:

$$S = -\frac{\sum_{i=0}^{n} p_i \log_2(p_i)}{2}$$

Where $p_i$ is the probability we calculated before of the nucleotide at index i. We divide the value by 2 for normalization purposes [8].

De Novo Motif Detection

We wanted to search for de novo motifs that appear at higher rates in groups of promoters that have high copy number against a reference group of promoters with low copy number, and vice versa, and use those motifs’ scores as features for a different set of promoters to avoid overfitting.

To do so, we have used HOMER (Hypergeometric Optimization of Motif EnRichment), which is a suite of tools for Motif Discovery and next-gen sequencing analysis. It is a collection of command-line programs. HOMER was primarily written as a de novo motif discovery algorithm and is well suited for finding 8-20 bp motifs in large-scale genomics data [9].

Data preparation

As part of our analysis of the data, we divided it into 3 sets, 15% test set ,15% validation set and 70% train set. During this process, we ensured that the copy number distribution remained consistent with the original dataset.

Figure 7: Pie diagram of the data split.

Since the variants were designed to have mutations only at one promoter, we trained one model to predict copy number based on RNAp promoter variants and another to predict copy number based on RNAi promoter variants.
Numeric features were scaled using a standard scaler, and zero variance features were removed.

Engineering cycle insights

The entire model pipeline of the RNAp data (model selection, feature selection, hyper parameters optimization) as shown on this page, was performed on the data from the article [2] and tested on our biological results producing poor correlations. Hence we contacted the author of the article, trying to understand where our model lacked, since the model showed very high results on the test set. We learned that between the copy number published in the article’s data and the actual measurements, there is an additional step of fitting the relative copy number, which was the direct result of the biological measurement, to a final copy number. With this new knowledge we decided to improve the fitting of the relative copy number to the actual copy number measurement. We implemented our own fitting function based on about 30% more data points. The final model that we show is the one trained over the updated data, which achieved better correlations with our copy number measurements saved for validation purposes (see more details in the Engineering Cycle page).

Model selection

We had 3 groups of candidate models - Neural network, Linear Regression models, Random Forest and Ensemble Learning models. We wanted to choose the model that answered several criteria: model performance, model complexity (we want the model to be as explainable as possible) and preventing overfitting.

Neural Networks

A Neural Network is a machine-learning technique that is composed of layers of interconnected processing units, with input, hidden, and output layers. Each unit is connected to the units in adjacent layers, with each connection having a weight value. The weights are estimated with a backpropagation process, where they are adjusted in order to minimize the average classification or regression error. Inputs are multiplied by these weights and summed at each unit, and then passed through an activation function, such as a sigmoid, tanh, or rectified linear unit (ReLU), before being passed to the next layer [10].

Neural Networks excel in performance for complex tasks, but they are highly complex and can easily overfit if not regulated. They are challenging to interpret due to their "black-box" nature.

Linear Regression models

Linear regression is widely used in predictive analysis methods when prediction results can be quantified and modeled using multiple input variables. It is a method of identifying linear correlations between dependent and independent variables through analyzing and modeling data. Therefore, this method is capable of modeling relationships between dependent variables and independent variables by analyzing and learning from existing training data [11].

Linear regression models have low to moderate performance but are straightforward and easy to understand. They have low complexity, low risk of overfitting, and high interpretability via coefficient analysis.

Lasso

Lasso (Least Absolute Shrinkage and Selection Operator) is a linear regression model that performs feature selection and regularization, making it a powerful tool for predicting outcomes in high-dimensional datasets.
The lasso model uses L1 regularization, which adds a penalty term equal to the absolute value of the coefficients of the regression parameters in order to shrink the coefficients of less important features to zero. This results in a more parsimonious model that only includes the most relevant features in the final prediction while also improving the model's performance by reducing overfitting.

Ridge regression

Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated [12]. It is especially useful for mitigating multicollinearity in linear regression, which commonly occurs in models with large numbers of parameters [13]. Generally, the method provides improved efficiency in parameter estimation problems in exchange for a tolerable amount of bias [14].

Elastic Net

The elastic net is a regularization and variable selection method. The elastic net is particularly useful when the number of predictors is much bigger than the number of observations [15]. The elastic net is a regularization method that linearly combines L1 and L2 penalties, which are used in LASSO and ridge regression methods respectively [16].

Random Forest model

The random forest algorithm works by constructing a multitude of decision trees on different subsets of the training data. Each decision tree is constructed using a random subset of the features, and at each split in the tree, a random subset of the features is considered. This process of feature randomness helps to reduce overfitting and improve the generalization performance of the model.

Once the individual decision trees have been constructed, the random forest algorithm makes predictions by averaging the predictions of the individual trees. In the case of regression tasks, the average of the individual predictions is used to determine the final prediction [17].

Random forests offer a good balance between performance and complexity. They are moderately complex and less prone to overfitting. Their interpretability is moderate through feature importance scores.

Ensemble Learning models

Ensemble Learning is a technique that uses the “Wisdom of the crown” principle and builds a strong predictive model by combining the predictions of multiple weak models, typically decision trees, in a sequential manner.
There are two main classes of ensemble learning methods:

Bagging method builds models in parallel using a random subset of data (sampling with replacement).
Boosting method builds “weak” models in sequence using the whole data, with each model improving on the previous model’s error.

Gradient Boosting Trees, CatBoost, and XGBoost are all variations of gradient boosting algorithms. All of them build decision trees, which are trained sequentially and use the error from the previous tree to adjust its learning and eventually minimize the loss function. Gradient boosting is primarily used to reduce the bias error of the model. Based on the bias-variance tradeoff, it is a greedy algorithm that can overfit a training dataset quickly. However, this overfitting can be controlled by shrinkage (learning rate), tree constraint (pruning), and regularization, which are implemented in CatBoost and XGBoost models [18].

Ensemble learning combines models for high performance. They generally have moderate complexity, a moderate risk of overfitting, and moderate interpretability through feature importance.

Gradient Boosting Trees

In each iteration, a tree is built using all the features, based on the residual errors of the previous tree. Each tree in the ensemble is shrunk after it is multiplied by the learning rate (ƞ) which ranges between 0 to 1.
The final prediction is given by the formula:
y(pred) = y1 + ( ƞ* r1) + (ƞ * r2) + ....... + (ƞ * rN)
where y1 is the prediction of the first tree, r1 is the prediction of the second tree that was trained based on the residual of the first one, and so on. Below is an illustration of ensemble models. This process is illustrated in Figure 8.

XGBoost

Based on gradient boosting trees and additionally contains regularization term and tree pruning to prevent overfitting.

Cat-Boost

Based on gradient boosting trees. Adapts the learning rate during training and incorporates built-in cross-validation and other techniques to mitigate overfitting.

*Table 3: A summary of the models’ properties.*
	Neural network	Linear regression models	Ensemble learning models	Random Forest
Model performance	High	Low to moderate	Moderate to high	Moderate to high
Model complexity	High	Low	Moderate to High	Moderate
Model overfitting	High	Low	Moderate to low	Moderate
Model interpretability	Low	High	Moderate	Moderate

“The best approach […] is to randomly divide the dataset into three parts: a training set, a validation set, and a test set. The training set is used to fit the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the final chosen model.” [19]

After splitting our data, in order to find the best model for our data, we performed a hyperparameter tuning for each model using “Optuna” [20] (see more details in the hyperparameters tuning section), and evaluated the R-squared and MAE of the train and validation sets. This approach allowed us to examine the model performance and model overfitting.

These are the results for the RNAp model:

Figure 9: RNAp Models exploration showing pearson correlation and spearman correlation results on train vs validation.(columns can be hidden and shown by clicking on labels).

These are the results for the RNAi model:

Figure 10: RNAi Models exploration showing pearson correlation and spearman correlation results on train vs validation.(columns can be hidden and shown by clicking on labels).

In light of these results, we decided to proceed to the next steps, feature selection and hyperparameter optimization, with the following models: XGBoost, Cat-Boost, Light GBM, and Random Forest.

Evaluation methods

R squared

In order to evaluate the model’s performance, R squared score was used. It represents the proportion of variance (of y) that has been explained by the independent variables in the model. It provides an indication of goodness of fit and, therefore a measure of how well unseen samples are likely to be predicted by the model through the proportion of explained variance.

The estimated R-squared is defined below where y is the real value and $\hat{y}$ is the predicted value:

$$R^2 (y,\hat{y}) = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}$$

where $\bar{y}$ = $\frac{1}{n} \sum_{i=1}^{n} y_i$ and $ \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $ = $ \sum_{i=1}^{n} \epsilon_i^2 $

Correlations

Additionally, we used Spearman and Pearson correlations. Spearman assesses how well the relationship between two variables can be described using a monotonic function, and Pearson is a measure of linear correlation between two sets of data.

MAE

We also calculated the mean absolute error (MAE). MAE is a measure of errors between paired observations expressing the same phenomenon. In our case, examples of Y versus X represent comparisons between predicted and observed data. MAE is calculated as the sum of absolute errors divided by the sample size:

$$MAE = \frac{\sum_{i=1}^{n} |y_i - x_i|}{n} = \frac{\sum_{i=1}^{n} |e_i|}{n}$$

Confidence Intervals

In order to assess the accuracy of our Regression-model's predictions we used confidence intervals.

Confidence intervals are a useful way to quantify the uncertainty in predictions from statistical models. They are characterized by two elements: An interval [xl,xu] and The confidence level C.

The confidence level C is the probability that the true value of the parameter of interest lies within the interval [xl,xu]. For example, a 95% confidence interval means that we are 95% confident that the true value of the parameter lies within the interval.

We achieved these confidence intervals calculation by training two models- one for the upper bound and one for the lower bound and using a quantile regression objective [21]. Quantile regression is a type of regression that models the quantiles of a response variable, rather than the mean.The most important hyperparameter of the quantile function is alpha.The alpha parameter of a quantile objective is a hyperparameter that controls the importance of underpredictions and overpredictions. It is a value between 0 and 1, and it represents the quantile of the target variable that the model is trying to predict. The quantiles of a response variable are the values below which a certain percentage of the observations fall. quantile regression functions can be described as a scaled and rotated MAE functions.

While CatBoost model incorporates a native quantile regression objective, XGBoost lacks one due to it's non-smooth characteristic. To address this gap, we used article [21] in order to implement a quantile regression objective for XGBoost. For more information refer to [21].

Feature selection

In our research workflow, feature selection has been a crucial step in enhancing the effectiveness of our machine learning models. Initially, we calculated the mutual information (MI) between each feature and the copy number, selecting those with MI values greater than the mean MI as potential candidates. Subsequently, we assessed the MI between the remaining features, identifying pairs of features exhibiting strong correlations within the top percentile. For each pair of correlated features, we chose the one displaying higher MI with the copy number, ensuring the retention of the most informative attributes.

Finally, to further refine our feature set and optimize model interpretability, we applied the eBoruta approach.

Boruta is an algorithm for selecting relevant features in a dataset, primarily designed for use with Random Forests [22]. It falls under the category of "wrapper" methods, as it employs a machine learning model to filter out features that are pertinent to the model's learning objective.

Boruta-SHAP is a feature selection algorithm that combines the Boruta algorithm with SHAP values to select important features.

Boruta is a tree-based feature selection algorithm that works by creating random shadow features (noise) and comparing the importance of the original features to the importance of the shadow features. Features that are more important than the shadow features are kept, while features that are not more important than the shadow features are removed.

SHAP [23] values are a way to measure the individual impact of each feature on the prediction of a model. SHAP values are calculated by considering all possible combinations of features and their impact on the prediction.

Using e-boruta with SHAP values works by using SHAP values to rank the importance of the features. The features with the highest SHAP values are then selected using the Boruta algorithm.

eBoruta-SHAP has several advantages over other feature selection algorithms, including:

It is robust to noise and outliers.
It can handle high-dimensional data.
It can select features that are important for both linear and non-linear models.
It can select features that are important for both classification and regression tasks.

By integrating this feature selection process, we obtained a comprehensive understanding of feature importance. This final selection of features, chosen based on their impact on model predictions, allowed us to train a more efficient and interpretable machine learning model, thereby advancing the quality and relevance of our research outcomes.

Features that survive this feature selection steps are considered the most important for the model's performance and are retained for training the final machine learning model.

Exploratory Data Analysis for the selected features

Below are the features chosen and their importance for both the Cat-Boost (figure 11) and Random Forest (figure 12) models for the RNAp model.

Figure 11: Priming RNA feature importance for Cat-Boost model, the higher the feature importance score (GINI score) is the higher feature contribution to the predicted copy number.

Figure 12: Priming RNA feature importance for Random Forest model, the higher the feature importance score (GINI score) is the higher feature contribution to the predicted copy number.

Below are the features chosen and their importance for Cat-Boost (figure 13) model for the RNAi model.

Figure 13: Inhibitory RNA feature importance for Cat-Boost model, the higher the feature importance score (GINI score) is the higher feature contribution to the predicted copy number.

The top 10 most important chosen features that were used in each chosen model were analyzed in order to understand the models and the way the features affect copy number. Below are the results showing the features chosen with their corresponding copy number and the features distributions for the Random Forest model for the RNAp model (figure 14) and for the Cat-Boost model for the RNAi model (figure 15).

Figure 14: The features chosen with their corresponding copy number and the features distributions for the Random Forest model for the RNAp model.

Figure 15: The features chosen with their corresponding copy number and the features distributions for the Cat-Boost model for the RNAi model.

One interesting conclusion related to the analysis above is related to the feature dG_total: It is noticeable that a specific range (not too high not too low) of the free energy feature (dG_total) contributes to a higher copy number.

Hyperparameters tuning using Optuna

Optuna is a powerful and versatile Python library for hyperparameter tuning, which has gained significant attention in the field of machine learning and optimization. Optuna automates the search for optimal hyperparameter configurations using Bayesian optimization, where each trial employs a different set of hyperparameter values and evaluates the R-squared score on the validation set [20]. This way, the best set of hyperparameters for our data and for each model is selected.

Here is a glimpse of our hyperparameter optimization study.

*Figure 16: Distribution of objective score (R squared) over the search of hyperparameter optimization from the Random forest model of RNAp.*

*Figure 17: Distribution of objective score (R squared) over the search of hyperparameter optimization from the Cat-Boost model of RNAi.*

Model results

The performances of our models on the test set from the data [2] appear in figure 18 (for RNAp promoter) and 19 (for RNAi promoter). As can be seen, in both cases we obtained a very high correlation between the model predictions and the measurements, demonstrating the ability of the model to predict copy number of plasmids that were not used for its training.

For RNAp the chosen model is a voting model where the average of the ensemble learning model, Cat-Boost, and Random Forest predictions is the final prediction. We chose this model as it tends to give both high Spearman and high Pearson correlation which were higher than most/all other models.

Figure 18: RNAp results showing actual copy number vs predicted copy number, with pearson correlation of 0.91 (p.v= $4.06 \times 10^{-50}$) and spearman correlation of 0.81 (p.v= $ 1.42 \times 10^{-30}$).

For RNAi the chosen model is the ensemble learning model, Cat-Boost. We chose this model as it had the highest Spearman and Pearson correlation.

Figure 19: RNAi results showing actual copy number vs predicted copy number, with pearson correlation of 0.79 (p.v= $5.13 \times 10^{-13}$) and spearman correlation of 0.82 (p.v= $ 1.62 \times 10^{-14}$).

Confidence Intervals

Below are the confidence intervals results of our models:

Figure 20: As we can see more than 62.70% of the observations of RNAp fall in the 90% confidence interval.

RNA secondary structure features

The model based on inhibitory RNA promoters produced less successful results than the model based on priming RNA promoters. Therefore, we decided to add more features that could help improve our model. The inhibitory RNA (RNAi or RNA I) and its promoter are complementary to a part of the priming RNA (RNAp or RNA II) (figure 21). Thus, a mutation in the promoter of RNAi will cause a mutation in RNAp itself and could alter its secondary structure. As a result, we planned to generate features that describe RNAp secondary structure and apply them to the model based on the inhibitory RNA promoters.

The replication of the ColE1 plasmid family, and pUC19 as part of it, is a highly regulated process, coordinated by the interplay of two pivotal RNA molecules: RNAi, transcribed from the antisense promoter P1, and RNAp, known as the pre-primer RNA, transcribed from the sense promoter P2 [24]. This regulation hinges on a remarkable phenomenon — the persistence of RNAp on the DNA template — an essential prerequisite for initiating replication.
This persistence is made possible through a specific interaction: a G-rich loop within the RNAp transcript, situated 265 nucleotides upstream of the origin, forms a crucial partnership with a C-rich stretch on the DNA template strand, positioned just 20 nucleotides upstream of the origin [25]. A balance between two distinct RNAp conformations plays a pivotal role in determining the percentage of transcripts capable of forming the persistent hybrid necessary for DNA synthesis initiation [25].

RNAp beta (𝞫) sequences can be paired either with a complementary upstream alpha (𝞪) sequence or with an equivalent downstream gamma (𝞬) sequence. After approximately 120 nucleotides have been transcribed by RNA polymerase, the nascent RNAp transcript folds into a secondary structure composed of three stem-loop domains I, II and III. However, this conformation is only temporary, as the polymerization of several more nucleotides disrupts stem loop III and creates an extended stem loop, IV, that is stabilized by hydrogen bonding between alpha and beta sequences. As transcription progresses, two structures are possible. Changing from alpha-beta to beta-gamma pairing perturbs the RNAp conformation, determining whether the transcript is competent for RNA-DNA hybridization and primer formation. RNA molecules in the alpha-beta conformation form the persistent hybrid necessary for DNA synthesis initiation, while transcripts in the beta-gamma conformation do not [25].

RNAi conformation consists of three stem loops and an unpaired 5' region nine nucleotides long. RNAi binds to RNAp by complementary base pairing. The interaction between RNAi and RNAp affects the folding of RNAp, such that beta-gamma pairing is favored. In vitro, RNAi affects primer formation only when added within a short period of time after initiation of RNAp transcription [25].

We used the sequence of RNAi and RNAp as it was found in pUC 19 plasmid sequence, that was obtained from Addgene (Plasmid #50005) [26] and the mutations of RNAi promoter as described in the article by Rouches et al [2]. The variants were edited at 9 positions (-33,-30), (-10,-7), and +1 in RNAi promoter which corresponds to positions 112, (118,121), (142,145) in RNAp sequence. It should be noted that the promoter of RNAi that is complementary to RNAp as it was found in PUC 19 is included in the data which contains the RNAi promoter variants and the suitable copy number.

We used the python package ViennaRNA [27] which includes the tools RNAfold and RNAeval. RNAfold calculates the minimum free energy RNA’s secondary structure, MFE secondary structure, and the centroid secondary structure, which is the secondary structure with minimal base pair distance to all other secondary structures in the Boltzmann ensemble. RNAfold also calculates the probabilities of each base to bond with another base along the RNA sequence; these probabilities are presented as colors on the RNA secondary structure image. RNAeval calculates the free energy of a given secondary structure, by each base pair bond’s free energy. Our aim is to generate features that describe RNAp secondary structure and characterize the alpha, beta and gamma sequences, based on the mutations in RNAi promoter.

We calculated the RNA secondary structure of RNAp as it was found in PUC 19 plasmid using RNAfold and we got 2 secondary structures as discussed. We calculate the RNA secondary structure of RNAp at several lengths, 108 bases as the length of RNAi, 200 bases to capture the alpha beta structure and 553 bases as the full length of RNAp.

Below are some examples of the secondary structure of RNAp and RNAi for illustration and verification of the structures described in the literature.

The MFE secondary structure of RNAp as found in PUC 19 at full length, is shown on the right (figure 27), and the centroid secondary structure on the left (figure 26).

The MFE secondary structure of RNAp as found in PUC 19 at a length of 108 bases with annotations of stem loops 1,2 and 3, is shown on the right (figure 29). In addition, the centroid secondary structure is shown on the left (figure 28).

The MFE secondary structure of RNAp as found in PUC 19 at a length of 200 bases with annotations of alpha sequence, beta sequence, -10 hexamer mutations, -35 hexamer mutations and the +1 mutation, is shown on top (figure 30). In addition, the centroid secondary structure is shown below (figure 31).

It can be seen through comparison with the sequence of the ColE1 ori previously discussed that in both structures at a length of 200 bases the conformation of alpha-beta pairing and stem loop IV can be found. This structure is associated with primer formation and corresponds with the high copy number of the RNAi promoter that is complementary to this RNAp.

The MFE secondary structure of RNAi as found in PUC 19 with annotations of stem loops 1,2 and 3, is shown on the right (figure 33). In addition, the centroid secondary structure is shown on the left (figure 32).

The MFE secondary structure of RNAp corresponding to the RNAi promoter with the highest copy number at full length, is shown on the right (figure 35), and the centroid secondary structure on the left (figure 34).

The MFE secondary structure of RNAp corresponding to the RNAi promoter with the highest copy number at a length of 200 bases with annotations of alpha sequence, beta sequence, -10 hexamer mutations, -35 hexamer mutations and the +1 mutation, is shown on top (figure 36). In addition, the centroid secondary structure is shown below (figure 37).

We developed many features describing RNAp secondary structure based on RNAfold and RNAeval calculations. The features are calculated for a range of lengths of RNAp in order to describe its structure during transcription:

*Table 4: Features describing RNAp secondary structure.*
Feature Group Name	Number of Features	Feature Description
Base pairs match ratio	22 features	We calculate the following ratio:$\dfrac{\text{number of paired bases in specific area}}{\text{number of paired bases in pUC19 sequence in this area}}$. We get the paired bases using RNAFold's output. Specific features examples: alpha_beta_match_seq_end_130 – it means we used the first 130 bases of the RNA sequences and checked alpha-beta area. alpha_beta_extended_match_seq_end_400 – it means we used the first 400 bases of the RNA sequences and checked extended alpha-beta area. Note: we used varied lengths of RNA because during its formation it changes shapes, and it can be interesting to see the difference in structures in different stages.
Base pairing probabilities	22,877 features	RNAFold can output in an additional file the probability of each nucleotide to pair with a nearby nucleotide. We parsed the file and took each such probability as a feature. Specific feature example: seq_end_400_375_385 – it means we used the first 400 bases of the RNA sequences and checked the probability of the nucleotide in index 375 to pair with nucleotide in index 385.
MFE and Centroid comparison	1,730 features	RNAFold can predict the secondary structure of a given RNA sequence using ‘MFE’ and ‘centroid’. We compared how similar the two predictions are. RNAFold’s documentation states that the more similar they are the prediction is more reliable, so we thought it could be a good feature as well. The features are whether base pairs in different indexes exist in both predicted structures. Specific feature example: seq_end_180 mfe == centroid 118_141 – it means that for each RNA sequence we used the first 180 bases as an input for RNAFold and got two predicted secondary structures: MFE and Centorid, for each base pair we check if it appears in both forms, in this case we check if the pair combined of nucleotides at indexes 118 and 141 appears in both forms.
Stem loops MFE	20 features	Another output we can extract is the minimal free energy (MFE) of specific areas in the secondary structure (using RNAFold and then RNAeval – another tool in ViennaRNA package). Specific feature example: mfe_seq_end_300_sl_IV – it means we used the first 300 bases of the RNA sequences and calculated MFE for stem loop IV.
Base pairing of specific areas	6 features	We checked how many bases are paired in specific areas in the predicted secondary structure of the RNA sequences. Specific feature example: base_pairing_mfe_alpha_range – it means we checked in the MFE predicted structure how many bases are paired in the alpha area.
RNA form representation	36 features	We tried to represent the form using some basic characteristics of the secondary structure output. We calculated: Number of unpaired bases Number of ‘arcs’ – a term we use for continuous sequence of unpaired bases that form an open ended circular shape. Number of ‘interior loops’ – a term we use for continuous sequences of unpaired bases that form an interior loop shape (similar to the tip of RNA hairpin) Specific features examples: unpaired_cnt_seq_end_82 - counts number of bases that don’t pair with other bases in the first 82 bases of the RNA sequences. arc_cnt_seq_end_102 – counts how many open ended circles we find in the first 102 bases of the RNA sequences. interior_loop_cnt_seq_end_172 – counts how many interior loops we find in the first 172 bases of the RNA sequences.
RNA form in Window	28 features	These features include: mfe, unpaired bases count, ‘arcs’ count and ‘interior loops’ count but in local windows of size 70. Since the bonds of RNA’s secondary structure are usually local it makes sense to check these features in small windows. Specific features examples: mfe_window_130_200 – calculates MFE of RNA between index 130 to 200. unpaired_cnt_window_120_190 – counts number of bases that don’t pair with other bases in the secondary structure of RNA between index 120 to 190. arc_cnt_window_110_180 – counts how many open ended circles we find in the window (here between index 110 to 180). interior_loop_cnt_window_100_170 – counts how many interior loops we find in the window (here between 100 to 170).

SHAP explainability

SHAP [23] (explainable AI with Shapley values) is a model-agnostic method for interpreting the impact of individual features on model predictions . It assigns a Shapley value to each feature, which quantifies its contribution to the model's output for a specific instance. Features with high absolute Shapley values have a substantial influence on predictions.

We used SHAP to evaluate the features that were chosen and their effect on the decision process of the model. In this way, we can determine which features are critical and which values of each feature contribute to a high copy number.

The graphs below shows the SHAP value for each feature chosen during the feature selection process in the model based on the RNAp model and RNAi model. This process is done for the sequence with the maximal copy number (figure 38, figure 39), the sequence with the minimal copy number (figure 40, figure 41) and the sequence with the median copy number (figure 42, figure 43). It shows each feature's contribution to the copy number compared to the mean copy number.

It can be seen that for the sequence with the maximal copy number the features values induce a higher copy number. Conversely, for the sequence with the minimal copy number the features values induce a lower copy number. As an example, in the sequence with the median copy number for RNAp model (figure 42) negative SHAP values of the feature dG_total contribute to a lower copy number while positive SHAP values of the feature A_A_count contribute to a higher copy number. Another example is the sequence with the median copy number for RNAi model (figure 43) indicating that the features seq_end_483_66_75 and aeq_end_80_66_75 contribute to a lower copy number.

The graphs below shows the absolute mean SHAP value for each feature chosen during the feature selection process in the model based on the RNAp model (figure 44) and RNAi model(figure 45). As can be seen the feature with the highest SHAP values and the greatest influence on the copy number in the RNAp model is dG_total.

The graphs below shows the SHAP value for each value of the feature dG_total (figure 46) and Tx_rate (figure 47). We noticed that there is a specific range of dG_total and Tx_rate values that have positive SHAP values. This means that they contribute to a higher copy number compared to the mean copy number .

As soon as we noticed these specific ranges, we checked the copy number in relation to the dG_total values. It can be seen that high copy number sequences are in the same dG_total range as observed before (figure 48). We validated this conclusion statistically by enrichment p value and found that it was significant. In conclusion the behavior of dg_total is more complex than initially expected. In addition high copy number sequences are in a similar Tx_rate range as observed before(figure 49).

ColE1-like Plasmids Bioinformatics Analysis

Motivation

Based on our discussion with Alagene, we were encouraged to expand our tool to different plasmids and hosts that the scientific community can use. We have gathered ColE1-like plasmids and analyzed them against E. coli puc19 plasmid to detect evolutionarily conserved locations.

We have created a bioinformatics data analysis pipeline of ColE1-like plasmids to gain new insights about the central modulators of the plasmid copy number - RNAp, RNAi, and their corresponding. Specifically, as explained below, we clustered RNAp promoters based on their sequence similarity to find distinct promoter groups. We believe that each cluster represents a novel family of promoters that can regulate RNAp expression. We plan to test the effect of each such family in the lab. This information will improve our model in two dimensions:

Adding additional degrees of freedom to the modifications we can do to the RNAp promoter.
Finding promoters that can work in various additional hosts.

We have created a code pipeline that includes the following components (more details appear in the next sub-sections):

Download ColE-like plasmids fasta files.
Develop a null model for evaluating the quality of the alignment score of a sequence to our RNAi or RNAp sequence.
Given a plasmid, we extract RNAi and RNAp above our null model threshold.
We extract the promoters of RNAi and RNAp 40bp upstream of RNAi and RNAp.
For each pair of promoters, we calculated a distance measure based on the chimera algorithm.
We cluster the promoters using calculated chimera scores using different clustering algorithms.
We extract interesting consensus sequences from different clusters using Multiple Sequence Alignment.
We plan to test the consensus sequences in the lab.

Data

We have gathered ColE1-like plasmids from the paper “Evolution of ColE1-like plasmids across γ-Proteobacteria: From bacteriocin production to antimicrobial resistance” [28]. The initial dataset included more than 1,000 plasmids from 62 host species that share RNAi and RNAp replication system (see figures 51, 52 and table 5).

RNAI And RNAP Detection

Given a plasmid, we find RNAi and RNAp by local alignment to the puc19 sequences:

RNAI (forward) = AACAGTATTTGGTATCTGCGCTCTGCTGAAGCCAGTTACCTTCGGAAAAAGAGTTGGTAGCTCTTGATCCGGCAAACAAACCACCGCTGGTAGCGGTGGTTTTTTTGTTTGCAAGCAGCAGATTACGCGCAGAAAAAAAGGATCTCA

RNAP_seq (forward) = GCAAACAAAAAAACCACCGCTACCAGCGGTGGTTTGTTTGCCGGATCAAGAGCTACCAACTCTTTTTCCGAAGGTAACTGGCTTCAGCAGAGCGCAGATACCAAATACTGTTCTTCTAGTGTAGCCGTAGTTAGGCCACCACTTCAAGAACTCTGTAGCACCGCCTACATACCTCGCTCTGCTAATCCTGTTACCAGTGGCTGCTGCCAGTGGCGATAAGTCGTGTCTTACCGGGTTGGACTCAAGACGATAGTTACCGGATAAGGCGCAGCGGTCGGGCTGAACGGGGGGTTCGTGCACACAGCCCAGCTTGGAGCGAACGACCTACACCGAACTGAGATACCTACAGCGTGAGCTATGAGAAAGCGCCACGCTTCCCGAAGGGAGAAAGGCGGACAGGTATCCGGTAAGCGGCAGGGTCGGAACAGGAGAGCGCACGAGGGAGCTTCCAGGGGGAAACGCCTGGTATCTTTATAGTCCTGTCGGGTTTCGCCACCTCTGACTTGAGCGTCGATTTTTGTGATGCTCGTCAGGGGGGCGGAGCCTATGGAAA

We ensure that in our data, RNAi and RNAp are on different strands on each plasmid and that their alignment score is higher than our null model.

Null Model

To create a null model, we randomly sampled 1000 locations in the E. coli genome and performed alignments against RNAi and RNAP above. Our null model thresholds (the minimum thresholds for RNAP and RNAi) were established based on their maximum alignment scores in the random sampling.

As seen in the figures below, the minimum threshold for RNAi was set to 374, corresponding to the maximum alignment score between the forward and reversed RNAi sequences, while the threshold for RNAp was established at 1117.

Promoters Extraction

After we found RNAi and RNAp on different strands above the threshold, we extracted 40bp upstream to find their corresponding promoters.

Then, we created the following table with the promoters:

ChimeraARS Score

ChimeraARS (Chimera Average Repetitive Substring) [29] is a measure of the tendency of a genomic sequence to include long sub-sequences that appear in a reference sequence or a reference set of genomic sequences. It can replace classical alignment scores when the homology is not very high.

For each extracted promoter, we have calculated its corresponding chimeraARS score using the following algorithm:

For position i in the gene, look for the longest nucleotide substring which begins at position i and appears in the reference.
The substring’s length is this position’s score.
The sequence’s score (including start and stop codons) is the average position score.

The chimera score calculation used puc19 promoters as a reference. The advantages of chimera are that it is a fast and robust way to find similarities to a reference sequence.

Clustering

To achieve robust insights and characterize the generated promoters, we divide them into distinct groups and analyze each group individually.

We computed chimera scores for each promotor by comparing them to the other promotors and then employed them as features in our clustering algorithm. We have done this for both RNAi and RNAp.

Our clustering algorithm leverages a dataset comprising 800 features representing chimera scores, a crucial aspect in understanding the characteristics of promoters. To ensure the selection of the most effective model, we harnessed the power of seven state-of-the-art clustering algorithms as hyperparameters, employing a meticulous grid search parameter tuning algorithm. Throughout the evaluation process, each model underwent further optimization of its own hyperparameters through a dedicated grid search algorithm. The performance metric employed to gauge the quality of clustering results consisted of two key components: a Silhouette score greater than 0.4 combined with the minimum of Davies-Bouldin score divided by Calinski-Harabasz score. The algorithm systematically evaluated various clustering approaches, including DBSCAN, HDBSCAN, K-Means, DPCA (as introduced in [30], [31]), Affinity Propagation, Gaussian Mixture, and Spectral Clustering, to identify the most suitable clustering technique for the given dataset. To plot the clusters, a PCA algorithm was used. [32]

Figure 54: Best clustering algorithm (SP, HSDB, GMM, DPCA) based on silhouette score>0.4 and minimal Davies-Bouldin-socre / Calinski-Harabasz-socre metrics. In each figure we can see the different clusters (different colors of the points ) on a graph that includes the values of the two major component of the PCA (PC1 and PC2).

The figure above reveals distinct clusters formed by all four algorithms. Each color group in a cluster plot represents a group of observations that the algorithm clustered together based on the used metric. The top-performing algorithms were HDB and SP, achieving scores of 0.32 and 0.52, respectively, with cluster numbers of 17 and 46.

Clusters Analysis

The clustering algorithm resulted in different promoter clusters, and on each, we ran Multiple Sequence Alignments using Clustalo [33] against the pUC19 RNAp promoter and found the consensus sequence for each cluster. As stated above, one of the goals of this analysis was an attempt to find genomic evidence of ColE1-like plasmids in other hosts, therefore we decided to focus our efforts on clusters with substantial representation of hosts other than E.coli. One of the clusters had sequences belonging exclusively to the Yersiniaceae family, a member of the order Enterobacteriales.

Part of the clusters of our final RNAp promoters clustering are shown below:

*Figure 55A: Visualization of part of the final clusters of RNAp promoters after multiple sequence alignment using DNAFluent [34].*

Each color corresponds to a different base, as shown in the legend. Each cluster comprises a series of RNAp promoter sequences with minor changes on top of each other after multiple sequence alignment.

*Figure 55B: Visualization of part of the final clusters of RNAp promoters after multiple sequence alignment using DNAFluent [34].*

Though we haven’t performed any experiments yet, due to the lack of time, our findings do suggest that a ColE1 system can be found in other organisms.

Conclusions

In conclusion, our modeling approach includes data searching, data preparation, features design, model selection, feature selection, hyperparameter tuning and evaluation of our model. To develop our models, we combine various types of tools including machine learning, biophysical selection, bioinformatics, statistics, and molecular evolution.

We believe that the strategy used here can be used in the future by other IGEM teams for solving various problems in synthetic biology. We also believe that the datasets (e.g. plasmids sequences) and code we developed can also be used in future IGEM projects.

Our models were validated via comparison to various types of measurements, demonstrating high correlations with experimental data in all the cases.

We believe our model contributes to a better understanding of the replicating system of the ColE1 plasmid family. It provided insight into the effect of the origin of replication sequence, as well as the effect of inhibitory and priming RNAs on copy number. Additionally, it helped us to understand the factors that influence the copy number the most. For example, our analysis indicates that one fundamental feature that affects the copy number is dG_total, which contributes to a higher copy number only when it is in a specific range . Additional important features are the folding structure of RNAp, and various DNA sequence motifs in the promoters of RNAi and RNAp.

In addition, our analyses suggest that the ColE1 system can be found in other organisms; thus, in the future our tool can be adapted relatively easily to design plasmid copy numbers in a wide range of hosts.

In summary, we believe that our model and our software which is based on it have the potential to make a significant impact on the field of synthetic biology; thus, we are very excited to share it with the wider scientific community. We have also filed a patent application for our model, which will allow us to protect our intellectual property and continue to develop and refine it.

References

S. S. Singh, A. Typas, R. Hengge, and D. C. Grainger, “Escherichia coli σ70 senses sequence and conformation of the promoter spacer region,” Nucleic Acids Res, vol. 39, no. 12, p. 5109, Jul. 2011, doi: 10.1093/NAR/GKR080.
M. V. Rouches, Y. Xu, L. B. G. Cortes, and G. Lambert, “A plasmid system with tunable copy number,” Nature Communications 2022 13:1, vol. 13, no. 1, pp. 1–12, Jul. 2022, doi: 10.1038/s41467-022-31422-0.
K. Robison, A. M. McGuire, and G. M. Church, “A comprehensive library of DNA-binding site matrices for 55 proteins applied to the complete Escherichia coli K-12 genome,” J Mol Biol, vol. 284, no. 2, pp. 241–254, Nov. 1998, doi: 10.1006/JMBI.1998.2160.
M. Pachkov, P. J. Balwierz, P. Arnold, E. Ozonov, and E. Van Nimwegen, “SwissRegulon, a database of genome-wide annotations of regulatory sites: recent updates,” Nucleic Acids Res, vol. 41, no. Database issue, p. D214, Jan. 2013, doi: 10.1093/NAR/GKS1145.
R. C. Brewster, D. L. Jones, and R. Phillips, “Tuning Promoter Strength through RNA Polymerase Binding Site Design in Escherichia coli,” PLoS Comput Biol, vol. 8, no. 12, p. e1002811, 2012, doi: 10.1371/JOURNAL.PCBI.1002811.
R. Muhammod, S. Ahmed, D. M. Farid, S. Shatabda, A. Sharma, and A. Dehzangi, “PyFeat: a Python-based effective feature generation tool for DNA, RNA and protein sequences,” Bioinformatics, vol. 35, no. 19, pp. 3831–3833, Oct. 2019, doi: 10.1093/BIOINFORMATICS/BTZ165.
T. L. LaFleur, A. Hossain, and H. M. Salis, “Automated model-predictive design of synthetic promoters to control transcriptional profiles in bacteria,” Nature Communications 2022 13:1, vol. 13, no. 1, pp. 1–15, Sep. 2022, doi: 10.1038/s41467-022-32829-5.
L. Baron, S. Atar, H. Zur, M. Roopin, E. Goz, and T. Tuller, “Computational based design and tracking of synthetic variants of Porcine circovirus reveal relations between silent genomic information and viral fitness,” Scientific Reports 2021 11:1, vol. 11, no. 1, pp. 1–17, May 2021, doi: 10.1038/s41598-021-89918-6.
S. Heinz et al., “Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities,” Mol Cell, vol. 38, no. 4, p. 576, May 2010, doi: 10.1016/J.MOLCEL.2010.05.004.
A. Shrestha and A. Mahmood, “Review of deep learning algorithms and architectures,” IEEE Access, vol. 7, pp. 53040–53065, 2019, doi: 10.1109/ACCESS.2019.2912200.
H. Il Lim, “A linear regression approach to modeling software characteristics for classifying similar software,” Proceedings - International Computer Software and Applications Conference, vol. 1, pp. 942–943, Jul. 2019, doi: 10.1109/COMPSAC.2019.00152.
D. E. Hilt, D. E. Hilt, D. W. , Seegrist, U. States., and Pa. ) Northeastern Forest Experiment Station (Radnor, Ridge, a computer program for calculating ridge regression estimates. Upper Darby, Pa: Dept. of Agriculture, Forest Service, Northeastern Forest Experiment Station, 1977. doi: 10.5962/bhl.title.68934.
P. Kennedy, Kennedy, and Peter, “A Guide to Econometrics, 5th Edition,” vol. 1, 2003, Accessed: Oct. 07, 2023. [Online]. Available: https://EconPapers.repec.org/RePEc:mtp:titles:026261183x
MARVIN. GRUBER, IMPROVING EFFICIENCY BY SHRINKAGE : the james-stein and ridge regression estimators. CRC PRESS, 2020. Accessed: Oct. 07, 2023. [Online]. Available: https://www.routledge.com/Improving-Efficiency-by-Shrinkage-The-James--Stein-and-Ridge-Regression/Gruber/p/book/9780367579364
H. Zou and T. Hastie, “Regularization and Variable Selection Via the Elastic Net,” J R Stat Soc Series B Stat Methodol, vol. 67, no. 2, pp. 301–320, Apr. 2005, doi: 10.1111/J.1467-9868.2005.00503.X.
L. Guo, “Extreme Learning Machine with Elastic Net Regularization,” Intelligent Automation And Soft Computing, vol. 26, no. 3, pp. 421–427, 2020, doi: 10.32604/iasc.2020.013918.
L. Breiman, “Random forests,” Mach Learn, vol. 45, no. 1, pp. 5–32, Oct. 2001, doi: 10.1023/A:1010933404324/METRICS.
“CatBoost vs. LightGBM vs. XGBoost | by Kay Jan Wong | Towards Data Science.” Accessed: Oct. 18, 2023. [Online]. Available: https://towardsdatascience.com/catboost-vs-lightgbm-vs-xgboost-c80f40662924
T. Hastie, R. Tibshirani, and J. Friedman, “Springer Series in Statistics The Elements of Statistical Learning Data Mining, Inference, and Prediction,” p. 222.
T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A Next-generation Hyperparameter Optimization Framework,” Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2623–2631, Jul. 2019, doi: 10.1145/3292500.3330701.
“Confidence intervals for XGBoost | Towards Data Science.” Accessed: Oct. 07, 2023. [Online]. Available: https://towardsdatascience.com/confidence-intervals-for-xgboost-cac2955a8fde
M. B. Kursa and W. R. Rudnicki, “Feature Selection with the Boruta Package,” J Stat Softw, vol. 36, no. 11, pp. 1–13, Sep. 2010, doi: 10.18637/JSS.V036.I11.
S. M. Lundberg and S. I. Lee, “A Unified Approach to Interpreting Model Predictions,” Adv Neural Inf Process Syst, vol. 2017-December, pp. 4766–4775, May 2017, Accessed: Oct. 10, 2023. [Online]. Available: https://arxiv.org/abs/1705.07874v2
M. Camps, “Modulation of ColE1-like plasmid replication for recombinant gene expression,” Recent Pat DNA Gene Seq, vol. 4, no. 1, pp. 58–73, Jan. 2010, doi: 10.2174/187221510790410822.
H. Masukata and J. ichi Tomizawa, “A mechanism of formation of a persistent hybrid between elongating RNA and template DNA,” Cell, vol. 62, no. 2, pp. 331–338, Jul. 1990, doi: 10.1016/0092-8674(90)90370-T.
“Addgene: pUC19.” Accessed: Oct. 07, 2023. [Online]. Available: https://www.addgene.org/50005/
R. Lorenz et al., “ViennaRNA Package 2.0,” Algorithms for Molecular Biology, vol. 6, no. 1, pp. 1–14, Nov. 2011, doi: 10.1186/1748-7188-6-26/TABLES/2.
M. Ares-Arroyo, E. P. C. Rocha, and B. Gonzalez-Zorn, “Evolution of ColE1-like plasmids across γ-Proteobacteria: From bacteriocin production to antimicrobial resistance,” PLoS Genet, vol. 17, no. 11, Nov. 2021, doi: 10.1371/JOURNAL.PGEN.1009919.
H. Zur and T. Tuller, “Exploiting hidden information interleaved in the redundancy of the genetic code without prior knowledge,” Bioinformatics, vol. 31, no. 8, pp. 1161–1168, Apr. 2015, doi: 10.1093/BIOINFORMATICS/BTU797.
“Density peak clustering Sample by Python | Kaggle.” Accessed: Oct. 10, 2023. [Online]. Available: https://www.kaggle.com/code/guansuo/density-peak-clustering-sample-by-python
A. Rodriguez and A. Laio, “Clustering by fast search and find of density peaks,” Science (1979), vol. 344, no. 6191, pp. 1492–1496, Jun. 2014, doi: 10.1126/SCIENCE.1242072/SUPPL_FILE/ RODRIGUEZ.SM.PDF.
“2.3. Clustering — scikit-learn 1.3.1 documentation.” Accessed: Oct. 10, 2023. [Online]. Available: https://scikit-learn.org/stable/modules/clustering.html
F. Sievers, G. J. Barton, and D. G. Higgins, “Multiple Sequence Alignments.” Wiley, pp. 227–250, 2020. Accessed: Oct. 07, 2023. [Online]. Available: https://discovery.dundee.ac.uk/en/publications/multiple-sequence-alignments
J. Seaman and R. J. A. Buggs, “FluentDNA: Nucleotide Visualization of Whole Genomes, Annotations, and Alignments,” Front Genet, vol. 11, Apr. 2020, doi: 10.3389/FGENE.2020.00292.

Feature Group Name	Number of Features	Feature Description
Base pairs match ratio	22 features	We calculate the following ratio:\(\dfrac{\text{number of paired bases in specific area}}{\text{number of paired bases in pUC19 sequence in this area}}\). We get the paired bases using RNAFold's output. Specific features examples: alpha_beta_match_seq_end_130 – it means we used the first 130 bases of the RNA sequences and checked alpha-beta area. alpha_beta_extended_match_seq_end_400 – it means we used the first 400 bases of the RNA sequences and checked extended alpha-beta area. Note: we used varied lengths of RNA because during its formation it changes shapes, and it can be interesting to see the difference in structures in different stages.
Base pairing probabilities	22,877 features	RNAFold can output in an additional file the probability of each nucleotide to pair with a nearby nucleotide. We parsed the file and took each such probability as a feature. Specific feature example: seq_end_400_375_385 – it means we used the first 400 bases of the RNA sequences and checked the probability of the nucleotide in index 375 to pair with nucleotide in index 385.
MFE and Centroid comparison	1,730 features	RNAFold can predict the secondary structure of a given RNA sequence using ‘MFE’ and ‘centroid’. We compared how similar the two predictions are. RNAFold’s documentation states that the more similar they are the prediction is more reliable, so we thought it could be a good feature as well. The features are whether base pairs in different indexes exist in both predicted structures. Specific feature example: seq_end_180 mfe == centroid 118_141 – it means that for each RNA sequence we used the first 180 bases as an input for RNAFold and got two predicted secondary structures: MFE and Centorid, for each base pair we check if it appears in both forms, in this case we check if the pair combined of nucleotides at indexes 118 and 141 appears in both forms.
Stem loops MFE	20 features	Another output we can extract is the minimal free energy (MFE) of specific areas in the secondary structure (using RNAFold and then RNAeval – another tool in ViennaRNA package). Specific feature example: mfe_seq_end_300_sl_IV – it means we used the first 300 bases of the RNA sequences and calculated MFE for stem loop IV.
Base pairing of specific areas	6 features	We checked how many bases are paired in specific areas in the predicted secondary structure of the RNA sequences. Specific feature example: base_pairing_mfe_alpha_range – it means we checked in the MFE predicted structure how many bases are paired in the alpha area.
RNA form representation	36 features	We tried to represent the form using some basic characteristics of the secondary structure output. We calculated: Number of unpaired bases Number of ‘arcs’ – a term we use for continuous sequence of unpaired bases that form an open ended circular shape. Number of ‘interior loops’ – a term we use for continuous sequences of unpaired bases that form an interior loop shape (similar to the tip of RNA hairpin) Specific features examples: unpaired_cnt_seq_end_82 - counts number of bases that don’t pair with other bases in the first 82 bases of the RNA sequences. arc_cnt_seq_end_102 – counts how many open ended circles we find in the first 102 bases of the RNA sequences. interior_loop_cnt_seq_end_172 – counts how many interior loops we find in the first 172 bases of the RNA sequences.
RNA form in Window	28 features	These features include: mfe, unpaired bases count, ‘arcs’ count and ‘interior loops’ count but in local windows of size 70. Since the bonds of RNA’s secondary structure are usually local it makes sense to check these features in small windows. Specific features examples: mfe_window_130_200 – calculates MFE of RNA between index 130 to 200. unpaired_cnt_window_120_190 – counts number of bases that don’t pair with other bases in the secondary structure of RNA between index 120 to 190. arc_cnt_window_110_180 – counts how many open ended circles we find in the window (here between index 110 to 180). interior_loop_cnt_window_100_170 – counts how many interior loops we find in the window (here between 100 to 170).