Abstract

Our project followed the structured iGEM engineering cycle (figure 1). This cycle comprised distinct phases:

Design: We designed a machine learning (ML) model capable of detecting the copy number of a given plasmid by analyzing alterations in RNA promoters which are integral to the origin of replication.

Build: We built a web-tool that modifies ORI sequences to achieve desired copy number and use it to design and generate a library of plasmids with various copy numbers.

Test: Experimentally measure the copy number of the plasmid library and compare them to the copy numbers predicted and designed by our model.

Learn: Use the measured copy numbers to improve our model and learn new insights regarding various factors contributing to plasmid copy number.

We performed additional “in-silico” engineering cycles for improving our models based on various types of public data and biological knowledge.

Following this engineering cycle, our project aims to advance the understanding of biological systems, enabling more precise and controlled manipulation of plasmid copy numbers for practical applications

Design

In the “design” stage, we created a predictive model for plasmid copy number in E. coli. The choice of E. coli as our model organism was influenced by extensive documentation, wide usage, and its relevance to understanding Plasmid Copy Number Regulation. Our model's code was written in Python, a widely used language for data analysis and machine learning.

For our model (see more details in the Model page), we adopted a voting approach that averaged the results of Ensemble Learning models: Random Forest, XGBoost, Light GBM and CatBoost. After assessing various predictive machine learning models, this approach proved to be the most effective. We made sure to split our data into balanced train validation and test sets, with the goal of preventing data leakage and achieving accurate assessment of the results.

A variety of sequence-related features (see more details in the Model page) were collected for RNAp and RNAi (see more details in the Project Description page). These features included elements such as promoter binding energy, promoter strength, nucleotide counts, one-hot encoding, motifs, features based on RNA's secondary structure, de-novo motifs, and more (see more details in the Model page) (figure 2).

This design step formed the foundation of our project's capability to predict and control plasmid copy numbers.

In-Silico inner cycles

Due to a lack of experimental data regarding RNAi promoters, the model's performance wasn’t good enough. Thus, we searched for additional ways by which mutations in the RNAi promoter can affect plasmids’ copy number.

The inhibitory RNA (RNAi) and its promoter are complementary to a part of the priming RNA (RNAp). Thus, a mutation in the promoter of RNAi will cause a mutation in RNAp itself and could alter its secondary structure (see more details in the Project Description page). As a result, we have generated features that describe RNAp secondary structure and apply them to the model based on the RNAi promoters. Thus, we developed many features describing RNAp secondary structure based on RNAfold and RNAeval calculations. The features are calculated for a range of lengths of RNAp in order to describe its structure during transcription (see more details in the Model page).

Thus, we performed the following in-silico engineering cycle involved designing, building, testing, and learning:

Design: Various features based on RNAp folding during its transcription were designed to improve our model.

Build: We used software tools to calculate the designed features.

Test: We introduced the new features into our model and evaluated its performances after feature selection.

Learn: We saw that the folding features scored high and improve the prediction. Thus, we learn that the structural changes in RNAp due to alteration to RNAi may have a pivotal role in the plasmid copy number levels and found a way to improve our model.

We performed various such in-silico iterations that improved our model (see more details in the Model page). Among other, the iterations included adding more relevant features (e.g. various known and de-novo sequence motifs), choosing the right model(s), choosing the right hyper parameters of the models, and more.

Build

In the "Build" step of our project, we developed two critical components: a web tool and a library of mutants.

Web Tool:
Based on our model, we designed an innovative web tool with the primary purpose of allowing users to modify ORI sequences to achieve their desired plasmid copy numbers. This cutting-edge software aims to revolutionize plasmid copy number prediction and design. With its user-friendly interface, the tool empowers users to input a desired copy number and get a FASTA file of the modified ORI sequence customized to their specified copy number.

Library of Mutants:
In the wet lab, we implemented a novel protocol to replace the origin of replications in various sequences. This process led to the creation of a library consisting of 15 distinct plasmid sequences. This library played a pivotal role in validating and improving our predictive model. The construction of this library involved two distinct cycles of design. The first cycle incorporated sequences sourced from previously published research articles, while the second cycle introduced novel sequences generated by our predictive model. Moreover, random mutations were introduced during the assembly process (see more details in the Notebook page). It's important to note that all variants within the library underwent validation through Sanger sequencing, confirming the presence of the intended mutations within the RNAp and RNAi promoters (figure 3).

*Figure 3: The mutant ORI sequences were inserted into the plasmid with an assembly method. Then, the insertion was validated with sanger sequencing.*

Test

In the "Test" phase of our project, we evaluated our machine learning model's performance by employing a combination of validation techniques. We not only assessed the model's accuracy through various evaluation metrics using cross-validation but mainly put it to the test with the biological data generated in our laboratory.

To evaluate the real-world performance of our model, we compared the predicted plasmid copy numbers generated by our model with the copy numbers calculated from experimental results in our lab and was not used to train the model (figure 4). This comparison was essential to validate the model's predictive capabilities (see more details in the Model page).

In the laboratory, we measured the copy numbers of the newly constructed plasmids. To ensure accuracy and minimize potential errors associated with DNA isolation procedures, we implemented a colony qPCR method (see more details in Experiments page). Using the colony qPCR method, we successfully determined the plasmid copy numbers for the various plasmids within our library. We were also able to see the differences in the copy number with the help of the chromoprotein gene we inserted into the plasmid (Table 1).

Mut1 picture — *Table 1: The new plasmids' calculated copy number and their colony's picture.*

Mut2 picture — *Table 1: The new plasmids' calculated copy number and their colony's picture.*

Mutant	Calculated Copy Number	Colony Picture
Mut 1	11
Mut 2	164
Mut 3	53
Mut 4	188
Mut 5	120
Mut 6	68
Mut 7	197
Mut 8	232
Mut 9	174
Mut 10	119
Mut 11	197
Mut 12	177
Mut 13	167
Mut 14	153
Mut 15	144

When we compared the experimental copy number calculations with the predictions made by our model, we observed a significant correlation between the two sets of results. This alignment between calculations and model predictions further underscored the reliability and effectiveness of our machine learning model.

Learn

After observing a difference between the model prediction on the test set and biology results, we approached the writer of the original article [1]. In our conversation we learned about a hidden step between their biologic results and published copy number. Their biological measurements were of a relative copy number, meaning based on growth parameters. These measurements were highly correlated (in terms of ranking) with copy number; hence they used a polynomial fit based on ddPCR measurements to convert their measurements into final copy number. With this understanding, and with the additional qPCR biological copy number measurements, we performed a mathematical transformation to the data used to train our model for improving the model performance: Starting from the relative copy number, we improved the mathematical fitting of the data to the measured copy number. For this purpose, we added 30% more data points to the sequences used for the fitting while saving aside an equal number of data points for validation. Analyzing the data, we observed a power log behavior connecting the measured copy number to the relative copy number; therefore, we decided to perform a linear regression of the logarithmic values to transform the relative copy number to copy number estimations. The newly transformation was performed on all the relative copy number used to train our model (see more details in the Model page). The model predictions based on the transform data were compared to the validation data points and the results showed a significant correlation with the validation data (Table 1). This cycle helped us to improve the model accuracy and performances and yield a significant correlation between the model predictions and the biological copy number measurements. As a result of applying this engineering cycle, we were able to reach this achievement.

Figure 4A: Fitting of relative copy number to biological copy number measurements, with 30% more data points.

Figure 4B: Updated model predictions correlation with biological measurements saved for validation.