Abstract

In our project, we introduced a predictive model for estimating pUC19 plasmid copy numbers based on the ORI and the regulatory sequences within this region. To this end, we evaluated and combined various machine learning models (e.g. neural network, ridge, lasso, ElasticNet, XGBoost, CatBoost, LightGBM and Random Forest) and generated thousands of features as input. Feature engineering was based on statistical and biophysical models such as de-novo sequence motifs, RNA structure predictions, and free energy of the interaction between the promoter and the RNA polymerase (see model page).

In addition, we developed a novel bioinformatics pipeline and analyzed Col-E1 plasmids from different organisms to gain new insights regarding the regulatory sequence in the ORI in a wide range of hosts (see model page).

Based on our models and pipeline, we designed a user-friendly software tool that enables any researcher to easily design a plasmid (or a set of plasmids) with a requested copy number(s) (see software page).

To validate the model, we have utilized homologous joining assembly and created an efficient and highly accessible protocol for substituting the origin of replication in pUC19 with a modified one. This protocol was used to create a library of 15 novel sequences of origin of replication sequences (see experiments page).

By measuring the copy number values for those sequences, we validated our model’s prediction capability; the Pearson correlation between the model predictions and the measurements is up to 0.91 (P-value= 4.06 x 10^-50) in one type of data and 0.652 (p- value=1.6X10^-2) in the second type of data. Furthermore, we show the impact of plasmid copy number on growth rates and protein expression. Details of the results can be found on this page.

Predictive Model

Our model is the first model to predict copy number from the origin of replication sequence. This will allow synthetic biologists to tune the expression of genes of interest and thereby optimize protein production with minimal metabolic burden on the cell. Maximizing protein yield can create a more cost-effective process, which can impact several industries. We chose to use the voting model, or an average of the best-performing models: CatBoost and Random Forest, after examining a few predictive machine learning models, and fitting the relative copy number to the new biological measurements. We extracted thousands of sequence-related features for priming RNA (RNAp) and inhibitory RNA (RNAi) promoters, which are the two genes encoded in the origin of replication sequence. Among others, the features used include promoter binding energy, promoter strength, nucleotide counts, one hot encoding of nucleotide positions, motifs, features based on the secondary structure of RNA, de-novo motifs and more (figures 1,2) (link to model page).

The main features chosen by the models include promoter binding energy strength related to RNAp and RNAi, position-specific scoring matrix (PSSM) values related to the distributions of nucleodtides in the promoters of RNAi and RNAp, nucleotide counts, one-hot encoding of the promoter sequence, and various sequence motifs from DPINTERACT [1] and SwissRegulon [2] databases.

Based on the chosen features, we can better understand the biophysics of plasmid copy numbers via the regulation of RNAp and RNAi; this knowledge will enable us to design efficient copy number engineering approaches. For example, the following features are used by our models:

Features such as dG (which is the difference in free energy between a bound and unbound promoter) and Tx rate (which is transcription initiation rate) that is related to the efficiency of RNA polymerase interaction with the DNA. We specifically found that excessively high or low dG (see figure 3) will lead to minimize copy number. A possible explanation for this result is that low values of dG (which represent a weak connection between RNA polymerase and DNA) lead to low transcription; we hypothesize that high dG values that representing a strong connection of the RNA polymerase to the DNA prevent the advancement of the RNA polymerase, therefore inhibiting the transcription (see more info in our model page).
Various features related to the RNA folding of the RNAp sequence which are related to the way RNAp and RNAi hybridize to each other and regulate the plasmid replication (link to model page).
Various DNA sequence motifs and position specific scoring matrices (PSSM) were found to contribute to the predictive power of the model. These motifs represent the binding sites of regulatory factors that effect RNAp and RNAi transcription.

datasetx — Figure 1: Illustration of ensemble models, detailed on the model page (Gradient Boosting Trees). Each gray box represents a tree built in the ensemble model. The error residuals of one tree are used to train the next tree. The final prediction is the sum of the tree predictions, multiplied by ƞ (the learning rate).

catredfeat — Figure 2: Priming RNA feature importance for different models; the higher the feature importance score is the higher feature contribution to the predicted copy number. We extracted thousands of sequence-related features for RNAp promoters including promoter binding energy, promoter strength, entropy, nucleotide counts, one hot encoding, motifs, features based on the secondary structure of RNA and more (detailed in the model page on model features section).

randforestfeat — Figure 2: Priming RNA feature importance for different models; the higher the feature importance score is the higher feature contribution to the predicted copy number. We extracted thousands of sequence-related features for RNAp promoters including promoter binding energy, promoter strength, entropy, nucleotide counts, one hot encoding, motifs, features based on the secondary structure of RNA and more (detailed in the model page on model features section).

dgtotal — Figure 3: Shapley Additive Explanations (SHAP - game theoretic approach to explain the output of any machine learning model [3]) values for each value of the feature dG_total. There is a specific range of dG_total values with positive SHAP value. This means that values in these region contribute to higher copy number and values out of this range contribute to lower copy number

The model achieved accurate results with a Pearson correlation of up to 0.91 (P-value= 4.06 x 10^-50) and Spearman correlation of 0.81 (P-value= 1.42 x 10^-30) between the model predictions and a test set partitioned from the article that was not used to train the model (see engineering cycle page). Moreover, the Pearson correlation with biological measurements we generated and saved for validation was 0.652 (P-value= 1.6X10 x 10^-2) (figure 4).

Figure 4A. model predictions vs. test set values which include measurement from the paper [4] that were not used to train the model.

Figure 4B. Model predictions vs. our biological measurements (blue points are related to copy number estimation with ddPCR and red points are related to copy number estimation with qPCR).

A User-Friendly Software

We designed a user-friendly web tool to revolutionize plasmid's copy numbers usage. With its friendly user interface (UI), this tool enables users to input plasmid sequences or select from existing ones, and generate FASTA files containing edited sequences engineered to satisfy their desired copy numbers. It offers the unique ability to generate multiple sequences in one go, streamlining experimentation. Moreover, it enhances the user experience by providing comprehensive plasmid maps with clear annotations, facilitating data interpretation, and easy result sharing. The software meets synthetic biology standard set through experimental investigation, and is thus useful for various projects such as bioproduction optimization and genetic circuit development. The tool's accessibility across various devices ensures effortless use, making it an invaluable asset for geneticists and biotechnologists (link to the tool , link to software page) (figure 5).

Bioinformatics Analysis of ColE1-like Plasmids in Various Hosts

We have also created a bioinformatics data analysis pipeline of ColE1-like plasmids to gain new insights about the central modulators of the plasmid copy number - RNAp, RNAi, and their corresponding promoters from an evolutionary perspective, to be able to control the copy number of a wider set of hosts in the future (figure 6) (see model page). Key steps included RNAi and RNAp detection based on sequence alignment algorithms, extraction of corresponding promoters based on a tailored algorithm, calculation of the sequence distances between pairs of promoters based on a measure called chimeraARS [5], and clustering the promoters based on these distances using various algorithms. The analysis revealed distinct promoter clusters, with a focus on identifying ColE1-like plasmids in different hosts. While experimental verification in underway, reportable results are not yet available. Preliminary observations of these results indicated the presence of the ColE1 system in various organisms, opening avenues for future research into plasmid evolution and host interactions.

rnapcluster — figure 6. Visualization of part of the final clusters of RNAp promoters after multiple sequence alignment using DNAFluent [1]. The figure includes the multiple sequence alignment of the RNAp promoter from each cluster. As can be seen, there are various RNAp promoters in nature that can be used for designing ORI sequences. See more details in model page).

Plasmid Copy Number Measurements and Correlation to the Model predictions

In the wet lab portion, we replaced the origins of replications with our newly developed protocol, as described in detail on the experiments page (link to experiments page). This innovative approach allowed us to construct a library comprising a total of 15 distinct plasmid sequences. These sequences played a crucial role in the verification and improvement of our model. The process of constructing this library involved two distinct cycles of design. The first cycle incorporated sequences from previously published research articles, while the second cycle introduced novel sequences generated by our model. Additionally, random mutations were introduced during the assembly process. It is worth noting that all variants within the library underwent validation through Sanger sequencing, confirming the presence of the intended mutations within the RNAp and RNAi promoters. Details of the mutations can be found in the provided table in the experiments section (link to experiments page).

Following the successful verification of mutations and sequence integrity, we proceeded to calculate the copy numbers of the newly constructed plasmids. To minimize potential errors stemming from DNA isolation procedures, we adopted a colony qPCR approach, as outlined in the experiment page (link to experiments page). To accurately determine plasmid copy numbers, we first established a calibration curve to assess the replication efficiencies of the specific primers used for both the plasmid and the bacterial genomic DNA. A comprehensive description of this calibration process can be found in the experiment page (link to experiments page).

By utilizing the colony qPCR method and the established calibration curve, we successfully calculated the plasmid copy numbers for the various constructs within our library. The results of this copy number analysis are presented in the accompanying figure, providing valuable insights into the replication dynamics of these engineered plasmids. Based on these findings, as anticipated, it is evident that substituting the promoter sequences at the origin of replication has an impact on the plasmid copy number (figure 7).

plascoopperchro — figure 7. Calculated plasmids copy numbers (y-axis) of different new mutants (x-axis) using colony qPCR. Error bars represent standard deviation. The data provides crucial insights into replication dynamics and stability, guiding further optimization and application.

When we compared the results of our measurements to the results of the model predictions, we saw that there is indeed a correlation between the calculations and the model's predictions. This further verifies the reliability of our model and shows that we were successful in providing the first machine-learning model for copy number engineering and prediction (figure 3B).

Effects of Plasmid Copy Number on Growth Rate and Biomanufactoring

After we were successfully able to show that our model can be used to engineer sequence modification to achieve a desired copy number, we conducted several experiments to analyze the copy number effects on different variables. First, we performed a growth experiment (detailed in experiments page) to test whether changes in the plasmid copy number influence the growth rate of the bacteria. The results in figure 8 demonstrate that the different mutants from our library expose different growth patterns.

Figure 8. Growth curves of various mutants (represented in different colors) over time (minutes) demonstrating significant variability in optical density (OD) measurements (y-axis). The distinct growth profiles highlight diverse physiological behaviors, emphasizing the intricate interplay of genetic factors and environmental conditions in shaping the growth dynamics of the mutant strains (the mutants strains are detailed in experiment page).

To establish the connection between the plasmid copy number and the growth curve, we created several features that capture diverse data points from the curve. The features are: 1. Linear region-exponential phase. 2. Lag phase. 3. Stationary phase 4. Area under the curve (figure 9).

mut7 — Figure 9. Detailed analysis of growth rates focusing on distinct phases of the growth curve. The features are marked on a selected curve: 1. Linear region-exponential phase, 2. Lag phase, 3. Stationary phase 4. Area under the curve. These features providing valuable insights into the dynamic growth behavior of the studied mutants.

auc — Figure 9. Detailed analysis of growth rates focusing on distinct phases of the growth curve. The features are marked on a selected curve: 1. Linear region-exponential phase, 2. Lag phase, 3. Stationary phase 4. Area under the curve. These features providing valuable insights into the dynamic growth behavior of the studied mutants.

Upon comparing the various features with the calculated copy numbers of the mutants, we observed considerable diversity among the results, exposing the large effect of the plasmid copy number on the growth curve (figure 10). Notably, a significant positive correlation emerged between the exponential phase's slope and the plasmid's copy number, contrary to our initial expectations (figure 11). We can assume that within the limited range of copy numbers we created, a higher plasmid count was matched with an elevated growth rate, likely due to the amplified expression of the antibiotic resistance gene in the plasmid. It is probable that further extension of the copy number range will result in the anticipated negative relationship, leading to a convex relation over the entire range. Further investigation is being performed to elucidate the underlying dynamics of this relationship and its implications for bioproduction optimization.

Figure 10. The association between the different features to the measured copy number exhibits large variability. Each sub-figure includes a dot plot of the feature values vs. plasmid copy number.

linreg — Figure 11. The correlation between the slope of the exponential phase exhibits a positive correlation to the plasmid copy number. P-values was calculated empirically based on a permutation test.

Additionally, our results show the connections between qPCR measurements of the mutant’s plasmid copy number to protein expression levels. We have used chromoprotein from the iGEM collection, which allows us to visually see different colors that correspond to protein levels and analyze them using image processing (figure 12). After sorting the color of the colonies according to their tones and comparing it to the plasmid copy number values, we were not surprised to find that the number of copies influences the colonies’ color and thus the protein expression levels. The mutants that showed a strong color had, as expected, a high number of copy numbers, and in contrast, the mutants that showed a weak color had a low number of copy numbers (figure 13).

chromopro — *Figure 12. Chromoprotein colors correspond to colonies with varying copy numbers.*

colorint — Figure 13. The connection between the plasmid copy number and the color of the colonies. Dark colonies exhibit averaged copy number of 165 while light color exhibit averaged copy number of 88. P-value= 2.27X10^-5 calculated with t-test.

Our analysis showed that the copy number influences both growth rate and protein levels, however, this relationship is complex supporting the need for developing sophisticated tools for copy number design.

Conclusions

In conclusion, our project has successfully introduced a robust predictive model for estimating and engineering pUC19 plasmid copy numbers, validated on a large-scale dataset with 366 RNAi promoters and 833 RNAp promoters with different plasmid copy numbers and comprehensive library of 15 modified plasmid sequences generated by a novel protocol. Our user-friendly software tool facilitates efficient plasmid design with desired copy numbers, while our bioinformatics analysis provides crucial insights into the modulators of plasmid replication in various hosts. The observed correlations between copy numbers, growth rates, and protein expression levels emphasize the model's practical significance in optimizing bioproduction processes. While our findings illuminate the potential of copy number engineering in synthetic biology, further research is performed now to unravel and model the intricate relationships between copy numbers and cellular dynamics, paving the way for groundbreaking advancements in the field.