Model

× Modeling Fluid Flow in IBCS Reactor Modeling epPCR Background Strengths and Weaknesses Governing Assumptions Basis of Model Step 1 Step 2 Step 3 Step 4 Step 5

Modeling Fluid Flow in Isothermal Biochemical Synthesis (IBCS) Reactor

ANSYS Fluent is a computational fluid dynamics (CFD) software package we will be using to model fluid flow, chemical reactions and heat transfer. Our Product Development subteam will be designing an isothermal biochemical synthesis (IBCS) reactor for the efficient conversion of caffeine to paraxanthine/7-methylxanthine. Modeling of our IBCS reactor is important for a few reasons:

  1. The thermodynamics of the reactor must be realistic to maximize conversion rate within the reactor
  2. Modeling heat transfer will influence the use of a sous vide machine, as well as influence the method in which we maintain isothermal conditions in more complicated iterations
  3. Identifying potential pressure concentrations within our reactor early will guide our methodical scale-up process as we work to design a prototype for industrial scale-up
  4. Modeling both fluid flow and chemical reactions within the same software provides a comprehensive view
  5. Expanding this model and analyzing it in tandem will characterize all aspects of the reactor, provides a foundation for industrial scale-up and future prototyping

We started by creating a Poiseuille’s tube, which models the basic flow through tube of the reactor. The mesh below models the upper half of the tube, and a full tube will be obtained by setting the object to be axisymmetric about the centerline. The full tube is a cylinder with a radius of 0.5 cm and length of 10 cm.

Figure 1: Mesh of basic tube

The lower half of the tube is obtained by applying a reflection about the ZX plane. The obtained velocity contours show that there will be a higher velocity near the centerline, which shed light to some of the challenges we could face when optimizing for yield.

Figure 2: Velocity Magnitude Contours of basic tube [m/s]

After experimenting with simple geometry, one bead is added to the simulation. First, the singular bead was subtracted out to the flow domain. The tube dimensions are the same as in the previous model but with the bead of radius 0.2 cm centered 5 cm along the tube. Necessary conditions were specified, and the whole volume in the simulation was meshed. Presuming laminar flow, the Reynold’s number can be found.

Figure 3: Mesh of tube with bead

The velocity magnitude contours of this model was obtained below. With this, we can see that the upper and lower parts of the bead experiences a higher velocity, translating to a lower reaction time and higher exchange of reactants. However, the right and left parts experience a lower velocity, so there is less exchange of reagents with a higher reaction time. Although the changes in velocity around the bead is important to consider to optimize the overall reaction rate and heat transfer in the tube, this problem may be lessened when the beads are uniformly layered within the tube as the gaps between each bead will be more uniform.

Figure 4: Velocity Magnitude Contours of tube with bead [m/s]

Once the simulation for one bead is complete, the beads can be layered in one plane. To obtain this model, we did a 3D tube with the same radius and length and created a slice of 1 cm beads. The slice of beads was then duplicated across the tube using the pattern tool.

We began with a coarse mesh resolution to determine if the movement was in the correct direction.

Contours of concentrations represented in the reaction were plotted, comparing the locations of product formation, as well as the location of the reactants. When the fluid comes into contact with the beads, the velocity of the fluid changes, so the chemical reaction is adjusted accordingly. Additionally, to keep the walls at a constant temperature, the beads are modeled as heat sinks. This is because the beads absorb heat from within the fluid. Using the equation dU = dQ - dW, energy equations can be coupled to flow equations. Alternatively, heat flux boundary conditions can be used to account for how heat flows into the beads.

Figure 5: Turbulent modeled exposure of pressure on a bead from a flow
Figure 6: Reassessment of the change in velocity in a KSST-omega turbulence model that accounts for drag showed key differences, namely the diffusive velocity at the edge of the beads and the buildup of a more drastic low velocity, more stagnant fluid environment behind the bead. The extreme shifts in velocity may necessitate a change in scaffolding of the bead to ensure it is not damaged by constant influx of fluid.

Looking at how the turbulent model affects the changes in both exposure of fluid on the bead, it is interesting to note that the greatest exposure of fluid to the bead occurs along the x axis of the bead, both at the immediate point of contact, as well in the more stagnant drag environment on the opposite side of the flow. In contrast, the vertical poles experience the greatest velocity of the fluid, but on the bead itself there is low pressure, which would mean that there is a lack of contact between fluid and the bead. The turbulent assessment is important to accurately gauge the relationship of both the speed and drag of the fluid and its effect on contact with the bead.

In the third iteration, I looked at how the final design of the internal bead immobilizer would interact in a turbulent flow environment. By understanding the flow of fluid movement, it would be possible to iteratively calculate areas of lower velocity depressions that would potentially lead to deposit buildup, and likewise looking at areas of higher velocity flow which may be accelerated in degradation. Using the model that was developed in ANSYS, the geometry was shrink wrapped to simplify the planar interfaces in ANSYS. While it would be possible to use the original design to calculate turbulent flow velocities, it would prove challenging to calculate due to the sheer number of calculated faces that the geometry would possess, and the shared topology would overcomplicate the design. Therefore, a tight shrink wrap with an enclosure of 0.1% to the body of the geometry would allow for simplification of the model without compromising on the overall behavior of the design in reaction to the turbulent flow. A cylindrical enclosure was subsequently designed around the geometry to serve as an area of bounds in which fluid would interact with the geometry.

Figure 7: Geometry shrink wrapped shown in green demonstrates a compromise between simplification of the figure whilst still preserving the overall integrity and character of the product. Blue circle is the opening of the enclosure (rest not shown)

An automatic gaussian surface mesh wrap was performed on the design, which led 89 million nodal surfaces and 20 million cells designed across the geometry and the enclosure. Though ideally this automatic specification would allow for an optimized iterative calculation of the geometry, due to the capacity of the software available, the number of nodes would need to be reduced to more economically develop a solution to the flow design. To perform an iterative reduction of nodal surfaces, the design was remeshed in accordance to a reduction algorithm where smaller nodes were combined and solved as a larger, surface complex polyhedral. These aggregating polyhedrals would enlarge based on increasing maxima specification, and the software continued to bleb polyhedrals into larger iterative spaces until an exponential reduction in interfaces and nodes was achieved. The reduction procedure was designed by me as a way to exponentially decrease the number of volume subsets that would need to be iteratively solved without biasing the results to an oversimplification of the geometry. Specification of smaller nodes between the topology of the geometry and the enclosure ensured that the calculations near the surface of the immobilizer would remain overall accurate, and larger polyhedral meshing near the boundaries of the enclosure and within the immobilizer would compensate for the level of precision required at the shared topology. The design was effectively simplified to 4 million nodes and 89,000 cells in the geometry, which saw an effective 20x reduction in meshing generation whilst still maintaining levels of acceptable tolerances to the iterative mechanics.

Figure 8: From top left clockwise: original design of the polyhedral mesh was calculated to be 89 million nodes with 20 million cells, which surpassed the processing power and capabilities of the software available. Top right: overall geometry fully meshed, demonstrating a level of nodal precision that was arguably providing an optimized solution with the highest level of precision. Bottom right: the design for maxima and minima of cell volume originally specified. Bottom left: spliced view of the polyhedral cells.
Figure 9: Top: continuation of the reduction algorithm saw increasing tolerances applied for acceptable maxima and minima volume of cells, which reduced the nodal surfaces by aggregating neighboring cells into a larger polyhedral cell. Middle: closeup splices demonstrate smaller cell tolerances at the shared topology between the geometry and the enclosure to account for the level of precision required at the part where the fluid would interact with the geometry. Successive nodes both at the end of the enclosure and internal to the geometry have a larger cell to compensate for the overall number of cells required.

Having meshed the geometry, the initial governing equations could be set. Flow was modeled with generalized GEKO turbulent flow with standard conditions of viscosity and flow interface interactions. Fluid was run through at a theoretical 40 m/s flow rate, which was adjustable to the design specifications of the team. I designed the governing flow to flow in parallel along the enclosure and initially desired a residual run of 100 iterations, with both the drag and lift coefficients to have time to stabilize. However, it was apparent that despite the optimization of the mesh above, the design was still incredibly complicated, and each iterative step took twenty minutes to complete. Therefore, the process was terminated at only 62 iterations as the coefficient of drag and lift both had stabilized, and though not all residuals had fallen below the initial tolerance value of 10-3, the design’s z-velocity, which was the most important as the fluid was flowing in that axis, had fallen below 10-3, which was acceptable. Despite the continuity and k not falling below the initial 10-3 expectation for residual, overall the number of iterations ran should still provide an overall accurate assessment of flows and stresses along the geometry, as their residuals were still within acceptable standards. In future iterations, I would be keen to run the design on a software with greater processing capabilities, and I would prefer to have reduced my time step to better simulate the flow, but for the purpose of the team, I decided the flow and iterative calculations were within acceptable standards to generally visualize points of stress and dynamical changes in fluid flow.

Figure 10: Top Left: enclosure fully in view set with appropriate governing conditions. Top Right: residual plots demonstrating the z-velocity falling below 10-3 in 62 iterations, which was deemed acceptable for visualization. Drag and Lift coefficients demonstrate a leveling out which shows the overall iterative solution has reached acceptable levels of accuracy.

Upon visualization, it is apparent that at the body of the enclosure, velocity is rather constant, but the reams in the geometry do force some internal mixing and eddy currents within the intersection of the topology and the fluid region. Design shows eddy rifts that develop currents and depressions of flow between screw rings of the design, where fluid is drawn into successively slower depressions that allow for mixing and interaction of the fluid solution with the immobilized beads placed on the reams of the geometry. Studying a cross-section further amplifies this notion. Pressure due to the fluid flow does not seem to affect the screw geometry except for one of the ends that is in greatest contact with the fluid, which shows the design is not prone to breakage due to turbulent flow.

Figure 11: Top Left: Total isometric profile highlighting low variation in velocity at edge boundaries. Top Right: side profile of turbulent fluid flow with immobilization geometry sectioned. Bottom Left: turbulent fluid flow with immobilization geometry incorporated from a different angle. Bottom Right: Pressure profile of fluid on immobilization geometry highlight low overall strain placed on immobilization geometry due to fluid pressure.

Modeling Error-Prone PCR (epPCR) in silico

Background

The premise of our project is to optimize caffeine metabolizing enzymes, namely NdmA, NdmB, and NdmD. This optimization will allow us to generate variants that are superior to the original versions. If we wish to reduce the costs of 7-MX and provide a viable oral treatment for myopia, we need to be able to produce 7-MX faster than traditional methods. Literature review has indicated that a coculture of E.coli, one strain producing NdmA + NdmD, and the other strain producing NdmB + NdmD, yielded a conversion rate of 85%. Team Cornell will not be using a coculture due to the nature of our project. Thus, we look for alternative methods to optimize the enzymes and establish a conversion rate of 85% or higher.

Error-prone PCR (epPCR) is a technique used to randomly introduce mutations into a DNA sequence. It is typically used for studying protein structure and function, and for directed evolution experiments (which is a big part of our methodology, learn more about this process in our project description and competition video!).

In epPCR, mutations are usually introduced by using a low fidelity DNA polymerase. Instead of using a new polymerase, we decided to lower the fidelity of New England Biolabs’ Q5 Hi-Fidelity polymerase by adding various amounts of MnCl2.

The following table illustrates the effect of MnCl2 on the fidelity of DNA polymerase:

Effect of MnCl2 on fidelity Explanation
Higher Error Rate
  • Mn2+ ions are less specific in stabilizing correct base pairings during DNA replication
  • Decreased specificity = increased mutation rate
Template-Dependent Errors
  • Certain sequences may be more prone to errors due to the interaction between template DNA and Mn2+
Slower Replication
  • Mn2+ is a less efficient cofactor
  • Replication slows down, allowing more time for errors to occur

Figure 12: Explanation of the effect of MnCl2 on the fidelity of DNA polymerases

Strengths and Weaknesses of epPCR

The strengths and weaknesses of epPCR must be considered when constructing a simulation of it. The accuracy of this model will influence the method in which we decide to conduct epPCR experiments and utilize it for our ADE (adaptive/directed evolution) experiments.

Strengths

  • Generates a wide variety of mutations based on the concentration of MnCl2 in the PCR reaction
  • Can be applied to any DNA template (in our case, an entire plasmid)
  • Mutations are random
  • Diverse mutant libraries can be generated based on [MnCl2]

Weaknesses

  • Certain motifs may be more prone to mutations than others (consider mutation hotspots, which are incorporated in the model)
  • No control over the types of mutations induced (substitutions, insertions, deletions)
  • Primer design can become challenging
  • A balance must be struck between generating mutations vs increasing the likelihood of non-functional variants

At first glance, epPCR seems to be more troublesome than its worth. However, the needs of our project fit the strengths of epPCR. We are looking to generate random mutations as the tertiary structure of any of the N-demethylases is unknown. The generation of diverse mutant libraries directly correlates with our ADE cycle, meaning that this is a key strength of epPCR in relation to our project.

The model that will be developed in silico will focus on predicting the effect of the weaknesses of epPCR. We will incorporate each weakness into the model to generate a thorough simulation of the error-prone PCR. All simulations were performed in RStudio Version 2023.06.1 Build 524. The following iterations will be conducted/analyzed as follows:

  1. Define a basic epPCR function in RStudio based on mutation rate
  2. Modify epPCR function to predict [MnCl2]
  3. Nonlinear Iteration of epPCR function (predict [MnCl2])
  4. Increase complexity by introducing different mutation types
    1. Substitutions, Deletions, and Insertions
  5. Simulate effect of finalized DNA mutation model on amino acid primary sequence

Governing Assumptions

  1. Independence of Mutations: if one base is mutated, whether or not the next base is mutated is independent of the result of the previous base.
    1. In reality, adjacent mutations can influence each other due to the presence of DNA repair mechanisms
  2. Equal Probability of Mutation for Each Base: each base has an equal probability of being used in a mutation
    1. In reality, the propensity of a base to be mutated depends on the polymerase’s fidelity and the template strand
  3. Constant Mutation Rate: the entire sequence has a constant mutation rate (though this facet will change as the model increases in complexity)
    1. In reality, primer design, reaction conditions, and polymerase fidelity can affect the mutation rate
  4. No Incorporation Bias: All mutated bases are incorporated by DNA polymerase with equal efficiency
    1. In reality, a polymerase may have a bias regarding the incorporation of specific bases by the polymerase enzyme
  5. Constant Number of Doublings: 5 doublings per epPCR cycle was run
    1. In reality, the number of doublings can affect the amount of mutation, as established below

Basis of Model

A study performed by David Wilson and Tony Keefe in 2000 discusses “Random Mutagenesis by PCR". They found that the prevalence of mutations differs based on [MnCl2], number of doublings, and length of DNA in a linear fashion. The following figures demonstrate the pattern they found as such ([MnCl2] = 25 mM):

# of Doublings Length of DNA (bp) Mutations per Nucleotide Position
100 200 400 800 1600
5 0.33 0.66 1.3 2.6 5.3 0.0033
10 0.66 1.3 2.6 5.3 11 0.0066
20 1.3 2.6 5.3 11 21 0.013
30 2.0 4.0 7.9 16 32 0.02
50 3.3 6.6 13 26 53 0.033

Figure 13: Table of values correlating # of doublings, length of DNA template, and mutations per nucleotide position. The values directly underneath each length of DNA value represent average number of mutations per DNA template of a certain length after a certain number of doublings

Unfortunately, we were not able to collect sufficient data to replicate this data as well as accumulate our own data regarding mutations per nucleotide position at different [MnCl2]. We started the process but due to circumstances out of our control (delays in shipments, incorrect primers) we could not collect the breadth of data that we hoped to. Nevertheless, the data provided by Wilson and Keefe served as the basis of all of our computational modeling regarding epPCR. Models will be improved as more data is experimentally determined. Specifically, the regression and interpolation models that will be presented in step 2 and 3 will use collected data as training data to help provide better predictions for [MnCl2].

Figure 14: Left: Linear relationship established between number of doublings and mutation per nucleotide position. It is independent of length of DNA. Right: Linear relationship established between length of DNA and total number of mutations per length. 5 curves plotted, each representing a different number of doublings. Establishes that number of mutations accrued as DNA length increases changes based on number of doublings. Logically, a greater number of doublings increases the total number of mutations per DNA strand.

Step 1: Simulating epPCR Based on Constant Mutation Rate

Expanding on the data found in Wilson and Keefe’s study, we can estimate a constant rate of mutation for pENGM42. The length of pENGM42 is 7181 bp. For this simulation, we assume that # of doublings = 5 to simplify the function and the model. Below is a screenshot of the implemented RStudio Code for this function with comments:

Figure 15: RStudio implementation of step 1. Likelihood of mutation is determined by comparing random number generated from uniform distribution and determining if the number is less than the value of the mutation rate, which equals 0.0033 (representing 5 doublings). If a mutation is to be generated, depending on the base pair being considered, a mutation will be generated so that an “A” will only be replaced by “T”, “C”, “G” and cannot be replaced by itself. Total number of mutations was recorded and returned.

We implemented the above code with this 40 bp sequence as a trial run “AAAAAAAAAATTTTTTTTTTCCCCCCCCCCGGGGGGGGGG”. After running the code 6 times, we finally received a mutated version of the sequence, pictured below:

Figure 16: Result of 40 bp sequence going through epPCR. 2 mutations were generated after running the code 6 times over.

The fact that we had to run this code 6 times over indicates that for a small 40 bp sequence like this, the rate of mutation is very little. 2 mutations is a lot for a 40 bp segment of DNA, but it took 6 trials for mutations to be generated. Thus, we can conclude that the amount of mutations generated should not affect the function of this sequence heavily. However, we will assess this in a more complex iteration of this code.

To analyze the effect of epPCR with 5 doublings on our plasmid, we used the function on the full 7181 bp sequence of pENGM42. Due to the length of the plasmid, the input sequence was split into 5 input sequences and then all 5 sequences were pasted together in one sequence. The code was run 100 times to output the following data:

Figure 17: 100 iterations of epPCR simulated on pENGM42. Mutation rate set to 0.0033, 0.33%. Resembles a right-skewed version of a uniform distribution. Average number of mutations after 100 iterations is 24.28.

The mutation rate of 0.33% is relatively low, but for our purposes, it is important that we don’t risk generating too many mutations. Note that the number of mutations can vary stochastically. In some experiments, you might observe more or fewer mutations due to chance. Replicating the experiment multiple times could provide a more robust understanding of the variability in the number and locations of mutations.

Furthermore, the low mutation rate is important because epPCR generates random mutations. In silico, it is easy to identify where a mutation is by comparing the original and mutated sequence against each other. However, it is not as easy to identify the location of mutations in experiments due to the fact that epPCR generates many different variants. Sanger sequencing or next gen sequencing will generate different results for sequences because within a sample, there can be thousands of variants.

However, we note that after 5 doublings, 99.7% of the sequence is maintained. This is important for three reasons:

  1. Preservation of sequence integrity: the conservation of the genetic sequence is vital for maintaining the functionality and stability of proteins.
  2. More Representative of Natural Genetic Variation: Sporadic mutations can represent natural genetic variation without significantly compromising the overall structure or function of the biological entity.
  3. Maintaining Experimental Control: Low mutation rates ensure that time is not wasted generating a large library full of non-functional variants. This allows us to actually use these variants in our ADE cycle.

To further validate that 5 doublings and a 0.33% mutation rate is a good option for our experiments, we ran this simple epPCR simulation with 10 doublings (0.66% mutation rate), and 20 doublings (1.3% mutation rate), and 50 doublings (3.3% mutation rate). The results are presented below:

Figure 18: Top Left: 100 iterations of epPCR simulated on pENGM42. Mutation rate set to 0.0066, 0.66%. Average number of mutations after 100 iterations is 47.88. Top Right: 100 iterations of epPCR simulated on pENGM42. Mutation rate set to 0.013, 1.3%. Average number of mutations after 100 iterations is 94.98. Bottom: 100 iterations of epPCR simulated on pENGM42. Mutation rate set to 0.033, 3.3%. Average number of mutations after 100 iterations is 236.96.

The x-axis scale in these figures have drastically changed, though the distribution of each histogram looks very similar. Clearly, increasing the number of doublings affects the rate of mutation. Overall sequence preservation is 99.3%, 98.7%, and 96.7% for 10 doublings, 20 doublings, and 50 doublings, respectively. Conducting 100 iterations of these simulations indicate that the results are more statistically significant.

The higher mutation rates lead to a diverse pool of genetic variants, thus increasing the amount of genetic diversity which could potentially help to identify a superior variant. However, the higher the number of mutations, the higher the likelihood of mutations occurring in critical regions of genes or proteins.

Therefore, we learned some important lessons from this rudimentary epPCR simulation:

  1. A balance must be struck between generating higher rates of mutation and risking protein function
  2. Our original choice of 5 doublings is a safe option
  3. It could be worth performing 10 doublings in epPCR reactions, but this should be a consideration made after collecting preliminary data
  4. The high number of iterations indicates that we be confident in our experimental selections

Step 2: Suggesting MnCl2 Concentrations For a Specific Rate of Mutation

Wilson and Keefe’s data was all collected at 25 mM concentration of MnCl2. However, the concentration of MnCl2 also affects the generated mutant. The previous approach characterized in step 1 is beneficial for numerous reasons:

  1. Maintains consistent conditions: Consistency is crucial for reproducibility of our experiments. A crucial initial step of our experimental epPCR trials is to replicate Wilson and Keefe’s data.
  2. Standardization: allows for easy establishment of standard protocols, which is an important result that we look to present
  3. Predictable Mutagenesis: Users of this protocol can predict, to some extent, an expected “total number of mutations” of a template based on the linearity between MnCl2, which is valuable when designing experiments related to mutagenesis

However, the approach in step 1 has a few downfalls, which step 2 aims to correct.

  1. Low Number of Doublings: The major downside of step 1 is that in order for a 0.0033 mutation rate to be obtained, only 5 doublings can be performed. However, we look to incorporate the mutants in a plasmid. More copies are needed than what can be found in 5 doublings.
  2. Limited Flexibility: Constant MnCl2 concentrations limit the ability to fine-tune the mutation rate. The way to do that would be to increase the number of doublings which is not the best method to go about fine-tuning the mutation rate.

Thus, the guiding principle behind step 2 is the following: we want to suggest a rate of mutation and the number of doublings, and the program will suggest a concentration of MnCl2 to use via linear interpolation. This model has one key assumption: it assumes linearity between doublings and mutation rates. This phenomenon was proven when using MnCl2 concentrations of 25 mM. However, when expanding the model beyond the provided data, if there is a nonlinear relationship between doublings and mutation rates, this model will not be accurate. Below is the code for this function:

Figure 19: Code for linear interpolation to identify [MnCl2] based on number of doublings and desired mutation rate. The function considers known data, provided by Wilson and Keefe’s data. The function takes two parameters, number of doublings and desired mutation rate per nucleotide position. The approx() function is used for linear interpolation, and the predicted concentration is calculated via the above formula.

Linear interpolation is a very straightforward, flexible model that allows one to predict MnCl2 concentration for various desired mutation rates and doublings. It also does not rely on external libraries which makes this method easy to use in any RStudio environment. This is beneficial for iGEM teams who are not as experienced with RStudio.

However, linear interpolation relies on the linear assumption, which assumes that there is a linear relationship between the number of doublings and mutation rates. In reality, we know that the linear relationship is not always true (consider the presence of CpG dinucleotides and its effect on mutation). Therefore, for nonlinear relationships, this model will not be the most accurate.

The linear assumption is a valuable assumption to make as we saw in step 1 that there is a linear relationship between length of sequence and total number of mutations. Wilson and Keefe also established that there is a linear relationship between [MnCl2] and rate of mutation. We can replicate those results in silico by plotting the desired rate of mutation vs [MnCl2] using a constant doubling number of 30 (this is the number of doublings we perform in standard PCR reactions). The figure can be found below:

Figure 20: Top Left: Relationship between desired mutation rate per nucleotide position and predicted [MnCl2] with 10 doublings. Top Right: Relationship between desired mutation rate per nucleotide position and predicted [MnCl2] with 25 doublings. Bottom: Relationship between desired mutation rate per nucleotide position and predicted [MnCl2] with 30 doublings. Linear relationship established in all three trials indicating that under the linear assumption, mutation rate is independent of doubling times.

The observation of a linear relationship between the desired mutation rate and [MnCl2] suggests a consistent and predictable pattern. This linearity simplifies the prediction process, making it easier to estimate the required MnCl2 concentration for a desired rate of mutation. The figure above also demonstrates that under the linear assumption, the mutation rate per nucleotide position is independent of the number of doublings. However, [MnCl2] must be changed accordingly. This could have practical implications in experimental design, allowing iGEM teams to maintain a consistent rate of mutation regardless of the duration.

The linear relationship established here provides a simple, useful, predictive model. Given a desired mutation rate, iGEM teams can use this model to estimate the MnCl2 concentration needed to achieve the target rate. The results here are only predictions, they must be supported by performing experiments in the lab. Due to time constraints, we did not gain sufficient data to present regarding this. However, the model will continue to be refined and complexity will continue to be added to make this model realistic.

Step 3: Suggesting [MnCl2] For a Specific Rate of Mutation (Nonlinear Conditions)

The linear assumption is a great one to make for the sake of simplicity and efficiency of calculation/generation of curves. However, as discussed before, epPCR can follow some nonlinear behavior. An example of this is the effect of a previous mutation on the next nucleotide being added. DNA repair mechanisms can affect the prevalence of a mutation. Based on this example, we can identify a few reasons why nonlinear relationships must be considered for epPCR:

  1. Biological Processes Are Complex: As shown in the example, biological systems often involve intricate and nonlinear interactions due to the multifaceted nature of biochemical reactions
  2. Realism and Accuracy: Taking non linear relationships into account captures nuances and subtleties of biological systems in the model
  3. Data Driven Insights: Nonlinear models can extract meaningful insights from experimental data. This allows us and other iGEM teams to experiment with different [MnCl2], different number of doublings, and different lengths of DNA and compare that with the results of the model. This process can also help to improve the model.

Below is a picture of the code written for this simulation:

Figure 21: Nonlinear regression model for epPCR. The function takes two parameters: number of doublings and desired rate of mutation. A and b coefficients indicate the nonlinear modeling. a represents the effect of doublings on MnCl2 concentration and b represents the effect of the desired mutation rate on [MnCl2].
Figure 22: Top Left: Relationship between desired mutation rate per nucleotide position and predicted [MnCl2] with 10 doublings. Top Right: Relationship between desired mutation rate per nucleotide position and predicted [MnCl2] with 25 doublings. Bottom: Relationship between desired mutation rate per nucleotide position and predicted [MnCl2] with 30 doublings. This is under the condition of a nonlinear regression model, but linear relationship generated anyway.

The nonlinear regression model generated a linear relationship between desired mutation rate and estimated [MnCl2]. This seems counterintuitive, but we believe that it could have occurred for a few reasons. The first is that epPCR may actually follow a linear relationship between desired mutation rate and predicted [MnCl2]. Therefore, predicting [MnCl2] may be more accurate under the linear assumption. However, this suggestion is unlikely considering that DNA synthesis is a complex biochemical process. There could have been noise in the training data used for this model. In the presence of substantial noise, nonlinear model will magnify the effects of the noise.

Lastly, if the training data was not precise, they may not reflect the true nonlinear behavior of the model. The training data was sourced primarily from Wilson and Keefe’s established data. If the nonlinear regression coefficients were not calculated properly, this could affect the results of the model. Furthermore, a polynomial regression model may be more accurate in relation to the nonlinear behavior of epPCR. However, when this model was implemented, the training data was not accurate. More training data is required for a polynomial model to provide an accurate model of epPCR.

As an aside, we do not need to test our plasmid against this model due to the fact that we have already pre-chosen the rate of mutation per nucleotide position. While understanding the linear relationship is important, the model must be confirmed experimentally. Thus, this model informed our decision to experiment with different concentrations as we saw it as a necessary step to ensure that we get a large quantity of DNA as it undergoes epPCR. Furthermore, we have no confirmation whether epPCR is nonlinear. The data already established by Wilson and Keefe establishes that there is a linear relationship between doublings and rate of mutation. Logically, it is hard to establish a reason as to why linearity would not hold with different concentrations of MnCl2. We realize that this is a big assumption so we plan to test our hypothesis with a range of epPCR experiments to confirm either the linear or nonlinear model. Due to time constraints, we were not able to collect sufficient data to confirm our hypothesis. This data will be presented at the Grand Jamboree.

The nonlinear model, while still very accurate, still cannot fully encompass the multifaceted nature of biological processes. There are many technicalities to consider, but we will consider only a few of them as these are the most relevant for the nature of ENERGEM. These technicalities include

  • Single Nucleotide Polymorphisms (SNPs)
    • Insertions
    • Deletions
    • Substitutions
  • Mutation Context: the effect of neighboring bases on the mutation probability
  • Environmental Factors
    • Temperature
    • pH

Step 4: Incorporating Different Mutation Types

The mutations being considered in step 1 are substitutions, which are a form of SNPs. The model in step 1 is a simplified version that doesn’t reflect biological realism. In true biological systems, different types of mutations have different effects on protein structure and function. By modeling these mutation types, we can simulate the diversity of mutations found in the natural population. Though epPCR is a laboratory technique, the ADE cycle relies on natural selection as we look to isolate the best variants. Thus, mimicking biological realism is key. We focus on three mutation types: substitutions, insertions, and deletions.

A substitution will only affect a single amino acid in the protein sequence. This may or may not affect the function of the enzyme depending on the location of the mutation. In contrast, insertions and deletions cause frameshift mutations, altering the entire protein sequence downstream of the mutation site. Frameshift mutations often result in non-functional proteins; thus, understanding the likelihood of a frameshift mutation to occur is important. This model will focus on assessing the effect of frameshift mutations on the DNA sequence of samples and our plasmid. A future iteration will discuss the likelihood a mutation will generate a non-functional protein (it will take environmental considerations into account as well).

The assumptions of this simulation are as follows:

  1. [MnCl2] = 25 mM: This decision was made as Wilson and Keefe’s data is confirmed and it serves as a good comparison against this simulation
  2. Mutations are discrete events: this establish both independence of mutation and ignores complex biochemical processes like replication slippage
  3. Homogenous Environment: Optimal reaction conditions were maintained during this simulation.
  4. The probability of each mutation type is equal: this helps simplify the model although in biological contexts, this is not true
  5. Linear Assumption: linear relationship between the number of doublings and mutation rates, as established by Wilson and Keefe’s data.

The figure below contains a commented-version of the code for this simulation. This simulation is a sophisticated version of the one presented in step 1. All four assumptions are addressed (assumption 3 is indirectly addressed by not considering the effect of reaction conditions within the model).

Figure 23: Commented version of step 4 model. A function called simulate_mutations was defined and takes three parameters: sequence, desired rate of mutation, and number of doublings. If a mutation occurs, the type of mutation is randomly, nonspecifically selected. The function returns the mutated sequence, number of substitutions, insertions, and deletions, number of frameshift mutations, and the position at which a mutation occurs.

The sophistication of this model is well beyond the previous simulations presented. As such, we provide a more detailed breakdown of the function here:

  1. The function “simulate_mutations” takes three parameters: the original DNA sequence, the probability of a mutation occurring, and the number of doublings in the PCR machine
  2. A mutation occurs if a randomly generated number from a uniform distribution is lower than the specified mutation rate
  3. Following assumption 4, the likelihood of each mutation type is equal. If a mutation occurs, the function randomly selects if a substitution, insertion, or deletion occurs
  4. The code for a substitution mutation is greatly simplified in this function. In step 1, we specified specific if-elseif-else statements depending on if the base in question was an A, T, G, or C. In this simulation, we sample from a vector containing all 4 bases, but the function will compare which base is mutating and automatically remove that base from the “allowed_bases” variable. So if an A is being mutated, it will only be replaced by T, C, or G. The same pattern holds for the other nucleotides
  5. If an insertion mutation occurs, a random base is inserted before the current position
  6. In a deletion mutation occurs, the base at the current position is removed
  7. The type and position of each mutation is recorded and stored
  8. Insertions and deletions are classified as frameshift mutations. For our purposes, if a frameshift occurs, the DNA sequence will likely yield a non-functional protein. Thus, keeping track of the number of frameshift mutations and the likelihood that one occurs is important information for our experimentation.
  9. A text-based graph is generated to visually see the occurrence of each mutation

This model also yields 5 important results: # of substitutions, deletions, and insertions, number of frameshift mutations, and the position of a mutation.

In reality, the position of a mutation does not matter since in the lab, we won’t know the position of a mutation until after sequencing. Furthermore, since NdmA, NdmB, and NdmD aren’t thoroughly documented, and since software tools like AlphaFold and Rosetta sometimes provide inaccurate models of protein structure, the position of a mutation is somewhat inconsequential to us in reality because there are no actionable steps we can take to correct it. However, this information is helpful as we look to improve this simulation and generate useful visualizations related to how our genetic parts and plasmid may be affected as they undergo epPCR. Figures can be generated based on the number of mutations observed and the type of mutation.

A big strength of this model is that it records the number of frameshift mutations. A single frameshift mutation can result in a completely nonfunctional mutant. However, if frameshift mutations occur in multiples of three, that could indicate that a genetic part could still be worth testing. If multiples of three frameshift mutations can occur (3, 6, 9, 12…), then a few codons would be changed but the mutant may still be valuable. However, if the number of frameshift mutations is not a number of three, that indicates the mutant is most likely non-functional. The number of frameshift mutations is also directly impacted by the assumption that the probability of each mutation type is equal. In reality, this is not true. For example, frameshift mutations might occur if there are regions of high repetition in sequences. These repetitive sequences are prone to slippage during DNA replication. We make the assumption that all mutation types have an equal probability to simplify the model and be more computationally efficient.

Below is data and figures generated from an example sequence “AAAAAAAAAATTTTTTTTTTGGGGGGGGGGCCCCCCCCCC” with a mutation rate of 0.0033 and 30 doublings

Figure 24: Top: Output of running function simulate_mutations with a mutation rate of 0.0033 and 30 doublings. 3 substitutions, 1 insertion, and 2 deletions were generated. Positions of mutations were listed, as well as what type of mutation they are. Bottom: Barplot showing number of substitutions, insertions, deletions, and frameshifts.

In a previous paragraph, we mentioned that if the number of frameshift mutations is a multiple of three, then the open reading frame would not be affected and the variant would be preserved. However, this simulation shows otherwise. The number of frameshift mutations is equal to 2, with one insertion and one deletion. The deletion cancels out the insertion. Thus, we modify our previous statement and instead assert that if the # of |insertions - # of deletions| is a multiple of three, then the open reading frame will be maintained. We utilize this new criterion and rerun this simulation, shown below:

Figure 25: For all images, modified sequence was used “AAAAAAAAATTTTTTTTTGGGGGGGGGCCCCCCCCC” (1x A, T, G, C removed). Top: Output of simulation run with same original sequence, mutation rate of 0.033, and 30 doublings. 2nd from top: Graph depicting mutation types and accumulated frameshift, or |# of insertions - dilutions|. Bottom two images: A second trial with same parameters. Note that the accumulated frameshift is 6, which is a multiple of three. The total number of frameshift mutations is 22.

Figure 25 shows some important information, the key being that the hypothesis that |insertions-deletions| must be a multiple of three in order for the ORF to be maintained. We changed the length of the example sequence from 40 bp to 36 bp since it is a multiple of three and it would be easier to identify a changed ORF if the number of codons was an integer.

In the first two images of figure 25, the accumulated frameshift is 4, which changes the ORF from

“AAA AAA AAA TTT TTT TTT GGG GGG GGG CCC CCC CCC” to

“CAA AAA ACG TAG GTG AGT CGG TGA GAG AAT CC”

Note that 4 nucleotides were deleted, indicating that an amino acid would be lost. But note that the last two nucleotide bases do not form a codon. Thus, we can conclude that the ORF has been modified because two amino acids were lost. We also note that there are stark differences between the two sequences indicating that a mutation rate per nucleotide of 0.033 is too high. If a total of 37 mutations were induced in a 36 bp sequence, the rate of mutation would almost certainly result in nearly 100% non-functional variants. Thus, we need to ensure that mutations occur but do not affect the amino acid sequence above a certain threshold. Before we develop the amino acid directed model, we ran our plasmid through simulated_mutations with a mutation rate of 0.0033 with 30 doublings. The results are presented below.

Figure 26: pENGM42 sequence ran through simulated_mutations. Very high mutation rate with a 0.0033 and 30 doublings.

The images in Figure 26 show that our selected mutation rate of 0.0033 mutations per nucleotide position may be too high. It generated a total of 713 mutations which would almost certainly result in a non-functional protein. Depending on where the frameshifts could occur, this plasmid may code for entirely new proteins. Due to the length of the plasmid (7181 bp), we did not look for the location of each point mutation.

This simulation, though far from perfect, heavily influenced our plan for epPCR experimentation. We hypothesized that a mutation rate of 0.00033 mutations per nucleotide position (0.033%) could be better as it would induce 10x less mutation under the linear assumption. We tested our hypothesis with simulated_mutations, but this time we ran 100 iterations to analyze the statistical significance of these results.

Figure 27: 100 simulations conducted of pENGM42 entire plasmid sequence with 30 doublings and mutation per nucleotide position = 0.00033. Top: RStudio output of average number of substitutions, insertions, deletions, and frameshift mutations. Also shows RStudio ANOVA output (analysis to follow). Bottom: barplot of average mutation types and average accumulated frameshifts. This simulation does not focus on amino acid sequence changes but only on DNA sequence. Note that the number of accumulated frameshift mutations is low, indicating that the overall length of the sequence should be the same under our hypothesized epPCR conditions.

As we predicted, the amount of each type of mutation decreased by a factor of 10 after we reduced the mutation rate per nucleotide position by a factor of 10 from 0.0033 to 0.00033. The average number of substitutions, insertions, deletions, and accumulated frameshift was 23, 24, 23.5, and 0.5, respectively, rounded to the nearest half number.

Interestingly, after 100 iterations, the accumulated number of frameshift mutations were very close to 0. Under the assumption that all mutation types are equally likely to occur, we can assume that this pattern may translate into real-world experiments. However, knowing that DNA replication can follow nonlinear replication patterns, we cannot place 100% trust in this prediction solely based on the data presented.

To further increase confidence in our results, a one-way ANOVA (analysis of variance) was performed. The ANOVA test is used to analyze the differences among group means in a sample. This analysis is most applicable to the simulation represented in figure 27 as we ran 100 iterations, collected the means, and plotted different groups against each other. ANOVA is also suitable for balanced data, which is when the number of observations in each group is equal. All four data points were collected 100 times and then averaged, meaning this condition is met. A one-way ANOVA was chosen because we are comparing the amount of mutation generated with a constant number of doublings.

Provided below is an analysis of the ANOVA results:

  1. Degrees of Freedom: Between groups (mutation types) = 3, Within groups (Residuals) = 396
  2. Sum of Squares: Between groups (mutation types) = 41851, Within groups (Residuals) = 10621
    1. A high sum of squares indicates that there is a lot of variability in the data. This could be due to the random nature of epPCR, and the assumptions that we made when developing this model
  3. Mean Squares: Between groups (mutation types) = 13950, Within groups (Residuals) = 27
    1. Indicates that trials between each group have little to no variation. However, the mean squares between groups is quite large, indicating large variability between groups. This could be due to the fact that the average number of frameshift mutations is lower than the other groups being considered in this ANOVA
  4. F-Value: 520.1
    1. A high F-value indicates that the means of at least one mutation type significantly differs from the others. This makes sense considering that the average number of frameshift mutations is considerably lower than the other groups being considered
  5. P-Value: 2x 10-16
    1. This is much lower than our chosen significance level 0.05. Suggests that there is strong evidence to reject the null hypothesis, indicating that there are highly significant differences among the mutation types
    2. At first glance, this seems illogical considering the bar graph. However, since the average number of frameshifts was included in this ANOVA, there will be high variation. This result confirms that the data we generated can represent experimental groups under the linear assumption
  6. Significance Levels: ***
    1. Three stars indicates that the results are statistically significant

Though the results of the ANOVA may have been skewed by the fact that accumulation of frameshift mutations was included. However, this does not affect our confidence in the results since the significance level was 3 stars. Furthermore, by analyzing the data by inspection, we can see that between the average number of insertions, deletions, and substitutions.

Overall, simulated_mutations tells us lots of important information. Firstly, it helps us understand mutation dynamics, or the effect of different mutations on the overall sequence. We found that if the accumulated number of frameshift mutations is not a multiple of three, the ORF will not be the same and the epPCR’d DNA fragment will essentially be non-functional. Thus, during our sequencing rounds, if the length of the epPCR’d plasmid is not equal to the length of the original plasmid, or if the mutated plasmid's length does not differ in length by a multiple of three, we can classify this plasmid as non-functional.

Secondly, this model illustrated the need to consider amino acid sequence and not just the DNA sequence. Insertions and deletions can cause frameshifts that can drastically change the amino acid sequence as demonstrated by the analysis of figure 25. Therefore, the next and final iteration of this epPCR model will be to consider the effect of epPCR on the amino acid sequence and computationally determine an acceptable rate of mutation.

This model also serves as a tool for other iGEM teams who are looking to expand on epPCR as a method to generate a large library of mutants. This model would serve as the basis for the modeling that another iGEM team would conduct. The previous models were necessary as we worked to design this sophisticated iteration of epPCR simulation. However, this model can benefit from considering factors such as:

  1. Effect of pH and mutagens on epPCR
  2. Removing the linear assumption and working on nonlinear modeling
  3. Accounting for other complex mutations such as translocations, inversions, and duplications
  4. Use machine learning algorithms + extensive training data to improve the realism of this model under nonlinear conditions

Step 5: Simulate Effect on Amino Acid Sequence

This model is important for a few reasons. The first is due to the functional impact of epPCR on amino acid sequence. Even a single amino acid change can alter the protein’s structure, function, or stability. Though we won’t be able to detect these changes via our ADE cycle, we need to ensure there is not too many changes, otherwise we risk generating non-functional mutants which wastes time and resources. Secondly, we must consider accumulated frameshift mutations. If this is not a multiple of three, the reading frame of the DNA sequence shifts. This leads to a completely different amino acid sequence downstream of the mutation site. Such alterations often result in premature stop codons, truncated and nonfunctional proteins, or entirely different amino acid sequences. Lastly, in every enzyme, there are specific residues that must be maintained in order for them to retain their catalytic activity. For example, chymotrypsin relies on the catalytic triad (His-57, Asp-102, and Ser-195). The active site cannot be changed too much or the catalytic activity of the enzyme will be deleted.

This model will have the following assumptions:

  1. Linear Assumption: There is a linear relationship between the number of doublings and mutation rates.
    1. As aforementioned, DNA replication is a nonlinear process with many intricacies
  2. Linear Mapping from DNA to Amino Acids: each codon maps to a specific amino acid
    1. In reality, this relationship is complex due to factors like codon usage bias and wobble base pairing
  3. Unbiased Codon Usage: All codons are used uniformly, meaning that there is no one codon that is more frequently used that another
  4. Stop Codons Lead to Truncated Proteins
  5. Absence of External Factors: this model will not consider the presence of chaperone proteins, cofactors, and post-translational modification

The purpose of this model is as follows:

  1. Predict functional consequences of epPCR on amino acid structure
  2. Use linear interpolation, as established in step 2, to suggest [MnCl2]
  3. Establish an acceptable threshold of mutation

The code attached below is an incomplete version of the finalized model. Thus, it serves as a template for future iGEM teams to improve upon. For our purposes, understanding the effect on amino acid sequence can be beneficial but it becomes inconsequential as our ADE cycle will indirectly tell us the effect of epPCR on a mutant’s amino acid sequence. Thus, this simulation is not necessary for Team Cornell’s epPCR modeling as our focus is on the effect of mutation on DNA and the optimal concentration of [MnCl2], which the other four simulations answer.

This final model combines code from step 2 (linear interpolation) and step 4 (effect of multiple mutation types and adds a few things to it:

  1. A codon library to use the DNA sequence and convert it to an amino acid sequence
  2. A method to indicate how many stop codons are present
    1. If there is one stop codon at the end, this is an acceptable mutant
    2. If there are more than one stop codons, the protein becomes truncated at the first one and is non-functional
Figure 28: Incomplete code for final function, simulate_mutations. This function combines many previous functions, adds code to convert to amino acid sequence, and then performs mutations. Function is designed to suggest functionality based on if level of mutation is within a certain experimentally determined threshold.

There are multiple functions within this simulate_mutations: mutate_dna, dna_to_aa, and suggest_MnCl2_concentration. Integrating all three of these functions into one big function creates a modular design that allows for easy modification of mutation, translation, and assessment logic. The commented code in figure 28 also provides a comprehensive overview of the simulation and what it is meant to do.

The function mutate_dna simulates mutations (substitutions, insertions, and deletions) in the input DNA sequence based on the given mutation rate and number of doublings. It provides a modular approach to handle mutation logic separately. This function can also be customized for different mutation scenarios.

dna_to_aa converts a DNA sequence into an amino acid sequence using a predefined codon-to-amino acid mapping (defined at the very beginning of the function). It is meant to translate codons into amino acids, similar to how ribosomes perform translation in a cell. It converts the input DNA sequence into its original amino acid sequence without considering post-translational modification. This function also calculates the number of differences between the mutated and original amino acid sequences. This number will be used to determine if a mutant can be classified as functional.

suggest_MnCl2_concentration predicts the required MnCl2 concentration based on the desired mutation rate and number of doublings using linear interpolation. This is a critical parameter in PCR-based techniques, from experimental data.

Lastly, this function determines usability based on the calculated percent difference between the mutated sequence and the original amino acid sequence. In this way, we can get a sense of when a mutant becomes very different from the original amino acid sequence and becomes non-functional.

The strengths of this code include that the functions are modular and it is comprehensive: The code covers mutation simulation, translation, usability assessment, and MnCl2 concentration prediction, providing a holistic simulation model. However, the model has a few weaknesses which can be improved upon by future iGEM teams:

  1. Placeholder Logic: The mutation logic provided is a placeholder and lacks specific rules for mutations. A more detailed mutation model is needed for accurate simulations.
  2. Static Codon-to-Amino Acid Mapping: The codon-to-amino acid mapping is predefined and might not cover all genetic variations or organisms, limiting the accuracy of amino acid translation.
  3. Simplified Relationship: The linear relationship between mutation rate, doublings, and MnCl2 concentration might not capture the complexity of real-world experimental conditions.

The function simulate_mutations provides the basis for a more sophisticated epPCR simulation that future iGEM teams can improve upon. Though our function is incomplete, much of the logic behind generating certain types of mutations, translation from DNA to protein, assessing functionality, and estimating [MnCl2] has already been developed.