Modeling Fluid Flow in Isothermal Biochemical Synthesis (IBCS) Reactor
ANSYS Fluent is a computational fluid dynamics (CFD) software package we will be using to model fluid flow, chemical reactions and heat transfer. Our Product Development subteam will be designing an isothermal biochemical synthesis (IBCS) reactor for the efficient conversion of caffeine to paraxanthine/7methylxanthine. Modeling of our IBCS reactor is important for a few reasons:
 The thermodynamics of the reactor must be realistic to maximize conversion rate within the reactor
 Modeling heat transfer will influence the use of a sous vide machine, as well as influence the method in which we maintain isothermal conditions in more complicated iterations
 Identifying potential pressure concentrations within our reactor early will guide our methodical scaleup process as we work to design a prototype for industrial scaleup
 Modeling both fluid flow and chemical reactions within the same software provides a comprehensive view
 Expanding this model and analyzing it in tandem will characterize all aspects of the reactor, provides a foundation for industrial scaleup and future prototyping
We started by creating a Poiseuille’s tube, which models the basic flow through tube of the reactor. The mesh below models the upper half of the tube, and a full tube will be obtained by setting the object to be axisymmetric about the centerline. The full tube is a cylinder with a radius of 0.5 cm and length of 10 cm.
The lower half of the tube is obtained by applying a reflection about the ZX plane. The obtained velocity contours show that there will be a higher velocity near the centerline, which shed light to some of the challenges we could face when optimizing for yield.
After experimenting with simple geometry, one bead is added to the simulation. First, the singular bead was subtracted out to the flow domain. The tube dimensions are the same as in the previous model but with the bead of radius 0.2 cm centered 5 cm along the tube. Necessary conditions were specified, and the whole volume in the simulation was meshed. Presuming laminar flow, the Reynold’s number can be found.
The velocity magnitude contours of this model was obtained below. With this, we can see that the upper and lower parts of the bead experiences a higher velocity, translating to a lower reaction time and higher exchange of reactants. However, the right and left parts experience a lower velocity, so there is less exchange of reagents with a higher reaction time. Although the changes in velocity around the bead is important to consider to optimize the overall reaction rate and heat transfer in the tube, this problem may be lessened when the beads are uniformly layered within the tube as the gaps between each bead will be more uniform.
Once the simulation for one bead is complete, the beads can be layered in one plane. To obtain this model, we did a 3D tube with the same radius and length and created a slice of 1 cm beads. The slice of beads was then duplicated across the tube using the pattern tool.
We began with a coarse mesh resolution to determine if the movement was in the correct direction.
Contours of concentrations represented in the reaction were plotted, comparing the locations of product formation, as well as the location of the reactants. When the fluid comes into contact with the beads, the velocity of the fluid changes, so the chemical reaction is adjusted accordingly. Additionally, to keep the walls at a constant temperature, the beads are modeled as heat sinks. This is because the beads absorb heat from within the fluid. Using the equation dU = dQ  dW, energy equations can be coupled to flow equations. Alternatively, heat flux boundary conditions can be used to account for how heat flows into the beads.
Looking at how the turbulent model affects the changes in both exposure of fluid on the bead, it is interesting to note that the greatest exposure of fluid to the bead occurs along the x axis of the bead, both at the immediate point of contact, as well in the more stagnant drag environment on the opposite side of the flow. In contrast, the vertical poles experience the greatest velocity of the fluid, but on the bead itself there is low pressure, which would mean that there is a lack of contact between fluid and the bead. The turbulent assessment is important to accurately gauge the relationship of both the speed and drag of the fluid and its effect on contact with the bead.
In the third iteration, I looked at how the final design of the internal bead immobilizer would interact in a turbulent flow environment. By understanding the flow of fluid movement, it would be possible to iteratively calculate areas of lower velocity depressions that would potentially lead to deposit buildup, and likewise looking at areas of higher velocity flow which may be accelerated in degradation. Using the model that was developed in ANSYS, the geometry was shrink wrapped to simplify the planar interfaces in ANSYS. While it would be possible to use the original design to calculate turbulent flow velocities, it would prove challenging to calculate due to the sheer number of calculated faces that the geometry would possess, and the shared topology would overcomplicate the design. Therefore, a tight shrink wrap with an enclosure of 0.1% to the body of the geometry would allow for simplification of the model without compromising on the overall behavior of the design in reaction to the turbulent flow. A cylindrical enclosure was subsequently designed around the geometry to serve as an area of bounds in which fluid would interact with the geometry.
An automatic gaussian surface mesh wrap was performed on the design, which led 89 million nodal surfaces and 20 million cells designed across the geometry and the enclosure. Though ideally this automatic specification would allow for an optimized iterative calculation of the geometry, due to the capacity of the software available, the number of nodes would need to be reduced to more economically develop a solution to the flow design. To perform an iterative reduction of nodal surfaces, the design was remeshed in accordance to a reduction algorithm where smaller nodes were combined and solved as a larger, surface complex polyhedral. These aggregating polyhedrals would enlarge based on increasing maxima specification, and the software continued to bleb polyhedrals into larger iterative spaces until an exponential reduction in interfaces and nodes was achieved. The reduction procedure was designed by me as a way to exponentially decrease the number of volume subsets that would need to be iteratively solved without biasing the results to an oversimplification of the geometry. Specification of smaller nodes between the topology of the geometry and the enclosure ensured that the calculations near the surface of the immobilizer would remain overall accurate, and larger polyhedral meshing near the boundaries of the enclosure and within the immobilizer would compensate for the level of precision required at the shared topology. The design was effectively simplified to 4 million nodes and 89,000 cells in the geometry, which saw an effective 20x reduction in meshing generation whilst still maintaining levels of acceptable tolerances to the iterative mechanics.
Having meshed the geometry, the initial governing equations could be set. Flow was modeled with generalized GEKO turbulent flow with standard conditions of viscosity and flow interface interactions. Fluid was run through at a theoretical 40 m/s flow rate, which was adjustable to the design specifications of the team. I designed the governing flow to flow in parallel along the enclosure and initially desired a residual run of 100 iterations, with both the drag and lift coefficients to have time to stabilize. However, it was apparent that despite the optimization of the mesh above, the design was still incredibly complicated, and each iterative step took twenty minutes to complete. Therefore, the process was terminated at only 62 iterations as the coefficient of drag and lift both had stabilized, and though not all residuals had fallen below the initial tolerance value of 10^{3}, the design’s zvelocity, which was the most important as the fluid was flowing in that axis, had fallen below 10^{3}, which was acceptable. Despite the continuity and k not falling below the initial 10^{3} expectation for residual, overall the number of iterations ran should still provide an overall accurate assessment of flows and stresses along the geometry, as their residuals were still within acceptable standards. In future iterations, I would be keen to run the design on a software with greater processing capabilities, and I would prefer to have reduced my time step to better simulate the flow, but for the purpose of the team, I decided the flow and iterative calculations were within acceptable standards to generally visualize points of stress and dynamical changes in fluid flow.
Upon visualization, it is apparent that at the body of the enclosure, velocity is rather constant, but the reams in the geometry do force some internal mixing and eddy currents within the intersection of the topology and the fluid region. Design shows eddy rifts that develop currents and depressions of flow between screw rings of the design, where fluid is drawn into successively slower depressions that allow for mixing and interaction of the fluid solution with the immobilized beads placed on the reams of the geometry. Studying a crosssection further amplifies this notion. Pressure due to the fluid flow does not seem to affect the screw geometry except for one of the ends that is in greatest contact with the fluid, which shows the design is not prone to breakage due to turbulent flow.
Modeling ErrorProne PCR (epPCR) in silico
Background
The premise of our project is to optimize caffeine metabolizing enzymes, namely NdmA, NdmB, and NdmD. This optimization will allow us to generate variants that are superior to the original versions. If we wish to reduce the costs of 7MX and provide a viable oral treatment for myopia, we need to be able to produce 7MX faster than traditional methods. Literature review has indicated that a coculture of E.coli, one strain producing NdmA + NdmD, and the other strain producing NdmB + NdmD, yielded a conversion rate of 85%. Team Cornell will not be using a coculture due to the nature of our project. Thus, we look for alternative methods to optimize the enzymes and establish a conversion rate of 85% or higher.
Errorprone PCR (epPCR) is a technique used to randomly introduce mutations into a DNA sequence. It is typically used for studying protein structure and function, and for directed evolution experiments (which is a big part of our methodology, learn more about this process in our project description and competition video!).
In epPCR, mutations are usually introduced by using a low fidelity DNA polymerase. Instead of using a new polymerase, we decided to lower the fidelity of New England Biolabs’ Q5 HiFidelity polymerase by adding various amounts of MnCl_{2}.
The following table illustrates the effect of MnCl_{2} on the fidelity of DNA polymerase:
Effect of MnCl_{2} on fidelity  Explanation 

Higher Error Rate 

TemplateDependent Errors 

Slower Replication 

Figure 12: Explanation of the effect of MnCl_{2} on the fidelity of DNA polymerases
Strengths and Weaknesses of epPCR
The strengths and weaknesses of epPCR must be considered when constructing a simulation of it. The accuracy of this model will influence the method in which we decide to conduct epPCR experiments and utilize it for our ADE (adaptive/directed evolution) experiments.
Strengths
 Generates a wide variety of mutations based on the concentration of MnCl_{2} in the PCR reaction
 Can be applied to any DNA template (in our case, an entire plasmid)
 Mutations are random
 Diverse mutant libraries can be generated based on [MnCl_{2}]
Weaknesses
 Certain motifs may be more prone to mutations than others (consider mutation hotspots, which are incorporated in the model)
 No control over the types of mutations induced (substitutions, insertions, deletions)
 Primer design can become challenging
 A balance must be struck between generating mutations vs increasing the likelihood of nonfunctional variants
At first glance, epPCR seems to be more troublesome than its worth. However, the needs of our project fit the strengths of epPCR. We are looking to generate random mutations as the tertiary structure of any of the Ndemethylases is unknown. The generation of diverse mutant libraries directly correlates with our ADE cycle, meaning that this is a key strength of epPCR in relation to our project.
The model that will be developed in silico will focus on predicting the effect of the weaknesses of epPCR. We will incorporate each weakness into the model to generate a thorough simulation of the errorprone PCR. All simulations were performed in RStudio Version 2023.06.1 Build 524. The following iterations will be conducted/analyzed as follows:
 Define a basic epPCR function in RStudio based on mutation rate
 Modify epPCR function to predict [MnCl_{2}]
 Nonlinear Iteration of epPCR function (predict [MnCl_{2}])
 Increase complexity by introducing different mutation types
 Substitutions, Deletions, and Insertions
 Simulate effect of finalized DNA mutation model on amino acid primary sequence
Governing Assumptions
 Independence of Mutations: if one base is mutated, whether or not the next base is mutated is independent of the result of the previous base.
 In reality, adjacent mutations can influence each other due to the presence of DNA repair mechanisms
 Equal Probability of Mutation for Each Base: each base has an equal probability of being used in a mutation
 In reality, the propensity of a base to be mutated depends on the polymerase’s fidelity and the template strand
 Constant Mutation Rate: the entire sequence has a constant mutation rate (though this facet will change as the model increases in complexity)
 In reality, primer design, reaction conditions, and polymerase fidelity can affect the mutation rate
 No Incorporation Bias: All mutated bases are incorporated by DNA polymerase with equal efficiency
 In reality, a polymerase may have a bias regarding the incorporation of specific bases by the polymerase enzyme
 Constant Number of Doublings: 5 doublings per epPCR cycle was run
 In reality, the number of doublings can affect the amount of mutation, as established below
Basis of Model
A study performed by David Wilson and Tony Keefe in 2000 discusses “Random Mutagenesis by PCR". They found that the prevalence of mutations differs based on [MnCl_{2}], number of doublings, and length of DNA in a linear fashion. The following figures demonstrate the pattern they found as such ([MnCl_{2}] = 25 mM):
# of Doublings  Length of DNA (bp)  Mutations per Nucleotide Position  

100  200  400  800  1600  
5  0.33  0.66  1.3  2.6  5.3  0.0033 
10  0.66  1.3  2.6  5.3  11  0.0066 
20  1.3  2.6  5.3  11  21  0.013 
30  2.0  4.0  7.9  16  32  0.02 
50  3.3  6.6  13  26  53  0.033 
Figure 13: Table of values correlating # of doublings, length of DNA template, and mutations per nucleotide position. The values directly underneath each length of DNA value represent average number of mutations per DNA template of a certain length after a certain number of doublings
Unfortunately, we were not able to collect sufficient data to replicate this data as well as accumulate our own data regarding mutations per nucleotide position at different [MnCl_{2}]. We started the process but due to circumstances out of our control (delays in shipments, incorrect primers) we could not collect the breadth of data that we hoped to. Nevertheless, the data provided by Wilson and Keefe served as the basis of all of our computational modeling regarding epPCR. Models will be improved as more data is experimentally determined. Specifically, the regression and interpolation models that will be presented in step 2 and 3 will use collected data as training data to help provide better predictions for [MnCl_{2}].
Step 1: Simulating epPCR Based on Constant Mutation Rate
Expanding on the data found in Wilson and Keefe’s study, we can estimate a constant rate of mutation for pENGM42. The length of pENGM42 is 7181 bp. For this simulation, we assume that # of doublings = 5 to simplify the function and the model. Below is a screenshot of the implemented RStudio Code for this function with comments:
We implemented the above code with this 40 bp sequence as a trial run “AAAAAAAAAATTTTTTTTTTCCCCCCCCCCGGGGGGGGGG”. After running the code 6 times, we finally received a mutated version of the sequence, pictured below:
The fact that we had to run this code 6 times over indicates that for a small 40 bp sequence like this, the rate of mutation is very little. 2 mutations is a lot for a 40 bp segment of DNA, but it took 6 trials for mutations to be generated. Thus, we can conclude that the amount of mutations generated should not affect the function of this sequence heavily. However, we will assess this in a more complex iteration of this code.
To analyze the effect of epPCR with 5 doublings on our plasmid, we used the function on the full 7181 bp sequence of pENGM42. Due to the length of the plasmid, the input sequence was split into 5 input sequences and then all 5 sequences were pasted together in one sequence. The code was run 100 times to output the following data:
The mutation rate of 0.33% is relatively low, but for our purposes, it is important that we don’t risk generating too many mutations. Note that the number of mutations can vary stochastically. In some experiments, you might observe more or fewer mutations due to chance. Replicating the experiment multiple times could provide a more robust understanding of the variability in the number and locations of mutations.
Furthermore, the low mutation rate is important because epPCR generates random mutations. In silico, it is easy to identify where a mutation is by comparing the original and mutated sequence against each other. However, it is not as easy to identify the location of mutations in experiments due to the fact that epPCR generates many different variants. Sanger sequencing or next gen sequencing will generate different results for sequences because within a sample, there can be thousands of variants.
However, we note that after 5 doublings, 99.7% of the sequence is maintained. This is important for three reasons:
 Preservation of sequence integrity: the conservation of the genetic sequence is vital for maintaining the functionality and stability of proteins.
 More Representative of Natural Genetic Variation: Sporadic mutations can represent natural genetic variation without significantly compromising the overall structure or function of the biological entity.
 Maintaining Experimental Control: Low mutation rates ensure that time is not wasted generating a large library full of nonfunctional variants. This allows us to actually use these variants in our ADE cycle.
To further validate that 5 doublings and a 0.33% mutation rate is a good option for our experiments, we ran this simple epPCR simulation with 10 doublings (0.66% mutation rate), and 20 doublings (1.3% mutation rate), and 50 doublings (3.3% mutation rate). The results are presented below:
The xaxis scale in these figures have drastically changed, though the distribution of each histogram looks very similar. Clearly, increasing the number of doublings affects the rate of mutation. Overall sequence preservation is 99.3%, 98.7%, and 96.7% for 10 doublings, 20 doublings, and 50 doublings, respectively. Conducting 100 iterations of these simulations indicate that the results are more statistically significant.
The higher mutation rates lead to a diverse pool of genetic variants, thus increasing the amount of genetic diversity which could potentially help to identify a superior variant. However, the higher the number of mutations, the higher the likelihood of mutations occurring in critical regions of genes or proteins.
Therefore, we learned some important lessons from this rudimentary epPCR simulation:
 A balance must be struck between generating higher rates of mutation and risking protein function
 Our original choice of 5 doublings is a safe option
 It could be worth performing 10 doublings in epPCR reactions, but this should be a consideration made after collecting preliminary data
 The high number of iterations indicates that we be confident in our experimental selections
Step 2: Suggesting MnCl_{2} Concentrations For a Specific Rate of Mutation
Wilson and Keefe’s data was all collected at 25 mM concentration of MnCl_{2}. However, the concentration of MnCl_{2} also affects the generated mutant. The previous approach characterized in step 1 is beneficial for numerous reasons:
 Maintains consistent conditions: Consistency is crucial for reproducibility of our experiments. A crucial initial step of our experimental epPCR trials is to replicate Wilson and Keefe’s data.
 Standardization: allows for easy establishment of standard protocols, which is an important result that we look to present
 Predictable Mutagenesis: Users of this protocol can predict, to some extent, an expected “total number of mutations” of a template based on the linearity between MnCl_{2}, which is valuable when designing experiments related to mutagenesis
However, the approach in step 1 has a few downfalls, which step 2 aims to correct.
 Low Number of Doublings: The major downside of step 1 is that in order for a 0.0033 mutation rate to be obtained, only 5 doublings can be performed. However, we look to incorporate the mutants in a plasmid. More copies are needed than what can be found in 5 doublings.
 Limited Flexibility: Constant MnCl_{2} concentrations limit the ability to finetune the mutation rate. The way to do that would be to increase the number of doublings which is not the best method to go about finetuning the mutation rate.
Thus, the guiding principle behind step 2 is the following: we want to suggest a rate of mutation and the number of doublings, and the program will suggest a concentration of MnCl_{2} to use via linear interpolation. This model has one key assumption: it assumes linearity between doublings and mutation rates. This phenomenon was proven when using MnCl_{2} concentrations of 25 mM. However, when expanding the model beyond the provided data, if there is a nonlinear relationship between doublings and mutation rates, this model will not be accurate. Below is the code for this function:
Linear interpolation is a very straightforward, flexible model that allows one to predict MnCl_{2} concentration for various desired mutation rates and doublings. It also does not rely on external libraries which makes this method easy to use in any RStudio environment. This is beneficial for iGEM teams who are not as experienced with RStudio.
However, linear interpolation relies on the linear assumption, which assumes that there is a linear relationship between the number of doublings and mutation rates. In reality, we know that the linear relationship is not always true (consider the presence of CpG dinucleotides and its effect on mutation). Therefore, for nonlinear relationships, this model will not be the most accurate.
The linear assumption is a valuable assumption to make as we saw in step 1 that there is a linear relationship between length of sequence and total number of mutations. Wilson and Keefe also established that there is a linear relationship between [MnCl_{2}] and rate of mutation. We can replicate those results in silico by plotting the desired rate of mutation vs [MnCl_{2}] using a constant doubling number of 30 (this is the number of doublings we perform in standard PCR reactions). The figure can be found below:
The observation of a linear relationship between the desired mutation rate and [MnCl_{2}] suggests a consistent and predictable pattern. This linearity simplifies the prediction process, making it easier to estimate the required MnCl_{2} concentration for a desired rate of mutation. The figure above also demonstrates that under the linear assumption, the mutation rate per nucleotide position is independent of the number of doublings. However, [MnCl_{2}] must be changed accordingly. This could have practical implications in experimental design, allowing iGEM teams to maintain a consistent rate of mutation regardless of the duration.
The linear relationship established here provides a simple, useful, predictive model. Given a desired mutation rate, iGEM teams can use this model to estimate the MnCl_{2} concentration needed to achieve the target rate. The results here are only predictions, they must be supported by performing experiments in the lab. Due to time constraints, we did not gain sufficient data to present regarding this. However, the model will continue to be refined and complexity will continue to be added to make this model realistic.
Step 3: Suggesting [MnCl_{2}] For a Specific Rate of Mutation (Nonlinear Conditions)
The linear assumption is a great one to make for the sake of simplicity and efficiency of calculation/generation of curves. However, as discussed before, epPCR can follow some nonlinear behavior. An example of this is the effect of a previous mutation on the next nucleotide being added. DNA repair mechanisms can affect the prevalence of a mutation. Based on this example, we can identify a few reasons why nonlinear relationships must be considered for epPCR:
 Biological Processes Are Complex: As shown in the example, biological systems often involve intricate and nonlinear interactions due to the multifaceted nature of biochemical reactions
 Realism and Accuracy: Taking non linear relationships into account captures nuances and subtleties of biological systems in the model
 Data Driven Insights: Nonlinear models can extract meaningful insights from experimental data. This allows us and other iGEM teams to experiment with different [MnCl_{2}], different number of doublings, and different lengths of DNA and compare that with the results of the model. This process can also help to improve the model.
Below is a picture of the code written for this simulation:
The nonlinear regression model generated a linear relationship between desired mutation rate and estimated [MnCl_{2}]. This seems counterintuitive, but we believe that it could have occurred for a few reasons. The first is that epPCR may actually follow a linear relationship between desired mutation rate and predicted [MnCl_{2}]. Therefore, predicting [MnCl_{2}] may be more accurate under the linear assumption. However, this suggestion is unlikely considering that DNA synthesis is a complex biochemical process. There could have been noise in the training data used for this model. In the presence of substantial noise, nonlinear model will magnify the effects of the noise.
Lastly, if the training data was not precise, they may not reflect the true nonlinear behavior of the model. The training data was sourced primarily from Wilson and Keefe’s established data. If the nonlinear regression coefficients were not calculated properly, this could affect the results of the model. Furthermore, a polynomial regression model may be more accurate in relation to the nonlinear behavior of epPCR. However, when this model was implemented, the training data was not accurate. More training data is required for a polynomial model to provide an accurate model of epPCR.
As an aside, we do not need to test our plasmid against this model due to the fact that we have already prechosen the rate of mutation per nucleotide position. While understanding the linear relationship is important, the model must be confirmed experimentally. Thus, this model informed our decision to experiment with different concentrations as we saw it as a necessary step to ensure that we get a large quantity of DNA as it undergoes epPCR. Furthermore, we have no confirmation whether epPCR is nonlinear. The data already established by Wilson and Keefe establishes that there is a linear relationship between doublings and rate of mutation. Logically, it is hard to establish a reason as to why linearity would not hold with different concentrations of MnCl_{2}. We realize that this is a big assumption so we plan to test our hypothesis with a range of epPCR experiments to confirm either the linear or nonlinear model. Due to time constraints, we were not able to collect sufficient data to confirm our hypothesis. This data will be presented at the Grand Jamboree.
The nonlinear model, while still very accurate, still cannot fully encompass the multifaceted nature of biological processes. There are many technicalities to consider, but we will consider only a few of them as these are the most relevant for the nature of ENERGEM. These technicalities include
 Single Nucleotide Polymorphisms (SNPs)
 Insertions
 Deletions
 Substitutions
 Mutation Context: the effect of neighboring bases on the mutation probability
 Environmental Factors
 Temperature
 pH
Step 4: Incorporating Different Mutation Types
The mutations being considered in step 1 are substitutions, which are a form of SNPs. The model in step 1 is a simplified version that doesn’t reflect biological realism. In true biological systems, different types of mutations have different effects on protein structure and function. By modeling these mutation types, we can simulate the diversity of mutations found in the natural population. Though epPCR is a laboratory technique, the ADE cycle relies on natural selection as we look to isolate the best variants. Thus, mimicking biological realism is key. We focus on three mutation types: substitutions, insertions, and deletions.
A substitution will only affect a single amino acid in the protein sequence. This may or may not affect the function of the enzyme depending on the location of the mutation. In contrast, insertions and deletions cause frameshift mutations, altering the entire protein sequence downstream of the mutation site. Frameshift mutations often result in nonfunctional proteins; thus, understanding the likelihood of a frameshift mutation to occur is important. This model will focus on assessing the effect of frameshift mutations on the DNA sequence of samples and our plasmid. A future iteration will discuss the likelihood a mutation will generate a nonfunctional protein (it will take environmental considerations into account as well).
The assumptions of this simulation are as follows:
 [MnCl_{2}] = 25 mM: This decision was made as Wilson and Keefe’s data is confirmed and it serves as a good comparison against this simulation
 Mutations are discrete events: this establish both independence of mutation and ignores complex biochemical processes like replication slippage
 Homogenous Environment: Optimal reaction conditions were maintained during this simulation.
 The probability of each mutation type is equal: this helps simplify the model although in biological contexts, this is not true
 Linear Assumption: linear relationship between the number of doublings and mutation rates, as established by Wilson and Keefe’s data.
The figure below contains a commentedversion of the code for this simulation. This simulation is a sophisticated version of the one presented in step 1. All four assumptions are addressed (assumption 3 is indirectly addressed by not considering the effect of reaction conditions within the model).
The sophistication of this model is well beyond the previous simulations presented. As such, we provide a more detailed breakdown of the function here:
 The function “simulate_mutations” takes three parameters: the original DNA sequence, the probability of a mutation occurring, and the number of doublings in the PCR machine
 A mutation occurs if a randomly generated number from a uniform distribution is lower than the specified mutation rate
 Following assumption 4, the likelihood of each mutation type is equal. If a mutation occurs, the function randomly selects if a substitution, insertion, or deletion occurs
 The code for a substitution mutation is greatly simplified in this function. In step 1, we specified specific ifelseifelse statements depending on if the base in question was an A, T, G, or C. In this simulation, we sample from a vector containing all 4 bases, but the function will compare which base is mutating and automatically remove that base from the “allowed_bases” variable. So if an A is being mutated, it will only be replaced by T, C, or G. The same pattern holds for the other nucleotides
 If an insertion mutation occurs, a random base is inserted before the current position
 In a deletion mutation occurs, the base at the current position is removed
 The type and position of each mutation is recorded and stored
 Insertions and deletions are classified as frameshift mutations. For our purposes, if a frameshift occurs, the DNA sequence will likely yield a nonfunctional protein. Thus, keeping track of the number of frameshift mutations and the likelihood that one occurs is important information for our experimentation.
 A textbased graph is generated to visually see the occurrence of each mutation
This model also yields 5 important results: # of substitutions, deletions, and insertions, number of frameshift mutations, and the position of a mutation.
In reality, the position of a mutation does not matter since in the lab, we won’t know the position of a mutation until after sequencing. Furthermore, since NdmA, NdmB, and NdmD aren’t thoroughly documented, and since software tools like AlphaFold and Rosetta sometimes provide inaccurate models of protein structure, the position of a mutation is somewhat inconsequential to us in reality because there are no actionable steps we can take to correct it. However, this information is helpful as we look to improve this simulation and generate useful visualizations related to how our genetic parts and plasmid may be affected as they undergo epPCR. Figures can be generated based on the number of mutations observed and the type of mutation.
A big strength of this model is that it records the number of frameshift mutations. A single frameshift mutation can result in a completely nonfunctional mutant. However, if frameshift mutations occur in multiples of three, that could indicate that a genetic part could still be worth testing. If multiples of three frameshift mutations can occur (3, 6, 9, 12…), then a few codons would be changed but the mutant may still be valuable. However, if the number of frameshift mutations is not a number of three, that indicates the mutant is most likely nonfunctional. The number of frameshift mutations is also directly impacted by the assumption that the probability of each mutation type is equal. In reality, this is not true. For example, frameshift mutations might occur if there are regions of high repetition in sequences. These repetitive sequences are prone to slippage during DNA replication. We make the assumption that all mutation types have an equal probability to simplify the model and be more computationally efficient.
Below is data and figures generated from an example sequence “AAAAAAAAAATTTTTTTTTTGGGGGGGGGGCCCCCCCCCC” with a mutation rate of 0.0033 and 30 doublings
In a previous paragraph, we mentioned that if the number of frameshift mutations is a multiple of three, then the open reading frame would not be affected and the variant would be preserved. However, this simulation shows otherwise. The number of frameshift mutations is equal to 2, with one insertion and one deletion. The deletion cancels out the insertion. Thus, we modify our previous statement and instead assert that if the # of insertions  # of deletions is a multiple of three, then the open reading frame will be maintained. We utilize this new criterion and rerun this simulation, shown below:
Figure 25 shows some important information, the key being that the hypothesis that insertionsdeletions must be a multiple of three in order for the ORF to be maintained. We changed the length of the example sequence from 40 bp to 36 bp since it is a multiple of three and it would be easier to identify a changed ORF if the number of codons was an integer.
In the first two images of figure 25, the accumulated frameshift is 4, which changes the ORF from
“AAA AAA AAA TTT TTT TTT GGG GGG GGG CCC CCC CCC” to
“CAA AAA ACG TAG GTG AGT CGG TGA GAG AAT CC”
Note that 4 nucleotides were deleted, indicating that an amino acid would be lost. But note that the last two nucleotide bases do not form a codon. Thus, we can conclude that the ORF has been modified because two amino acids were lost. We also note that there are stark differences between the two sequences indicating that a mutation rate per nucleotide of 0.033 is too high. If a total of 37 mutations were induced in a 36 bp sequence, the rate of mutation would almost certainly result in nearly 100% nonfunctional variants. Thus, we need to ensure that mutations occur but do not affect the amino acid sequence above a certain threshold. Before we develop the amino acid directed model, we ran our plasmid through simulated_mutations with a mutation rate of 0.0033 with 30 doublings. The results are presented below.
The images in Figure 26 show that our selected mutation rate of 0.0033 mutations per nucleotide position may be too high. It generated a total of 713 mutations which would almost certainly result in a nonfunctional protein. Depending on where the frameshifts could occur, this plasmid may code for entirely new proteins. Due to the length of the plasmid (7181 bp), we did not look for the location of each point mutation.
This simulation, though far from perfect, heavily influenced our plan for epPCR experimentation. We hypothesized that a mutation rate of 0.00033 mutations per nucleotide position (0.033%) could be better as it would induce 10x less mutation under the linear assumption. We tested our hypothesis with simulated_mutations, but this time we ran 100 iterations to analyze the statistical significance of these results.
As we predicted, the amount of each type of mutation decreased by a factor of 10 after we reduced the mutation rate per nucleotide position by a factor of 10 from 0.0033 to 0.00033. The average number of substitutions, insertions, deletions, and accumulated frameshift was 23, 24, 23.5, and 0.5, respectively, rounded to the nearest half number.
Interestingly, after 100 iterations, the accumulated number of frameshift mutations were very close to 0. Under the assumption that all mutation types are equally likely to occur, we can assume that this pattern may translate into realworld experiments. However, knowing that DNA replication can follow nonlinear replication patterns, we cannot place 100% trust in this prediction solely based on the data presented.
To further increase confidence in our results, a oneway ANOVA (analysis of variance) was performed. The ANOVA test is used to analyze the differences among group means in a sample. This analysis is most applicable to the simulation represented in figure 27 as we ran 100 iterations, collected the means, and plotted different groups against each other. ANOVA is also suitable for balanced data, which is when the number of observations in each group is equal. All four data points were collected 100 times and then averaged, meaning this condition is met. A oneway ANOVA was chosen because we are comparing the amount of mutation generated with a constant number of doublings.
Provided below is an analysis of the ANOVA results:
 Degrees of Freedom: Between groups (mutation types) = 3, Within groups (Residuals) = 396
 Sum of Squares: Between groups (mutation types) = 41851, Within groups (Residuals) = 10621
 A high sum of squares indicates that there is a lot of variability in the data. This could be due to the random nature of epPCR, and the assumptions that we made when developing this model
 Mean Squares: Between groups (mutation types) = 13950, Within groups (Residuals) = 27
 Indicates that trials between each group have little to no variation. However, the mean squares between groups is quite large, indicating large variability between groups. This could be due to the fact that the average number of frameshift mutations is lower than the other groups being considered in this ANOVA
 FValue: 520.1
 A high Fvalue indicates that the means of at least one mutation type significantly differs from the others. This makes sense considering that the average number of frameshift mutations is considerably lower than the other groups being considered
 PValue: 2x 10^{16}
 This is much lower than our chosen significance level 0.05. Suggests that there is strong evidence to reject the null hypothesis, indicating that there are highly significant differences among the mutation types
 At first glance, this seems illogical considering the bar graph. However, since the average number of frameshifts was included in this ANOVA, there will be high variation. This result confirms that the data we generated can represent experimental groups under the linear assumption
 Significance Levels: ***
 Three stars indicates that the results are statistically significant
Though the results of the ANOVA may have been skewed by the fact that accumulation of frameshift mutations was included. However, this does not affect our confidence in the results since the significance level was 3 stars. Furthermore, by analyzing the data by inspection, we can see that between the average number of insertions, deletions, and substitutions.
Overall, simulated_mutations tells us lots of important information. Firstly, it helps us understand mutation dynamics, or the effect of different mutations on the overall sequence. We found that if the accumulated number of frameshift mutations is not a multiple of three, the ORF will not be the same and the epPCR’d DNA fragment will essentially be nonfunctional. Thus, during our sequencing rounds, if the length of the epPCR’d plasmid is not equal to the length of the original plasmid, or if the mutated plasmid's length does not differ in length by a multiple of three, we can classify this plasmid as nonfunctional.
Secondly, this model illustrated the need to consider amino acid sequence and not just the DNA sequence. Insertions and deletions can cause frameshifts that can drastically change the amino acid sequence as demonstrated by the analysis of figure 25. Therefore, the next and final iteration of this epPCR model will be to consider the effect of epPCR on the amino acid sequence and computationally determine an acceptable rate of mutation.
This model also serves as a tool for other iGEM teams who are looking to expand on epPCR as a method to generate a large library of mutants. This model would serve as the basis for the modeling that another iGEM team would conduct. The previous models were necessary as we worked to design this sophisticated iteration of epPCR simulation. However, this model can benefit from considering factors such as:
 Effect of pH and mutagens on epPCR
 Removing the linear assumption and working on nonlinear modeling
 Accounting for other complex mutations such as translocations, inversions, and duplications
 Use machine learning algorithms + extensive training data to improve the realism of this model under nonlinear conditions
Step 5: Simulate Effect on Amino Acid Sequence
This model is important for a few reasons. The first is due to the functional impact of epPCR on amino acid sequence. Even a single amino acid change can alter the protein’s structure, function, or stability. Though we won’t be able to detect these changes via our ADE cycle, we need to ensure there is not too many changes, otherwise we risk generating nonfunctional mutants which wastes time and resources. Secondly, we must consider accumulated frameshift mutations. If this is not a multiple of three, the reading frame of the DNA sequence shifts. This leads to a completely different amino acid sequence downstream of the mutation site. Such alterations often result in premature stop codons, truncated and nonfunctional proteins, or entirely different amino acid sequences. Lastly, in every enzyme, there are specific residues that must be maintained in order for them to retain their catalytic activity. For example, chymotrypsin relies on the catalytic triad (His57, Asp102, and Ser195). The active site cannot be changed too much or the catalytic activity of the enzyme will be deleted.
This model will have the following assumptions:
 Linear Assumption: There is a linear relationship between the number of doublings and mutation rates.
 As aforementioned, DNA replication is a nonlinear process with many intricacies
 Linear Mapping from DNA to Amino Acids: each codon maps to a specific amino acid
 In reality, this relationship is complex due to factors like codon usage bias and wobble base pairing
 Unbiased Codon Usage: All codons are used uniformly, meaning that there is no one codon that is more frequently used that another
 Stop Codons Lead to Truncated Proteins
 Absence of External Factors: this model will not consider the presence of chaperone proteins, cofactors, and posttranslational modification
The purpose of this model is as follows:
 Predict functional consequences of epPCR on amino acid structure
 Use linear interpolation, as established in step 2, to suggest [MnCl_{2}]
 Establish an acceptable threshold of mutation
The code attached below is an incomplete version of the finalized model. Thus, it serves as a template for future iGEM teams to improve upon. For our purposes, understanding the effect on amino acid sequence can be beneficial but it becomes inconsequential as our ADE cycle will indirectly tell us the effect of epPCR on a mutant’s amino acid sequence. Thus, this simulation is not necessary for Team Cornell’s epPCR modeling as our focus is on the effect of mutation on DNA and the optimal concentration of [MnCl_{2}], which the other four simulations answer.
This final model combines code from step 2 (linear interpolation) and step 4 (effect of multiple mutation types and adds a few things to it:
 A codon library to use the DNA sequence and convert it to an amino acid sequence
 A method to indicate how many stop codons are present
 If there is one stop codon at the end, this is an acceptable mutant
 If there are more than one stop codons, the protein becomes truncated at the first one and is nonfunctional
There are multiple functions within this simulate_mutations: mutate_dna, dna_to_aa, and suggest_MnCl2_concentration. Integrating all three of these functions into one big function creates a modular design that allows for easy modification of mutation, translation, and assessment logic. The commented code in figure 28 also provides a comprehensive overview of the simulation and what it is meant to do.
The function mutate_dna simulates mutations (substitutions, insertions, and deletions) in the input DNA sequence based on the given mutation rate and number of doublings. It provides a modular approach to handle mutation logic separately. This function can also be customized for different mutation scenarios.
dna_to_aa converts a DNA sequence into an amino acid sequence using a predefined codontoamino acid mapping (defined at the very beginning of the function). It is meant to translate codons into amino acids, similar to how ribosomes perform translation in a cell. It converts the input DNA sequence into its original amino acid sequence without considering posttranslational modification. This function also calculates the number of differences between the mutated and original amino acid sequences. This number will be used to determine if a mutant can be classified as functional.
suggest_MnCl2_concentration predicts the required MnCl_{2} concentration based on the desired mutation rate and number of doublings using linear interpolation. This is a critical parameter in PCRbased techniques, from experimental data.
Lastly, this function determines usability based on the calculated percent difference between the mutated sequence and the original amino acid sequence. In this way, we can get a sense of when a mutant becomes very different from the original amino acid sequence and becomes nonfunctional.
The strengths of this code include that the functions are modular and it is comprehensive: The code covers mutation simulation, translation, usability assessment, and MnCl_{2} concentration prediction, providing a holistic simulation model. However, the model has a few weaknesses which can be improved upon by future iGEM teams:
 Placeholder Logic: The mutation logic provided is a placeholder and lacks specific rules for mutations. A more detailed mutation model is needed for accurate simulations.
 Static CodontoAmino Acid Mapping: The codontoamino acid mapping is predefined and might not cover all genetic variations or organisms, limiting the accuracy of amino acid translation.
 Simplified Relationship: The linear relationship between mutation rate, doublings, and MnCl_{2} concentration might not capture the complexity of realworld experimental conditions.
The function simulate_mutations provides the basis for a more sophisticated epPCR simulation that future iGEM teams can improve upon. Though our function is incomplete, much of the logic behind generating certain types of mutations, translation from DNA to protein, assessing functionality, and estimating [MnCl_{2}] has already been developed.