Promoter Generation


A promoter is a region of DNA that initiates the transcription of a particular gene. It is located upstream(towards the 5’ end) of the gene and provides a binding site for RNA polymerase, the enzyme responsible for transcribing DNA into RNA. Promoters play a crucial role in gene expression as they regulate the rate at which genes are transcribed and control when and where gene expression occurs.

Since Chlamydomonas reinhardtii is not an easy chassis to deal with, optimizing promoters in Chlamydomonas reinhardtii can have following purposes and benefits:

  • Enhancing gene expression:

    by optimizing the promoter sequence, the rate of transcription initiation can be increased, leading to higher levels of protein production from the target gene. This is particularly important when trying to achieve high levels of expression for specific genes of interest.
  • Fine-tuning gene expression:

    Promoters can be optimized to allow precise conrtol over the timing, location, and level of gene expression. This is beneficial for applications where gene expression needs to be regulated or modulated, such as in metabolic engineering or synthetic biology.


Actually, there are no accurate gene annotation of promoters in Chlamydomonas reinhardtii, so we refer to a method mentioned in the previous published article (Novel cis-regulatory elements as synthetic promoters to drive recombinant protein expression from the Chlamydomonas reinhardtii nuclear genome).

Firstly, the 500bp upstream sequence of each Chlamydomonas reinhardtii gene was extracted from the database and stored in excel file. Next, we try to train all the data with two generation models, GAN and VAE. Their overall structure can be seen in Figure 1.2.1 and Figure 1.2.2.

Next, we would like to elaborate the principle of GAN specifically. Now, we have a generator and a discriminator. To learn the generator pgo over data x, we define a prior on input noise variables pz(z)={z(1),...,z(m)}, then represent a mapping to data space as G(z;θg), where G is a differentiable function represented by a multilayer perceptrion with parameters θg. We also define a second multilayer perceptron D(x;θd) that outputs a single scalar. D(x) represents the probability that x came from the data rather than pg. And we update the discriminator by ascending its stochastic gradient:

We train D to maximize the probability of assigning the correct label to both training examples and samples from G. We update the generator by ascending its stochastic gradient:

We simultaneously train G to minimize log(1-D(G(z))). In other words, D and G play the following two-player minimax game with value function V(G,D):

However, how to make discriminator and generator fit each other is exactly a thorny problem. Based on experiments, for G fixed, the optimal discriminator D is

Besides, if G and D have enough capacity, and at each step of Algorithm, the discriminator is allowed to reach its optimum given G, and pg is updated so as to improve the criterion

Then pg converges to pdata.


Despite the gene sequences we extracted are not well correlated and the Pearson correlation coefficient is only 0.21. We firstly trained our adversarial nets on the extracted datasets with the setting of epochs = 1000, learning rate=1e-3. Results are shown in Figure 1.3.1.

It is obviously seen that in the early stage of training, the ability of the generator is relatively weak, and the discriminator is also weak at this time, but it can still distinguish the generated samples from the real samples with sufficient accuracy, so that D(x) performs really well and G(x)’s accuracy is within acceptable limits too. However, as we continue training, the discriminator gets too successful that the generator gradient vanishes and learns nothing, the generator collapses! Actually, in many applications, GAN is hard to tune parameters due to the unavoidable unbalance between the generator and discriminator causing overfitting and their highly sensitivity to the hyperparameter selections. Therefore, we adjust model’s parameters and reduce the training epoch to 20. The loss of our model can be seen in the Figure 1.3.2.


Generally, here we introduce two generative model, VAE and GAN to optimize promoter design. As there is no accurate gene annotation in Chlamydomonas reinhardtii , we adopt a new method to cut the promoter sequence out, then run the VAE and GAN to generate new sequences, which achieve a desirable performance.

Source Code

Modeling of Starch Metabolism


In the part of starch metabolism modification, wet experiments knocked out enzymes such as SBE, AGPase, and GWD. But the results of the wet experiment were qualitative, we recognize the need to predict outcomes under conditions that cannot be controlled through wet experiments. For example, how the growth rate of the mutant Chlamydomonas varies, how the activity of different enzymes affects the amount of starch accumulated, and how enzymes in the starch branching process affect starch branching. To address this, we intend to employ modeling techniques to anticipate results and optimize our approach.

Growth of Mutants of Relevant Genes


With limited resources, the growth and reproduction curves of organisms generally conform to the logistic growth equation, a differential equation, in the case of limited resources:

There N is the total number of individuals in the population, t is time, r is the population growth potential index, and K is the environmental maximum holding capacity.

The meaning of this equation is that when a species moves into a new ecosystem, its population will change. Assuming that the starting population of the species is less than the maximum holding capacity of the environment, the population will grow. This equation is one of the best mathematical models for describing the pattern of population growth under conditions of limited resources.

The laboratory conditions for culturing Chlamydomonas reinhardtii are theoretically consistent with the applicability of the logistic equation. The objective of gene editing is to enhance starch accumulation. Therefore, this part of the study assesses the growth and reproduction of mutant Chlamydomonas reinhardtii. Ideally, the goal is to obtain mutants with growth rates no lower than that of the wild type, facilitating increased starch production.

Data collection in the laboratory involved the following methods:

  • Daily sampling of both the wild type (srt) and different mutants of Chlamydomonas cultures, followed by dilution and measurement of their OD (optical density) values.

  • Samples were taken every eight hours for a total of ten times.

  • These data, in conjunction with Matlab's lsqcurvefit function, were used to fit a nonlinear curve to the growth data and calculate the r-value of the logistic equation under these conditions.


    The data for measuring the OD values of Chlamydomonas are shown below:

    Due to the limitation of the maximum OD value that can be measured by the spectrophotometer, it can be seen that the growth of the wild type and each mutant basically conforms to the first half of the s-shaped curve of the logistic Steele's equation, and most of them have not yet reached the plateau period. We therefore fitted the lsqurvefit function in matlab to calculate the growth rates of the different mutants and predict the growth afterwards. The results are as follows:

    After calculating the r values for the wild type as well as for the different mutants by fitting, the difference in growth rate between the mutants and the wild type was tested for significance by t-test, which can be calculated for the mutants respectively as below:

    When the degree of freedom is 5 and a is taken as 0.05, the table is checked to get the p-value of 2.571 for two-sided t-test, which shows that there is a significant difference in the growth rate of sbe3 and gwd1 mutants. gwd1 mutant has a significantly faster growth rate than the wild type and has an increase in the starch yield, which is a very compliant mutant. gwd2, agp4, and sbe2 mutants do not have any significant difference, which indicates that the growth rate is not slower than the wild type, also ideal mutant. sbe3 growth rate was significantly slowed down, which may affect the starch yield.

    Modeling of Starch Amount in Mutants


    This model is based on the Michaelis-Menten equation for enzymatic reaction rate compliance.

    Here y is the product concentration, x is the substrate concentration, v_max is the maximum rate of the enzymatic reaction, which is determined by a combination of the enzyme concentration and enzyme activity, and k_m represents the affinity of the enzyme, and the greater the affinity of the enzyme for the substrate, the smaller the value of k_m.

    In this simplified model, we narrow our focus to the two primary variables that drive changes in starch accumulation: GWD (Glucan-Water Dikinase) and AGPase (ADP-Glucose Pyrophosphorylase) activity. The model encompasses two central metabolic processes:

  • Starch Synthesis: AGPase plays a pivotal role by converting glucose monophosphate, a precursor of starch synthesis, into ADP-glucose.

  • Starch Catabolism: This process involves phosphorylated degradation catalyzed by GWDase (Glucan-Water Dikinase). To streamline the model, we simplify the representation of branching from glucose hexaphosphate to glucose monophosphate, starch branching, and other degradation pathways, as these processes are considered constant and do not significantly impact the overall starch accumulation outcome within the scope of this mode.

  • Using the Michaelis-Menten equation, with values for the enzyme's key parameters k_m and v_maxx obtained from Brenda's website, the following equations can be listed to represent the catalytic rates of the two enzymes.

    We establish the temporal changes in the concentrations of glucose ([G]) and starch ([S]) as two separate arrays, with GAPase and GWD acting independently. To simulate the ongoing production of three-carbon sugars through photosynthesis, we introduce the variable 'P' to represent the steady growth of glucose. This approach allows us to formulate two distinct sets of differential equations.

    We employed the ode45 function in Matlab to solve the differential equations until the concentrations of starch and glucose reached a steady state. At that juncture, we recorded the cumulative amount of starch and created a corresponding plot.


    Starch accumulation was predicted according to the model as follows. The horizontal axis is the activity of GWD and AGPase, respectively, and the vertical axis is the amount of starch accumulation.

    Observed from the gwd and agpase axes, respectively:

    From the resultant graphs, it can be seen that the lower the AGPase activity, the smaller the starch accumulation; the lower the GWD activity, the larger the starch accumulation. This matches with the qualitative results of starch detection in the wet experiment. At the same time, the effects of AGPase and GWD on the amount of starch accumulation were similar because the two enzymes had similar Km values and similar affinity for the substrate.

    Modeling of Starch Branching in Mutants


    The starch branching process consists of three primary stages. Initially, activated ADP-glucose is added to the ends of starch chains, a reaction catalyzed by starch synthase (SS). Subsequently, starch branching enzyme (SBE) intervenes by removing a portion of the starch chain under specific criteria, including the cut chain length and the remaining chain length, to create branches within the starch structure. Lastly, starch debranching enzyme (DBE) catalyzes a more intricate reaction, responsible for cutting the starch branches and removing them.

    It's worth noting that the process governed by starch debranching enzymes is somewhat intricate. This is because starch branching in plants adheres to certain conditions: the length of the truncated branched chain must not be shorter than a defined value, Xmin, and the length of the starch chain remaining in the cut should not be shorter than X0. In homologous organisms, this value tends to remain relatively constant.

    From the literature, we learn about a model used to represent plant starch branching systems. In this model, starch branching is depicted through an infinite-dimensional vector, denoted as N(x). This vector represents the number of branched chains with straight-chain starch branching chain lengths of x. For instance, N(5) indicates the count of branched chains with a branch length of 5.

    The catalytic average rates of the other three enzymes are a_SS , a_DBE , and a_SBE, focusing on the effect of the three processes on N(x), respectively.

    SS can extend a starch chain of length x-1 by x or a starch chain of length x by x+1, so for length x the effect is:

    DBE catalyzes the removal of starch branches, therefore a chain of length x the action is

    The role of SBE is rather intricate. To represent this complexity, we introduce a step function H(x), where H(x) equals 1 when x is greater than 0 and 0 when x is less than 0.

    On one hand, when x exceeds X_min + X_0, SBE cuts the longer starch strands, thereby reducing the number of branches with a length of x.

    On the other hand, when SBE cuts chains of a length greater than X, all sites are cut with equal probability. This results in both the chains being cut down to form new branches and the chains left behind likely having a length of x. This process corresponds to the second and third terms in the following equation.

    The rate of change in starch chain length is obtained by summing the three enzymes acting together, i.e.

    Chemical reactions in organisms generally reach a steady state condition. Thus when the rate of change of starch chain length is zero,

    Shift the terms and combine to get [4]

    At this point the equation changes from three parameters to two, with β representing the activity of the SBE with respect to the SS, i.e., the branching capacity of the system, and γ representing the activity of the DBE with respect to the SS, i.e., the capacity of the system for starch degradation pathways.


    The starch branching situation is shown below, the horizontal axis is the value of beta and gamma, respectively, setting the parameters: xmin=7, x0=6

    With the longest starch branch length set to N=20, we establish a threshold value of threshold=5. In this context, starch chains with lengths exceeding five monomers are identified as biased straight chains. The sum of the number of branched chains with lengths surpassing the threshold serves as the y-value on the vertical axis. This value effectively characterizes the starch branching, where a higher y-value indicates a greater orientation towards straight-chain starch.

    Observing the effect of beta on the y-value when the gamma value is constant:

    It can be seen that the larger the relative activity of SBE, the smaller the amount of straight-chain starch. When the relative activity of DBE is large enough, the effect of SBE on the amount of straight-chain starch becomes smaller because degradation dominates at this point compared to branching. The model is not supported by wet experimental data for technical reasons, but it is consistent with common sense.


    Wet Lab Data Issues

    Due to insufficient wet experimental data, the results of some models could not be verified, e.g., only qualitative data were available for starch accumulation, and there was a complete lack of data for starch branching. Therefore, if the wet experimental data can be improved to obtain more accurate quantitative data, the existing model can be fitted to form a closed loop. At the same time, the model can be improved under the guidance of the wet experimental data, such as adding or subtracting terms and adjusting parameters in the modeling equation.

    Parameter Selection Issues

    The issue of parameter selection is fundamentally linked to the availability of wet experimental data. In the starch metabolism model, enzyme-related parameters ideally should be derived from rigorous wet experimental measurements conducted under specific culture conditions. However, due to technical constraints, these parameters were instead sourced from a database integrated into the model. Nevertheless, it's important to note that the enzyme kinetics of the specific organism, Chlamydomonas reinhardtii, are not extensively studied, and data are relatively limited. Consequently, the model relies on enzyme data from species as closely related as possible, which can introduce potential sources of error.

    • [1]Radakovits, R., Jinkerson, R. E., & Darzins, A. Posewitz,. MC. (2010). Genetic engineering of algae for enhanced biofuel production. Eukaryotic Cell, 9(4), 486-501
      [3]Chi A W,G R G. Molecular weight distributions of starch branches reveal genetic constraints on biosynthesis.[J]. Biomacromolecules,2010,11(12).
      [4]Chi A W,G R G. Molecular weight distributions of starch branches reveal genetic constraints on biosynthesis.[J]. Biomacromolecules,2010,11(12).