Model | ShanghaiTech-China

Overview

In our design, we involved the luminescence and rhythm changes of cyanobacteria, and the analysis of these two aspects is very important. Therefore, we designed five models to simulate the circadian rhythm, metabolic changes, selection of luminescent proteins, and the luminescence process of cyanobacteria.

In the first model, we used ODEs to simulate the circadian rhythm protein KaiABC and its mechanism in cyanobacteria.

In the second model, we used flow balance analysis (FBA) to analyze the nutritional status of PCC7942.

In the third model, we successfully predicted the brightness of hundreds of fluorescent proteins with the help of LSTM systems. And the model has been proved to be accurate by wetlab-works.

In the fourth model, we conducted probable fluorescent proteins that never exist on earth with the help of GANs system, which may surpass the brightness of all fluorescent proteins in wild nature.

In the fifth model, this study employed molecular dynamics simulations with Python and GROMACS to model and analyze the luminescence efficiency within experimental BRET systems. System preparation, molecular structure acquisition, and GROMACS formatting were followed by simulation setup, energy minimization, and equilibration. Subsequent production runs captured BRET system dynamics, while Python scripts processed trajectory data for key parameters, enabling the calculation of BRET efficiency. These simulations provided insights into donor-acceptor interactions, distances, and orientations, shedding light on the dynamic nature of BRET under various conditions. This integrated approach enhances our understanding of molecular interactions in biological systems and holds promise for diverse research applications.

Circadian Rhythms ODES

The circadian rhythm is produced by organisms in order to adapt to the objectively existing day and night changes in nature. It originates inside the cell and can do it on its own without external regulation. The biological clock is realized by the organism spontaneously through three functions of timing, internal and external coupling and output signal.

Timing is done through a series of slow biochemical processes, resulting in an oscillation period approaching 24 hours. To ensure that these temporal information are synchronized with the environment, circadian clocks need to be adjusted through internal and external coupling mechanisms. Finally, time information is transmitted to other parts of the cell through the signal output mechanism, thus forming the biological clock of the organism ^[1].

This diurnal rhythm was manifested in the concentration changes of KaiA, KaiB and KaiC in Synechococcus sp. PCC7942. To put it simply, KaiA, KaiB and KaiC protein in the cell will regulate the phosphorylation and dephosphorylation of KaiC protein itself, and the oscillation process of phosphorylation and dephosphorylation lasts exactly 24 hours , thus forming the timing mechanism of the PCC7942 biological clock ^[2].

KaiC is a hexameric enzyme, which can self-phosphorylate and self-dephosphorylate at two sites of serine 431 (S431) and threonine 432 (T432), which makes KaiC protein have four states: two None of the sites are phosphorylated-U(unphosphorylated), only phosphorylated at the serine 431 site-S(S431), only phosphorylated at the threonine 432 site-T(T432), at two positions Phosphorylated-ST occurred in all spots. As shown in the figure, KaiA protein can promote the three processes of U to T, T to ST, and S to ST, and inhibit the dephosphorylation of ST to S, while both KaiB and S can inactivate KaiA protein ^[3].

According to Rust MJ's method, we can write the chemical equation for this process:

(14) Pcu => Pct (Phosphorylation at T432)

(15) Pct => Pcb (Phosphorylation at S431)

(16) Pcb => Pcs (Dephosphorylation at T432)

(17) Pcs => Pcu (Dephosphorylation at S431)

(18) Pcu => Pcs (Phosphorylation at S431)

(19) Pcs + KaiA => Pcb (Phosphorylation at T432)

(20) Pcb => Pct (Dephosphorylation at S431)

(21) Pct => Pcu (Dephosphorylation at T432)

Because KaiC protein is a hexamer, and both KaiA and KaiB are dimers, we can write the chemical equation of the polymerization process, and only the polymerization of KaiC protein is reversible during this process, while the polymerization of KaiA and KaiB proteins are irreversible processes ^[4].

(7) 2Pamo => Pa (Dimer formation process)

(11) 2Pbmo => Pb (Dimer formation process)

(12) 6Pcmo => Pcu (The formation of hexamers)

(13) Pcu => 6Pcmo (The disintegration process of the hexamer)

In addition, KaiB protein and KaiC protein have a common promoter KaiBC ^[5], and its transcription is positively regulated by transcription factor RpaA ^[6], which is related to the effect of ATP content on phosphorylation rate ^[7], the change of dissociation rate in cyanobacteria, and the effect of light on transcription and translation in cyanobacteria. The regulation of the rate together constitutes the internal and external coupling mechanism of the PCC7942 circadian clock.

Since RpaA is activated by KaiC when KaiC is active. When KaiC is inactive, RpaA is inactivated by KaiC ^[8]. Based on this, we can write the chemical equation for the activation and inactivation process of RpaA. Since the phosphorylated KaiC protein at different sites does not show significant differences in the effects of RpaA transcription factors, the same reaction rate is used when writing its rate constant:

(1) Rac + Pcu => Rin + Pcu (Inactivation of RpaA)

(2) Rin + Pcs => Rac + Pcs (Activation of RpaA)

(3) Rin + Pct => Rac + Pct (Activation of RpaA)

(4) Rin + Pcb => Rac + Pcb (Activation of RpaA)

And the transcription and translation chemical equations of the three proteins KaiA, KaiB, and KaiC can be written:

(5) Ga => Ga + Ma

(6) Ma => Ma + Pamo

(8) Gbc => GBC + Mb + Mc

(9) Mb => Mb + Pbmo

(10) Mc => Mc + Pcmo

To sum up, all the reaction equations we involve are as follows:

(1) Rac + Pcu => Rin + Pcu (Inactivation of RpaA)

(2) Rin + Pcs => Rac + Pcs (Activation of RpaA)

(3) Rin + Pct => Rac + Pct (Activation of RpaA)

(4) Rin + Pcb => Rac + Pcb (Activation of RpaA)

(5) Ga => Ga + Ma

(6) Ma => Ma + Pamo

(7) 2Pamo => Pa (Dimer formation process)

(8) Gbc => GBC + Mb + Mc

(9) Mb => Mb + Pbmo

(10) Mc => Mc + Pcmo

(11) 2Pbmo => Pb (Dimer formation process)

(12) 6Pcmo => Pcu (The formation of hexamers)

(13) Pcu => 6Pcmo (The disintegration process of the hexamer)

(14) Pcu => Pct (Phosphorylation at T432)

(15) Pct => Pcb (Phosphorylation at S431)

(16) Pcb => Pcs (Dephosphorylation at T432)

(17) Pcs => Pcu (Dephosphorylation at S431)

(18) Pcu => Pcs (Phosphorylation at S431)

(19) Pcs + KaiA => Pcb (Phosphorylation at T432)

(20) Pcb => Pct (Dephosphorylation at S431)

(21) Pct => Pcu (Dephosphorylation at T432)

We have now summarized all our reaction equations and can begin writing ordinary differential equations.

When writing the ordinary differential equation of the above reaction equation, you need to make some assumptions that can simplify the formula ^[9]:

1. The change of all substances is linear.

2. Generally, translation is much faster than transcription, and both are performed simultaneously in prokaryotic cells, so we omit the presence of mRNA in the ODE equation.

3. Since KaiB and KaiC are co-transcribed, it is assumed that they have the same concentration of mRNA ([MB] = [MC]).

4. Assume that protein and mRNA degradation is not regulated, i.e. it is not dependent on the concentration of other molecules in the cell.

5. It is believed that S, T and ST respectively activate the RpaA transcription factor at the same rate.

6. Since the polymerization intermediate product is unstable, it is simplified to a one-step polymerization reaction.

7. Since the coefficient Q10 of the PCC7942 reaction rate with temperature is only 1.1, the influence of temperature is not considered in the follow-up model ^[10].

Since KaiA and RpaA are synthesized at a constant rate ^[11], the transcription process is simulated without using the Hill function, and dmp is introduced as a dissociation constant, and the activation and inactivation of RpaA and the polymerization process of KaiA protein are added:

`(1)\frac{d[R\text{in}]}{dt}=ktr-kact*[R\text{in}]*([Pcs]+[Pct]+[Pcb])+k\text{in}a*[Rac]*[Pcu]-dmp*[R\text{in}]`

`(2)\frac{d[Rac]}{dt}=kact*[R\text{in}]*([Pcs]+[Pct]+[Pcb])-k\text{in}a*[Rac]*[Pcu]-dmp*[Rac]`

`(3)\frac{d[Pamo]}{dt}=ktsa-2*ka*[Pamo]^{2}-dmp*[Pamo]`

`(7)\frac{d[Pa]}{dt}=ka*[Pamo]^{2}-ksb*[Pcs]-dmp*[Pa]`

The transcription of KaiBC was simulated using the Hill function ^[12] with a Hill coefficient of 4 ^[13]. It can be seen that the transcriptional synergy between Rac and KaiBC promoters is very high:

`(6)h^{+}([Rac],Kbc\text{,nbc})=\frac{[Rac]^{\nbc}}{(Kbc^{nbc}+[Rac]^{nbc})}`

Introduce the Hill function into the transcription process of KaiB and KaiC:

`(4)\frac{d[Pbmo]}{dt}=h^{+}([Rac],Kbc,nbc)*ktbc-2*kb*[Pbmo]^{2}-dmp*[Pbmo]`

`(5)\frac{d[Pcmo]}{dt}=h^{+}([Rac],Kbc,nbc)*ktbc+6*dcmo*[Pcu]-6*kc*[Pcmo]^{6}-dmp*[Pcmo]`

And write the function of KaiB and KaiC:

`(8)\frac{d[Pb]}{dt}=kb*[Pbmo]^{2}-dmp*[Pb]`

`(9)\frac{d[Pcu]}{dt}=kc*[Pcmo]*dcmo*[Pcu]+ktu*[Pct]+ksu*[Pcs]-kut*[Pcu]`
`-kus*[Pcu]-drmp*[Pcu]`

`(10)\frac{d[Pct]}{dt}=kut*[Pcu]+kbt*[Pcb]-ktu*[Pct]-ktb*[Pct]-dmp*[Pct]`

`(11)\frac{d[Pcs]}{dt}=kus*[Pcu]+kbs*[Pcb]-ksu*[Pcs]-ksb*[Pcs]-dmp*[Pcs]`

`(12)\frac{d[Pcb]}{dt}=ktb*[Pct]+ksb*[Pcs]-kbt*[Pcb]-kbs*[Pcb]-dmp*[Pcb]`

Regulation of phosphorylation rate by introduction of KaiA according to function in Rust MJ paper ^[2]:

`(13)A=max\{0,[Pa]-2*m*([Pcs]+[Pb])\}`

`(14)[AC]=\frac{A+[Pcu]+kd-\sqrt{(A+[Pcu]+kd)^2-4*A*[Pcu]}}{2}`

`(15)kxy=\text{kxy}0+\frac{\text{kxy}1^*[\text{AC}]}{\frac{\text{[Pcu]}}{\text{Khalf +[AC]/[Pcu]}}}`

Import the above 15 equations into matlab, use the ODE45 numerical method to simulate, and get the image:

All substances have reached a stable state. We try to draw only the four forms of KaiC proteins:

It can be seen that the four forms of KaiC protein have reached a stable state, but the effective oscillation form has not yet formed in cyanobacteria, so the change of ATP content is introduced to form a stable rhythm.

According to Rust MJ's new research in 2011, the specific ATP rate can be introduced into the phosphorylation process rate of KaiC, but this process does not affect the dephosphorylation rate ^[7]:

`(15) kxy=\text{kxy}0+\frac{\text{kxy}1*[AC]}{\frac{[Pcu]}{\text{Khalf }+[AC]/[Pcu]}}` (At dephosphorylation)

`(16) kxy=\text{kxy}0+\frac{\text{kxy}1*ratp*[AC]}{\frac{[Pcu]}{\text{Khalf}+[AC]/[Pcu]}}` (At phosphorylation)

`(17) ratp=\frac{[ATP]}{[ATP]+krel*[ADP]}`

To sum up, we can now write all the rate equations:

`(1)\frac{d[R\text{in}]}{dt}=ktr-kact*[R\text{in}]*([Pcs]+[Pct]+[Pcb])+k\text{in}a*[Rac]*[Pcu]-dmp*[R\text{in}]`

`(2)\frac{d[Rac]}{dt}=kact*[R\text{in}]*([Pcs]+[Pct]+[Pcb])-k\text{in}a*[Rac]*[Pcu]-dmp*[Rac]`

`(3)\frac{d[Pamo]}{dt}=ktsa-2*ka*[Pamo]^{2}-dmp*[Pamo]`

`(4)\frac{d[Pbmo]}{dt}=h^{+}([Rac],Kbc,nbc)*ktbc-2*kb*[Pbmo]^{2}-dmp*[Pbmo]`

`(5)\frac{d[Pcmo]}{dt}=h^{+}([Rac],Kbc,nbc)*ktbc+6*dcmo*[Pcu]-6*kc*[Pcmo]^{6}-dmp*[Pcmo]`

`(6)h^{+}([Rac],Kbc\text{,nbc})=\frac{[Rac]^{\nbc}}{(Kbc^{nbc}+[Rac]^{nbc})}`

`(7)\frac{d[Pa]}{dt}=ka*[Pamo]^{2}-ksb*[Pcs]-dmp*[Pa]`

`(8)\frac{d[Pb]}{dt}=kb*[Pbmo]^{2}-dmp*[Pb]`

`(9)\frac{d[Pcu]}{dt}=kc*[Pcmo]*dcmo*[Pcu]+ktu*[Pct]+ksu*[Pcs]-kut*[Pcu]`
`-kus*[Pcu]-drmp*[Pcu]`

`(10)\frac{d[Pct]}{dt}=kut*[Pcu]+kbt*[Pcb]-ktu*[Pct]-ktb*[Pct]-dmp*[Pct]`

`(11)\frac{d[Pcs]}{dt}=kus*[Pcu]+kbs*[Pcb]-ksu*[Pcs]-ksb*[Pcs]-dmp*[Pcs]`

`(12)\frac{d[Pcb]}{dt}=ktb*[Pct]+ksb*[Pcs]-kbt*[Pcb]-kbs*[Pcb]-dmp*[Pcb]`

`(13)A=max\{0,[Pa]-2*m*([Pcs]+[Pb])\}`

`(14)[AC]=\frac{A+[Pcu]+kd-\sqrt{(A+[Pcu]+kd)^2-4*A*[Pcu]}}{2}`

`(15) kxy=\text{kxy}0+\frac{\text{kxy}1*[AC]}{\frac{[Pcu]}{\text{Khalf }+[AC]/[Pcu]}}` (At dephosphorylation)

`(16) kxy=\text{kxy}0+\frac{\text{kxy}1*ratp*[AC]}{\frac{[Pcu]}{\text{Khalf}+[AC]/[Pcu]}}` (At phosphorylation)

`(17) ratp=\frac{[ATP]}{[ATP]+krel*[ADP]}`

And adjust the metabolic rate and gene transcription rate according to the specific ATP rate. However, because there is no quantitative relationship research on the specific ATP rate in Synechococcus PCC7942 over time, only in the paper of Takano S, it was found that the specific ATP rate will drop to 80% in the dark, but there is no process of changing the rate Research, so the logistics function is used to fit this process ^[14]:

The fitting result is:

$$\frac{1.8}{1+\mathrm{e}^{\frac{\ln0.8}6(x-6)}}$$

Here we can put forward a hypothetical simplified model, thinking that 00:00-06:00 and 18:00-24:00 are nights every day, and the rest are days. According to a series of linear changes, the logistics function in three intervals can be obtained:

$$\frac{1.8}{1+\mathrm{e}^{\frac{\ln0.8}6(x)}}+1.8$$

Time between 00:00 and 06:00

$$\frac{1.8}{1+\mathrm{e}^{\frac{\ln0.8}6(x-12)}}$$

Time between 06:00 and 18:00

$$\frac{1.8}{1+\mathrm{e}^{\frac{\ln0.8}6(x-24)}}+1.8$$

Time between 18:00 and 24:00

Bring the specific ATP rate into the model, and use the ODE45 numerical method of matlab to solve it again, and get the image:

All materials have formed stable oscillations, and we still focus on the four forms of KaiC protein:

It can be seen that the concentration of the four states of KaiC forms a good oscillation, and the cycle is 24h, which coincides with the biological clock of PCC7942.

A complete list of abbreviations, substance descriptions, rate constants and their sources, chemical reaction equations, ODEs, matlab codes are in the appendix.

View our source codes on igem gitlab.

FBA in PCC7942

Flux Balance Analysis (FBA) is a computational method commonly used in systems biology to study biological networks, especially metabolic networks. The core idea is to predict the growth rate of a cell or the rate of other biological processes by balancing the input and output metabolite fluxes under given environmental conditions ^[15].

The basic principle is to take advantage of a specific rate or "flow" that exists for every metabolic reaction in an organism. Under steady-state conditions, we believe that the production rate and consumption rate of each metabolite are balanced, which means that in the metabolic network of the organism, the total input flow of each node is equal to the total output flow.

In addition to equilibrium constraints, there are other constraints, including but not limited to the upper and lower limits of enzyme reaction rates, nutrient uptake rates, and artificially prescribed simulation conditions.

FBA often uses linear programming to find a flow distribution that satisfies all equilibrium constraints to maximize or minimize a certain objective function, such as maximizing biomass production. In this model we refer to the experimental method of Triana J. , maximizing the flow rate of the biomass reaction ^[16].

Since it often takes several years to construct a metabolic network, we found Triana J's model on the BioModels Database, named Triana2014 - Genome-scale metabolic network of Synechococcus elongatus (iSyf715), and the link is https://www.ebi.ac.uk/biomodels/MODEL1006230034#Files ^[17].

After obtaining the name MODEL1407310000_url.xml from the link, we started to simulate. Since the tool BioMet(18)used in the literature is no longer available, we decided to use python's xml.etree.ElementTree library to import the file and perform subsequent processing.

First, we obtained the file information and learned that the model involved a total of 902 reactions and 915 substances.

We then renamed the two compartments and exported the substances in the compartment Cytosol as the S matrix of linear programming.

And in a 902*1 0 matrix, change the value corresponding to biomass to 1 as the finding condition.

Then we defined the upper and lower limits of each reaction. First, we set the upper and lower limits of all reactions to 10000 and -10000, and then additionally changed the constraints of several reactions:

Reaction Name	Upper Limit	Lower Limit
CO2in	0	1.99
H2CO3transport	0	1.99
nitrateTRANS-RXN59G-237	160	160
_COtransp	-10	10
_lightII	0	1.96
_lightI	0	1.96
sulfateTRANS-RXN59G-407	-104	104

Then we imported the matrix into MATLAB and used the linprog() function for linear programming. We found that the growth of biomass was 2.193 times the nitrate absorption rate, the ATP production rate was 294.556 times the nitrate absorption rate, and the glycogen production rate was 294.556 times the nitrate absorption rate. 5.894 times the absorption rate.

We then used our design to increase sucrose reserves by inhibiting the expression of GlgC protein during the day, and then eliminate this effect at night to achieve the need to store energy during the day for consumption at night ^[19].

So we queried the reaction list. The reaction in which the GlgC protein participated was alpha-D-glucose 1-phosphate + ATP -> ADP-D-glucose + diphosphate. The reaction name was 2.7.7.27. We limited the rate of this reaction to 0. , and then linear programming was solved again, and it was found that the growth of biomass was 1.772 times the nitrate absorption rate, the ATP production rate was 226.919 times the nitrate absorption rate, and the glycogen production rate was 4.264 times the nitrate absorption rate.

We tabulate it and calculate it:

	GlgC Involved	No GlgC Involved	Change Rate
Nitrate Absorption Rate	1	1	0%
Growth Rate	2.193	1.772	-19.20%
ATP Production Rate	294.556	226.919	-22.96%
Glycogen Production Rate	5.894	4.264	-27.66%

It can be seen that the glycogen production rate, ATP production rate, and cell growth rate all dropped significantly during the day. According to the conservation of materials, the sucrose production rate increased significantly, and our design can be basically realized.

View oue source codes on igem gitlab.

LSTM prediction system

Background

Currently, there are over 900 recorded fluorescent proteins. However, the brightness information for these proteins is incomplete, with nearly half of them lacking recorded fluorescence intensity data. The fluorescence intensity of fluorescent proteins can vary by more than 1600-fold.

To identify the brightest fluorescent proteins suitable for the LAMPs project, I attempted to use an LSTM system to establish a model for fluorescence intensity. The following is an introduction to the LSTM system.

LSTM (Long Short-Term Memory) is a variant of recurrent neural networks (RNNs) specifically designed to handle and model long-term dependencies in data. Compared to traditional RNNs, LSTM introduces a mechanism called "gate units" to better capture and manage long-term dependencies in time series data.

In traditional RNNs, information is passed from one time step to the next through recurrent connections. However, when dealing with long time sequences, RNNs may encounter the problem of vanishing or exploding gradients, making it difficult to effectively capture and utilize long-range dependencies. LSTM addresses this issue by introducing gate mechanisms.

The core component of an LSTM network is the LSTM cell, which consists of an input gate, a forget gate, an output gate, and a cell state. The computation process of an LSTM cell can be summarized as follows:

Input Gate: Determines the extent to which the cell state should be updated. It takes the input and the previous time step's hidden state, applies a sigmoid activation function, and yields a value between 0 and 1, indicating the importance of each element.

Forget Gate: Determines the extent to which information should be forgotten from the cell state. Similarly, it takes the input and the previous time step's hidden state, applies a sigmoid activation function, and yields a value between 0 and 1, indicating the degree of forgetting for each element.

Update Cell State: The cell state is updated by multiplying the input gate and the forget gate with the previous time step's cell state and adding the current input after being passed through a tanh activation function. This allows the LSTM to selectively remember or forget certain information.

Output Gate: Determines the current time step's hidden state. First, the input and the previous time step's hidden state are combined and passed through a sigmoid activation function, resulting in a value between 0 and 1, indicating the degree of output for each element. Then, the current cell state is passed through a tanh activation function and multiplied by the output gate to obtain the final hidden state.

Model Construction Process

In the process of model construction, considering the conservation of fluorescent protein fluorescent domains, I directly established the mapping from protein sequences to fluorescence intensity, rather than from protein sequences to protein tertiary structures and then to fluorescence intensity.

First, I downloaded the basic information of all fluorescent proteins from the website https://www.fpbase.org/. Then, I encoded the protein sequences represented as string sequences into integer sequences for model computation. Next, I selected proteins with fluorescence intensity information as the training set and trained a discriminator model for fluorescence intensity. Finally, I input the protein sequences without fluorescence intensity information into the discriminator model to obtain predicted fluorescence intensity values.

Dif f iculties Encountered

Due to issues such as model overfitting, the loss value and val_loss value remained high, indicating insufficient effectiveness of the learning iterations. I optimized the model by normalizing the fluorescence intensity, reducing model complexity, and employing other methods. However, due to the limited availability of training data for fluorescent proteins, the model still has limitations.

The above is a portion of the data used for training.

When I input a string sequence into the trained discriminator, the discriminator will return a score indicating the fluorescence intensity of that sequence (with a maximum score of 1 after data normalization).

Experimental Veri f ication

I selected several fluorescent proteins with high existing fluorescence intensity and wavelengths suitable for fluorescence resonance energy transfer as Series A. For Series B, I chose proteins without fluorescence intensity information but with high predicted values from the model. Plasmids were constructed and transformed into Escherichia coli (BL21, DH5`\alpha`) for expression. The results showed that the fluorescence intensities of proteins in Series B, which had similar predicted values to Series A, were indeed close to those of Series A (in nature, fluorescent protein fluorescence intensities can vary by more than 1600-fold). For more detailed information, please refer to the wetlab section.

<- Series A Series B ->

Partial Code:

View our source codes on igem gitlab.

GAN production system

Background

In order to overcome the limitations of nature and further enhance the fluorescence intensity of fluorescent proteins, I am using a generative adversarial network (GAN) to computationally generate protein sequences that potentially have higher fluorescence intensity.

Here is an introduction to Generative Adversarial Networks (GANs):

Generative Adversarial Networks (GANs) are a class of machine learning models that have gained significant attention in recent years. GANs consist of two neural networks, namely the generator and the discriminator, which are trained in an adversarial manner.

The generator network takes random noise as input and learns to generate synthetic data samples, such as images, text, or other structured data, that resemble real data. The goal of the generator is to generate samples that are indistinguishable from the real data.

On the other hand, the discriminator network is trained to distinguish between real data samples and generated (fake) samples. It learns to classify whether a given sample is real or fake. The discriminator is trained using a combination of real and generated data samples.

During the training process, the generator and discriminator are pitted against each other in a minimax game. The generator aims to fool the discriminator by generating highly realistic samples, while the discriminator aims to accurately classify real and fake samples. This adversarial process drives both networks to improve over time.

As training progresses, the generator learns to generate samples that become increasingly similar to the real data distribution, while the discriminator becomes more adept at distinguishing between real and fake samples. Ideally, this leads to the generator producing high-quality samples that are difficult for the discriminator to differentiate from real data.

Model establishment process

The process involves constructing a generator model and using the LSTM system from the previous model as the discriminator. The generator is trained with feedback from the discriminator. Once training is completed, it can generate amino acid sequences that are estimated to have high fluorescence intensity. Among them, sequences with higher estimated values are selected as the C series. However, due to time constraints, the C series has not been synthesized in the laboratory and tested for expression in Escherichia coli.

<- Series C ->

Molecular Dynamic Simulation

Method 1-Gromacs

The first method is to use GROMACS. This method offers high simulation accuracy, but it has strict requirements for protein PDB files. Therefore, it was not used when conducting large-scale tests on AI-generated sequences. So, in this section, only YFP and luciferase were used as examples for validation.

Gromacs molecular dynamic simulation procedures:

Donor and Acceptor Molecules: We chose Yellow Fluorescent Protein (YFP) as the donor molecule and Luciferase as the acceptor molecule. The PDB structure files for these molecules can be obtained from specialized biological databases such as the Protein Data Bank (PDB).

Building Molecular Models: During the model preparation phase, we utilized built-in tools within GROMACS (e.g., pdb2gmx) to process PDB files. This included adding hydrogen atoms, removing water molecules, and handling any ions or co-factors. (Here we use PyMol to eliminate water molecules.)This ensures accuracy and stability in the molecular models for simulation.

Defining the Simulation Box: We created a cubic simulation box, ensuring its dimensions were sufficiently large to accommodate the donor, acceptor, and an ample number of water molecules to simulate the aqueous environment found in biological systems. The size of the simulation box is typically optimized to prevent unwanted interactions.

Solvation: To mimic a solution environment, we added water molecules to the simulation box. Typically, we employed an appropriate water model (e.g., TIP3P or SPC) to describe the behavior of water molecules. Solvation of the simulation box is a critical step in atomic simulations.

Ion Addition: To maintain the neutrality of the simulation system and simulate physiological salt concentrations, we added an appropriate number of ions. This ensures charge balance and system stability.

Energy Minimization: Energy minimization runs were performed using GROMACS' grompp and mdrun commands. This step helps eliminate unfavorable conformations and reduces the system's energy to achieve a stable starting point.

Temperature and Pressure Equilibration: To stabilize the simulation system, NVT (temperature) and NPT (pressure) equilibration runs were conducted. In these steps, the system was gradually brought to the target temperature and pressure, and equilibrium simulations were carried out for a specified duration.

Simulation Parameters: Appropriate simulation parameters, including time step, temperature control, and pressure coupling methods, were chosen. Generally, a sufficiently small time step is selected to capture fast molecular motion while ensuring simulation efficiency.

Simulation Execution: Using the GROMACS mdrun command, we performed the production molecular dynamics simulation. This step covered the entire time range of the simulation and typically required significant computational resources.

Extraction of Key Data: We employed GROMACS-provided tools and scripts to extract key parameters, such as donor-acceptor distances and orientations, from the simulation trajectory data.

BRET Efficiency Calculation: BRET efficiency is often calculated based on energy transfer rates and distances. We utilized information obtained from the simulation data to calculate these parameters and subsequently determine BRET efficiency.

The following codes can be directly run in python. Also, the "gmx" part can be directly run in PowerShell.

Also, here lists all the .mdp files

Method 2-Python

Python molecular dynamic simulation procedures:

Building the Simulation System: We began by establishing a virtual three-dimensional simulation system that included luminescent protein molecules. This system aimed to mimic the microenvironment within a cell, accounting for the presence of solvent molecules.

Initial Positions and Velocities: Setting the initial positions and velocities of the luminescent protein molecules was a crucial starting point for the simulation. These initial conditions could be determined based on experimental data or theoretical assumptions.

Brownian Motion Simulation: We employed a Brownian motion model to describe the random motion of luminescent protein molecules. Brownian motion represents the random diffusion of molecules in a liquid, influenced by collisions with surrounding molecules. This model is typically simulated using random number generation.

Numerical Integration Methods: We used numerical integration methods such as the Euler method or implicit methods to compute the new positions and velocities of molecules. These methods predict the position and velocity at the next time step based on the current position and velocity, as well as the simulation's time step.

Recording Trajectory Data: During the simulation runs, we recorded the changes in position and velocity of molecules over time, creating trajectory data. These data allowed us to reconstruct the path of molecular motion.

Results:

Here shows results for most promising sequences predicted by the last model. Each sequence was predicted with five different structures through AlphaFold 2.

Conclusion:

Our simulation revealed distinct molecular diffusion patterns within the cellular context. We observed that luminescent protein molecules exhibited both free diffusion and constrained motion near cellular structures. This suggests that the local microenvironment influences their movement.

By tracking the distances between luminescent protein pairs over time, we observed fluctuations in donor-acceptor distances. These fluctuations have implications for BRET efficiency, as energy transfer efficiency is highly dependent on the proximity of the donor and acceptor molecules. We discovered a strong correlation between changes in donor-acceptor distances and BRET efficiency. When donor and acceptor molecules were in close proximity, the simulated BRET efficiency increased, demonstrating the impact of molecular motion on energy transfer.

The insights gained from this simulation offer valuable implications for BRET studies. Researchers can use this knowledge to better understand and predict BRET efficiency based on the dynamic behavior of luminescent protein molecules.

Our simulations indicate a significant influence of molecular motion on BRET efficiency. Fluctuations in the distances between donor and acceptor molecules can lead to variations in BRET efficiency. This underscores the importance of considering molecular motion when interpreting experimental data and designing BRET experiments.

Python random motion simulation offers a high degree of control over simulation conditions, allowing precise adjustments of parameters to simulate different scenarios. Compared to laboratory experiments, molecular dynamics simulations save significant resources and time. This method is not only cost-effective but also capable of predicting outcomes before conducting experiments. Simulations grant us an in-depth understanding of molecular dynamic behavior and interactions, providing robust guidance for further experimental research.

Simulations are based on theoretical models, thus their accuracy is constrained by model selection. Different models may yield different results. Random motion simulations are approximate methods that overlook some details, such as quantum effects, which could introduce errors in certain cases. Simulations may require substantial computational resources and time, especially for complex systems and long-time scales.

Experimental Verif ication:

I selected several fluorescent proteins with high existing fluorescence intensity and wavelengths suitable for fluorescence resonance energy transfer as Series A. For Series B, I chose proteins without fluorescence intensity information but with high predicted values from the model. Plasmids were constructed and transformed into Escherichia coli (BL21, DH5`\alpha`) for expression. The results showed that the fluorescence intensities of proteins in Series B, which had similar predicted values to Series A, were indeed close to those of Series A (in nature, fluorescent protein fluorescence intensities can vary by more than 1600-fold). For more detailed information, please refer to the wetlab.

From left to right, they are labeled as A1, A2, B2, and the blank control.

The verification data of the model in wetlab is attached below. They are the emission spectra results of A series B series proteins and blank control in the enzyme marker and the luminous map of A series proteins and B series proteins in the laboratory.

In the enzyme marker, the peak brightness of A-series proteins and B-series proteins when the excitation light frequency is about 510nm, which is in line with the theory and conforms to the principle of fluorescence resonance energy transfer.

View our source codes on igem gitlab.

Reference

Model 1:

[1] Swan JA, Golden SS, LiWang A, Partch CL. Structure, function, and mechanism of the core circadian clock in cyanobacteria. J Biol Chem. 2018 Apr 6;293(14):5026-5034. doi: 10.1074/jbc.TM117.001433. Epub 2018 Feb 13. PMID: 29440392; PMCID: PMC5892564.

[2] Rust MJ, Markson JS, Lane WS, Fisher DS, O'Shea EK. Ordered phosphorylation governs oscillation of a three-protein circadian clock. Science. 2007 Nov 2;318(5851):809-12. doi: 10.1126/science.1148596. Epub 2007 Oct 4. PMID: 17916691; PMCID: PMC2427396.

[3] Welkie DG, Rubin BE, Chang YG, Diamond S, Rifkin SA, LiWang A, Golden SS. Genome-wide fitness assessment during diurnal growth reveals an expanded role of the cyanobacterial circadian clock protein KaiA. Proc Natl Acad Sci U S A. 2018 Jul 24;115(30):E7174-E7183. doi: 10.1073/pnas.1802940115. Epub 2018 Jul 10. PMID: 29991601; PMCID: PMC6064986.

[4] Mehra A, Hong CI, Shi M, Loros JJ, Dunlap JC, Ruoff P. Circadian rhythmicity by autocatalysis. PLoS Comput Biol. 2006 Jul 28;2(7):e96. doi: 10.1371/journal.pcbi.0020096. Epub 2006 Jun 8. PMID: 16863394; PMCID: PMC1523307.

[5] Kageyama H, Nishiwaki T, Nakajima M, Iwasaki H, Oyama T, Kondo T. Cyanobacterial circadian pacemaker: Kai protein complex dynamics in the KaiC phosphorylation cycle in vitro. Mol Cell. 2006 Jul 21;23(2):161-71. doi: 10.1016/j.molcel.2006.05.039. PMID: 16857583.

[6] Takai N, et al. A KaiC-associating SasA-RpaA two-component regulatory system as a major circadian timing mediator in cyanobacteria. Proc Natl Acad Sci USA. 2006;103:12109-12114.

[7] Rust MJ, Golden SS, O'Shea EK. Light-driven changes in energy metabolism directly entrain the cyanobacterial circadian oscillator. Science. 2011;331(6014):220-223.

[8] Vijayan V, Zuzow R, O'Shea EK. Oscillations in supercoiling drive circadian gene expression in cyanobacteria. Proc Natl Acad Sci U S A. 2009 Dec 29;106(52):22564-8. doi: 10.1073/pnas.0912673106. Epub 2009 Dec 14. PMID: 20018699; PMCID: PMC2799730.

[9] Polynikis A, Hogan SJ, di Bernardo M. Comparing different ODE modelling approaches for gene regulatory networks. J Theor Biol. 2009 Dec 21;261(4):511-30. doi: 10.1016/j.jtbi.2009.07.040. Epub 2009 Aug 6. PMID: 19665034.

[10] Masato Nakajima et al. ,Reconstitution of Circadian Oscillation of Cyanobacterial KaiC Phosphorylation in Vitro.Science308,414-415(2005).DOI:10.1126/science.1108451

[11] Ito H, et al. Cyanobacterial daily life with Kai-based circadian and diurnal genome-wide transcriptional control in Synechococcus elongatus. Proc Natl Acad Sci USA. 2009;106:14168-14173.

[12] Chai LE, Loh SK, Low ST, Mohamad MS, Deris S, Zakaria Z. A review on the computational approaches for gene regulatory network construction. Comput Biol Med. 2014 May;48:55-65. doi: 10.1016/j.compbiomed.2014.02.011. Epub 2014 Feb 24. PMID: 24637147.

[13] Zwicker D, Lubensky DK, ten Wolde PR. Robust circadian clocks from coupled protein-modification and transcription-translation cycles. Proc Natl Acad Sci U S A. 2010 Dec 28;107(52):22540-5. doi: 10.1073/pnas.1007613107. Epub 2010 Dec 13. PMID: 21149676; PMCID: PMC3012469

[14] Takano S, Tomita J, Sonoike K, Iwasaki H. The initiation of nocturnal dormancy in Synechococcus as an active process. BMC Biol. 2015 Jun 10;13:36. doi: 10.1186/s12915-015-0144-2. PMID: 26058805; PMCID: PMC4494158.

Model 2:

[15]Orth, J., Thiele, I. & Palsson, B. What is flux balance analysis?. Nat Biotechnol 28, 245-248 (2010). https://doi.org/10.1038/nbt.1614

[16]Triana J, Montagud A, Siurana M, Fuente D, Urchueguía A, Gamermann D, Torres J, Tena J, de Córdoba PF, Urchueguía JF. Generation and Evaluation of a Genome-Scale Metabolic Network Model of Synechococcus elongatus PCC7942. Metabolites. 2014 Aug 20;4(3):680-98. doi: 10.3390/metabo4030680. PMID: 25141288; PMCID: PMC4192687.

[17]Li C, Donizelli M, Rodriguez N, Dharuri H, Endler L, Chelliah V, Li L, He E, Henry A, Stefan MI, Snoep JL, Hucka M, Le Novère N, Laibe C. BioModels Database: An enhanced, curated and annotated resource for published quantitative kinetic models. BMC Syst Biol. 2010 Jun 29;4:92. doi: 10.1186/1752-0509-4-92. PMID: 20587024; PMCID: PMC2909940.

[18]Cvijovic M, Olivares-Hernández R, Agren R, Dahr N, Vongsangnak W, Nookaew I, Patil KR, Nielsen J. BioMet Toolbox: genome-wide analysis of metabolism. Nucleic Acids Res. 2010 Jul;38(Web Server issue):W144-9. doi: 10.1093/nar/gkq404. Epub 2010 May 18. PMID: 20483918; PMCID: PMC2896146.

[19]Qiao C, Duan Y, Zhang M, Hagemann M, Luo Q, Lu X. Effects of Reduced and Enhanced Glycogen Pools on Salt-Induced Sucrose Production in a Sucrose-Secreting Strain of Synechococcus elongatus PCC 7942. Appl Environ Microbiol. 2018 Jan 2;84(2):e02023-17. doi: 10.1128/AEM.02023-17. PMID: 29101204; PMCID: PMC5752869.

Model 3:

No reference

Model 4:

No reference

Model 5:

[20] Martín-Betancor, Keila et al. “Construction of a self-luminescent cyanobacterial bioreporter that detects a broad range of bioavailable heavy metals in aquatic environments.” Frontiers in microbiology vol. 6 186. 9 Mar. 2015, doi:10.3389/fmicb.2015.00186

[21] Yang, Jun et al. “Construction of luminescent Escherichia coli via expressing lux operons and their application on toxicity test.” Applied microbiology and biotechnology vol. 106,18 (2022): 6317-6333. doi:10.1007/s00253-022-12136-1

[22] Wu, Y.; Jiang, T. Developments in FRET- and BRET-Based Biosensors. Micromachines 2022, 13, 1789. https://doi.org/10.3390/mi13101789

[23] Rodriguez, Erik A et al. “The Growing and Glowing Toolbox of Fluorescent and Photoactive Proteins.” Trends in biochemical sciences vol. 42,2 (2017): 111-129. doi:10.1016/j.tibs.2016.09.010

[24] Yao Xu, David W. Piston, and Carl Hirschie JohnsonAuthors Info & Affiliations January 5, 1999 96 (1) 151-156 https://doi.org/10.1073/pnas.96.1.151

[25] El Khamlichi, Chayma et al. “Bioluminescence Resonance Energy Transfer as a Method to Study Protein-Protein Interactions: Application to G Protein Coupled Receptor Biology.” Molecules (Basel, Switzerland) vol. 24,3 537. 1 Feb. 2019, doi:10.3390/molecules24030537

[26] Kaku, Tomomi et al. “Enhanced brightness of bacterial luciferase by bioluminescence resonance energy transfer.” Scientific reports vol. 11,1 14994. 22 Jul. 2021, doi:10.1038/s41598-021-94551-4

[27] Maria Kowalski-Jahn et al. ,Frizzled BRET sensors based on bioorthogonal labeling of unnatural amino acids reveal WNT-induced dynamics of the cysteine-rich domain.Sci. Adv.7,eabj7917(2021).DOI:10.1126/sciadv.abj7917

Visit our model on igem gitlab for more details.