Loading SCORE : 0
MATHEMATICAL MODELING
1
Introduction

For the mathematical modeling module, we addressed the following two main parts of the problem:

1. Constructing an ordinary differential equation model to simulate the change in the content of the product nontoxic Ochratoxin α during the degradation of the toxin Ochratoxin A in wine;

2. To predict the secondary structure of T3 and C3 proteins based on random forest-based machine learning model training, which complemented the subsequent prediction of tertiary structure.

2
Changes in substance content based on ordinary differential equations

Not every process in a genetic circuit can be quantitatively observed and monitored, so the responses of biological systems cannot be accurately predicted. Therefore, it is very important to verify the feasibility of certain experiments and provide guidance through mathematical modeling before Wet Lab works.

We employed Python programming to build the Biological Expression model based on Ordinary Differential Equations (ODEs). This approach not only ensured the stability of the experiment but also significantly reduced both the cost and time required for the research.

Symbolic descriptions and assumptions

Symbolic description:

Model assumes:

1.The copy number of plasmids is kept as a constant.

2.The degradation rates of mRNA and protein are constant.

3.The molecular species are uniformly distributed inside the cell.

4.Some of the chemical reactions are in equilibrium or steady state.

ODE

In our project, given the limited thermal stability and acid-base tolerance of carboxypeptidase A (referred to as M-CPA), we needed to immobilize the mature enzyme following the truncation of its propeptide and signal peptide segments. This was achieved by binding it to a mesh-like structure formed by SpyCatcher and SpyTag (referred to as S-S). The goal was to create immobilized enzyme complexes for the degradation of OTA (ochratoxin A) to form live functional materials. This enzymatic process converts OTA in wine into phenylalanine and the non-toxic metabolite, ochratoxin α (referred to as OTα), which is considered safe for human consumption.

Within this process, there are four pathways of chemical reactions, for which we establish Ordinary Differential Equations (ODEs) based on compiled chemical reaction equations and reaction rates.

(1)Transcription and translation of protain

Since in practical operations, we directly synthesize the mature enzyme that forms a mesh-like complex with SpyCatcher and SpyTag (referred to as M-CPA_S-S), protein transcription and translation processes are involved in this case, namely:

\({DNA_{M-CPA\rightarrow S-S} \stackrel{Tc_{M-CPA\rightarrow S-S}}{\longrightarrow} mRNA_{M-CPA\rightarrow S-S}}\)

\(mRNA_{M-CPA_{-}S-S} \stackrel{\text{Ts}_{M-CPA_{-}S-S}}{\longrightarrow} M-CPA_{-}S-S\)

(2) OTA degradation

The function of M-CPA is to degrade OTA in wine into phenylalanine (Phe) and the non-toxic compound OTα. M-CPA is already immobilized within the M-CPA_S-S complex, which directly participates in the degradation reaction. The reaction proceeds without any intermediate products, namely:

\(OTA \xrightarrow[M-CPA_{-}S-S]{k_{1}} OT\alpha + Phe\)

(3)Degradation of mRNAM-CPA_S-S and M-CPA_S-S

During the degradation of OTA, M-CPA_S-S and mRNAM-CPA_S-S experience varying degrees of degradation, namely:

\({mRNA_{M-CPA_{-}S-S} \xrightarrow{Dm_{M-CPA_{-}S-S}} \emptyset}\)

\(M-CPA_{-}S-S \xrightarrow{Ds_{M-CPA_{-}S-S}} \emptyset\)

During transcription and translation, the actual production environment where the system operates is subject to slight disturbances caused by environmental factors. Therefore, we employed ordinary differential equation models based on the Michaelis-Menten equation and the law of mass action. The aim is to explore trends in the variations of different substances during the reaction process and investigate the extent to which these minor disturbances impact the system's stability.

The Michaelis equation represents a velocity equation relating the initial rate of an enzyme-catalyzed reaction to substrate concentration. The law of mass action states that the reaction rate is directly proportional to the effective mass of reactants, where the effective mass is essentially referring to concentration. Therefore, the transcription process of mRNAM-CPA_S-S can be described using a first-order kinetic model, expressed by the following equation:

\(\frac{d [mRNA_{\text{M-CPA}_{-}\text{S-S}}]}{dt} = \frac{V_m [DNA_{\text{M-CPA}_{-}\text{S-S}}]}{K_m + [DNA_{\text{M-CPA}_{-}\text{S-S}}]} - Ts_{\text{M-CPA}_{-}\text{S-S}} - Dm_{\text{M-CPA}_{-}\text{S-S}}\)

Assuming that during the translation of mRNAM-CPA_S-S, the half-life of translated protein per unit time is independent of the enzyme's concentration and its own concentration, the translation process of mRNAM-CPA_S-S can be described by the following equation:

\(\frac{d[{M-CPA_{-}\text{S-S} }]}{dt} = Ts_{M-CPA_{-}\text{S-S}} - Ds_{M-CPA_{-}\text{S-S}}\)

In the experimental process, OTA is catalytically decomposed into OTα and Phe under the action of M-CPA_S-S. Since there are no intermediate products during the reaction, and there is no significant concentration difference of M-CPA_S-S before and after the reaction, we can assume that the concentration of M-CPA_S-S remains constant throughout the reaction. The change in OTα concentration in the solution can be calculated and simulated using a Logistic model. In this case, the change in OTα concentration in the solution follows a logistic growth model similar to a single population, where the relative growth rate of OTα decreases as the concentration changes. Additionally, because the volume of the solution containing OTα remains constant during the experiment, changes in OTα molecular weight can be indirectly represented by changes in concentration. Therefore, the rate of change in OTα concentration can be described by the following equation:

\(V_{OT\alpha} = \frac{dC}{dt} = k_1C \left(1 - \frac{C}{C_m}\right)\)

where k1 represents the reaction rate, and Cm is the maximum concentration of OTα during the reaction.

When the initial concentration of OTA is 1μg/mL, the OTα concentration over time, as obtained through computational simulation, is illustrated in Fig. 1:

Fig. 1 Concentration trends of OTα

In the actual experimental process, environmental factors such as light exposure and temperature can have an impact. To demonstrate that our reaction can proceed stably under typical environmental conditions, we introduced perturbations in the simulation process, causing changes in the initial concentration of OTα (with an initial concentration of 0.001 μg/mL). These perturbations ranged from -15% to 15%, with seven different levels. Through our model simulation, we obtained the trends in OTα variations as shown in Fig. 2:

Fig. 2 Trends of OTα with different levels of disturbance

Finally, we took the concentration of OTα near equilibrium at t=8.64X104s under different perturbation levels as the vertical axis, with perturbation level as the horizontal axis, to create a graph for analyzing the model's stability, as shown in Fig. 3:

Fig. 3 OTα concentration when the reaction reaches equilibrium at different levels of disturbance

Upon analysis, it can be observed that even under the most severe perturbation, the error does not exceed 0.011%. In other words, under various disturbances, the concentration of OTα converges consistently. This demonstrates that our reaction exhibits good stability and can operate effectively under different environmental conditions.

3
Protein Secondary Structure Prediction Based on Machine Learning

Given that both T3 protein and C3 protein have unknown sequences, we begin by predicting their secondary structures using machine learning before establishing their tertiary structures. The predictions can complement the subsequent tertiary structure predictions.

In the context of protein secondary structure prediction, based on the fundamental theory that “the protein's secondary structure is entirely determined by its primary structure”, we contemplate the application of machine learning methods in the prediction process.

Machine learning is a specialized field that studies how computers can simulate or replicate human learning behaviors to acquire new knowledge or skills, reorganize existing knowledge structures to continually improve their performance, and enhance specific algorithmic performance in experiential learning. It has now extended its applications to the field of bioinformatics, with a wide range of practical uses.

Feature Extraction

According to the DSSP divided by the repeated pattern of hydrogen bonds in the three-dimensional structural coordinates of proteins, the secondary structure of proteins is divided into eight states:H(α-helix), G(3-helix or 310 helix), I(5-helix or π-helix), E(extended strand participates in β-ladder), B(residue in isolate β-bridge), T(Hydrogen bond turn), S (bend) and -(other). The 8-state secondary structure is a refinement of the 3-state (H(α-helix), E(β-fold), C(coiled)).These three categories are the components of the secondary structure and are also our classification objectives. We recorded the results of the three secondary structures using 1, 2, and 3, respectively.

The classification results are represented as shown in Table 1:

Table 1. Classification

Considering that our available data consists of amino acid sequences, which represent the protein's primary structure arrangement, we utilize a 20-bit encoding method to encode the 20 different amino acids, as depicted in Table 2 :

Table 2. Encoding

The secondary structure of amino acids at a specific position is affected by the properties of amino acids connected to it. Therefore, we constructed the data set by a sliding window. In order to improve the prediction accuracy as much as possible, we took the sequence of five adjacent amino acids as one data point. Taking the sequence of AVLIF as an example, if we intend to predict residue L in the central, then 2 amino acid residues would be selected from the left and right sides respectively, and there are 5 amino acid residues in total, with the coding result as follow:

10000000000000000000

00000000000000000100

00000000010000000000

00000001000000000000

00001000000000000000

If the first amino acid A is predicted, in order to make the number of windows 5, two zeros as amino acids are added before A, which equals to fill 40 bits of 0. The encoding result is:

00000000000000000000

00000000000000000000

10000000000000000000

00000000000000000100

00000000010000000000

We used it as a feature to construct a machine learning model for training.

Algorithm Construction

Since machine learning methods encompass a wide range, in order to better uncover the relationship between primary and secondary structures, we have decided to employ a variety of machine learning techniques and compare the models trained on each.

After careful selection, we constructed models using random forests, support vector machines, and neural networks, and compare their performance in recognizing protein secondary structures. This allows us to choose the best predictive model for the estimation and construction of the secondary structures of the unknown T3 and C3 proteins.

Support Vector Machine (SVM):

The fundamental idea can be summarized as follows: first, it is necessary to employ a transformation to increase the dimensionality of the space, which, of course, is nonlinear. Then, in the new complex space, the optimal linear classification surface is determined. The principle is shown in Fig. 4.

Fig. 4 Schematic diagram of support vector machine principle

Random Forest:

Random forest can be seen as a combination of multiple decision trees, with each dataset randomly selected and some features randomly selected as inputs. Summarizing each result can obtain the estimation error of the out of pocket data. The principle is shown in Fig. 5.

Fig. 5 Schematic diagram of random forest principle

For the solution, we utilize the 'randomForest()' function from the 'randomForest' package in the R Programming Language. Simultaneously, we select an importance measure matrix for the input variables and input the extracted features for model training.

Neural Networks:

A neural network is composed of individual units interconnected with each other, where each individual unit requires numerical inputs and produces outputs. It needs to learn using a specific learning criterion before it can function. When the network makes errors, it reduces the likelihood of making the same errors through the learning process.The principle is shown in Fig. 6 .

Fig. 6 Schematic diagram of neural network principle

For the solution, we opt for the Back Propagation (B-P) neural network model and begin by normalizing the extracted feature data. Next, we employ the 'neuralnet()' function from the R Programming Language's 'neuralnet' package. We choose to train the model with one hidden layer containing 40 neurons, using the Sigmoid activation function, and set a termination condition after 100,000 iterations.

Screening and prediction

The accuracy, sensitivity, and specificity indicators of the three trained machine learning methods after training are shown in Table 3:

(The specific training model's prediction results can be found in the attached document labeled 'result_train'.)

Table 3. Model Training Metrics

It can be observed that the three models have fairly similar training performance, with Random Forest showing slightly higher accuracy and significantly faster execution. Therefore, we have applied it to the prediction of the secondary structures of T3 protein and C3 protein.

The final prediction results can be found in the attached documents labeled "T3predict" and "C3predict".

5
Reference

Zhou, Zhihua. Machine Learning. Qing Hua Da Xue Chu Ban She, 2016, pp. 121-139, 401-492.

Hang, Li. Statistical-Learning, Qing Hua Da Xue Chu Ban She, 2012, pp.95-135.

Kabsch, Wolfgang, and Christian Sander. "Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features." Biopolymers: Original Research on Biomolecules 22.12 (1983): 2577-2637.

Linlin Wu, and Shuo Xu. "Protein secondary structure prediction based on multi-SVM ensemble." Chinese Journal of BioInformatics, 3 (2010): 187-190.

Click to download Code and Result