Model | SJTU-software

1. Introduction

Nowadays, Natural Language Processing(NLP) has been extensively applied in protein engineering, and Large Protein Language Models have been proven to have high-precision prediction functions for protein biological properties. In this study, we collected a dataset of more than 40,000 sets of protein sequences and their Melting Temperatures(Tm) from Kaggle. Based on this dataset, we designed DARWINS(Directed AGO Renewal With Ideal proteiN thermal-Stability), a melting temperature prediction model focusing on AGO protein sequences. The name of our model pays homage to Darwin. Darwin, as the founder of evolution, put forward the theory of biological evolution based on natural selection in his book On The Origin of Species, which destroyed various theories such as Species Invariance Theory. And we've been working on the directed evolution of AGO.

The core of DARWINS is a pre-trained model of a BERT-style Transformer-encoder architecture for acquiring semantic information and potential features of AGO sequences. Subsequently, these features are fed into a Multilayer Perceptron Model(MLP) for Tm prediction.

Compared with traditional biological experiments, DARWINS can predict the thermal stability of protein sequences more rapidly. Moreover, it is sensitive to protein mutations. We expect DARWINS to assist directed evolutionary efforts to obtain AGO mutant strains with higher thermal stability, and eventually be applied in specific scenarios such as nucleic acid detection platforms and rapid viral or cancer diagnosis.

Figure 1.1 Structure of DARWINS

2.NLP and Large Protein Language Model

2.1 NLP

NLP is a computational technique for automatically analyzing and representing human language, a field of computer science, artificial intelligence, and linguistics concerned with the interaction between computers and human (natural) language. NLP focuses on handling characters, phrases, sentences, and even complex text or speech data[1] to form a quantitative output by extracting the concepts or meanings of the text[2]. A complete natural language processing project typically includes the following phases:

2.1.1 Text collections and preprocessing

This phase mainly involves the acquisition and cleaning of raw data, such as text, corpus, and speech data. Subsequently, the data will be converted into a suitable input format, such as tokenization and stemming.

Figure 2.1 An example of tokenization

Figure 2.2 An example of stemming

2.1.2 Word vector representation

In this stage, words will be mapped to vector space, enabling computers to better understand natural language. Commonly used word vector models include One-hot, Word2Vec, etc.

Figure 2.3 An example of One-hot

Figure 2.4 Structure of Wored2Vec Model

2.1.3 Model training

In this part, deep learning is often employed to train NLP models. As a significant branch of machine learning, deep learning is based on multi-layered neural networks that learn the internal connections of sample data so that machines can eventually have the ability to analyze and understand data such as text, images, and sounds like humans. Deep learning-based network models, such as Recurrent Neural Networks(RNN), Convolutional Neural Networks(CNN), Long Short Term Memory Networks(LSTM), and Transformer have proven to be vital contributors to the development of NLP.

The most basic idea of deep learning is to optimize the loss function using nonlinear methods. The loss function characterizes the difference between the predicted and true values of the model. Since the derivation of the predicted values is closely related to the parameters of the model, by minimizing the loss function, we can obtain a set of parameters for the optimal model. A deep learning model calculates and updates the loss function in two steps, forward propagation, and back propagation, to optimally tune the model parameters.

① Forward propagation

Forward propagation can be described as the process of taking the output of the previous layer of a neural network as the input to the next layer and getting the output of the next layer.

Let \(w_{kj}^l\) be the weight of the kth neuron in layer l - 1 to the jth neuron in layer l, \(b_j^l\) be the bias of the jth neuron in layer l \(a_j^l\) be the output of the jth neuron in layer l, and σ be the activation function, then the output of the jth neuron in layer l can be expressed as

\(a_j^l=σ(z_j^l ) = σ(∑w_{ij}^l a_i^{l-1} + b_j^l)\)

We can make the formula more concise by using the matrix multiplication form

\(Z^l=W^l A^{l-1}+B^l\)

\(A^l=σ(Z^l)\)

② Back propagation

Backpropagation is the basis of neural network training. The basic principle is the chain rule in calculus, which updates the neural network parameter values by calculating the gradient of the loss function on each parameter of the neural network so that the predicted values of the network can be closer to the true values.

Let F be the loss function of the network, n be the number of samples, L be the number of layers of the network, and y(x) be the true value, we take the quadratic cost function as an example to illustrate the back propagation process. Define the function formula as follows:

\(F = \left(\frac{1}{2n}\right) \sum_x \left(y(x) - a^L(x)\right)^2\)

It is easily known that F is a function of \(a^L (x)\). We can define \(δ_j^l\) to describe the error of the jth neuron in layer l. For the error of the output layer L, according to the chain rule we have

\(\delta_j^l = \frac{\partial F}{\partial z_j^L} = \frac{\partial F}{\partial a_j^L} \frac{\partial a_j^L}{\partial z_j^L} = \frac{\partial F}{\partial a_j^L} \sigma'(z_j^L)\)

\(∂F/(∂a_j^L)\) is easily derived from the derivation of F. Subsequently, the error at each layer of the network can be obtained from the error transfer equation. The error transfer equation is a recursive equation which establishes a link between the error at layer l as well as the error at layer l + 1

\(\delta_j^l = \sum_k w_{jk}^{(l+1)} \delta_k^{(l+1)} \sigma'(z_j^l)\)

Finally, based on the error calculation equations for each layer, we can calculate the rate of change of the cost function F on the weight parameters and bias parameters of each layer and update the network parameters by gradient descent method.

\(\frac{\partial F}{\partial b_j^l} = \delta_j^l\)

\(\frac{\partial F}{\partial w_{jk}^l} = a_k^{(l-1)} \delta_j^l\)

2.1.4 Model valuation and application

This phase focuses on evaluating model performance using a validation dataset and applying it to real-world problems such as sentiment analysis and machine translation. Perplexity is a vital index used to evaluate how good a language model is. Consider a sentence S consisting of k words

\( S = w_1, w_2, \ldots, w_k \)

We can define P(S) as the probability of this sentence

\( P(S) = P(w_1, w_2, \ldots, w_k) = P(w_1)P(w_2|w_1) \ldots P(w_k|w_1,w_2,\ldots,w_{k-1}) \)

The perplexity (PP(W)) evaluates the probability of getting a normal sentence through the model. To summarize, the higher the occurrence probability of a normal sentence, the lower the perplexity, and the better the model performance.

\( \text{PP}(W) = P(S)^{-\frac{1}{N}} \)

2.2 Large Protein Language Model

Large Protein Language Modeling is a specific application of NLP, which mainly serves protein engineering and protein generation technology, and can accelerate the study of the structure and function of mutant sequences. Large Protein Language Modeling extracts information from massive protein sequence databases and captures the semantics and properties of a given sequence to fully explore the connection between protein sequences and the functions of proteins.

Transformer, as a significant model for NLP, has been shown to have great potential for protein classification and generation tasks, while BERT, as a vital improvement of Transformer, enables the model to fully understand the semantic information and associations of the amino acids in the protein sequences during the training process through Masked Language Modeling(MLM). Many Large Protein Language Models rely on BERT-styled or Transformer-styled training, such as ESM-2[3] and ProGen[4].

3. BERT

BERT (Bidirectional Encoder Representation from Transformers) is a pre-training model proposed by Google AI in 2018, which has shown impressive performance in the top-level test of machine reading comprehension, SQuAD-1.1[5]. The architecture of BERT is a multi-layer bi-directional Transformer-encoder. The bi-directional architecture enables BERT to simultaneously utilize the information of both the lexicon in front of and the lexicon behind a particular word in a language text species, which produces a deeper understanding of the semantic information compared to the traditional single-item language model, as a result of this training method.

BERT is pre-trained using two unsupervised tasks, Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). These allow BERT to learn token-level information and understand the connections between different sentences. Our DARWINS construction process mainly uses MLM.

For convenience, we can analogize the MLM task to a cloze task. The introduction of MLM gives BERT the ability to understand context. Specifically, 15% of the tokens in a given sentence are randomly selected and replaced according to the following principles:

① There is an 80% probability that the token will be replaced with [MASK].
② There is a 10% probability that the token will be replaced with any token in the dictionary.
③ There is a 10% probability that the token will remain unchanged.

Subsequently, the model is asked to predict the identity of the replaced token based on the remaining tokens. It is worth noting that not all of the replaced tokens are replaced with [MASK]. This is because [MASK] tokens will not be present in the subsequent utterances of the BERT-based fine-tuning task and we cannot guarantee the lexical correctness in the corresponding position of the input text. Thus the use of marking methods as the above can make the model more dependent on contextual information and give the model some error correction capability. In BERT's MLM task, if the set of replaced tokens is called M, the dictionary size is V (prediction is essentially a multi-categorization problem with the replaced positions on the dictionary), and the set of model parameters is θ. A loss function can be built from the accuracy of the prediction of the replaced portion, denoted as

\( L(\theta) = -\sum_{i=1}^M \log{p(m = m_i | \theta)}, \quad m_i \in [1,2,3,\ldots,V] \)

The formula indicates that the more accurate the model is if each replaced token has a higher probability of predicting the outcome as the original token in that position.

Figure 3.1 An example of BERT-styled model

4. ESM-2

ESM-2 is a Large Protein Language Model based on the BERT architecture, trained by Masked Language Modeling (MLM), namely amino acid information at positions in protein sequences that have not been masked out to predict the identity of amino acids that have been randomly masked out, which allows ESM-2 to better capture the contextual information of protein sequences to learn the structural information. Specifically, for a given protein amino acid sequence x, each amino acid has a 15% probability of being selected to be modified. These selected amino acids have an 80% probability of being replaced with the mask tag M (a random variable), a 10% probability of being replaced with a random amino acid, and a 10% probability of not being changed[4]. The masked protein sequence will be given to the model to predict the original amino acids. This allows the model to fully learn the dependencies between the amino acids. We define the following negative log-likelihood function as the model's optimization target

\( L_{MLM} = -\sum_{i \in M} \log{p(x_i | x_{\backslash M})} \)

where \(p(x_i |x_{\backslash M})\) refers to under the condition that the set of amino acids not masked is \(x_{\backslash M}\), the probability that the amino acid masked will predicted as the amino acid at the position of the original sequence. This function achieves model optimization by minimizing the difference between the predicted and original values of the replaced token.

Figure 4.1 PLM in protein structure prediction

5.Tm prediction

We work on Tm prediction through an MLP. Specifically, the MLP accepts the tensor of the protein's potential features from ESM-2 and outputs the Tm of the sequence

\( \text{Tm}_{\text{pred}} = \text{MLP}(h) \)

where h is the potential representation of the protein sequence.

We use the mean square error as the loss function to measure the difference between the predicted value and the true value, and the optimization algorithm is Adam, with the batch size set to 100 and the learning rate set to 10^(-4). We also introduce early stopping to prevent overfitting of the model. The loss function of the model can be expressed by the following equation

\(L_{MLP} = \frac{1}{N} \sum{(Tm_{\text{pred}} - Tm_{\text{real}})^2}\)

6.Versions & Improvements

6.1 DARWINS 1.0

ESM-2 was originally designed to be applied to protein structure prediction, which differs from our training goal. Therefore, we attempted to use TemPL, a fine-tuned model based on ESM-2 architecture. TemPL was trained using a comprehensive dataset of 96 million strain optimal growth temperatures (OGTs) (since strain OGTs are closely related to information on protein optimal enzyme activity temperatures and stability), which allowed the pre-trained model to learn the relationship between a protein sequence and its optimal temperature association. We expect DARWINS to show even better Tm prediction after using fine-tuning.

6.2 DARWINS 2.0

After initial training, we found that DARWINS did not reflect very good prediction results. We use two approaches to improve the model. The first is to improve the model architecture. Specifically, we doubled the model output tensor by updating the pre-trained model network structure. We attempted to improve the accuracy of Tm prediction by expanding the dimension of the potential feature tensor to obtain a more accurate representation of protein sequence features. The second is the introduction of mutant protein sequences, and we expect this approach to improve the sensitivity and accuracy of the model to mutant sequences.

For more detail please refer to Engineering.

[1] Nadkarni P M, Ohno-Machado L, Chapman W W. Natural language processing: an introduction [J]. Journal of the American Medical Informatics Association, 2011, 18(5): 544-551.
[2] Locke S, Bashall A, Al-Adely S, et al. Natural language processing in medicine: a review[J]. Trends in Anaesthesia and Critical Care, 2021, 38: 4-9.
[3] Zeming Lin et al. ,Evolutionary-scale prediction of atomic-level protein structure with a language model.Science379,1123-1130(2023).
[4] Madani, A., Krause, B., Greene, E.R. et al. Large language models generate functional protein sequences across diverse families. Nat Biotechnol 41, 1099–1106 (2023).
[5] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.