Engineering Success | SJTU-software

1. Project Implementation

1.1 Data Collection and Preparation

The project introduces a large language model to extract protein sequence features to support subsequent prediction work, requiring a significant amount of protein sequence data. We used Argonaute protein data provided by Novozymes in a public dataset on Kaggle. To extract protein sequences with downstream task significance, we selected data within the length range of 300-800 amino acids. After the data cleaning, a total of 13,917 valid sequences were obtained, which were randomly split into a training set and a test set in an 8:2 ratio. Additionally, in collaboration with Professor Hong Liang's research group, we obtained Tm values for 28 KmAgo protein mutant sequences that had been experimentally verified, which were used for validation and subsequent work on the directed evolution of KmAgo.

1.2 Model Building and Training

Protein Sequence Feature Extraction

The first language model we used was ESM-2, a protein language model that utilizes masked language modeling (MLM) to capture amino acid dependencies in protein sequences. The protein sequence features and amino acid dependencies obtained through ESM-2 can be used for various downstream tasks.

To optimize a part of the protein sequence extraction process, we subsequently used the TemPL language model for feature extraction. TemPL is based on the BERT language model architecture and employs the ESM-2 model framework. During fine-tuning, it received a significant number of temperature-related labels. As a result, we anticipated that protein sequence features extracted using TemPL would outperform those from ESM-2. However, the actual results were different from our expectations.

2) Comparison of Results from Two Models

We trained Argonaute protein data using both ESM-2 and TemPL models, and the fitting results are shown below:

1.3 Building the Thermal Stability Prediction Platform

Our team embarked on the development of a thermal stability prediction platform. This platform uses the extracted protein sequence features for temperature prediction, allowing users to submit sequence data and obtain their optimal temperature, including the one we are particularly focused on, the Tm value for Ago proteins. The development of this platform involves a combination of front-end web design and back-end model integration. We used Nginx for reverse proxy and created the web interface using languages such as HTML, CSS, and JavaScript. The Flask framework in Python was used for user data upload and download.

You can access our platform by clicking the button below:

2. Model Optimization

2.1 First Optimization——DARWINS 2.0

After consulting literature and careful consideration, adjustments were made to the DARWINS 1.0 model. This included changing the parameter file for feature extraction, resulting in a larger feature matrix for protein sequence extraction (output increased from 1280 dimensions in version 1.0 to 2560 dimensions). Simultaneously, data from 28 mutated sequences of KmAgo, obtained in collaboration with the Hong Liang research group, was incorporated into the training set. Following training, an optimized model, DARWINS 2.0, was obtained. The modified model aims to extract protein sequence features in a more detailed and accurate manner, exhibiting improved performance in mutation prediction, including sequences like KmAgo. The following are the fitting results of the DARWINS 2.0 model on the test set after training:

A significant improvement can be observed in the fitting of the second version model on the test set compared to the first version (R=0.49 increased to R=0.85).

2.2 Second Optimization — Binary Classification Model——DARWINS 3.0

Considering that our original training dataset mostly comprised wild-type Argonaute protein data, lacking training support for mutation data, there might be limitations in predicting protein sequence-directed evolution. To address this, we decided to train the model using a different mutation dataset, aiming for higher reliability and competitiveness in predicting the thermo-stability-directed evolution of KmAgo.

After careful consideration, we chose the mutation dataset used in PremPS training. We initially filtered this mutation dataset, retaining 3092 sequences with lengths between 100-300kb. The dataset was then split into a 7:3 ratio for training and testing.

Due to the significant influence of the wild-type on the mutant's thermo-stability, we modified the DARWINS 2.0 model, transforming it from a regression model into the binary classification model DARWINS 3.0. In other words, inputting a mutated sequence into this model returns a result that can determine whether the thermo-stability of the mutant sequence is increased or decreased compared to the wild-type, providing qualitative insights for directed evolution design.
The following are the performance results of DARWINS 3.0 on the test set:

2.3 Testing and Validation

During the platform development, it is crucial to validate the accuracy of the thermo-stability prediction models. We compared three mainstream models—PremPS, FoldX, and dynamut2. It's worth noting that during our replication efforts, we found that these three models only provide web-based prediction services and do not support batch submissions. Therefore, testing their performance on a larger dataset for comparison was not feasible. Additionally, these three models exhibit differences in how they reflect thermo-stability in their outputs. For instance, PremPS outputs ddG, reflecting the extent that the mutant is more or less thermo-stable compared to the wild-type. In contrast, our models, DARWINS 1.0 and DARWINS 2.0, output tm values, directly indicating temperature. Considering these differences, we chose to perform binary classification predictions on 28 mutated sequences of KmAgo using these three models and compared the results with our modified binary classification model, DARWINS 3.0. The following are the prediction results of these three models on the mutated ago dataset:

We also compared DARWINS 3.0 with the current three mainstream models using the original dataset of 28 sequences. The comparative results are as follows:

It can be observed that the DARWINS 3.0 model is generally comparable to these three models in certain aspects of prediction. Simultaneously, there is a significant improvement in recall rate, indicating that our model can better capture mutations that lead to an increase in thermo-stability of protein sequences, reducing the likelihood of missing true positives. Additionally, our web platform allows for the upload and prediction of multiple sequences, providing greater convenience for researchers compared to the web-based services of the other three models, which is one of our advantages.

2.4 User Feedback

We have provided the web-based software for testing to the Tsinghua team and senior students in our collaborating lab. For detail please refer to contribution page

3. Protein Engineering

We use the regression models DARWINS 1.0 and DARWINS 2.0 from the three versions to predict the TM values of KmAgo protein sequences after one round of mutation. Our goal is to identify mutation sites and related mutated sequences that can increase the TM values, providing insights for subsequent thermo-stability-directed evolution. Below is the visualization of TM values after mutation at some selected positions: (The complete heatmap is not practical to display on the web due to size limitations.)

At the same time, we also performed visualizations of the mutation sites for the existing 28 mutated sequences which we have already known the experimental results.

Through the visualized heatmaps, we can intuitively observe the variations in predicted values for each mutated type. Simultaneously, we can identify protein sequence positions where mutations are more likely to lead to an increase in thermo-stability. For example, in these heatmaps above, positions like N541 and Q510 show an increase in thermo-stability in the predictions of both model versions.

4. Conclusion

In this project, we trained a high-precision Ago protein thermal stability prediction model and developed an interactive online prediction platform. Users can input sequence data and receive results. Additionally, we used this model to assist in the directed evolution of KmAgo protein, resulting in more stable protein sequences. The test results indicate the effectiveness of our model. We have provided detailed user instructions and plan to offer ongoing model optimization and maintenance to assist researchers in various fields in wet lab experiments and other research activities related to Ago proteins.

To guarantee the successfulness of our project, we follow strictly with the standard engineering cycle. We spend a lot of energy to make sure our project is successful and can generate value for the scientific research process.