Engineering Success

back to top back to top

Background

Inspired by the humanization of mouse-derived antibodies, our project starts with the other sources of antibody sequences against target species antigen (such as mouse-derived antibodies), automatically generates antibody sequences that may have better efficacy and lower immunogenicity, and performs comprehensive scoring. That is, when given specific antibody sequence from a certain species and the target species for obtaining the corresponding antibody, we hope to generate a potential antibody database for the target species. The generated antibody data should preserve the CDR region of the original antibody data as much as possible because this region is used for specific recognition and binding to antigens, thereby achieving the function of antibody drugs.

The most critical point for generating data is to ensure that it conforms as closely as possible to the characteristics of antibody data for the target species. In other words, we aim to minimize the immunogenicity of the generated antibody data for the target species.

Currently, mainstream antibody humanization tools primarily address the above issues, attempting to use machine learning or deep learning tools to provide a measure of sequence immunogenicity with respect to humans, i.e., providing homology measures of generated sequence FR segments relative to human antibody data. These algorithms have made some progress in the field of humanization of mouse antibodies, and some humanized antibody drugs have been designed. But we hope to design antibodies suitable for more species.

overview
Figure 1. Project overview

To ensure the effectiveness and reliability of our model in antibody design, we rigorously followed engineering iteration requirements. We conducted five iterations following the Design-Build-Test-Learn cycle and, through various results and discussions with experts, summarized the areas where the project can be further improved.

Initial Model

Initial Model Design

Our project is mainly divided into three parts: potential antibody sequence generation, antibody sequence species specification scoring, and antibody structure scoring. The species specification scoring of antibody sequences extends the current mainstream tools to various species, and this part is the focus of our project. In the design part, we divide this project into three components: data encoding and processing, model construction, and training with debugging. The entire system framework for the model is a critical issue that we need to continuously iterate and explore. The iterative process presented below mainly revolves around the core content of antibody sequence species specification scoring.

We used Python as the primary programming language and experimented with multiple machine learning and deep learning libraries. In the first attempt, we chose to build the model using the basic ‘scikit-learn’ library.

Initial Model Build

To validate the feasibility of our project, we used a relatively simple encoding method and machine learning models to complete this task.

The species specification scoring of antibody sequences represents the probability that the generated sequence belongs to a specified species. The higher the probability, the more it conforms to the antibody patterns of the specified species. In other words, we have good reason to believe that sequences with higher probabilities have lower immunogenicity for the specified species and are safer for use in the biological organism of that species.

We approached this sequence scoring task from the perspective of a natural language processing sequence multi-classification task. We modeled the species specification scoring of antibody sequences as a sequence multi-classification task and addressed it using deep learning.

After modeling the overall task, we also encoded the input data. Since we ultimately need to generate FR region segments for target species, we selected FR region segments from multiple species as input features and encoded them.

Taking the example of the heavy chain sequence of antibodies, the heavy chain segments consist of FR1, CDR1, FR2, CDR2, FR3, CDR3, and FR4. To ensure the completeness of the generated data and to consider potential interactions between different regions after sequence folding, we concatenated these four segments as input features.

Initial scoring pipeline
Figure 2. Initial scoring pipeline

In the first version of the solution, we employed the commonly used one-hot encoding method in the field of bioinformatics, where each amino acid is encoded into a 20-dimensional vector.

One-hot
Figure 3. One-hot
One-hot
Figure 4. One-hot encoding example

In the classifier, we utilized the classic Support Vector Machine (SVM) algorithm in the first version of our solution. We conducted classification attempts separately on the eight species with the highest antibody counts as well as on almost all the data. Our aim was to validate the separability of the antibody data using this approach.

SVM
Figure 5. SVM

Initial Model Test

The experiment achieved a relatively high accuracy, demonstrating the feasibility of our initial attempt. However, one-hot encoding showed poor interpretability: while the advantage of one-hot encoding lies in its ability to represent differences between antibody data, making the data highly interpretable, our analysis revealed that it fails to capture the close relationships between similar subtypes and the distant relationships between evolutionarily distant species. It cannot learn the true features of the antibody data.

Initial Model Learn

One-hot encoding method makes the data has high separability, but does loss valuable information. Based on this, we recognized the need for a more appropriate encoding method, one that incorporates both the similarities in the physicochemical properties of amino acids and the statistical patterns of natural genetic evolution. We investigated the mainstream amino acid coding matrices including BLOSUM matrix and PAM matrix, and improve encoding method in the next version.

Version II

Version II - Design

In the first version of the model, a core issue arose from the overly naive encoding method used. Therefore, we conducted research into mainstream amino acid encoding matrices, including BLOSUM matrices, PAM matrices, and Position-Specific Scoring Matrices (PSSM). Ultimately, we chose to use the BLOSUM62 substitution scoring matrix, which is a widely applied encoding method in protein analysis. The BLOSUM62 matrix is based on statistical information from protein sequence alignments, and we have good reason to believe that it incorporates both the physicochemical similarities between amino acids and statistical patterns from natural genetic evolution.

Blosum62
Figure 6. Blosum62 matrix
Blosum62 example
Figure 7. Blosum62 ecoding example

Version II - Build

In a similar setup using the Support Vector Machine (SVM), we conducted a classification attempt on the 18 most abundant species of antibodies, employing the BLOSUM62 encoding and SVM model.

Version II - Test

This attempt resulted in a successful classification accuracy of 0.92. From the confusion matrix provided below, it is evident that the model exhibited good performance, with some confusion occurring due to the presence of similar subtypes. For instance, there might be multiple rodent species, and such confusions actually validate the rationality of the encoding and classification methods.

Version II test result
Figure 8. Version II test result

Version II - Learn

In the end, we settled on using the BLOSUM62 encoding and shifted our focus toward the design of a more effective model.

Version III

Version III - Design

Once we had determined the encoding method, we employed various traditional machine learning algorithms to compare their performance with the SVM method, with the aim of improving our model.

Version III - Build

In this phase, we conducted a horizontal comparison of three algorithms: SVM, KNN, and decision trees, to assess their performance within our model, which led to further updates and iterations.

When classifying the 18 most abundant species of antibodies, the three models showed relatively similar results. However, as the sample size increased, SVM exhibited superior performance.

Version III - Test

Focusing on protein optimization design, Yuman Li, Jingfei Hou, and Yuan Chiang discovered significant exploration potential in enzyme optimization design. They proposed an enzyme optimization pipeline: predicting mutation sites based on MSA and HotSpot Wizard, constructing a sequence space for mutations, and predicting the activity and stability of potential sequences. Ziqian Wang and Zhan Shi attempted various algorithms, using BLOSUM encoding to construct a mutation sequence space for a specific enzyme, and used deep models to fit and predict activity and stability.

KNN
Figure 9. KNN
Comparison of three models
Figure 10. Comparison of three models

Version III - Learn

Analyzing the classification performance of the three models, the encoded vectors of antibody data reached a dimensionality of 1900, which is considered high-dimensional data. SVM is well-suited for handling higher-dimensional data, while decision trees tend to generate overly complex structures when dealing with high-dimensional data, resulting in the poorest performance.

Additionally, we conducted t-SNE dimensionality reduction analysis on the antibody data to observe the separability of the 18 classes of data.

TSNE plot
Figure 11. TSNE plot of 18 spieces

As the volume of data continues to increase, the performance of traditional machine learning models gradually deteriorates. Furthermore, in the collected data, there is a high proportion of small-sample data corresponding to species. Antibody data is also challenging to augment adequately. Therefore, a significant problem arises where small-sample data is overwhelmed by large-sample data. Based on this, we aim to explore deep learning models with stronger feature extraction capabilities for classification and scoring.

Version IV

Version IV - Design

After accumulating experience from previous model iterations, we decided to leverage the PyTorch deep learning framework and employ state-of-the-art deep learning models for designing our classifier.

Version IV - Build

Deep learning models are widely recognized for their strong feature extraction capabilities. In the context of our antibody sequence scoring and classification task, we experimented with LSTM (Long Short-Term Memory) models and Transformer models, which are well-suited for natural language processing tasks.

LSTM structure
Figure 12. LSTM structure
Transformer structure
Figure 13. Transformer structure

Version IV - Test

We attempted to use LSTM and Transformer models for the classification of antibodies from 70 different species. The LSTM model achieved a classification accuracy of 0.87, while the Transformer model achieved a classification accuracy of 0.90. The Transformer model exhibited stronger feature extraction capabilities.

model comparison
Figure 14. Comparison of deep learning models

While the deep learning models achieved an accuracy of 0.90 in the classification of 70 species, there was a significant issue of imbalanced data sample sizes. This means that a few species dominated almost all of the data, while the majority of species had much smaller amounts of data.

Version IV - Learn

The high accuracy was mainly due to the models favoring the learning of features from the species with large sample sizes and misclassifying data from species with small sample sizes into the categories with larger sample sizes.

preference of feature learning
Figure 15. Confusion matrix shows preference of feature learning

Based on this, we aim to adopt new models in an effort to address the issue of small-sample data being overwhelmed by large-sample data.

Version V

Version V - Design

The core issue at the moment is that there are too many biological species, and the training data is not balanced. The use of neural network classification heads with a predefined number of classes not only results in insufficient learning from small-sample data but also makes it challenging to extend the model to newly added species.

Through research in the field of bioinformatics, we have learned that the concept of one-class models has shown promising practical results in studies at the cellular level. One-class models are often used to address data detection, classification, and anomaly detection when there is only one class of samples. These models are trained using only samples from a single class, learning the characteristics of that single class as thoroughly as possible. A common decision-making approach in the field of bioinformatics is to extract template feature vectors from single-class samples and use similarity calculations between new data and template feature vectors, along with a threshold, to make decisions.

One significant advantage of one-class models is their strong scalability. When new data for a new species is collected, we only need to train a one-class model for that specific species, which greatly saves computational resources and update time.

Therefore, we have decided to change the existing classification approach and explore the use of one-class models.

Version V - Build

We conducted research on existing one-class models and found that SVM-based models, particularly the One-Class SVM, have shown superior performance in previous classifications. As a result, we primarily focused on experimenting with the One-Class SVM.

Performance of different models
Figure 16. Performance of different models

We trained a separate One-Class SVM model for each class of data. The goal of the One-Class SVM model is to enclose the single-class data within the smallest possible hypersphere, with the primary parameter being the error tolerance rate. We partitioned an equal proportion of the test set for each class of data. Since all classes of data exist in the same dimensional space, we made classification decisions based on the distance between the test data and the soft classification boundary of each One-Class SVM model.

One-Class SVM
Figure 17. One-Class SVM

Version V - Test

We selected eight species, including those with both large and small data volumes. Through multiple attempts, we observed that the model's performance was significantly influenced by the error tolerance rate parameter. When adjusting this parameter separately for each species, we were able to achieve 1.0 accuracy. However, this tuning approach is not suitable for scaling up to accommodate the addition of new species.

Performance of One-Class SVM
Figure 18. Performance of One-Class SVM

Based on this, we conducted a new t-SNE dimensionality reduction analysis and observed outliers in the data. We noticed the presence of significant outliers in the data, and these outliers were severely affecting the position of the soft classification planes of the One-Class SVM models.

Version V - Learn

TSNE clustering diagram
Figure 19. TSNE clustering diagram

It can be seen from the TSNE clustering diagram that the quality of the collected antibody sequences is uneven. They may be high-frequency somatic mutations or non-natural sequences. We need to eliminate outliers to ensure data quality and thereby reduce the immunity of the designed antibody sequences.

Version VI

Version VI - Design

Because we observed significant outlier data, in order to avoid these outliers causing excessive interference in the One-Class SVM model, we aim to use the One-Class SVM model to identify clear outlier data points and remove them. This will allow us to use the more compact data for each class to complete the classification and scoring tasks using SVM models.

Version VI - Build

We added outlier detection based on our model. The updated model is as follows:

Version VI model
Figure 20. Version VI model

Version VI - Test

This approach effectively achieved the removal of outlier data points. Under this model construction, we successfully improved accuracy and achieved a classification accuracy of 1.0 for the previously selected eight classes of data.

TSNE plot with outlier detection
Figure 21. TSNE plot with outlier detection
Version VI performance
Figure 22. Version VI performance

Version VI - Learn

We really need outlier detection in our model. However, as the number of species gradually increased, traditional machine learning models continued to exhibit a weakness in feature extraction capability. Therefore, we adopted the concept of one-class modeling and combined it with deep learning models.

Version VII

Version VII - Design

A completely new attempt: deep one class

The advantage of one-class models is that they train on single-class data, which can result in more compact representations. However, there's no guarantee that the model extracts genuinely useful antibody features. On the other hand, deep learning multi-class methods ensure strong feature extraction capabilities but can't solve the serious misclassification problem caused by imbalanced multi-species data.

We drew inspiration from Deep One-Class models used for image anomaly detection and weighted the descriptive loss of the multi-class task along with the compactness loss of the one-class model. We used this weighted loss to update the parameters of the deep learning model, ensuring that the model has strong feature extraction capabilities while also training the single-class data to be more compact.

Version VII - Build

Training stage
Figure 23. Training stage of DOC model
Loss funtion

The final training model consists of two paths, as shown in the diagram.

The lower training path takes as input the single-class antibody data for a specified species. Here, we use a Transformer Encoder model with strong feature extraction capabilities and suitability for protein sequences to extract features from the antibody data. The Transformer Encoder model does not include a classification head; its final output is the extracted feature vector. We use the variance of data within a batch as the loss function. Smaller variance indicates more compact data, ensuring that this training path trains the model to represent the specified species' data more compactly.

\(m_i = \frac{1}{n-1} \sum_{j\neq i} x_j\), \(z_i = x_i - m_i\)

\(l_C = \frac{1}{nk} \sum_{i=1}^{n} z_i^T z_i\)

The upper training path uses multi-class antibody data from other species and also passes through a Transformer Encoder model. This Transformer Encoder model has the same structure and shares parameters with the model in the lower training path. The extracted features are then passed through a classification layer to perform multi-class classification, with the final cross-entropy loss used as the training loss. Smaller cross-entropy indicates stronger feature extraction capabilities in the model.

The weighted loss obtained from both training paths is used to update the shared parameters of the Transformer Encoder, achieving the desired model's performance.

Version VII - Test

We applied the above method to classify species with sample sizes ranging from 20 to 200 and achieved an accuracy of 0.92. However, this performance may not be satisfactory.

Performance of DOC model
Figure 25. Performance of DOC model

Considering that small-sample species as target classes may not have enough data to support the data requirements of the Transformer Encoder model, we implemented improvements involving model pretraining and parameter transfer.

In the actual model training, due to the limited sample size for certain single-class species, using the Transformer Encoder model alone may not match sufficient data. We treated the training process described earlier as a fine-tuning step. Before performing the above training, we initially trained the Transformer Encoder and the classification layer on the entire dataset. After training, we obtained the pretrained model parameters and then transferred these pretrained parameters into the training process mentioned earlier, allowing for fine-tuning.

However, even after these efforts, the model's performance was not as expected. We suspected that the training process for the weighted model had a significant impact on the model's performance. Therefore, we explored various approaches to address this issue.

Version VII - Learn

Attempted fine-tuning methods
Figure 26. Attempted fine-tuning methods

The first training approach showed the best classification performance, but the training results of deep learning models can be sensitive to parameters. Therefore, further improvements were made:

Considering that our model uses weighted loss, the choice of weighted parameters significantly affects the model's performance. Since the compactness loss should have a relatively small weight, we tried various parameter settings and ultimately chose a compactness loss weight of 0.1 and a descriptive loss weight of 1.0.

Additionally, we set the model training to fine-tune for 10 epochs, used the Adam optimizer with a learning rate of 0.0005, and applied the corresponding learning rate decay.

These parameter choices and optimizer settings were determined after multiple iterations, experiments, and analysis to achieve the optimal results.

Final Model

Final Model - Design

After completing the model training, we further updated and adjusted the testing process multiple times, and the final model was determined as follows.

Testing stage of DOC model
Figure 27. Testing stage of DOC model

In the field of bioinformatics, a common decision-making approach involves learning the features of single-class samples and obtaining template feature vectors for these single-class samples. When new data needs to be tested, decisions are made by calculating the similarity between the new data and template feature vectors and using a predefined threshold.

For image anomaly detection, multiple template feature vectors are used as the basis for classification.

classification method
Figure 27. Classification method

In terms of the overall workflow, our project can be understood as consisting of sequence encoding, sequence generation, classification, and scoring. The scoring phase includes both sequence scoring and structural scoring as mentioned before.

After conducting literature research and team discussions, we have determined the following sequence generation strategy:

In the sequence generation part, we integrate the features of input antibody data with the statistical patterns of antibody data for the target species. We adopt a strategy of conservative mutations to determine the amino acids with a high likelihood at each position. We then generate a potential library of antibody data sequences through permutations and combinations.

Afterward, we subject the generated sequences to the model specific to the target species to obtain corresponding sequence scores.

As for the other part of scoring, the structural scoring, we will leverage models like IgFold:

In the structural scoring part, we first fold the original sequence and the generated potential sequences using the IgFold model. This helps us determine whether the generated sequences can fold into complete structures. We also visually compare the structures of the original sequence and the generated sequences to identify differences.

This comprehensive approach integrates sequence generation and structural scoring to assess the quality and potential of the generated antibody sequences.

IgFold
Figure 29. IgFold antibody structure prediction

Final Model - Build

The main sequence generation strategies employed include:

Setting a threshold at 0.9: We calculate all the amino acids at each position for the target species, where the total frequency reaches or exceeds 0.9. If certain positions are dominated by a single amino acid with a frequency of 0.9 or higher, we retain only that amino acid. For other positions, we keep two amino acids. Additionally, we take into account the amino acids at the corresponding positions in the input sequence to minimize mutations.

About sequence scoring, we attempted to randomly extract sample data and then average their output features from the model to create a template feature vector. We also experimented with both Pearson and Spearman coefficients to measure vector similarity, with Pearson coefficients showing superior results.

After obtaining the PDB (Protein Data Bank) files via IgFold, we perform a comparison of structural similarity between the original sequence and the generated sequence's PDB files using two commonly used metrics: TM-Score and RMSD (Root Mean Square Deviation). These metrics provide a reasonable structural scoring to assess the similarity between the structures of the original and generated sequences, mainly the structural similarity of the CDR regions that bind to the antigen.

Structture scoring
Figure 30. Structture scoring

Final Model - Test

Sequence generation process
Figure 31. Sequence generation process
Software output
Figure 32. Software output

The sequence generation process went very smoothly. We obtained a large amount of potential sequence space. We further put these sequences into the model for scoring.

In the sequence scoring part, we found that this decision-making approach, while improving the model's performance on small-class samples, had the opposite effect on larger-class data. Through dimensionality reduction analysis, we discovered that due to the distribution patterns in the data, some large-class sample test points were actually farther away from the class center than small-class sample test points. This resulted in a deterioration in the classification performance for large-class samples.

distribution patterns
Figure 33. Distribution patterns shown in TSNE plot

To address this issue, we attempted two different approaches: using the K-Nearest Neighbors (KNN) classification decision method and applying the Pearson coefficient in combination with sample downsampling. Ultimately, we chose the latter approach, and the resulting classification and scoring performance is as follows.

resulting performance
Figure 34. Resulting performance

The model's classification accuracy has reached 0.97, and any remaining confusion is primarily due to similarities between subtypes, which is reasonable. We further refined the species classification based on these similar subtypes, resulting in an accuracy of 1.0. Such a high accuracy proves that our sequence score is considerably reasonable.

Clustering
Figure 35. Clustering
Performance after clustring
Figure 36. Performance after clustring

Specifically, look at the species-specific scoring for sequences. The sequence scoring process also shown promising results, with scores of 0.9 or higher for sequences belonging to their respective species. It also performs well in classifying similar subtypes, while the species classification score for other species remains below 0.6, which shows that our sequence scoring is very reasonable.

Sequence scoring example
Figure 37. Sequence scoring example

Sequences in our potential sequence space were also passed through the model to obtain sequence scores.

Generated sequences and scores
Figure 38. Generated sequences and scores

Our structural scoring is also going well. First of all, from the pdb file, the sequence was successfully folded into the typical structure of the antibody, and it was relatively close to the original CDR region structure. In order to better measure the similarity, we used the structure score, which uses numbers to better characterize the score of the structure.

PDB file generated by IgFold
Figure 39. PDB file generated by IgFold
Generated sequences and structure scores
Figure 40. Generated sequences and structure scores

Regarding the rationality verification of our final results, you can see the wiki-Result page.

Final Learning

1. Protein Encoding:

Initially, we employed a simple one-hot encoding paradigm to represent amino acids as 20-dimensional vectors. However, this approach led to overly sparse representations in the feature space and overlooked the physicochemical properties of amino acids. Based on the results of our initial tests, we recognized the need for a more suitable encoding method that would capture both the similarities in physicochemical properties among amino acids and the statistical patterns arising from natural genetic evolution. Therefore, we opted for encoding using the BLOSUM62 matrix, which is based on sequence alignment information for proteins. We had good reason to believe that it encapsulated the similarity in physicochemical properties of amino acids and the statistical patterns arising from natural genetic evolution. Indeed, its performance proved to be excellent.

2. Choice of Classifier Model:

In a task characterized by a high-dimensional feature space and complex calculations with limited samples, traditional machine learning models did show promise. We experimented with three such models, and each had some level of success. However, through testing, we discovered that these traditional models did not perform as well as deep learning methods when it came to fitting a large number of data samples and facilitating data expansion. Consequently, we turned to deep neural networks, particularly models built on the Transformer architecture, which ultimately yielded superior results.

3. Scalability and Real-World Applicability:

As a system designed for antibody design across multiple species, it was imperative to ensure that our model possessed sufficient scalability. We encountered limitations when using traditional neural networks, especially with standard classification layers, as they did not easily adapt to our needs. Our task couldn't be simplified to a mere generation, classification, or scoring paradigm. After conducting thorough research, we ultimately adopted an iterative one-class approach, and through testing, we identified the impact of outlier data on experimental outcomes. In a subsequent iteration, we successfully employed outlier detection to mitigate this impact. Following comparative experiments, we firmly established the core of our model as "deep one class," continuously fine-tuning hyperparameters and decision-making processes based on experimental results. This approach has allowed us to continually incorporate new training samples in response to real-world application scenarios and accommodate the design of antibodies for novel species.

Areas for Improvement:

Currently, the system is presented in the form of Python engineering files, which may require users to have some programming knowledge to use effectively. Therefore, the next step is to develop a user-friendly interactive software that caters to end-to-end sequence generation requirements and can display the structural and scoring aspects based on user preferences.