Modeling Overview

At present, many machine learning and deep learning tools for humanizing mouse antibodies have been developed, among which tools for scoring humanized sequences of antibodies occupy the mainstream.

Our project starts with the other sources of antibody sequences against target species antigen (such as mouse-derived antibodies), automatically generates antibody sequences that may have better efficacy and lower immunogenicity, and performs comprehensive scoring. In addition to completing the basic statistics of antibody patterns and potential sequence generation for each species, the most critical step is to provide the generated species-specific antibody sequence correspondence species score.

Antibody scoring includes sequence scoring and structure scoring.

The sequence score represents the probability that the generated sequence belongs to the specified species. The greater the probability, the more consistent it is with the antibody pattern of the specified species. That is, we have good reason to believe that the lower the immunogenicity of this sequence to the specified species. This antibody is safer for organisms of this species.

The sequence scoring tool is currently the most mainstream scoring method for humanized mouse antibodies and is the core of our entire project. Based on the above explanation, we approach this sequence scoring task from the perspective of a sequence multi-classification task of natural language processing.

The structure score means that we have considered the rationality of the generated sequence. Considering that the sequence we generated is the FR region of the antibody, it needs to be spliced with the CDR region of the given sequence to obtain the complete antibody sequence. The structure score takes into account the splicing. The rationality indicates whether the generated sequence can be folded into a reasonable structure and the structural similarity between the folded antibody and the original antibody.

The model overview is as follows.

Encoding

We model the species classification score of antibody sequences as a sequence multi-classification task and solve this multi-classification task through deep learning. Because we ultimately need to generate FR region fragments of target species, we select FR region fragments of multiple species as input features and encode them.

Let's take the heavy chain sequence of an antibody as an example. The heavy chain fragments are composed of FR1, CDR1, FR2, CDR2, FR3, CDR3, and FR4. In order to ensure the integrity of the generated data and to take into account potential interactions between various regions after sequence folding, we spliced four fragments together as input features.

We need to encode the antibody sequence so that it can be processed using a deep learning model. Antibody sequences are generally composed of 20 common amino acids. Currently, a mainstream and simple encoding method in the field of bioinformatics is one-hot encoding. This encoding method makes the data has high separability, but does not contain information on the physical and chemical properties of amino acids and genetic statistical laws of nature. In order to dig out the real characteristics of the data, we investigated the mainstream amino acid coding matrices including BLOSUM matrix and PAM matrix.

The BLOSUM substitution scoring matrix is a log-odds matrix. Based on the identity between sequences (greater than a threshold), these protein sequences are clustered into 500 groups. The sequences in each group are subjected to multiple sequence alignments to conserve gaps. The area is divided into blocks, with a total of more than 2,000 blocks. Identity can take many values. The BLOSUM62 matrix uses an identity greater than or equal to 62%. Then based on these blocks, the frequency of all substitutions of 20 amino acids can be found.

The PAM matrix, (Point Accepted Multation) is generated based on the point mutation model of evolution. The point mutation model of evolution is based on a hypothesis: If two amino acids are replaced frequently, it means that nature has a strong ability to accept this base substitution, then This pair of base substitutions has a high score. The basic PAM-1 matrix reflects the magnitude of an average of one mutation per 100 amino acids produced by evolution, obtained by statistical methods. PAM-1 multiplies itself n times to get PAM-n, which indicates that multiple mutations have occurred. Taking the logarithm is its replacement score matrix.

Taken together, the BLOSUM62 matrix is suitable for sequence comparisons with distant and close genetic relationships. It is also a widely used sequence scoring matrix. Therefore, in the end, we use the BLOSUM62 matrix for encoding, that is, each amino acid is encoded into a 20-dimensional Vector, the FR data length of heavy chain is generally around 95, which is encoded as a 1900-dimensional vector. We also tried the above-mentioned various encoding methods in subsequent classification tasks, including dimensionality reduction of BLOSUM62 encoding. In the end, BLOSUM62 encoding showed the best effect.

Sequence Scoring

Sequence scoring is the most important link in the entire project. Inspired by the humanization of mouse-derived antibodies, we hope to design antibodies suitable for more species, including cats, dogs, cows, pigs and other animals such as rodents. The expansion of species is the significance of our project, and it also brings about the biggest difficulty of the project: there is a serious imbalance in the amount of data for each species.

However, when using traditional machine learning methods and deep learning models such as LSTM and Transformer to classify antibody data, there is a high rate of misclassification of species in small samples. In the collected data, the proportion of species corresponding to small sample data is very high, and it is difficult to achieve reasonable data amplification of antibody data, and these species also urgently need some antibody design solutions and tools.

Currently, in the face of various data situations such as multi-class data, single-class data, large sample size and small sample size, there are many targeted solutions such as direct multi-classification using deep learning models, fine-tuning and one class.

In response to the current situation of unbalanced samples and small samples, we adopted the concept of one class, training only one type of samples, and obtaining a more accurate template feature representation of this type of samples, when evaluating and testing the sequence score of a new sequence belonging to a certain species. , only need to judge its similarity with the template vector result of the specified species. When the similarity reaches a certain threshold, the new sample can be determined to belong to this category. At the same time, considering that one class also involves the issue of sample data volume, we also incorporate fine-tuning techniques.

General multi-classification models can ensure that the model has strong enough feature extraction capabilities and learn the real characteristics of the data, but it cannot guarantee that small sample data will not be overwhelmed by large sample data.

The general One Class model can ensure that the model trains single-class data compactly enough, but it cannot guarantee the model's ability to extract features.

The training model we finally adopted combines the strong feature extraction capabilities of the multi-class deep learning model and the One Class model's ability to ensure that single-class data is not interfered with by other data.

We borrowed from the Deep One Class model in image anomaly detection, weighted the descriptive loss of multi-classification tasks and the single-class data compactness loss of the One Class model, and used the weighted loss to update the parameters of the deep learning model to ensure that the model It has strong enough feature extraction capabilities and can also train single-type data more compactly.

The final training model is shown in the figure above, and the model is trained in two paths.

The input of the training path below is antibody data of a single specified species. Here, the Transformer Encoder model, which has strong feature extraction capabilities and is suitable for protein sequences, is used for feature extraction of antibody data. The Transformer Encoder model does not include a classification header, that is, its final The output is the extracted feature vector. We use the variance of the data within a batch as the loss. The smaller the variance, the more compact the data. Therefore, this training path ensures that the model will train the specified species data more compactly.

\(m_i = \frac{1}{n-1} \sum_{j\neq i} x_j\), \(z_i = x_i - m_i\)

\(l_C = \frac{1}{nk} \sum_{i=1}^{n} z_i^T z_i\)

The training data above uses antibody data from multiple types of other species, and also passes through a Transformer Encoder model. This Transformer Encoder has the same structure and shares parameters with the model in the training path below. The extracted features are passed through the classification layer to implement multi-classification tasks, and the final classification cross-entropy is used as the loss. The smaller the cross-entropy, the stronger the feature extraction ability of the model.

Use the obtained weighted loss to update the parameter-shared Transformer Encoder to achieve the effect of the model we envisioned.

In the actual training of the model, considering that the sample size of some single species is small and the Transformer Encoder model cannot match a sufficient amount of data, we regard the above training process as the Fine-tuning process of the model. That is, before training the above model, we first use all the data to train the Transformer Encoder and classification layer. After training, we obtain the pre-training parameters of the model, and migrate the pre-training parameters to the above-mentioned training process to achieve fine-tuning of the model.

The parameters of the model training were to set the fine-tuning process to 10 epochs, the weight of the compactness loss in the weighted loss to 0.1, the optimizer selected Adam, the learning rate was 0.0005, and the corresponding learning rate attenuation was set.

The model adopted during the testing phase is as follows:

For each species, we extract a certain proportion of test data, train the model with that species as the Target Class, and average the output features to obtain the template feature vector of the species.

When calculating the score of the test sequence relative to the species, the test data is trained on the model of the species to obtain the output features, and the sequence score is obtained by calculating the Pearson coefficient between the output features and the template feature vector of the species.

Sequence Generation

In the sequence generation part, we integrated the characteristics of the input antibody data and the statistical rules of the antibody data of the target species, adopted a conservative mutation strategy, determined the most likely amino acids at each site, and then generated potential through permutation and combination. Antibody data sequence library.

The main strategies adopted are:

Set the threshold to 0.9, and count all amino acids corresponding to the total frequency of each site in the target species reaching 0.9. If one amino acid occupies a frequency of 0.9 or above at some sites, only this amino acid will be retained, and two amino acids will be retained at other sites, and comprehensively consider the amino acids at the corresponding positions of the input sequence to achieve as few mutations as possible.

We then pass the generated sequence through the model of the specified species to give the corresponding sequence score.

Structure Scoring

In the structure scoring part, we first fold the original sequence and the generated potential sequence through the IgFold (github.com/Graylab/IgFold) model to determine whether the generated sequence can be folded into a complete structure, and visually observe the difference between the structure of the original sequence and the generated sequence.

After obtaining the PDB file, we use two commonly used PDB similarity comparison methods, TM-Score and RMSD, to compare the similarity between the original sequence and the PDB of the generated sequence, and give a reasonable structure score.