Overview

Environments

In this project, our experiments were done mainly on the computer. The main part of the computer experiment files are written in python language, and the experiment environment is a virtual environment installed under Anaconda3. For the machine learning part of it, the framework used is pytorch. meanwhile, in order to accelerate the computation and easy access, the experiment uses Beijing Cloud Supercomputer.

Below are all the dependency packages and their versions installed in the experimental environment:

            
              biopython==1.79
              certifi==2016.2.28
              cycler==0.10.0
              iminuit==1.5.4
              matplotlib==2.0.2
              numpy==1.12.1
              pandas==0.20.3
              patsy==0.4.1
              pyparsing==2.2.0
              python-dateutil==2.6.1
              pytz==2017.2
              scikit-learn==0.19.0
              scipy==0.19.1
              seaborn==0.8
              six==1.10.0
              sklearn-contrib-lightning==0.4.0
              statsmodels==0.8.0
              tmscoring==0.4.post0
              wincertstore==0.2

For the igfold section used in structural scoring, you can refer to the installation method in the official igfold documentation: github.com/Graylab/IgFold

BLAST Data Mining

We used BLAST for data mining because it can collect a richer number of species compared to other people's organized databases and can filter sequences with very low homology scores to ensure data security.

The dataframe collected by BLAST contains the following columns: id number; name (e.g., Ig heavy chain variable VDJ region), whether heavy chain or light chain; species type (e.g., Sus scrofa), accession time, sequence, sequence length, and the length of the sequence, as well as the number of sequences of FR1, CDR1, FR2, CDR2 and so on.

The workflow of BLAST is as above, and the collected data overview is as follows:

For heavy chains, we have initially collected a total of 36,180 valid sequences in the FR-CDR region, involving 253 species names, 22 species with 100 or more data; 70 species with 10 or more data; and 100 species with 3 or more data. It can be seen that unlike the huge data on humans and mice, the existing data on many other species are very sparse. We further filtered out some core antibody data of species, and the obtained statistics are as follows:

Figure 3: Heavy chain data's species distribution

The word cloud diagram of species whose number of collected antibody sequences is not in the top ten is as follows:

Sequence statistics for each species are analyzed below:

First, the figure below shows Species FR Region Portraits-Sequence Site Distribution:

Then we also counted the proportion of amino acids at each site, that is, Species FR Region Portraits-Site Amino Acid Ratio:

Here we only select the statistics of some species for display. Using this data, we can do a better job of completing projects.

Software Experimental Procedure

Our project is mainly divided into three parts: potential antibody sequence generation, antibody sequence speciation scoring and antibody structure scoring, and among them, antibody sequence speciation scoring is to expand the humanized scoring of the current mainstream tools to multiple species, which is the focus of our project, and it mainly includes three parts: data encoding and processing, model construction and selection, and model debugging.

We attempted to classify the 18 species with the highest number of antibodies using BLOSUM62 coding as well as SVM modeling and successfully achieved a classification accuracy of 0.92.

The traditional one class model of machine learning has poor feature extraction ability for low sample and high dimensional data, based on this, we adopt the deep one class model, i.e., deep learning model+one class, and use deep learning models with strong feature extraction ability such as transformer to extract features of high dimensional data The deep learning model+one class uses a deep learning model with strong feature extraction capability, such as transformer, to extract features from the high-dimensional data, obtain the feature vector corresponding to one class, and then use the similar method as above to make category judgment or score.

The advantage of the One Class model is that it only trains one class of data, which makes it more compact, but there is no guarantee that the model extracts truly useful antibody features. The deep learning multi-classification approach can guarantee strong enough feature extraction, but it cannot solve the serious misclassification problem caused by the uneven samples of multi-species data.

We borrow the Deep One Class model for image anomaly detection, weight the descriptive loss of the multi-classification task and the compactness loss of the One Class model, and update the parameters of the deep learning model with the weighted loss to ensure that the model has sufficiently strong feature extraction capability while also being able to train the single class data more compactly.

In the actual training of the model, considering that the sample size of some single-class species is small and the use of Transformer Encoder model cannot match the sufficient amount of data, we regard the above training process as the Fine-tuning process of the model. That is, before the above model training, we first train the Transformer Encoder and the classification layer with all the data, get the pre-training parameters of the model after training, and migrate the pre-training parameters to the above training process to realize the fine-tuning of the model.

The parameters of the model training were set to set the fine-tuning process to 10 epochs, and the optimizer was chosen to be Adam, with a learning rate of 0.0005, and the corresponding learning rate decay was set.

The above parameters and optimizer selection is the optimal result we got after many times adjusting the model to try and analyze the comparison.

The classification accuracy of the model has reached 97%, and the confusion that occurs is basically entirely due to similar subtypes, which is justified by the fact that we have reclassified the species according to similar subtypes, at which point the accuracy has reached 100%.

t-SNE

Also, in order to be able to better analyze the coding performance of the model and to visualize the morphology of our coding in the high-dimensional space for analysis and debugging, we used tsne to downscale and visualize the coding space.

t-SNE (t-Distributed Stochastic Neighbor Embedding) is a commonly used technique for downscaling and visualization. It is based on the principle of constructing a representation in a low-dimensional space based on the similarity between high-dimensional data points.

t-SNE achieves dimensionality reduction through two main steps. First, it computes the similarity between data points in the high-dimensional space. This is done by measuring the Euclidean distance or other similarity metrics between data points. Then, for each data point, t-SNE constructs a probability distribution based on its similarity, i.e., the probability of data points that are similar to it in the high-dimensional space.

In the second step, t-SNE tries to reconstruct these probability distributions in the low-dimensional space in order to minimize the difference between the high-dimensional space and the low-dimensional space. It uses optimization techniques such as gradient descent to minimize this difference and map the data points into the low-dimensional space.

The t-SNE transforms the similarity between data points into conditional probabilities, where the similarity of data points in the original space is represented by a Gaussian joint distribution, and the similarity of data points in the embedding space is represented by a Student's t-distribution.

The goodness of the embedding effect is assessed by the KL scatterof the joint probability distribution in the original and embedding spaces.That is, the function about the KL scatter is used as a loss function, and the loss function is minimized by a gradient descent algorithm to finally obtain the convergence result.

First of all, t-SNE can help us visualize high-dimensional datasets in low-dimensional space, thus revealing the intrinsic structure and relationships among the data. By mapping the high-dimensional data points into two- or three-dimensional space, we can observe the distribution of data points, clustering, and similarities between data points. This helps us discover important information such as patterns, trends and outliers in the data.

Secondly, t-SNE performs better in maintaining the local structure between data points. Compared with other dimensionality reduction methods, t-SNE is better at preserving the relative distance relationship between data points in the high-dimensional space. This means that similar data points are closer and dissimilar data points are more separated in the visualization results, thus better presenting the similarities and differences of the data.

In addition, t-SNE visualization can help us discover hidden features in the data. By observing clusters or distributions in the low-dimensional space, we can infer some potential features or attributes present in the data. This is valuable for data analysis and machine learning tasks and can provide clues for subsequent analysis and modeling.

Using tSNE, we have a better view of the distributional information of the model's encoding results and can help us analyze the cases where errors occur, such as outliers and confounded classes, so that we can better iterate the model.