Software

back to top back to top

Overview

The theme of our project is to generate antibody sequences of target species based on antibody sequences of other sources against target species antigens, enabling computer-aided multi-species antibody design.

Inspired by the humanization of mouse-derived antibodies, our project starts with the other sources of antibody sequences against target species antigen (such as mouse-derived antibodies), automatically generates antibody sequences that may have better efficacy and lower immunogenicity, and performs comprehensive scoring.

Currently, the heavy chain design covers 55 species, and the light chain design covers 9 common species.

Users can enter a piece of antibody data based on existing experimental results or select antibody data with a specific number of a specific species from the antibody database we provide, and then enter the name of the species they want. Our model will provide multiple pecies of potential antibody data that retain the specific binding ability of the original sequence against a specific antigen while minimizing the risk of causing an immune response in the target species. We provide multiple scoring metrics for these antibody sequences as well as the original sequence.

overview
Figure 1. Software pipeline overview

① Our project uses large-scale antibody sequence data with high homology from BLAST, which covers almost all common animal species.

② We generate mutations that increase the "target species identity" score based on the original sequence input by the user. The mutations are only performed on the framework region; the CDR residues are retained alone to maintain the antibody binding properties.

③ We constructed a Deep One Class model with Transformer Encoder based on the obtained multi-species antibody data, constructed a scoring model and a representative template feature vector for each species. And based on this, we gave a sequence score representing the degree of speciesization.

④ We integrated the antibody structure prediction model IgFold, performed protein folding attempts on the species-specific sequences, and gave a structure score based on the similarity of the protein files.

Target Users

Antibody drugs are drugs prepared by antibody engineering technology based on cell engineering technology and genetic engineering technology. They have the advantages of high specificity, uniform properties, and can be prepared for specific targets. They are widely used in the treatment of various diseases, especially in the field of tumor treatment. The application prospects have attracted much attention.

At present, in the field of pharmaceutical research and development and application, antibody drugs and antibody compound drugs are gradually occupying a dominant position in the market.

In addition to human demand for antibody drugs, many animals are also in urgent need of rapid development and application of antibody drugs.

Our project hopes to help those groups who are committed to pet care and pet rescue. We hope to provide faster, AI-assisted antibody drug development auxiliary technology for cats, dogs and other pets.

Our project hopes to help those groups who are committed to developing animal husbandry and developing animal husbandry drugs. We hope to accelerate the development of antibody drugs for various livestock and poultry such as cattle, sheep, pigs and chickens, and hope to use antibody drugs with fewer side effects and mild and long-lasting effects.

Our project hopes to help all groups who care about animal drug experiments. We hope to use deep learning tools to accelerate the development of multi-species antibody drugs. We hope to maximize the role of existing antibody drugs and computer-assisted capabilities, and accelerate the development of antibody drugs as much as possible, and reduce inefficient animal experiments in antibody drug development.

Antibody
Antibody

Innovation

Our project has three main innovations.

The biggest innovation is that we have expanded the humanization of mouse antibodies to multiple species, which is also the most important significance of our project. To this end, we independently collected nearly 40,000 pieces of antibody data covering more than 200 species through BLAST. We screened existing species and antibody data to generate our own database, and based on this database, we achieved multiple genifications of antibodies.

The second innovation is that we migrated the Deep One Class model for image anomaly detection to multi-genus scoring. We combined the concept of One Class and the strong feature extraction capabilities of the deep learning model to successfully achieve reasonable species scoring of antibody data and solve the problem of imbalanced sample data.

The third innovative point is that we provide an interface for structural scoring, which is currently not available in mainstream humanized tools such as BioPhi and SAbPred. We called the IgFold model and used TM-Score and RMSD to score the similarity of protein PDB files.

Algorithm Summary

In terms of algorithms, our biggest contribution is to propose a reasonable multi-genus antibody sequence scoring tool. We have developed a model suitable for serious imbalances in antibody data of various species, and successfully solved the problem of small sample data being overwhelmed by large sample data.

The training model we finally adopted combines the strong feature extraction capabilities of the multi-class deep learning model and the One Class model's ability to ensure that single-class data is not interfered with by other data.

We borrowed from the Deep One Class model in image anomaly detection, weighted the descriptive loss of multi-classification tasks and the single-class data compactness loss of the One Class model, and used the weighted loss to update the parameters of the deep learning model to ensure that the model has strong enough feature extraction capabilities and can also train single-type data more compactly.

training
Figure 4. Training progress
testing
Figure 5. Testing progress

For each species, we extract a certain proportion of test data, train the model with that species as the Target Class, and average the output features to obtain the template feature vector of the species.

When calculating the score of the test sequence relative to the species, the test data is trained on the model of the species to obtain the output features, and the sequence score is obtained by calculating the Pearson coefficient between the output features and the template feature vector of the species.

Usage

Here, we give a simple example of software usage.

Assume that you currently have a mouse heavy chain variable region antibody sequence, which derived from mice and is the corresponding antibody of the antigen for the species we wish to obtain drugs:

EVMLVESGGGLVMPGGSLKLSCAAS GFTFSNYAMSWVRQIPEKRLEWVATI SIGGHFTYFPDSVKGRFTISRDNAKNTLYLRMSSLRSEDTAMYYCVR HEGYGRPYFDYWGQGTTLTVSS

,respectively FR1-CDR1-FR2-CDR2-FR3-CDR3-FR4. We accept sequences whose startnum is H1 or H2, that is, the length of FR1 meets 25/24, and subsequent FR fragments are also need to be complete.

We splice the FR fragments of this sequence and input the complete FR sequence as the first line in the input_seq file. Then please enter the three CDR fragments in the second, third and fourth lines respectively. See format below.

This input method is because we adopt an antibody generation method that retains CDR residues alone, in order to maximize the specificity of the antibody.

antibody_data file
Figure 6. Our input method

You can then enter the species you want to species your antibody sequence into the species_target.txt file. For a list of species names supported by our project, please see the corresponding species file (below).

antibody_data file
antibody_data file
antibody_data file

After that, run the seq_gen.py file and you will get the results of the sequence generation in the ans_com file. This is a csv file that includes two columns, the first column is the complete antibody sequence, and the second column is the sequence score of the antibody sequence for the target species.

In the antibody data given, the last line is the original input sequence, and we also give its species score for comparison with the generated sequence.

Sequence generation
Figure 10. Sequence generation
Software output
Figure 11. Software output

At this point, if you have configured the IgFold environment and wish to obtain the structure score of the antibody sequence, you can write the sequence of your choice in structure_seq.txt. In this file we have provided the sequence with the highest score for the target species among the generated sequences. When selecting the sequence you want, the sequence score can be used as a basis for the degree of mutation, and you can choose freely. You will need to paste the sequence from the results file into structure_seq.txt.

Structure_seq.txt usage
Figure 12. Structure_seq.txt usage

Afterwards, please put the structure_seq.txt file, ans_com.csv file and structure_score.py file in a directory. Run the structure_score.py file in the configured environment. We will call the IgFold interface in your environment and get the original sequence from the ans_com file, the sequence with the highest sequence score and the sequence you want to analyze from structure_seq. We will generate corresponding PDB structure files for these sequences, and compare the structures of the generated sequences with the original sequences to score. Finally, we will generate the corresponding scores.csv file, which includes the TM Score comparing the generated sequence with the original sequence and the TM Local Score of each site. The larger the score, the higher the similarity. The RMSD value (root mean square deviation) is also provided in the file. The smaller the RMSD, the higher the similarity.

Structure scores
Figure 13. Structure scores

You can also open the PDB file to view the generated sequence and the folded structure of the original sequence.

Generated PDB file
Figure 14. Generated PDB file