Inspiration
In our experimental design, we have devised a substitutive nanobody, with the substitutable region of the antibody tailored based on antigenic characteristics. Initially, we contemplated immunological methods to obtain the desired sequence. However, we encountered issues with this approach. Firstly, when we communicated with the company, we found that the experimental timeline for immunological methods was excessively protracted, accompanied by elevated costs. Furthermore, the resulting antibody sequences did not invariably exhibit correct expression and the formation of precise structures within the administered plants.
To address these challenges, we aspired to construct a generative model for antibody sequences, guided by the structural and sequential attributes of antigens. In contrast to traditional experimental approaches, machine learning offers a high-throughput and efficient avenue for antibody sequence selection. Additionally, generative models have the potential to yield sequences not naturally occurring in nature, potentially enhancing the functionality relative to naturally similar sequences. Leveraging existing protein structure prediction models, we can predict protein structures solely from sequences, affording a multidimensional and more intuitive assessment of protein functionality.
Requirement analysis
Primarily, we have identified our target audience as researchers engaged in antigen-antibody interaction studies and relevant industry professionals. Our aim is to develop user-friendly software, enabling independent utilization and interpretation of experimental results without the need for external expertise. Our objective is to build a generative model for antibody sequences based on antigenic structures. Lastly, we commit to refraining from any form of personal information storage related to users.
Therefore, we deduce that our model should possess the following attributes:
(1) Capacity for high-throughput data processing to handle extensive datasets for training.
(2) Capability for superior visualization, facilitating model adjustments for both users and researchers.
(3) The chosen model should strive for maximum suitability in tasks related to antigen-antibody generation.
Design
Moreover, to facilitate further research and application by future teams and researchers, we have decided to open-source the entire model framework, accompanied by user-friendly annotations for guidance and assistance. This allows for parameter adjustments tailored to individual tasks. We have also created a visual interface to enable individuals without programming expertise to experience the fusion of artificial intelligence and antibody design. The entire program is developed using the Python PyTorch and TensorFlow frameworks, with no personal information storage involved. Our model aims to achieve antibody sequence generation based on antigenic sequences, with considerations for incorporating vital structural information into the model's feature range. We provide users with multiple options for protein sequence embedding models, allowing for the utilization of either our model's intrinsic embedding features or external embedding features, enhancing training flexibility.
Coding
Data Preprocessing
This section offers comprehensive methods for comprehensive PDB data processing. It not only reads antigen-antibody sequences from PDB files and categorizes them but also extracts structural information and converts it into 3D arrays for subsequent processing. The processed files are saved in CSV format in the script's directory, facilitating subsequent script reading and data processing. Our provided vocabulary file draws inspiration from various scoring matrices and amino acid encoding methods, enabling users to customize various sequence alignment methods and protein sequence encodings.
Model Core
We have chosen Python (3.x) as the primary language and mature deep learning frameworks, PyTorch and TensorFlow, for model construction. The model's core comprises a Variational Autoencoder (VAE) and single/dual-directional LSTM models for encoding and generation. The loss function selected is the Kullback-Leibler divergence, used to measure the disparity between the fitted distribution and the actual distribution. The decoder is configured to generate sequences based on default states without requiring specific input, enhancing user-friendliness. Additionally, the model's core is callable through the main function, allowing for training, model saving, sequence generation, and other operations.
Modular Component
Ankh-Based Protein Sequence Embedding
The choice of protein sequence embedding method significantly impacts the model's training efficacy. In our core model, considering lightweight and usability, we opted to directly employ one-hot encoding from the TensorFlow framework, avoiding additional embedding modes. However, to accommodate model scalability and future practicality, we offer users the option to employ embeddings from Ankh, a large-scale protein semantic model developed by Google. We anticipate improved performance in subsequent predictions.
ESM2-Based Structural Information Extraction
We aim to retain structural information that is often lost in seq2seq models but is crucial for antigen-antibody interactions. Hence, we extract structural features as a reference for prediction, and the extracted features, when re-encoded, contribute to antibody sequence prediction alongside sequence features. We achieve this by utilizing the ESM2 model developed by Facebook. We also provide a framework for fine-tuning the ESM2 large model, facilitating further customization by researchers for their specific tasks.
VAE-Based Structural Information Extraction
We employ a simple VAE model to learn from the structural information obtained from the PDB models. Our aim is for the extracted feature information to play a role in generating the final antibody sequences through re-encoding.
Database Introduction
We obtained a dataset of 866 antibody-antigen complexes in the Protein Data Bank (PDB) format (Figure 1A) from the Antibody Database (AbDb). These PDB files not only contain antigen and antibody sequence information but also store molecular coordinates of the complexes in space, aiding in the extraction of structural information.
Our Software
We have developed a web application for our model, making it convenient for users and researchers to utilize our software. The frontend web page is rendered using HTML framework and Vue.
When using our software, you can choose to disregard its other functionalities. After entering your antigen sequence into the input box and clicking the magnifying glass icon, the corresponding antibody will be generated, along with time and other relevant information.
On this interface, users can submit their original PDB files. The software will provide an encoded CSV file and an NPZ format 3D array to store the complex structure information of the file. These files will be automatically downloaded through the browser after processing.
In this interface, users can upload either the original PDB file or a CSV file containing antigen-antibody sequences for embedding using the Ankh model. After processing, the model will provide a CSV file containing embedding features for users to download through the browser.
In this interface, you can upload an antigen sequence or a PDB file. We can extract antigen structure information either by training a VAE model from scratch or using default parameters.
The above outlines the user workflow of our software, which offers a user-friendly interface for enhanced user convenience. However, the integration of our frontend and backend development still requires further work. Our current visualization software interface is in the testing phase, and we will continue to develop and improve it to ensure a better user experience.
Contribution to Synthetic Biology and the iGEM Community
Our software contributes significantly to the field of synthetic biology by offering compatibility with common data formats. We utilize CSV and PDB file formats, both widely used in the synthetic biology domain. This compatibility facilitates the use of our software for researchers and fellow iGEM community members.
Contribution to the Field of Synthetic Biology
Our model, while primarily trained on antigen-antibody complexes, fundamentally deals with protein-protein interactions. Therefore, in the realm of synthetic biology projects involving protein-protein interactions, such as the engineering of more efficient enzymes through synthetic biology techniques, our model can be employed for high-throughput sequence generation. Subsequently, by considering factors like structure, electrostatics, folding, and cleavage sites, researchers can evaluate the impact of these factors on protein performance before proceeding with wet lab experiments. This approach not only enables high-throughput data processing but also saves time, human resources, and financial resources. Thus, it makes a substantial contribution to protein-related synthetic biology research.
Contribution to Our Project
Our project involves replacing ID sequences of nanobodies for various diseases. The replacement sequences need to exhibit strong antigen affinity. This process typically involves time-intensive and costly immunoscreening. However, our model can generate a large number of sequences computationally, assisting in this task. This capability plays a crucial role in expanding our project's scope from rice blast disease to other diseases, significantly reducing the time and financial resources required.
Future to do
The model has not yet undergone experimental validation
Due to time constraints, we have not conducted wet lab experiments to confirm the affinity of the antibodies generated by the model, nor have we validated whether the protein structures align with our expected results. Therefore, the credibility of the model still requires subsequent experimental verification.
The data processing section needs further refinement
When constructing dynamic 3D models from PDB files, larger protein complexes extend beyond the boundaries of the set 3D model, leading to unknown errors. We intend to develop a dynamic modeling approach that enhances the tolerance for displaying and correctly reading the 3D models of PDB proteins.
The visualization interface still requires improvement
We need to create a more sophisticated and user-friendly interactive visualization interface for users. Due to time constraints, we did not implement the envisioned modular configuration of the model through mouse interaction. We hope to enhance the model's interactivity and user-friendliness in future experiments, facilitating its use by researchers unfamiliar with programming.
The re-encoding and modular interfaces need refinement
Currently, we have outlined the modular approach and implemented the functionalities of various modules. The integration of different modules involves the conversion of different tensors and data types, and it requires the establishment of a standardized method to ensure that data can be seamlessly read and further analyzed across different modules. The way in which features extracted by different modules are integrated for the final sequence generation service still requires further research and development. A fixed tensor shape needs to be specified to enable the integration of different features. The integration method also requires ongoing research.
Further refinement is required in the area of sequence generation
While our encoded sequence generation exhibits the capability to iterate and update effectively during training, the decoding process to generate meaningful sequences still requires additional development and debugging in order to achieve the full functionality of the model.
Reference
[1] Akbar, R. (2021). A compact vocabulary of paratope-epitope interactions enables predictability of antibody-antigen binding. Cell Reports, [online] 34(11), p.108856. doi:https://doi.org/10.1016/j.celrep.2021.108856.
[2] Elnaggar, A., Essam, H., Salah-Eldin, W., Moustafa, W., Elkerdawy, M., Rochereau, C. and Rost, B. (2023). Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling. [online] arXiv.org. doi:https://doi.org/10.48550/arXiv.2301.06568.
[3] Kanduri, C., Pavlović, M., Scheffer, L., Motwani, K., Chernigovskaya, M., Greiff, V. and Sandve, G.K. (2022). Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification. GigaScience, 11. doi:https://doi.org/10.1093/gigascience/giac046.
[4] Roshan M , R. (2020). Transformer protein language models are unsupervised structure learners. [online] Available at: https://www.biorxiv.org/content/10.1101/2020.12.15.422761v1.
[5] Ruffolo, J.A., Chu, L.-S., Mahajan, S.P. and Gray, J.J. (2023). Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies. Nature Communications, [online] 14(1), p.2389. doi:https://doi.org/10.1038/s41467-023-38063-x.
[6] Ruffolo, J.A., Sulam, J. and Gray, J.J. (2021). Antibody structure prediction using interpretable deep learning. Patterns, p.100406. doi:https://doi.org/10.1016/j.patter.2021.100406.
[7] Schneider, C., Buchanan, A., Taddese, B. and Deane, C.M. (2021). DLAB: deep learning methods for structure-based virtual screening of antibodies. Bioinformatics, 38(2), pp.377–383. doi:https://doi.org/10.1093/bioinformatics/btab660.
[8] Wu, J.V., Wu, F., Jiang, B., Liu, W. and Zhao, P. (2022). tFold-Ab: Fast and Accurate Antibody Structure Prediction without Sequence Homologs. bioRxiv (Cold Spring Harbor Laboratory). doi:https://doi.org/10.1101/2022.11.10.515918.