Overview

Antibody drugs, as a form of biologics, have wide applications in immune-related diseases and are generally considered safe in clinical use. Our model focuses on the laboratory research and development stage of drug development, with a strong emphasis on the safety of training data and the reliability of output sequences. We have evaluated various types of risks, reflecting the responsible nature of our design.

Safety for Antibody drugs

Antibody drugs generally do not pose a risk of drug-drug interactions with common small-molecule drugs. Due to their high specificity, they are less likely to have adverse effects on normal tissues, resulting in a lower risk of adverse reactions. Moreover, their lower immunogenicity reduces the risk of allergic reactions and immune-related adverse events. For example, in the case of antibiotics, which are widely used for animal treatments, their extensive use may not only kill beneficial bacteria but also lead to the development of antibiotic resistance in bacteria, making subsequent treatments more challenging. Antibody drugs, on the other hand, do not face these challenges and can also be used to treat diseases caused by other pathogens such as viral infections.

Data Safety

For our project, antibody sequence datasets for many species are relatively limited, especially when compared to human antibody data. This limitation can lead to challenges for AI models when designing antibodies for these species, as they may not have enough samples and information for accurate predictions and optimization. If the dataset contains biases or errors, the unpredictability of the results can compromise security and potentially lead to the design of antibodies with adverse effects. To address this issue, we conducted homogeneity screening, retaining only highly homologous data and manually filtering out irrelevant data that is not antibodies but other immunogenicity related cellular products.

In addition, we employed an Outlier Detection method. Due to the varying quality of collected antibody sequences, under BLOSUM62 encoding, we observed that some data points from the same species were far from the core data points. BLOSUM62 encoding is a commonly used protein sequence alignment matrix for assessing similarity between protein sequences. We identified that this was due to some sequences undergoing frequent in-cell mutations, while others were artificially species-specificationed non-natural sequences. In summary, outliers affected the accuracy of species-specific scoring. Therefore, we used Outlier Detection to eliminate this influencing factor. We utilized LOF (Local Outlier Factor) for outlier detection, and the expected effect is as follows:

Following this, we conducted TSNE visualization to provide a more intuitive representation of the effectiveness of our Outlier Detection:

Clearly, we have accurately removed outliers, ensuring that the species-specific antibody sequence data we use is more core and secure. This reduces interference with our model and lowers the likelihood of immunogenic reactions when used for specific species, making antibody drugs safer.

Output Sequence Safety — interpretability Problem

For antibody drugs, whether they are used for humans or animals, ensuring the safety of the antibody drug before it is applied to living organisms is essential. At this level, the interpretability of AI models becomes an important limitation.

During our Human Practice process, we visited many companies and communicated with them. During the communication, we found that the degree of AI used in the biomedical field is not high. For example, some companies said that they are unwilling to use AI de novo design, which is currently very popular in academia, to design protein because they feel that the credibility is not high.

Understanding how AI models predict and optimize antibody sequences and characteristics is crucial to ensuring their reliability and safety. However, current deep learning models often have complex structures and parameters that make it difficult to explain the reasons behind their decisions, raising doubts about their reliability and trustworthiness. We enhance the interpretability of our output results by controlling conservative strategies for mutation generation, and through thorough screening of homologous data to the extent possible.

At the same time, we perform visualization in many places in the project to enhance interpretability. For example, we used TSNE visualization when determining encoding effects and species specificity. As can be seen from the figure, the separability of encoding implementations is strong, and data for similar species subtypes tend to cluster together, such results also reflect that our model is reasonable, making people more trustful in our results, and researchers more willing to try our model.

In addition, our structurally based scoring, aided by mature tools like Igfold, quantifies the similarity between the spatial structure of generated sequences and the original structure. This helps users gain insight into the reliability of the generated sequences. If the spatial folding structure remains similar, it is highly likely that the functionality will also have similar expression, essentially providing a double safety net for the antibody's safety.