Antimicrobial peptides are the main approach we use to treat acne, and finding suitable and effective antimicrobial peptides is crucial in wet lab experiments. It is also a pressing issue for all biologists. Our modeling aims to utilize artificial intelligence techniques to propose a multistage model ensemble pipeline (MMEP) consisting of classification, ranking, and regression modules. This method can efficiently identify antimicrobial peptides that meet our requirements while ensuring a high level of accuracy. Experimental results have shown that our approach achieves an accuracy rate of 94.8% in identifying antimicrobial peptides and accurately predicts the corresponding minimum inhibitory concentration. To extend its benefits to a broader audience, we have developed a user-friendly website that allows biologists with no programming background to quickly grasp and experience this method. For more detailed information about this method, please refer to the following text.
Overview
Design of the machine-learning pipeline
We employ a data-driven approach to identify the optimal peptide candidates. The central achievement of this work is a plug-and-play machine learning pipeline that allows for a global search among candidate antimicrobial peptides. While it is theoretically possible to rely on a single machine learning model to determine the final predictions of minimum inhibitory concentration, this is often a risky choice as the inductive biases of any individual model can be problematic in this scenario. Therefore, to incorporate the principles of machine learning and data-driven thinking within a framework of global search, we compiled a set of models to expand the hypothesis space of the entire system and designed our methodology as a staged funnel-like filtering pipeline [1].
Our machine learning pipeline is constructed based on the features of each module that are bound to the properties of the dataset. On the one hand, the dataset is collected from external sources of widely published literature, which may largely depend on differences in laboratory environments, thereby causing the training dataset to carry inevitable noise or bias. On the other hand, the classification operation exhibits the roughest predictive performance but the strongest robustness against dataset noise. By contrast, the ranking module aims to rank peptide candidates based on their antibacterial activity, regardless of actual MIC values, with moderate predictive accuracy and moderate resistance to dataset bias. Finally, although the regression module has the highest prediction accuracy as it is trained to predict MIC values accurately and directly, it is most susceptible to dataset bias. Through analysis of the models and data, we tentatively determine the funnel structure, starting with the classification module, followed by the ranking stage, and ending with the regression model, to ensure the overall performance of the model is optimized.
Selection of machine-learning models
With the funnel-shaped pipeline in place, the next step is to select suitable models to adapt to each stage. As the first stage of the funnel, its main function is to perform coarse filtering to eliminate the least likely peptide candidates from the pool. Specifically, we trained a binary classifier on our dataset that can filter out candidates that lack antibacterial activity. We acknowledge that the classification model lacks the ability to correctly rank candidates and cannot precisely predict their MIC values. However, it can be considered a very effective and direct way to eliminate negative impact, especially when facing a large number of candidates as most of them seem to be negative. For experiments, we tried six popular machine learning algorithms, namely Logistic Regression (LR), Decision Tree (DT), Adaptive Boosting (Adaboost), Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Gradient Boosting Decision Tree (GBDT), and found that RF, XGBoost, and GBDT performed significantly better than the others. However, since misjudgment in the first stage could lead to the neglect of some highly promising antibacterial peptides, the prediction accuracy of the model is crucial. To enhance the model's generalization ability, we chose to fuse the above three algorithms. Experimental results show that the fused model (see Table 2) outperforms any single base learner.
We introduced a ranking module to further refine the MMEP funnel. The basic principle of this module is to randomly select two training samples and optimize the model to correctly rank the samples based on their target values. Cross-entropy loss is applied to train such a model. The ranking module exhibits strong robustness against dataset bias and noise because it heavily emphasizes how training samples should be ranked according to their experimental MIC values, actively disregarding precise numerical values. Considering the inherent similarity between the classification and ranking modules, we directly selected the best-performing base learner from the previous classification stage as the model for this stage, namely the XGBoost model. While the fused model performs better, we chose to reduce the complexity of the model in this module that demands less accuracy, in order to strike a balance between accuracy and model parameter size. In Figure 6, we depict a scatter plot showing the predicted MIC ranking results corresponding to the actual labels. The positive correlation between the two demonstrates the effectiveness of the trained ranking model.
As the final stage of the pipeline, we appended a regression stage. Although ending the funnel at the ranking module is conceptually feasible, we found that the learned ranking function is not highly sensitive to small differences in the target space and the positive candidate pool collected at this stage may cluster within a small range of MIC scores. Therefore, for the concluding stage of the funnel, the insensitivity of the ranking model to such small differences could be catastrophic as it lacks accuracy assurance. Hence, we trained a regression model on our training set using the standard L2 loss. To further improve the predictive accuracy of the model, we opted for a more popular deep learning network. We experimented with various convolutional neural networks (CNNs) but found that regardless of how we adjusted the hyperparameters, the model's performance remained unsatisfactory. After continuous trial and error, we discovered that bidirectional long short-term memory networks (BiLSTM) generally exhibit stronger learning capabilities in natural language processing than CNNs. Therefore, we chose to apply BiLSTM to the prediction of biological sequence information, and indeed achieved promising results. Please refer to Figures 7 and 8 for specific outcomes.
Results
Classify module
We attempted seven different feature encoding schemes and employed XGBoost as the classifier to evaluate their performance. The purpose was to assess the effectiveness of various feature encoding methods. From Figure 2, it can be observed that AAC, CTDD, DP, and DPC exhibited significant advantages over other methods. Therefore, we selected these four encoding schemes for fusion. The comparative results before and after fusion are presented in Table 1, demonstrating a remarkable improvement in the model's performance following the fusion.
Id | Sn(%) | Sp(%) | Acc(%) | MCC | AUROC |
---|---|---|---|---|---|
AAC | 93.392 | 94.138 | 93.764 | 0.875 | 0.982 |
CTDD | 93.822 | 94.320 | 94.072 | 0.881 | 0.985 |
DistancePair | 93.392 | 94.138 | 93.764 | 0.875 | 0.982 |
DPC | 92.892 | 93.394 | 93.142 | 0.863 | 0.978 |
Combined | 94.706 | 95.226 | 94.964 | 0.899 | 0.987 |
In order to select the final machine learning algorithm, we experimented with six different methods. As shown in Figure 3, XGBoost outperforms the other methods in all five evaluation metrics. Overall, XGBoost, RF, and GBDT are the top three methods, and they exhibit significant advantages over the other approaches.
To select the top three performing methods and integrate them, we conducted five-fold cross-validation to mitigate experimental variability. Figure 4 illustrates the specific distribution of the results from these five experiments. As shown in Table 2, the fused model outperforms the other models in sensitivity, specificity, accuracy, and Matthews correlation coefficient. While it slightly underperforms XGBoost in terms of AUROC, the fused model demonstrates superior performance overall.
Sn(%) | Sp(%) | Acc(%) | MCC | AUROC | |
---|---|---|---|---|---|
XGBoost | 94.624 | 94.570 | 94.600 | 0.892 | 0.987 |
GBDT | 93.466 | 93.154 | 93.310 | 0.866 | 0.981 |
RF | 94.624 | 94.372 | 94.498 | 0.890 | 0.982 |
Combined | 94.798 | 94.852 | 94.822 | 0.897 | 0.985 |
The confusion matrix of the fused model on an independent test set displays detailed information about the model's predictions. In the matrix, P represents positive samples and N represents negative samples. It can be observed that the model is capable of correctly predicting the vast majority of samples, with only a few instances of prediction errors.
Ranking module
The ranking results are depicted in Figure 6, where each point on the scatterplot represents the true and predicted rankings of a given sample. We use a threshold of 200 and find that 77.5% of the samples have both true and predicted rankings within the top 200, or beyond the top 200. This demonstrates the good discriminative ability of the ranking module.
Regression module
Figure 7 displays the Epochs-MSE curve for the deep learning model training. As the number of epochs increases, the overall MSE exhibits a downward trend. The loss function for all samples stabilizes at approximately 60 epochs, while that for positive samples stabilizes after 90 epochs, indicating that the models have been fully trained. The MSE for all samples reaches its minimum value of 0.654 after 115 epochs, whereas the MSE for positive samples reaches its minimum value of 0.845 after 119 epochs. Figure 8 depicts the relationship between the predicted and actual logarithmic MIC values from the regression model. It can be observed that the majority of samples have a prediction deviation of less than 2.
Method
Datasets
In this study, the dataset of antimicrobial peptides (AMPs) was obtained from an external dataset called Grampa [2], which includes experimental results of over 51,345 peptide segments sourced from open-access datasets such as APD [3], DADP [4], DBAASP [5], DRAMP [6], and YADAMP [7]. During the dataset preprocessing, we only selected the C-terminal amidated standardized amino acid sequences. Additionally, we retained only the data that exhibited specific activity against S. aureus, which is relevant to our laboratory experiments.
For AMPs with multiple conflicting minimum inhibitory concentration (MIC) values marked, a simple geometric mean strategy was employed to obtain the final measurement result. To eliminate the impact of sequence homology on the methodology, the original sequences were clustered using CD-HIT [8], where sequence identity was set at 0.7. Negative samples without antimicrobial activity were obtained from the UniProt database and had sequence lengths between 0 and 40, without the "antimicrobial" label. An equal number of negative samples were randomly selected to construct a balanced dataset.
To train the ranking and regression models, a virtual MIC value of 8192 μg/mL was assigned to all negative samples. Subsequently, all samples were divided into training and testing sets in an 8:2 ratio. The final dataset consisted of 8838 samples, with the training set comprising 3535 positive samples and 3535 negative samples, while the testing set comprised 884 positive samples and 884 negative samples.
Feature extraction
Through extensive experiments, we have found that employing different feature extraction methods for different modules can effectively enhance the performance of the model. Specifically, we employed two major types of feature extraction methods: (1) For the classification and ranking modules, we utilized physicochemical descriptors to represent the basic peptide features; (2) For the regression module, we employed a method similar to word embeddings in natural language processing to project each peptide into trainable amino acid embeddings. Below, we provide detailed descriptions of these methods:
Amino acid composition(AAC)
The Amino Acid Composition (AAC) encoding calculates the frequency of each amino acid type in a protein or peptide sequence. The frequencies of all 20 natural amino acids, namely "ACDEFGHIKLMNPQRSTVWY," can be computed as follows:
$$ f(t)=\frac{N(t)}{N}, \quad t \in\{A, C, D, \ldots, Y\} $$
Where N(t) represents the count of amino acid type t, and N represents the length of the protein or peptide sequence.
Di-Peptide Composition (DPC)
The 400 descriptors for dipeptide compounds provide information on their physicochemical and structural properties, charge characteristics, hydrophobicity, and amino acid properties. They enable analysis of the structure, properties, and functionality of these compounds. They can be computed as follows:
$$ D(r, s)=\frac{N_{r s}}{N-1}, r, s \in\{A, C, D, \cdots, Y\} $$
Where $N_{r s}$ represents the number of dipeptides represented by amino acid types r and s.
CTDD
The distribution descriptor consists of five values for each of the three groups (polar, neutral, and hydrophobic), namely the corresponding scores for the entire sequence, including the first residue of the given group, and containing 25%, 50%, 75%, and 100% of occurrences. For instance, we start from the first residue and proceed up to and including the residue marking the 25/50/75/100% occurrence of any given group, and then simply divide the position of that residue by the length of the entire sequence.
PseAAC of distance-pair and reduced alphabet (DistancePair)
The descriptors incorporate information about amino acid distances, coupling information, and simplified amino acid letter profiles into a general pseudo-amino acid composition vector. For the simplified letter profile, they are defined as cp(13), cp(14), and cp(15):
$$ \begin{gathered} c p(13)=\{\mathrm{MF} ; \mathrm{IL} ; \mathrm{V} ; \mathrm{A} ; \mathrm{C} ; \mathrm{WYQHP} ; \mathrm{G} ; \mathrm{T} ; \mathrm{S} ; \mathrm{N} ; \mathrm{RK} ; \mathrm{D} ; \mathrm{E}\} \\ \mathrm{cp}(14)=\{\mathrm{EIMV} ; \mathrm{L} ; \mathrm{F} ; \mathrm{WY} ; \mathrm{G} ; \mathrm{P} ; \mathrm{C} ; \mathrm{A} ; \mathrm{S} ; \mathrm{T} ; \mathrm{N} ; \mathrm{HRKQ} ; \mathrm{E} ; \mathrm{D}\} \\ \mathrm{sp}(15)=\{\mathrm{P} ; \mathrm{G} ; \mathrm{E} ; \mathrm{K} ; \mathrm{R} ; \mathrm{Q} ; \mathrm{D} ; \mathrm{S} ; \mathrm{N} ; \mathrm{T} ; \mathrm{H} ; \mathrm{C} ; \mathrm{I} ; \mathrm{V} ; \mathrm{W} ; \mathrm{YF} ; \mathrm{A} ; \mathrm{L} ; \mathrm{M}\} \end{gathered} $$
The absence of a semicolon (;) between individual letters indicates that they belong to the same cluster.
Word embedding is a technique that expresses the semantic and syntactic information of words by mapping them into a high-dimensional vector space. For a given text corpus, word embedding algorithms can learn the representation of each word in the vector space based on its context. To generate word embeddings, we need to use a large amount of textual data as they contain a wide range of words and their relationships, allowing the model to learn rich semantic information. Treating protein sequence data as textual data for extracting correlations between the sequences has become quite popular, and extensive practice has shown it to be a reliable feature extraction method.
Feature selection
If redundant or irrelevant information is used to represent a sequence, it can lead to overfitting and reduce the model's generalization ability. Fortunately, this problem can be avoided by using feature selection methods. To alleviate irrelevant features, various effective feature selection techniques have been proposed, such as analysis of variance, binomial distribution, minimum redundancy maximum relevance, and F-score. In our experiments, we used F-score method to alleviate irrelevant features. The F-Score is a method for measuring the discriminatory power of features between two classes, and effective feature selection can be achieved using this method. The formula for calculating F-score is shown below:
$$ F(i) = \frac{\left(\overline{\mathbf{x}}_i^{(+)}-\overline{\mathbf{x}}_i\right)^2+\left(\overline{\mathbf{x}}_i^{(-)}-\overline{\mathbf{x}}_i\right)^2}{\frac{1}{n_{+}-1} \sum_{k=1}^{n_{+}}\left(x_{k, i}^{(+)}-\overline{\mathbf{x}}_i^{(+)}\right)^2+\frac{1}{n_{-}-1} \sum_{k=1}^{n_{-}}\left(x_{k, i}^{(-)}-\overline{\mathbf{x}}_i^{(-)}\right)^2} $$
Where i represents the i-th feature, which means each feature has an F-score. $\overline{\mathbf{x}}_i$ is the average value of the i-th feature, k represents each instance for a specific i-th feature, and (+) and (-) represent all positive and negative samples, respectively. A higher F-score indicates a stronger discriminatory power of the feature.
Using the F-score method, we selected 352 features from the original 635 features, removing some irrelevant redundant features. This significantly reduced computational costs and improved the performance of the model.
Evaluation metrics
To evaluate the performance of the model, we used metrics including overall accuracy (ACC), sensitivity (Sn), specificity (Sp), Matthew's correlation coefficient (MCC), and the area under the receiver operating characteristic curve (AUC). The given true positive sample number (TP), true negative sample number (TN), false positive sample number (FP), and false negative sample number (FN) were used to calculate the metrics using equations.
$$ \begin{gathered} \text { Accuracy }=\frac{\mathrm{TP}+\mathrm{TN}}{(\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN})} \\ \mathrm{Sn}=\frac{\mathrm{TP}}{(\mathrm{TP}+\mathrm{FN})} \\ \mathrm{Sp}=\frac{\mathrm{TN}}{\mathrm{TN}+\mathrm{FN}} \\ \mathrm{MCC}=\frac{\mathrm{TP} \times \mathrm{TN}-\mathrm{FP} \times \mathrm{FN}}{\sqrt{(\mathrm{TP}+\mathrm{FP}) \times(\mathrm{TP}+\mathrm{FN}) \times(\mathrm{TN}+\mathrm{FP}) \times(\mathrm{TN}+\mathrm{FN})}} \end{gathered} $$
AUC is defined as the area under the receiver operating characteristic curve (ROC). The ROC curve is plotted by varying a series of cut-off values or thresholds, with the true positive rate (TPR) on the y-axis and the false positive rate (FPR) on the x-axis.
K-fold cross-validation and independent testing are widely used methods to evaluate machine learning models. In K-fold cross-validation, the original data is divided into K subsets (K-folds), where each subset is used as a validation set while the remaining K-1 subsets are used as training sets. The models are evaluated on the validation sets for each fold, and the final values of the metrics are averaged to obtain the cross-validation score. In the current work, we employed a 5-fold (K=5) cross-validation approach.
During independent testing, a completely different dataset is used, which means all samples are new to the trained model. This ensures that the model's performance is assessed on unseen data and can generalize well beyond the training dataset.
Mean Squared Error(MSE) is commonly used as a loss function for regression problems. It measures how closely the predicted values $\hat{Y_i}$ from a computational model match the true values $Y_i$ . The smaller the MSE value, the better the predictive performance of the model. The mathematical expression for MSE is as follows:
$$ \mathrm{MSE}=\frac{1}{n} \sum_{i=1}^n\left(Y_i-\hat{Y}_i\right)^2 $$
Contribution to future iGEM teams
To make it more accessible for researchers without programming backgrounds to use the novel method for screening antimicrobial peptides, we have developed a user-friendly website. This website is designed to help these users better utilize the proposed model.
How to use it?
Drag and drop the well-organized data file into the execution box, then click the submit button to automatically execute each step of the MMEP algorithm.
The results are presented as follows, with the first column representing the top ten peptide sequences predicted to have the highest potential, and the second column indicating their corresponding predicted MIC values.
Future Outlook
Although we have completed the modeling part, as a plug-and-play general model, we believe there is room for further optimization to make it more practical and reliable, benefiting a larger number of biological researchers.
Firstly, we will continue to enhance the website by adding modules for data preprocessing, feature selection, machine learning algorithm selection, machine learning pipelines, and result visualization. This will enable users to complete the entire process of machine learning within the website, saving significant time and improving user experience.
Additionally, the website will require better server configuration to avoid issues such as lagging, inability to handle large-scale data, and slow processing speed. As the user base grows, we will continuously strengthen website maintenance while ensuring it remains free and open-source to promote the adoption of this work and assist a wider audience.
Eventually, we will explore additional algorithms such as the popular Transformer, BERT, graph neural networks(GNN), as well as fine-tuning pretrained language models to adapt to downstream tasks related to biosequence prediction.