Abstract
In order to optimize CoPlat, we aim to create a classifier using machine learning techniques to identify potential adhesive proteins. Our goal is to discover potential adhesive proteins with higher adhesive strength than barnacle and mussel proteins. Subsequently, we plan to validate the strength of the potential adhesives through Wetlab’s functional tests.
Definition
Q: What is the "adhesion" we want?
A: Like the adhesive proteins in barnacles or mussels, the proteins are not specific to a particular substrate and can attach to a wide range of surfaces.
Q: What are the "potentially" adhesive proteins we are looking for?
A: The protein that is not explicitly stated to have non-specific, generalized adhesion in the functional descriptions of protein databases such as Uniprot and PDB, but the actual product may have generalized adhesion.
Method Description
First, build a binary classifier for proteins. Second, get four databases from Uniprot and name them test, test1, test2, and test3, each represents a protein dataset that has different features. Then, the four datasets are run separately using the classifier to distinguish whether or not the protein has the potential adhesion we are looking for.
Data Preprocessing
We collected data on adhesion proteins from UniProt and compiled it into a positive database. Then, we analyzed the positive database and generated a negative one with a length distribution matching that of the positive one.
First Stage: Positive Data Gaining
We employed function[cc] searching field on the UniProt , by using terms such as "adhesive," "cohesive," "adhesion," and "sticky," to identify positive data.
Second Stage: Positive Data Analysis
We constructed a pie chart of species distribution to visually examine the proportions of various species within the positive data.
According to the chart, Mouse-ear cress is the most common organism, accounting for 28% of the total, followed by Human, Mouse and Fission yeast. We have simplified the number of occurrences less than or equal to two as “Other,” which accounted for 46% of the total. It shows the distributive tendency of the positive data.
Here is our full version of the pie chart with all organisms.
Third Stage: Negative Data Gaining
We applied the same method to eliminate search results. For instance, "cell adhesion", "cell-cell adhesion" and similar terms are eliminated to prevent unacceptable formation of our positive database. Then, we analyzed this database and randomly generated ten sets of negative databases based on the length distribution of the positive database.
We can observe that most of the amino acid lengths are less than 500 base pairs.
Model Construction
To establish our model, we needed to determine the encoding and machine learning methods. In the first stage, we employed the SVM machine learning model along with three different encoding methods to construct the ML model. We conducted training and testing on this ML model and observed a high probability. In the second stage, to validate whether the model exhibited overfitting, we compared it to GB (Gradient Boosting) and RF (Random Forest). Through the comparison of the three encoding methods and models, we confirmed the final ML model as SVM with the ESM2[6] encoding method.
First Stage : Protein Encoding
Protein encoding entails transcribing a protein's primary structure, which includes its amino acid sequence, into a format suitable for utilization as input data within machine learning models. In our project, we employed three encoding methods to determine which one works best.
a. AAPC Encoding Method
Transform a protein sequence into a vector by counting the number of amino acid pairs.
b. AA Index Encoding Method[1]
Transform a protein sequence into a vector based on each amino acid's chemical and physical features; these features will be converted to values of the 14 physicochemical property scales of the 20 essential amino acids.
List of the 14 physicochemical properties:
- H11 & H12: hydrophobicity
- H2: hydrophilicity
- NCI: net char index of side chains
- P11 & P12 : polarity
- P2: polarizability
- SASA: solvent-accessible surface area
- V: volume of side chains
- F: flexibility
- A1: accessibility
- E: exposed
- T: turns
- A2: antegenic
c. ESM Encoding Method
ESM (evolutionary scale modeling) is a masked language model for analyzing the features and structure of protein. By extracting the 1280 numbers from the final layer of the model, we can get a vector representation of the protein. In summary, we employ the three encoding methods to transform amino acid sequences into formats suitable for input data for machine learning models.
Second Stage: ML Model Testing and Analyzing
In this section, we constructed a machine learning model and ensured its stability and excellent performance through various validation methods.
a. Model Selection
We chose support vector machine (SVM) as our machine learning method instead of the deep learning model.
b. Training
i. Balance training
To avoid training out a biased model, we utilize the balance training method.
ii. Jackknife (Leave-one-out) method
We utilize the Jackknife method to calculate our accuracy, which is a method that separates a single data
to test each time. In this case, we can confirm that our data set is not biased.
c. Testing
i. Stability Test
To verify that the accuracy of the model is not dependent on training with just one negative database, we
employed seven distinct negative databases along with a consistent positive database for training. Our objective
was to assess whether the model could deliver consistent predictions, regardless of the random assortment of
negative databases.
According to Figure 5, it illustrates that the accuracy of each negative database does not show a significant difference.
d. Overfitting Testing
Since the accuracy of SVM-ESM is higher than AAPC and AA index encoding methods with SVM, we suspect that overfitting may occur. Therefore, we compare the accuracy of Random Forest, Gradient Boost, and SVM. If the difference is too large, it means that SVM overfits, and the result indicates that it does not.
Figure 6~8 indicates that under the same encoding method, the accuracies of the three models are not far from each other. It can prove that they are not overfitting each other.
e. Determination of ML and Encoding Method
By pairing up different encoding methods with different machine learning models, we can get nine different combinations to find out the best encoding method.
We can see that in the case of the same model, we can know that ESM encoding is the most accurate, so we finally choose it as the encoding method. After establishing a machine learning model and validating its high stability and performance, we can employ themodel to identify potential adhesive proteins.
Pick the Potential Adhesive Protein
Having established the machine learning model, the next step is to use this model to select potential
adhesive proteins from those that have not been previously documented to possess adhesive properties. Within our
positive
dataset, it was observed that the majority of proteins were less than 500 amino acids in length. Therefore, based
on this length distribution, we fed the test datasets into our model for analysis.
The model predictions allow us to identify proteins most likely to have adhesive properties. These proteins
are then subjected to a feature comparison with our positive dataset.
a. Data Selection
We found out that the range of the positive adhesive protein is mostly below 500 amino acids. We then input four test datasets(test, test1~3) to the SVM-ESM model for classification.
b. Ranking
Proteins are ranked in probability, and the top proteins are likely to be generally adhesive. After obtaining the probabilities for each protein, we can utilize protein analysis tools to assess the accuracy of the model's predictions.
Feature Analyzing
In addition to the results predicted by the model, it is essential to compare these proteins, which are most likely to have adhesive properties, with the common features present in our positive dataset. Through further analysis, we can increase our confidence in the similarity between the proteins predicted by our model and the positive data. Below are the analytical tools we used to do this:
a. AACP Comparison
AACP(amino acid Component Percentage) can be obtained by counting the amino acid content of all proteins in the database. By using this statistical method, we can get the trend of the amino acid content of these adhesive proteins in nature. By comparing them with the proteins that we have identified, we can judge whether the identified proteins are the potential adhesive proteins. We present and visually analyze the results of comparing the positive and negative data through the bar chart.
Conclusion of AACP
From Figure 10, we can quickly see the characteristics of positive and negative data. For instance, there are more P, S, and T amino acids, and the possible reasons for these three amino acid characteristics are as follows:
P Amino Acid rich:
1. Proline has cyclic side chains and rigidity, which helps to stabilize the protein structure.[2]
2. Proline-rich regions are found in proteins involved in biofilm adhesion.[3]
S Amino Acid rich:
Serine is usually found in protein regions that require flexibility. It can participate in hydrogen bonding
interactions and help form secondary structures such as β-sheets and rings. Proteins that need to undergo
structural changes or exhibit flexibility may have higher levels of serine content. [4]
T Amino Acid rich:
Threonine is associated with focal adhesion which connects cells to the substrate, anchoring the cell to
the attached cellular substrate.[5]
b. AAPC Comparison and Visualization
AAPC (Amino Acid Pair component), which has been introduced in the AAPC encoding method, calculates the number of Amino Acid pairs in the protein sequence as a feature. Statistically, we can derive a table of 20*20 amino acid pairs and a 26*26 table by including the less common or uncertain amino acid symbols.
The above figure shows that a 20*20 charts are too complicated to read, so we use a heatmap to set its color. We standardize these data into numbers in a range from 0 to 1000 and use light brown to black color gradients to represent different data; the results are as follows:
c. InterPro
InterPro is a tool for analyzing protein characteristics. It can utilize existing data from different open-source databases to compare an input sequence. We can find out as much as possible about the characteristics of the sequence, such as which family it belongs to, which functional domains it has, and so on. Using this tool, we can obtain more detailed information about proteins and analyze them for potential widespread adhesion.
d. The MEME Suite
The MEME Suite provides multiple tools to analyze the motifs, a subsequence that occurs repeatedly in one or more input DNA or protein sequences and may contain some functional or structural feature.
First, we utilized two motif discovery tools, MEME (Multiple Em for Motif Elicitation) and STREME
(Sensitive,
Thorough, Rapid, Enriched Motif Elicitation) to analyze the input sequences. Second, since the tools output the
motifs they calculate, but are not necessarily known motifs.
We used the motif comparison tool, Tomtom, to compare the output motifs with Prosite, a protein motif
database. By
knowing which known motif is the most similar to our output motif, we could find out their possible functions.
This could help us understand more about the functional features of the positive pure data based on sequences.
MEME required manually setting the length of the motif and how many motifs it should find. Then, MEME will
find the motifs with the target number and length in the input sequence set. By trying several combinations of
parameters,
we decided to use the following setting:
Lengths between 4~15 amino acids: motif should be a short subsequence.
We have tested the number between 3~9 and found out that if the number is too small, some features are
missing; if the number is too large, the scores of the motifs are too low and have less credibility. Finally, we
found 5
motifs.
motif: EEGJTVFAPSDEAFK
Percentage of input sequences that contain this motif: 39.4%
The most similar known sequence from Prosite: PS00647 (THYMID_PHOSPHORYLASE)
The function of the most similar known sequence: Thymidine and pyrimidine-nucleoside phosphorylases signature
motif: VYGVDGVLLPEELFG
Percentage of input sequences that contain this motif: 21.2%
The most similar known sequence from Prosite: PS01095 (GH18_1)
The function of the most similar known sequence: Glycosyl hydrolases family 18 (GH18) active site signature
motif: EPVQJLLYHVJPEYY
Percentage of input sequences that contain this motif: 25.8%
The most similar known sequence from Prosite: PS00435 (PEROXIDASE_1)
The function of the most similar known sequence: Peroxidases proximal heme-ligand signature.
motif: DFIHTLLHYGGYNEM
Percentage of input sequences that contain this motif: 6.1%
The most similar known sequence from Prosite: PS01260 (BH4_1)
The function of the most similar known sequence: Apoptosis regulator, Bcl-2 family BH4 motif signature
motif: NITAILEKAPQFSTF
Percentage of input sequences that contain this motif: 19.7%
The most similar known sequence from Prosite: PS01236 (PDXT_SNO_1)
The function of the most similar known sequence: PdxT/SNO family signature
STREME automatically calculates all possible motifs in the input sequence set and ranks the motifs by their p-value, frequency of occurrence, and more. Finally, the motifs with the higher overall score will be output.
motif: DIYTDGRI
p-value: DIYTDGRI
Percentage of input sequences that contain this motif: 19.7%
The most similar known sequence from Prosite: PS00887 (ILVD_EDD_2)
The function of the most similar known sequence: Dihydroxy-acid and 6-phosphogluconate dehydratases signature 2
Conclusion of MEME suite
As a result, The MEME Suite did not find motifs with significant adhesive function in positive pure date, which
means that:
1. Each adhesive protein's mechanism may differ, and it is difficult to find a common subsequence related to
adhesion function.
2. The output sequences of the classifier trained in this dataset may have features similar to the motifs we
analyzed previously. Therefore, it can be used as one of the references for the final manual check of whether or
not
the output sequence is similar to the positive pure data.
Based on the results of the feature analysis described above, several key insights can be drawn:
1. The factors contributing to protein adhesiveness are complex and diverse, making identifying a highly prevalent
motif difficult.
2. In terms of amino acid composition, proline (P), serine (S), and threonine (T) are critical components that
contribute to a protein's adhesive properties. Therefore, proteins with high levels of these three amino acids are
more likely to be adhesive.
In light of these findings, we decided to select the top four proteins with the highest predictive scores
generated
by the model for further in-depth investigation.
Result
With the machine-learning classifier and protein feature analysis, we picked out the top few that get the highest probability from our classifier and examined their Function[CC] in Uniprot. Finally, we found four proteins with high probability that were not described as having general adhesive functions by Uniprot.
1. EcpA(P35645)
Uniprot function[cc]: Fimbrial protein EcpA
length(Amino acid): 159
Our Conclusion in InterPro: The family of pilus associated with cell adhesion has the potential for general adhesion.
Family belonging to EcpA:
Pilin is a component of a polar flexible filament, which involves cell adhesion, microcolony formation,
twitching
motility, and transformation.
2. Nid1(P08460)
Uniprot function[cc]: Sulfated glycoprotein widely distributed in basement membranes and tightly associated with
laminin. Also binds to collagen IV and perlecan. It probably has a role in cell-extracellular matrix
interactions.
length(Amino acid): 324
Our Conclusion in InterPro: A domain of unknown function, which is also present in some cell adhesion glycoproteins, is hypothesized to have the potential for general adhesion.
Domain belonging to Nid1:
NIDO domain, extracellular domain of unknown function. Some cell adhesion glycoproteins are known to contain a
NIDO domain.
3. epd2(P28771)
Uniprot function[cc]: This may play a role in neural plasticity. May be involved during axon regeneration.
length(Amino acid): 221
Our Conclusion in InterPro: The family and domain of this protein suggest that it may be associated with collagen fibers, with the potential for general adhesion.
Family belonging to epd2:
Ependymin, found predominantly in the cerebrospinal fluid of teleost fish. A bound form of the glycoproteins is
associated with fibrils.
4. zig-4(G5ECB1)
Uniprot function[cc]: Required for maintaining the axon position of PVQ and PVP neurons post embryonically in
the ventral nerve cord (VNC) by preventing axons drifting into the opposite side of the VNC that could occur
during body growth and movement.
length(Amino acid): 253
Our Conclusion in InterPro: It is a family of cell adhesion and also has a domain of cell adhesion, which has the potential for general adhesion.
Family belonging to zig-4:
Basigin-like, a family of neuronal cell adhesion molecules that belong to the immunoglobulin superfamily.
Domain belonging to zig-4:
EPENDYMIN, found predominantly in the cerebrospinal fluid of teleost fish. A bound form of the glycoproteins is
associated with the extracellular matrix, probably with collagen.
After we predicted and analyzed these four proteins, we tried to express them with E. coli to verify the correctness of our classifier. We also confirmed that they have adhesive properties and can be used as adhesive proteins in CoPlat. More details about the functional test of potential adhesive proteins are in the Result part.
Reference
- Chen KH, Wang TF, Hu YJ. Protein-protein interaction prediction using a hybrid feature representation and a stacked generalization scheme. BMC Bioinformatics. 2019;20(1):308. doi: 10.1186/s12859-019-2907-1.
- Yu H, Zhao Y, Guo C, Gan Y, Huang H. The role of proline substitutions within flexible regions on thermostability of luciferase. Biochim Biophys Acta. 2015 Jan;1854(1):65-72. doi: 10.1016/j.bbapap.2014.10.017. Epub 2014 Oct 30. PMID: 25448017.
- Yarawsky AE, English LR, Whitten ST, Herr AB. The Proline/Glycine-Rich Region of the Biofilm Adhesion Protein Aap Forms an Extended Stalk that Resists Compaction. J Mol Biol. 2017 Jan 20;429(2):261-279. doi: 10.1016/j.jmb.2016.11.017. Epub 2016 Nov 25. PMID: 27890783; PMCID: PMC5363081.
- van Rosmalen M, Krom M, Merkx M. Tuning the Flexibility of Glycine-Serine Linkers To Allow Rational Design of Multidomain Proteins. Biochemistry. 2017;56(50):6565-6574. doi: 10.1021/acs.biochem.7b00902.
- Brown MC, Perrotta JA, Turner CE. Serine and Threonine Phosphorylation of the Paxillin LIM Domains Regulates Paxillin Focal Adhesion Localization and Cell Adhesion to Fibronectin. Mol Biol Cell. 1998;9(7):1803-16. doi: 10.1091/mbc.9.7.1803. PMID: 9658172.