Model

Abstract

  In order to optimize CoPlat, we aim to create a classifier using machine learning techniques to identify potential adhesive proteins. Our goal is to discover potential adhesive proteins with higher adhesive strength than barnacle and mussel proteins. Subsequently, we plan to validate the strength of the potential adhesives through Wetlab’s functional tests.


Definition

Q: What is the "adhesion" we want?


A: Like the adhesive proteins in barnacles or mussels, the proteins are not specific to a particular substrate and can attach to a wide range of surfaces.



Q: What are the "potentially" adhesive proteins we are looking for?


A: The protein that is not explicitly stated to have non-specific, generalized adhesion in the functional descriptions of protein databases such as Uniprot and PDB, but the actual product may have generalized adhesion.



Method Description

  First, build a binary classifier for proteins. Second, get four databases from Uniprot and name them test, test1, test2, and test3, each represents a protein dataset that has different features. Then, the four datasets are run separately using the classifier to distinguish whether or not the protein has the potential adhesion we are looking for.

Figure 1. Flowchart of our model



Data Preprocessing


  We collected data on adhesion proteins from UniProt and compiled it into a positive database. Then, we analyzed the positive database and generated a negative one with a length distribution matching that of the positive one.


First Stage: Positive Data Gaining

  We employed function[cc] searching field on the UniProt , by using terms such as "adhesive," "cohesive," "adhesion," and "sticky," to identify positive data.


Second Stage: Positive Data Analysis

  We constructed a pie chart of species distribution to visually examine the proportions of various species within the positive data.

Figure 2.The organism composition of our positive data

  According to the chart, Mouse-ear cress is the most common organism, accounting for 28% of the total, followed by Human, Mouse and Fission yeast. We have simplified the number of occurrences less than or equal to two as “Other,” which accounted for 46% of the total. It shows the distributive tendency of the positive data.

  Here is our full version of the pie chart with all organisms.

Figure 3.The organism composition of our positive data

Third Stage: Negative Data Gaining

  We applied the same method to eliminate search results. For instance, "cell adhesion", "cell-cell adhesion" and similar terms are eliminated to prevent unacceptable formation of our positive database. Then, we analyzed this database and randomly generated ten sets of negative databases based on the length distribution of the positive database.

Figure 4.The length distribution of our positive data

  We can observe that most of the amino acid lengths are less than 500 base pairs.




Model Construction

  To establish our model, we needed to determine the encoding and machine learning methods. In the first stage, we employed the SVM machine learning model along with three different encoding methods to construct the ML model. We conducted training and testing on this ML model and observed a high probability. In the second stage, to validate whether the model exhibited overfitting, we compared it to GB (Gradient Boosting) and RF (Random Forest). Through the comparison of the three encoding methods and models, we confirmed the final ML model as SVM with the ESM2[6] encoding method.


First Stage : Protein Encoding

  Protein encoding entails transcribing a protein's primary structure, which includes its amino acid sequence, into a format suitable for utilization as input data within machine learning models. In our project, we employed three encoding methods to determine which one works best.

a. AAPC Encoding Method

  Transform a protein sequence into a vector by counting the number of amino acid pairs.

b. AA Index Encoding Method[1]

  Transform a protein sequence into a vector based on each amino acid's chemical and physical features; these features will be converted to values of the 14 physicochemical property scales of the 20 essential amino acids.

Figure 4. Values of the 14 physicochemical property scales of the 20 essential amino acids

List of the 14 physicochemical properties:

  1. H11 & H12: hydrophobicity
  2. H2: hydrophilicity
  3. NCI: net char index of side chains
  4. P11 & P12 : polarity
  5. P2: polarizability
  6. SASA: solvent-accessible surface area
  7. V: volume of side chains
  8. F: flexibility
  9. A1: accessibility
  10. E: exposed
  11. T: turns
  12. A2: antegenic

c. ESM Encoding Method

  ESM (evolutionary scale modeling) is a masked language model for analyzing the features and structure of protein. By extracting the 1280 numbers from the final layer of the model, we can get a vector representation of the protein. In summary, we employ the three encoding methods to transform amino acid sequences into formats suitable for input data for machine learning models.


Second Stage: ML Model Testing and Analyzing


  In this section, we constructed a machine learning model and ensured its stability and excellent performance through various validation methods.

a. Model Selection

  We chose support vector machine (SVM) as our machine learning method instead of the deep learning model.


b. Training

i. Balance training
To avoid training out a biased model, we utilize the balance training method.
ii. Jackknife (Leave-one-out) method
We utilize the Jackknife method to calculate our accuracy, which is a method that separates a single data to test each time. In this case, we can confirm that our data set is not biased.

c. Testing

i. Stability Test
  To verify that the accuracy of the model is not dependent on training with just one negative database, we employed seven distinct negative databases along with a consistent positive database for training. Our objective was to assess whether the model could deliver consistent predictions, regardless of the random assortment of negative databases.

Figure 5. The accuracy of our model with different negative training databases.

  According to Figure 5, it illustrates that the accuracy of each negative database does not show a significant difference.


d. Overfitting Testing

  Since the accuracy of SVM-ESM is higher than AAPC and AA index encoding methods with SVM, we suspect that overfitting may occur. Therefore, we compare the accuracy of Random Forest, Gradient Boost, and SVM. If the difference is too large, it means that SVM overfits, and the result indicates that it does not.

Figure 6.Three ML models accuracy with AAPC encoding

Figure 7.Three ML models accuracy with AA index encoding

Figure 8.Three ML models accuracy with AA index encoding

  Figure 6~8 indicates that under the same encoding method, the accuracies of the three models are not far from each other. It can prove that they are not overfitting each other.


e. Determination of ML and Encoding Method

  By pairing up different encoding methods with different machine learning models, we can get nine different combinations to find out the best encoding method.

Figure 9.Same model with different encoding methods accuracy comparison

  We can see that in the case of the same model, we can know that ESM encoding is the most accurate, so we finally choose it as the encoding method. After establishing a machine learning model and validating its high stability and performance, we can employ themodel to identify potential adhesive proteins.




Pick the Potential Adhesive Protein


  Having established the machine learning model, the next step is to use this model to select potential adhesive proteins from those that have not been previously documented to possess adhesive properties. Within our positive dataset, it was observed that the majority of proteins were less than 500 amino acids in length. Therefore, based on this length distribution, we fed the test datasets into our model for analysis.
  The model predictions allow us to identify proteins most likely to have adhesive properties. These proteins are then subjected to a feature comparison with our positive dataset.

a. Data Selection

  We found out that the range of the positive adhesive protein is mostly below 500 amino acids. We then input four test datasets(test, test1~3) to the SVM-ESM model for classification.

b. Ranking

  Proteins are ranked in probability, and the top proteins are likely to be generally adhesive. After obtaining the probabilities for each protein, we can utilize protein analysis tools to assess the accuracy of the model's predictions.




Feature Analyzing


  In addition to the results predicted by the model, it is essential to compare these proteins, which are most likely to have adhesive properties, with the common features present in our positive dataset. Through further analysis, we can increase our confidence in the similarity between the proteins predicted by our model and the positive data. Below are the analytical tools we used to do this:


a. AACP Comparison

  AACP(amino acid Component Percentage) can be obtained by counting the amino acid content of all proteins in the database. By using this statistical method, we can get the trend of the amino acid content of these adhesive proteins in nature. By comparing them with the proteins that we have identified, we can judge whether the identified proteins are the potential adhesive proteins. We present and visually analyze the results of comparing the positive and negative data through the bar chart.

Figure 10.Positive data AACP analyzing result compare with negative data

Conclusion of AACP

  From Figure 10, we can quickly see the characteristics of positive and negative data. For instance, there are more P, S, and T amino acids, and the possible reasons for these three amino acid characteristics are as follows:


P Amino Acid rich:
  1. Proline has cyclic side chains and rigidity, which helps to stabilize the protein structure.[2]
  2. Proline-rich regions are found in proteins involved in biofilm adhesion.[3]
S Amino Acid rich:
  Serine is usually found in protein regions that require flexibility. It can participate in hydrogen bonding interactions and help form secondary structures such as β-sheets and rings. Proteins that need to undergo structural changes or exhibit flexibility may have higher levels of serine content. [4]
T Amino Acid rich:
  Threonine is associated with focal adhesion which connects cells to the substrate, anchoring the cell to the attached cellular substrate.[5]


b. AAPC Comparison and Visualization

  AAPC (Amino Acid Pair component), which has been introduced in the AAPC encoding method, calculates the number of Amino Acid pairs in the protein sequence as a feature. Statistically, we can derive a table of 20*20 amino acid pairs and a 26*26 table by including the less common or uncertain amino acid symbols.

Figure 11.20*20 Matrix of AAPC symbols

  The above figure shows that a 20*20 charts are too complicated to read, so we use a heatmap to set its color. We standardize these data into numbers in a range from 0 to 1000 and use light brown to black color gradients to represent different data; the results are as follows:

Figure 12.Visualized AAPC Matrix

c. InterPro

  InterPro is a tool for analyzing protein characteristics. It can utilize existing data from different open-source databases to compare an input sequence. We can find out as much as possible about the characteristics of the sequence, such as which family it belongs to, which functional domains it has, and so on. Using this tool, we can obtain more detailed information about proteins and analyze them for potential widespread adhesion.


d. The MEME Suite

  The MEME Suite provides multiple tools to analyze the motifs, a subsequence that occurs repeatedly in one or more input DNA or protein sequences and may contain some functional or structural feature.

Figure 13.Flowchart of MEME Suite tool.

  First, we utilized two motif discovery tools, MEME (Multiple Em for Motif Elicitation) and STREME (Sensitive, Thorough, Rapid, Enriched Motif Elicitation) to analyze the input sequences. Second, since the tools output the motifs they calculate, but are not necessarily known motifs.
  We used the motif comparison tool, Tomtom, to compare the output motifs with Prosite, a protein motif database. By knowing which known motif is the most similar to our output motif, we could find out their possible functions. This could help us understand more about the functional features of the positive pure data based on sequences.
  MEME required manually setting the length of the motif and how many motifs it should find. Then, MEME will find the motifs with the target number and length in the input sequence set. By trying several combinations of parameters, we decided to use the following setting:
  Lengths between 4~15 amino acids: motif should be a short subsequence.
  We have tested the number between 3~9 and found out that if the number is too small, some features are missing; if the number is too large, the scores of the motifs are too low and have less credibility. Finally, we found 5 motifs.


Figure 14.MEME result motifs-1

motif: EEGJTVFAPSDEAFK
Percentage of input sequences that contain this motif: 39.4%
The most similar known sequence from Prosite: PS00647 (THYMID_PHOSPHORYLASE)
The function of the most similar known sequence: Thymidine and pyrimidine-nucleoside phosphorylases signature

Figure 15.MEME result motifs-2

motif: VYGVDGVLLPEELFG
Percentage of input sequences that contain this motif: 21.2%
The most similar known sequence from Prosite: PS01095 (GH18_1)
The function of the most similar known sequence: Glycosyl hydrolases family 18 (GH18) active site signature

Figure 16.MEME result motifs-3

motif: EPVQJLLYHVJPEYY
Percentage of input sequences that contain this motif: 25.8%
The most similar known sequence from Prosite: PS00435 (PEROXIDASE_1)
The function of the most similar known sequence: Peroxidases proximal heme-ligand signature.

Figure 17.MEME result motifs-4

motif: DFIHTLLHYGGYNEM
Percentage of input sequences that contain this motif: 6.1%
The most similar known sequence from Prosite: PS01260 (BH4_1)
The function of the most similar known sequence: Apoptosis regulator, Bcl-2 family BH4 motif signature

Figure 18.MEME result motifs-5

motif: NITAILEKAPQFSTF
Percentage of input sequences that contain this motif: 19.7%
The most similar known sequence from Prosite: PS01236 (PDXT_SNO_1)
The function of the most similar known sequence: PdxT/SNO family signature

STREME automatically calculates all possible motifs in the input sequence set and ranks the motifs by their p-value, frequency of occurrence, and more. Finally, the motifs with the higher overall score will be output.

Figure 19.STREME result motif

motif: DIYTDGRI
p-value: DIYTDGRI
Percentage of input sequences that contain this motif: 19.7%

Figure 20.STREME Motif Distruibution

The most similar known sequence from Prosite: PS00887 (ILVD_EDD_2)
The function of the most similar known sequence: Dihydroxy-acid and 6-phosphogluconate dehydratases signature 2


Conclusion of MEME suite

As a result, The MEME Suite did not find motifs with significant adhesive function in positive pure date, which means that:
1. Each adhesive protein's mechanism may differ, and it is difficult to find a common subsequence related to adhesion function.
2. The output sequences of the classifier trained in this dataset may have features similar to the motifs we analyzed previously. Therefore, it can be used as one of the references for the final manual check of whether or not the output sequence is similar to the positive pure data.
Based on the results of the feature analysis described above, several key insights can be drawn:
1. The factors contributing to protein adhesiveness are complex and diverse, making identifying a highly prevalent motif difficult.
2. In terms of amino acid composition, proline (P), serine (S), and threonine (T) are critical components that contribute to a protein's adhesive properties. Therefore, proteins with high levels of these three amino acids are more likely to be adhesive.
In light of these findings, we decided to select the top four proteins with the highest predictive scores generated by the model for further in-depth investigation.




Result

  With the machine-learning classifier and protein feature analysis, we picked out the top few that get the highest probability from our classifier and examined their Function[CC] in Uniprot. Finally, we found four proteins with high probability that were not described as having general adhesive functions by Uniprot.


1. EcpA(P35645)

Uniprot function[cc]: Fimbrial protein EcpA
length(Amino acid): 159

Figure 21.EcpA(P35645) AAC comparison with positive database

Figure 22.EcpA(P35645) Structure Prediction with AlphaFold

  Our Conclusion in InterPro: The family of pilus associated with cell adhesion has the potential for general adhesion.

Figure 23.EcpA(P35645) InterPro Analyzing Result

Family belonging to EcpA:
  Pilin is a component of a polar flexible filament, which involves cell adhesion, microcolony formation, twitching motility, and transformation.


2. Nid1(P08460)

Uniprot function[cc]: Sulfated glycoprotein widely distributed in basement membranes and tightly associated with laminin. Also binds to collagen IV and perlecan. It probably has a role in cell-extracellular matrix interactions.
length(Amino acid): 324

Figure 24.Nid1(P08460) AAC comparison with positive database

Figure 25.Nid1(P08460) Structure Prediction with AlphaFold

Our Conclusion in InterPro: A domain of unknown function, which is also present in some cell adhesion glycoproteins, is hypothesized to have the potential for general adhesion.

Figure 26.Nid1(P08460) InterPro Analyzing Result

Domain belonging to Nid1:
NIDO domain, extracellular domain of unknown function. Some cell adhesion glycoproteins are known to contain a NIDO domain.


3. epd2(P28771)

Uniprot function[cc]: This may play a role in neural plasticity. May be involved during axon regeneration.
length(Amino acid): 221

Figure 27.epd2(P28771) AAC comparison with positive database

Figure 28.epd2(P28771) Structure Prediction with AlphaFold

Our Conclusion in InterPro: The family and domain of this protein suggest that it may be associated with collagen fibers, with the potential for general adhesion.

Figure 29.epd2(P28771) InterPro Analyzing Result

Family belonging to epd2:
Ependymin, found predominantly in the cerebrospinal fluid of teleost fish. A bound form of the glycoproteins is associated with fibrils.


4. zig-4(G5ECB1)

Uniprot function[cc]: Required for maintaining the axon position of PVQ and PVP neurons post embryonically in the ventral nerve cord (VNC) by preventing axons drifting into the opposite side of the VNC that could occur during body growth and movement.
length(Amino acid): 253

Figure 30.zig-4(G5ECB1) AAC comparison with positive database

Figure 31.zig-4(G5ECB1) Structure Prediction with AlphaFold

Our Conclusion in InterPro: It is a family of cell adhesion and also has a domain of cell adhesion, which has the potential for general adhesion.

Figure 32.zig-4(G5ECB1) InterPro Analyzing Result

Family belonging to zig-4:
Basigin-like, a family of neuronal cell adhesion molecules that belong to the immunoglobulin superfamily.

Domain belonging to zig-4:
EPENDYMIN, found predominantly in the cerebrospinal fluid of teleost fish. A bound form of the glycoproteins is associated with the extracellular matrix, probably with collagen.


After we predicted and analyzed these four proteins, we tried to express them with E. coli to verify the correctness of our classifier. We also confirmed that they have adhesive properties and can be used as adhesive proteins in CoPlat. More details about the functional test of potential adhesive proteins are in the Result part.