K-Spherefit | NJTech-China-A

K-SphereFit

Methods
High-dimensional sphere fitting to cluster branches
Results

C. K-SphereFit for quantitative evaluation of protein stability

Molecular simulation is an important tool for analyzing protein stability. However, the analysis needs to integrate the information of numerical characteristics and spatial structure of proteins throughout the simulation process, but this often requires a rich knowledge base of biology, mathematics and computers, which is not friendly to most researchers. We aim to create a model that directly judges the stability of proteins during the simulation process through quantitative metrics.

To further analyze the conformational features and properties of proteins throughout the period, our initial idea is to model the conformational data of proteins throughout the period and quantify the degree of separation between different clusters of proteins and the aggregation density of each cluster itself by five metrics: RMSD, NC, Rg, Hbond, and Sasa.

Since our model uses K-means clustering^[7] and high-dimensional sphere fitting, we call it the "K-SphereFit" model.

3.1 Methods

3.1.1 K-means algorithm for clustering

We went through the conformational data of the proteins and selected five of them, RMSD, NC, Rg, Hbond and Sasa, and analyzed the proteins by clustering them using the K-means algorithm and the silhouette coefficients, and the results of the clustering presented are the different cluster branches of the proteins as well as their number.

First, we normalize the protein conformation data to eliminate the effect of the scale between different metrics, and the data matrix X with n objects and m metrics is as follows:

The normalized data matrix is Z, and each element in Z is:

Then we perform K-means clustering on the normalized conformational data, and the flow of the K-means algorithm is shown in Figure 19:

Fig. 19 Flowchart of K-means algorithm

We choose the Euclidean distance as the distance D between two data objects:

where n is the number of indicators, here n = 5, x_i and y_i are the conformational data for protein indicator i at two different times.

We carry out K-means clustering for the two proteins before and after respectively, for the selection of the initialization class number k, we carry out the test from 2 to 8 sequentially, and judge the optimal number of clusters by calculating the silhouette coefficients obtained from different class numbers k. The larger the silhouette coefficients^[8] are, the better the clustering is represented by the value of k. The formula for the silhouette coefficients is as follows:

For each point, i is a sample in the clustered data, b_i is the minimum value of the distance from i to all samples of other clusters, and a_i is the average value of the distance from i to its own cluster. Finally, the average value of the contour coefficient is calculated for all the sample points.

3.2 High-dimensional sphere fitting to cluster branches

In Step 1, we have clustered the conformational data of the proteins, and we can get that there are two cluster branches for each protein, next, we are going to fit a high-dimensional sphere^[9] to these two cluster branches to explore the distances between the centers of mass of the different cluster branches, as well as the respective densities of the different cluster branches.

For the average distance between the centers of mass of different cluster branches, we can calculate it according to the formule:

Where n is the number of indicators, here n=5; k is the number of clusters, here k=2; center_ij is the conformational data of indicator l at the center of mass of the i-th cluster and center_ij is the conformational data of indicator l at the center of mass of the j-th cluster.

We envision fitting each cluster branch into a high-dimensional sphere, and we first need to calculate the radius of each cluster, which we alternatively take to be the average of the distances to the center of mass of all the points in a class, with the following formula:

Where n is the number of indicators, here n=5; k is the ordinal number of the cluster; and x_j is the conformational data of the protein indicator j at different times,center_ij is the conformational data of indicator j at the center of mass, and num is the number of conformational data points contained in a given cluster.

We are told that the formula for the volume of an n-dimensional sphere is as follows^[10]:

Where ⌈x⌉ stands for rounding upward to x and ⌊x⌋ stands for rounding downward to x.

Since we are clustering by five metrics, we can fit the cluster branches to a five-dimensional sphere, which can be obtained according to the volume equation for the five-dimensional sphere as:

After that we can calculate the density of each cluster as:

In our previous discussion, we mentioned that the conformation of a stable protein should exhibit an average distribution. In K-SphereFit, this translates to achieving a high density of clusters and minimizing the distances between different clusters. As a result, the following formula can be formulated:

By definition, this is an extremely large indicator, which means that a larger value means that the protein is approximately stable^[11].

3.3 Results

3.3.1 Analysis of simulation results of LSPETase in different oxidative environments using models

The histograms of the silhouette coefficients versus the number of clusters for the proteins in groups a and b are shown in Figures 20 and 21:

Fig. 20 Number of clusters for group a

Fig. 21 Number of clusters for group b

From Fig. 20 and Fig. 21, we can see that the silhouette coefficients for both proteins of groups a and b present the maximum value when the number of clusters is 2. Therefore, we both selected the number of clusters as 2 for K-means clustering, i.e., we obtained the two main cluster branches of proteins, and drew parallel line graphs of their clustering indexes, as shown in Fig. 22 and Fig. 23

Fig. 22 Parallel folded graphs for group a

Fig. 23 Parallel folded graphs for group b

From Figures 22 and 23, we can see that both group a and group b have more similar clustering bases, in which those with higher RMSD, lower NC, higher Rg and higher Sasa features are one group, and the rest are the other, and the results of clustering based on the RMSD, NC and Rg indexes are more significant, which explains the reasonableness of our clustering results. However, the effect of clustering based on the Hbond indicator is not very significant, i.e., the Hbond values of both classes of data are more similar.

Finally, to explore the relationship between multiple metrics, we drew a matrix of grouped scatterplots for group a and group b proteins, i.e., combining two and two metrics, and drawing the relationship between two different features through a scatterplot, as shown in Figure 24 and 25:

Fig. 24 Scatterplot matrix of groupings for group a

Fig. 25 Scatterplot matrix of groupings for group b

Where the main diagonal line represents a ladder plot of the probability density of the distribution of a given metric, and the "×" in the plot represents the location of the center of mass of each cluster branch.

From the figure, it can be seen that the characteristics between the indicators of group b proteins are relatively aggregated compared to group a proteins, and it can be seen initially that group b proteins are more stable.

By calculating, we can get the following results (all values calculated after normalizing the original data) as shown in Table 1:

The stability analysis of LSPETase under denaturing and oxidizing conditions by the present model is in agreement with the above as well as with the experiments. The oxidized condition showed better stability

3.3.2 Stability of WT (c), Q126L (d), M157S (e), and W185F (f) in aqueous solution using model analysis

This set of four simulations was performed in an aqueous solution, and the rest of the simulation information is consistent with that described above.

Fig. 26 Number of clusters for group c

Fig. 27 Number of clusters for group d

Fig. 28 Number of clusters for group e

Fig. 29 Number of clusters for group f

From Fig. 26 to Fig. 29, we can see that the silhouette coefficients for the proteins of both groups c and d present their maximum values when the number of clusters is 3, and the silhouette coefficients for the proteins of both groups e and f present their maximum values when the number of clusters is 2. Therefore, we selected the number of clusters of 3 and 2 for the K-means clustering and drew parallel line graphs of their clustering metrics, which are shown in Fig. 30 and Fig. 33:

Fig. 30 Number of clusters for group c

Fig. 31 Number of clusters for group d

Fig. 32 Number of clusters for group e

Fig. 33 Number of clusters for group f

As we can see from Figures 30 and 33, the basis of clustering is more similar for both groups c and d. As we can see from Figures 28 and 29, the basis of clustering is more similar for both groups e and f.

Finally, to explore the relationship between multiple indicators, we drew a scatterplot matrix of grouped proteins of groups c, d, e, and f, i.e., combining two and two indicators, and drawing the relationship between two different features through a scatterplot as shown in Figures 34 to 37:

Fig. 34 Scatterplot matrix of groupings for group c

Fig. 35 Scatterplot matrix of groupings for group d

Fig. 36 Scatterplot matrix of groupings for group e

Fig. 37 Scatterplot matrix of groupings for group f

From the figure, it can be seen that the characteristics between the indicators of the proteins in groups d, e, and f are relatively dispersed compared to the proteins in group c. Initially, it can be seen that the stability of the proteins in group c is higher.

By calculating we can get the following results (all values calculated after normalizing the original data) as shown in Table 3 and Table 4:

From the present model we can speculate that the reason why the binding free energies of Q126L (d), M157S (e), and W185F (f) are better than the wild type but the activities shown in the experiments are not high may be since the structures of the complexes are not stabilized in the degradation system.

References

1. Seo, H.; Kim, S.; Son, H. F.; Sagong, H.; Joo, S.; Kim, K. Production of Extracellular Petase From Ideonella Sakaiensis Using Sec-Dependent Signal Peptides in E. Coli. Biochem. Biophys. Res. Commun. 2019, 508(1), 250-255. DOI: 10.1016/j.bbrc.2018.11.087.

2. Cui, L.; Qiu, Y.; Liang, Y.; Du, C.; Dong, W.; Cheng, C.; He, B. Excretory Expression of Ispetase in E. Coli by an Enhancer of Signal Peptides and Enhanced Pet Hydrolysis. Int. J. Biol. Macromol. 2021, 188, 568-575. DOI: 10.1016/j.ijbiomac.2021.08.012.

3. Katyal, N.; Deep, S. Revisiting the Conundrum of Trehalose Stabilization. Physical chemistry chemical physics : PCCP. 2014, 16(48), 26746-26761. DOI: 10.1039/c4cp02914c (accessed 2014).

4. Lu, D.; Liu, Z. Dynamic Redox Environment-Intensified Disulfide Bond Shuffling for Protein Refolding in Vitro: Molecular Simulation and Experimental Validation. The Journal of Physical Chemistry B. 2008, 112(47), 15127-15133. DOI: 10.1021/jp804649g.

5. Arnittali, M.; Rissanou, A. N.; Amprazi, M.; Kokkinidis, M.; Harmandaris, V. Structure and Thermal Stability of Wtrop and Rm6 Proteins through All-Atom Molecular Dynamics Simulations and Experiments. International Journal of Molecular Sciences. 2021, 22(11), 5931. DOI: 10.3390/ijms22115931.

6. Wu, J.; Lv, J.; Zhao, L.; Zhao, R.; Gao, T.; Xu, Q.; Liu, D.; Yu, Q.; Ma, F. Exploring the Role of Microbial Proteins in Controlling Environmental Pollutants Based On Molecular Simulation. Sci. Total Environ. 2023, 905, 167028. DOI: 10.1016/j.scitotenv.2023.167028.

7. Feng, A. Automotive Product Analysis Based On Mp-Dp-Kmeans Clustering. IEEE, 2023;pp 305-311.

8. Bagirov, A. M.; Aliguliyev, R. M.; Sultanova, N. Finding Compact and Well-Separated Clusters: Clustering Using Silhouette Coefficients. Pattern Recognit. 2023, 135, 109144. DOI: 10.1016/j.patcog.2022.109144.

9. 3.2305.20051V1.

10. LIU, S.; PAN, Z.; CHENG, X. A Novel Fast Fractal Image Compression Method Based On Distance Clustering in High Dimensional Sphere Surface. Fractals-Complex Geom. Patterns Scaling Nat. Soc. 2017, 25(04), 1740004. DOI: 10.1142/S0218348X17400047.

11. Kang, S.; Kim, M.; Sun, J.; Lee, M.; Min, K. Prediction of Protein Aggregation Propensity Via Data-Driven Approaches. 2023. DOI: 10.48550/arxiv.2304.03293 (accessed 2023).

Code of K-SphereFit