Model

Introduction

Due to the time limitations of iGEM and the amount of time it takes to acquire results in the laboratory, we wanted to strive towards taking our project further with the use of computational tools, as well as support the findings in the laboratory. The aim of our project was to improve the degradation of PFOA by DeHa catalysis.

Fig 0) The method for predicting protein structure. The open-source tool ColabFold is used, which applies the AlphaFold2 model and MMseqs2 to predict structures. These are then verified by AlphaFold and with FoldSeek.

To perform computational analyses on the dehalogenases, protein structures are needed. With the intention of performing docking simulations, we used cutting-edge machine learning tools to determine the structures of the four dehalogenases.

Method

Given the DNA sequences from the USAFA21 team, a protein structure prediction was performed using ColabFold. The structure was corrected and minimized with the preparation wizard in Schrödinger’s Maestro for further analysis. To verify the integrity of the predictions, the evaluations generated by AlphaFold were considered, and the structural alignment tool, FoldSeek, was used to further verify.

ColabFold

ColabFold is an open-source software that uses AlphaFold2 and MMseqs2 to perform rapid structure predictions1.

AlphaFold uses a set of similar sequences and homologous structures from databases, to perform multiple sequence alignment and accurately predict the structure of the primary sequence2.

To improve the search for similar sequences and homologous structures, ColabFold implements the MMseqs2 software. MMseqs2 enables a more sensitive search at a higher speed, approximately 400 times faster than native AlphaFold, to find these with higher surety.

Evaluation

To evaluate the results generated by the model, it generates 3 quality measure: (1) Predicted Local Distance Difference Test, (2) Predicted Aligned Error and (3) Sequence Coverage.

Predicted Local Distance Difference Test (pLDDT)

The pLDDT is a measure of the plausibility of local distances between residues in the predicted structure3. This evaluation improves the integrity of local domains in the structure, and therefore the confidence of possible functionally important domains.

It scores from 0 to 100, with >90 being very high.

Image 1
Fig 1) The pLDDT results for DeHa 1.

The pLDDT score tends towards <50 at the terminal position of the structure, which removes the structural value of the prediction in this region but indicates a possible disordered domain.

Like DeHa 2, the terminals of tend towards a scoring of <50, removing structural value from these regions.

Like DeHa 2 and DeHa 4, the first ~20 residues and the terminal lacks structural value due to the scores being <50. This is especially seen in the structure of DeHa 5 with the possibly disordered tail region.

Predicted Aligned Error

pLDDT does not verify the position of domains relative to each other. For this, the Predicted Aligned Error is applied. This scores the relative position of each residue to each other in the structure.

The visualization of the scores is depicted in a 2D graph with the residues on each axis, and the colour indicating the predicted distance between a residue in the predicted structure and the true structure. If the error is high, it means that the distance between two given residues is high. The colour will tend towards red from blue. A lower score is preferred.

Fig 2) The Predicted Alignment Error plots for the dehalogenases. Starting from the top, the plots are from DeHa 1, DeHa 2, DeHa 4 and DeHa 5

The strong red bands at the left side and top of the graph show that residue 0 to ~10 have very little positional confidence in relation to all other residues. This tells us that there is very little confidence in the relative position between these two domains. This supports the pLDDT scoring, in showing to be a possibly disordered domain at the N-terminus.

Sequence Coverage

The sequence coverage displays the number of homologous along the length of the sequence. The varying colours indicate the level of identity to the homologous. It can be a useful measure for predicting conserved domains if the level of identity to homologues proteins is high.

Image 1
Fig 3) The Sequence Coverage results for DeHa 1.

There is a large decrease in homologous sequences at the beginning of the DeHa 5 sequence, compared to the other Sequence Coverages. The rest of the DeHas show good coverage over their entire sequences, enabling more confident speculation.

Verification with FoldSeek

To further verify the confidence of the structures generated by ColabFold, FoldSeek is applied to find tertiary resemblance to other structures. This will tell us whether the functions of the enzymes match the predicted structures.

FoldSeek is a tool developed to align tertiary structures to those found in protein databases. In this verification, uncharacterized proteins from the databases will be ignored, since we are interested in linking the function to the structures predicted.

Querying the dehalogenases all gave results related to haloacid dehalogenases, with very similar structures. The alignment score (TM-score) and RMSD are given in the table below, and quantify the quality of the alignments:

Table 1) An overview of the scoring from FoldSeek. TM-score is a scoring of the alignment, while RMSD is the root-mean-square deviation. The closer TM-score is to 1, the better, and the lower the RMSD, the better.
TM-Score RMSD
DeHa 1 0.88061 2.21
DeHa 2 0.86421 2.7
DeHa 4 0.96049 1.27
DeHa 5 0.73167 3.39
Image 1
Fig 4) The alignment between DeHa1 and a structurally similar haloacid dehalogenase.

Concluding Remarks

The predictions show reliable structural evaluation from the pLDDT, for most of the structure for each of the dehalogenases. For DeHa 2, 4 and 5 the terminals lack some structural value, especially DeHa 5. The same can be said about DeHa 5s Predicted Aligned Error score, where a large uncertainty is found in the relative position of two domains, possibly indicating a disordered region. This region in DeHa 5 is also found to have less hits in homologous sequences in the Sequence Coverage.

From the FoldSeek verification it is found that the dehalogenases are similar in tertiary structure to enzymes with functions related to the proposed function of the dehalogenases. DeHa 4 was found to have 45% sequence identity to a fluoroacetate dehalogenase specifically, while the other queried dehalogenases were found to have ~25% sequence identity to haloacid dehalogenases.

DeHa 4 has an almost identical structure to the fluoroacetate dehalogenase, as is shown in the TM-score and RMSD. As can be seen in the alignment with DeHa 5, the long-disordered terminal region is present as suspected from the structure prediction evaluation. This deviation is also found in the scoring of DeHa 5, as it has a lower TM-score of 0.73167 and a higher RMSD of 3.39. This disordered region could be problematic in the docking investigations, if it found to be involved with the active site (see Docking subpage).

These results show high validity in the structural prediction of the dehalogenases, as can be seen in both the evaluation from ColabFold and in the functional likeness from FoldSeek. This will increase the legitimacy of the subsequent investigation using the predicted structures.

Implementation

next-page-button

  1. Mirdita, M., Schütze, K., Moriwaki, Y., Heo, L., Ovchinnikov, S., & Steinegger, M. (2022). ColabFold: making protein folding accessible to all. Nature Methods, 19(6), 679-682. https://doi.org/10.1038/s41592-022-01488-1
  2. Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., . . . Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583-589. https://doi.org/10.1038/s41586-021-03819-2
  3. Mariani, V., Biasini, M., Barbato, A., & Schwede, T. (2013). lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics, 29(21), 2722-2728. https://doi.org/10.1093/bioinformatics/btt473
  4. Lu, S., Wang, J., Chitsaz, F., Derbyshire, M. K., Geer, R. C., Gonzales, N. R., Gwadz, M., Hurwitz, D. I., Marchler, G. H., Song, J. S., Thanki, N., Yamashita, R. A., Yang, M., Zhang, D., Zheng, C., Lanczycki, C. J., & Marchler-Bauer, A. (2020). CDD/SPARCLE: the conserved domain database in 2020. Nucleic Acids Res, 48(D1), D265-d268. https://doi.org/10.1093/nar/gkz991
  5. Wang, Y., Xiang, Q., Zhou, Q., Xu, J., & Pei, D. (2021). Mini Review: Advances in 2-Haloacid Dehalogenases [Review]. Frontiers in Microbiology, 12. https://doi.org/10.3389/fmicb.2021.758886
  6. Ghattas, M. A., Raslan, N., Sadeq, A., Al Sorkhy, M., & Atatreh, N. (2016). Druggability analysis and classification of protein tyrosine phosphatase active sites. Drug Des Devel Ther, 10, 3197-3209. https://doi.org/10.2147/dddt.S111443
  7. Ghattas, M. A., Raslan, N., Sadeq, A., Al Sorkhy, M., & Atatreh, N. (2016). Druggability analysis and classification of protein tyrosine phosphatase active sites. Drug Des Devel Ther, 10, 3197-3209. https://doi.org/10.2147/dddt.S111443
  8. Isom, D. G., Castañeda, C. A., Cannon, B. R., Velu, P. D., & García-Moreno E., B. (2010). Charges in the hydrophobic interior of proteins. Proceedings of the National Academy of Sciences, 107(37), 16096-16100. https://doi.org/doi:10.1073/pnas.1004213107
  9. Adamu, A., Wahab, R. A., Shamsir, M. S., Aliyu, F., & Huyop, F. (2017). Deciphering the catalytic amino acid residues of l-2-haloacid dehalogenase (DehL) from Rhizobium sp. RC1: An in silico analysis. Comput Biol Chem, 70, 125-132. https://doi.org/10.1016/j.compbiolchem.2017.08.007
  10. U.S Envitronmental Protection Agency, “The CompTox Chemistry Dashboard: a community data resource for environmental chemistry” https://comptox.epa.gov/dashboard/.
  11. Halgren, T. A.; Murphy, R. B.; Friesner, R. A.; Beard, H. S.; Frye, L. L.; Pollard, W. T.; Banks, J. L., "Glide: A New Approach for Rapid, Accurate Docking and Scoring. 2. Enrichment Factors in Database Screening," J. Med. Chem., 2004, 47, 1750–1759
  12. Li, J., Abel, R., Zhu, K., Cao, Y., Zhao, S., & Friesner, R. A. (2011). The VSGB 2.0 model: a next generation energy model for high resolution protein structure modeling. Proteins, 79(10), 2794-2812. https://doi.org/10.1002/prot.23106
  13. Jacobson, M. P.; Pincus, D. L.; Rapp, C. S.; Day, T. J. F.; Honig, B.; Shaw, D. E.; Friesner, R. A., "A Hierarchical Approach to All-Atom Protein Loop Prediction," Proteins: Structure, Function and Bioinformatics, 2004, 55, 351-367
  14. Friesner, R. A.; Murphy, R. B.; Repasky, M. P.; Frye, L. L.; Greenwood, J. R.; Halgren,T. A.; Sanschagrin, P. C.; Mainz, D. T., "Extra Precision Glide: Docking and Scoring Incorporating a Model of Hydrophobic Enclosure for Protein-Ligand Complexes," J. Med. Chem., 2006, 49, 6177–6196