Protein Modelling
Protein modelling is a predictive tool that can be used to forecast the effects of changes in the sequence or conformation of a protein. This can help guide experimental work and reduce the amount of trial-and-error in the lab.
Modelling in PyRosetta
PyRosetta is a powerful computational tool used in structural biology and protein engineering1. It is an interactive Python-based interface to the Rosetta molecular modeling suite, which allows users to design and predict the 3D structure of both small and large bio-molecules. PyRosetta offers a robust platform for understanding how changes in a protein's structure can influence its function and stability.
Our project harnesses the power of protein modeling to anticipate the effects of single-point mutations on the stability and function of various proteins integral to our goals. Utilizing PyRosetta, an industry-standard computational tool, we input the Protein Data Bank (PDB) files of our chosen proteins to simulate and assess the effects of these mutations. This approach not only aids in the reduction of trial-and-error in experimental work but also serves as a quality control measure, allowing us to monitor any potential mutations during our synthetic biology processes.
Objectives
The core aim of our simulation studies revolves around preserving protein functionality amidst potential mutations. Our objectives are delineated as follows:
-
Primary Goal: Ensuring that the function of a protein remains unaffected by single-point mutations. This understanding permits us to foresee the repercussions of spontaneous mutations during our synthetic biology undertakings.
-
Quality Control: By employing computational methods, we can verify the presence or absence of mutations in proteins throughout our processes. This capability is invaluable for monitoring and sustaining quality control within our project.
The motivations behind our protein selections and the design of our simulations are deeply rooted in these objectives, shaping our subsequent analyses.
Methods
Understanding the implications of point mutations on protein stability, especially when assessed using ΔΔG values, remains a vital component in the realm of protein bioinformatics. A paper by Hernández et al.2 delves into this subject, elucidating a streamlined method, KORPM, that employs a residue-based orientational potential focusing on just three backbone atoms for predicting stability changes upon mutation. The performance metrics and approach of this method, especially its efficiency and reduced risk of overfitting, underscore the rapid advancements in predictive modeling of protein stability.
In our work, we harness the capabilities of PyRosetta to compute ΔΔG values, aiming to derive insights on the effects of point mutations, especially on the MxeGyrA protein. While the approaches might seem distinct at first, there's a deeper convergence at the level of objective: both methodologies strive for accurate, efficient, and robust predictions of the effects of mutations on protein stability.
While Hernández et al. focus on circumventing the challenges posed by mutations at interfaces and other extreme conditions, such aspects can be incorporated as supplementary considerations in our research. It might serve to provide context, ensuring that our stability predictions account for broader biochemical contexts.
Protein Selection
Our simulations focused on proteins pivotal to our project's objectives, namely IL_10, Staygold, and both single and docked inteins:
-
IL_10 and Staygold: Selected as our proof-of-concept, these proteins allowed us to illustrate the feasibility of our method in predicting the consequences of mutations on protein function and stability.
-
Single and Docked Inteins: Incorporated as a part of our Cell-Free Protein Synthesis (CFPS) project, these proteins were chosen for simulation to preempt and counteract any potentially detrimental mutations during the CFPS procedure.
These proteins were chosen based on their distinct properties and roles in our project, with simulations designed to derive insights into potential mutation impacts on their function and stability.
PyRosetta Scripting
We began by inputting the Protein Data Bank (PDB) file of our protein into PyRosetta. The software carries out an extensive conformational sampling, probing a broad spectrum of the protein's possible structures and orientations. Particular emphasis was placed on understanding the role of single-point mutations and their implications for the stability of the protein, evaluated using the ΔΔG (ddG) score metric.
Inputs
The PyRosetta script's primary input is a Protein Data Bank (PDB) file. This file contains the 3D atomic coordinates of the protein under study. PDB files are a standardized format in structural biology, utilized to archive and disseminate information about 3D structures of proteins, nucleic acids, and complex assemblies.
Outputs
Upon execution, the script yields a Comma-Separated Values (CSV) file that corresponds to the input PDB file. This CSV file has headers subdivided into two segments: pre-existing data and simulation-generated data.
Pre-Existing Data
This encompasses the information extant prior to running the script based on the input PDB file:
- type: Specifies whether the amino acid is part of the main chain or a side chain.
- residue_number: Indicates the position of the amino acid in the protein sequence.
- previous_aa: The amino acid in the original protein structure.
- new_aa_1l: The one-letter code of the mutated amino acid.
- new_aa_3l: The three-letter code of the mutated amino acid.
- conversion: Describes the specific mutation that occurred, e.g., "Ala to Gly."
- new_seq: The sequence of the protein with the mutation incorporated.
Simulation-Generated Data
This set encapsulates data produced during the PyRosetta script's simulation:
- fa_score: Full-atom score, indicating the overall energy of the protein structure.
- ddg_score: Change in the free energy (ΔΔG) upon mutation, indicating the potential impact on protein stability.
- hbond_score: Energy score based on hydrogen bonding in the protein.
- sasa_score: Solvent accessible surface area score, hinting at potential protein-protein or protein-ligand interactions.
- secondary_structure: Prediction of the protein's secondary structural elements, such as alpha-helices and beta-sheets, post-mutation.
- diff_hbonds: Difference in the number of hydrogen bonds before and after the mutation.
- diff_sasa: Change in solvent-accessible surface area upon mutation.
- diff_secondary_structure: Changes in secondary structural elements due to mutation.
Results
In this section, we present the analysis and findings based on the execution of the PyRosetta script on our selected proteins. The outcomes will be discussed in light of both the pre-existing and simulation-generated data.
IL_10
ΔΔG Analysis
High ΔΔG Values: Among the top 100 ΔΔG values, it was observed that changing a non-arginine (R) amino acid to arginine caused a significant increase in the ΔΔG score. This suggests that such mutations could substantially destabilize the IL_10 protein.
Low ΔΔG Values: On the other hand, when looking at the lowest ΔΔG values, it was noticed that changing a wild-type amino acid to alanine (A), glycine (G), or serine (S) reduced the ΔΔG score substantially, implying that these mutations could stabilize the protein.
Solvent Accessible Surface Area (SASA) Analysis
High SASA Values: An examination of the SASA values that were above the mean plus two standard deviations showed that mutations which changed amino acids to tryptophan (W), tyrosine (Y), and glycine (G) seemed to increase the SASA value. Interestingly, all of the top SASA values were associated with a low ΔΔG score, indicating a potential correlation between these two parameters.
Low SASA Values: In contrast, examining the values below the mean minus two standard deviations, it was found that the majority of the low SASA values seemed to be associated with mutating tyrosine (Y), arginine (R), and lysine (K) to another amino acid. This implies that such mutations might decrease the protein's interaction with the solvent. Furthermore, residues at the 139th, 87th, and 129th positions seemed particularly susceptible to these drops in SASA, suggesting they could be crucial regions for the protein's solvent interaction.
StayGold
ΔΔG Analysis
High ΔΔG Values: Our findings showed that among the 15 mutants with the highest ΔΔG scores, 14 had proline (P) at the mutated position. This suggests that the introduction of proline can considerably destabilize the protein, indicated by the high ΔΔG score. Interestingly, the addition of this amino acid seemed to reduce the count of hydrogen bonds by one, though this impact might not be significant given the overall number of hydrogen bonds (over 250).
Low ΔΔG Values: Mutating the histidine residue (H), specifically the 382nd residue in the amino acid chain, significantly reduced the ΔΔG score. This implies that such a mutation could improve the stability of Staygold.
Residue 382: A deeper examination of residue 382 revealed that most mutations on this histidine residue resulted in a negative ΔΔG score, suggesting an increase in protein stability. The highest ΔΔG score at this position occurred when histidine was mutated to proline (P). Intriguingly, all mutations at this position caused minimal change to the SASA score.


SASA Analysis
General Observations: Most of the mutants' SASA scores were noticeably close to the wild-type's, and the range of scores was quite narrow.
High SASA Values: Our observations revealed that mutations resulting in tryptophan (W) generally led to a significantly high SASA score, implying increased exposure of the protein to the solvent.
Low SASA Values: On the other end of the spectrum, the lowest SASA scores seemed to be associated with mutations on arginine (R). These low SASA scores might indicate a decrease in the solvent interaction of these specific mutants.
Docked Intein
ΔΔG Analysis
High ΔΔG Values: Mutations that involved changing various amino acids to proline (P) resulted in higher ΔΔG scores. This indicates that the introduction of proline can considerably destabilize the protein. Notably, the mean ΔΔG score across all mutations was 340, whereas it rose to 1830 when an amino acid was changed to proline.
Low ΔΔG Values: Based on the lowest ΔΔG scores, the 56th residue, followed by the 161st, 252nd, and 333rd residues, were most likely to contribute to lower ΔΔG scores in the event of a mutation. These lower scores suggest increased protein stability following these specific mutations.

SASA Analysis
High SASA Values: Statistical analysis showed that mutations at residue numbers 22 and 257 caused significant increases in SASA scores. In general, mutations resulting in tryptophan (W) led to considerably high SASA scores. These observations suggest a possible increase in the protein's exposure to the solvent following these mutations.
Low SASA Values: Mutations at several residues, including 219, 243, 107, 79, 75, led to a significant decrease in SASA scores. Particularly, the arginine (R) residue at position 219 showed high sensitivity to mutations, indicated by a marked drop in the SASA score. This suggests a potential decrease in solvent interaction for these mutants.
Modelling with GROMACS
Introduction and Background
GROMACS3 is a widely used open source software suite for modeling proteins, in particular for doing molecular dynamics. Molecular dynamics solve Newton’s equations of motion for a system of N atoms4.
Two types of modelling was performed, a simple MD run of Staygold Intein-Protein in a box of water, and a Free energy of solvation of Staygold Intein-Protein in water.
Objectives
- Model the general stability of Staygold Intein-Protein.
- Model the thermodynamic nature of Staygold Intein-Protein in water solvent to aid in wetlab results.
Methods
These simulations were completed using GROMACS version 4.5.5 with CUDA, running on UBC's ARC Sockeye5. The OPLS6 forcefield was used for simulations pertaining to Staygold.
Basic Modelling
An exploratory MD run on the Staygold Intein-Protein in water to model its stability was done. Using the Staygold Intein-Protein pdb file provided by the wet lab team, GROMACS was used to generate the topology of the protein. Next, the protein is solvated with water in a dodecahedron box, chosen because it is one of the smallest boxes saves CPU time7. Then to remove charges in the system, counter ions of Na and Cl were added in replacement of a few water molecules. The CHARMM8 forcefeild was chosen because it's design most aligned with our protein.
To stabilise the system, energy minimisation using steepest decent9, theromostat coupling and pressure coupling was performed. Configuration files were based on GROMACS provided tutorials10 11, help from Sockeye, and GROMACS documentation12. Finally, an MD run of 500000 ns was performed; root mean square deviation, radius of gyration, and diffusion constant values were obtained.
Advanced Modelling
Next to model the Staygold Intein-Protein in water, the free energy of solvation was determined with GROMACS. The preparation steps done in basic modelling are also used in advanced modelling.
Solvation13 is the interaction of solvent and solute, with hydration being the subcase of water as the solvent. Calculating the Gibbs free energy of solvation for the Staygold Intein-Protein can give insight to the nature of its folding when in water. Protein folding is more favourable when, for example the interactions between water molecules and hydrophobic side chains are minimised.
Generally speaking, solvation requires a number of steps resulting in changes in energy. First, a cavity is created in the solvent to insert the solute. Then as the solute is inserted into the solvent, more changes in energy occur. Using GROMACS, free energy14 of solvation of a protein can be simulated using lambda (vector) states15. Each state represents the protein and solvent in a certain stage, with lambda 0 being the inistal state of protein seperated from solvent and the last lambda state being the final state of protein submerged in water.
Since each lambda state requires it’s own MD run, the time required to run these simulations on our local computers ended being around 2-3 days. On Sockeye, this was reduced to 1 day and 15 hours. Thus, for faster turnaround and to be able to debug these simulations quickly, UBC’s ARC Sockeye GPUs were used.
Results
Basic Modelling
- MSD: The mean square displacement (MSD), or diffusion constant, was obtained for both Staygold and IL-1016 with GROMACS17. This will aid in our math modelling.
- Radius of Gyration: indicator of protein structure compactness18. If the protein is stably folded, it will maintain a relatively steady value, otherwise if the protien unfolds then the value will change over time.



Advanced Modelling
The simulation of 20 lambdas were run on UBC Sockeye, taking about 5 days, with these results:

A Gibbs free energy value of 11249.38 +/- 70.50 kJ/mol (2688.666348 cal/mol) was obtained when running gmx bar
was run.
Temperature: 298 K
Detailed results in kT (see help for explanation):
lam_A lam_B DG +/- s_A +/- s_B +/- stdev +/-
0 1 241.01 0.87 30.29 0.55 33.63 0.59 227.90 121.08
1 2 249.42 0.66 28.74 0.82 33.77 0.48 204.09 19.69
2 3 257.77 0.41 27.43 0.34 33.80 0.75 156.99 33.52
3 4 266.15 0.57 27.02 0.55 34.06 0.22 146.05 9.63
4 5 273.67 0.59 27.38 0.87 35.26 0.15 177.62 47.37
5 6 275.48 1.12 31.69 0.68 36.94 1.18 243.29 100.59
6 7 280.43 0.35 29.12 0.50 36.11 0.57 186.29 99.48
7 8 280.83 0.51 34.94 0.51 38.22 0.53 435.09 952.01
8 9 284.65 0.89 32.30 0.72 38.76 0.96 296.55 132.49
9 10 286.78 0.49 34.73 0.52 41.76 0.79 323.55 365.78
10 11 287.63 0.82 35.39 1.06 45.94 1.07 371.44 5265.04
11 12 281.06 1.22 42.29 0.68 43.53 0.47 1632.14 8266.62
12 13 283.29 2.29 40.84 1.34 54.52 1.57 7783.97 147761.99
13 14 276.36 2.26 45.32 1.28 59.66 1.15 3564.57 101308.68
14 15 259.96 2.75 58.34 0.61 63.35 1.40 224492.47 1888281.83
15 16 245.36 4.22 65.97 1.46 87.78 6.43 11629420.57 8082484763.52
16 17 179.58 9.85 114.98 2.38 123.33 4.91 33300971512.31 123325441600798448.00
17 18 172.35 9.23 70.01 2.64 677.44 9.43 245.69 14.00
18 19 -141.55 1.91 40.08 2.03 98.29 2.72 333.80 9.98
Final results in kJ/mol:
point 0 - 1, DG 597.15 +/- 2.16
point 1 - 2, DG 617.99 +/- 1.64
point 2 - 3, DG 638.67 +/- 1.01
point 3 - 4, DG 659.45 +/- 1.40
point 4 - 5, DG 678.08 +/- 1.47
point 5 - 6, DG 682.56 +/- 2.78
point 6 - 7, DG 694.82 +/- 0.88
point 7 - 8, DG 695.82 +/- 1.25
point 8 - 9, DG 705.28 +/- 2.20
point 9 - 10, DG 710.56 +/- 1.21
point 10 - 11, DG 712.68 +/- 2.03
point 11 - 12, DG 696.38 +/- 3.03
point 12 - 13, DG 701.92 +/- 5.68
point 13 - 14, DG 684.73 +/- 5.61
point 14 - 15, DG 644.12 +/- 6.81
point 15 - 16, DG 607.94 +/- 10.46
point 16 - 17, DG 444.94 +/- 24.41
point 17 - 18, DG 427.04 +/- 22.88
point 18 - 19, DG -350.72 +/- 4.73
total 0 - 19, DG 11249.38 +/- 70.50
Analysis and Relation to the Project
Advanced Modelling
A solvation process is thermodynamically favored only if the overall Gibbs energy of the solution is decreased, compared to the Gibbs energy of the separated solvent and solid13.
A Gibbs free energy value of 11249.38 +/- 70.50 kJ/mol (2688.666348 cal/mol) indicates that the solvation process is not thermodynamically favored. This could mean there's unfavorable interactions with Staygold and water such as exposed hydrophobic side chains.
Thus, this should be taken into account when designing experiments with the Staygold protein, and perhaps another protein of choice may be better suited.
Footnotes
-
Chaudhury, S., Lyskov, S., & Gray, J. J. (2010). PyRosetta: a script-based interface for implementing molecular modeling algorithms using Rosetta. Bioinformatics, 26(5), 689-691. ↩
-
Hernández, I. M., Dehouck, Y., Bastolla, U., López-Blanco, J. R., & Chacón, P. (2023). Predicting protein stability changes upon mutation using a simple orientational potential. Structural Bioinformatics. ↩
-
GROMACS development team. (2022). Welcome to GROMACS. Welcome to GROMACS - GROMACS webpage https://www.gromacs.org documentation. https://www.gromacs.org/ ↩
-
GROMACS development team. (2023). Computational Chemistry and Molecular Modeling. Introduction - GROMACS 2023.2 documentation. https://manual.gromacs.org/current/reference-manual/introduction.html#computational-chemistry-and-molecular-modeling ↩
-
UBC. (2023). UBC ARC SOCKEYE. UBC Arc Sockeye. https://arc.ubc.ca/ubc-arc-sockeye ↩
-
GROMACS development team. (2023b). OPLS. Force fields in GROMACS - GROMACS 2023.2 documentation. https://manual.gromacs.org/current/user-guide/force-fields.html#opls ↩
-
GROMACS development team. (2023c). Periodic boundary conditions. Periodic boundary conditions - GROMACS 2023.2 documentation. https://manual.gromacs.org/current/reference-manual/algorithms/periodic-boundary-conditions.html ↩
-
MacKerell Lab. (n.d.). CHARMM force field files. https://www.charmm.org/archive/charmm/resources/charmm-force-fields/ ↩
-
GROMACS development team. (2023b). Energy minimization. Energy minimization - GROMACS 2023.2 documentation. https://manual.gromacs.org/current/reference-manual/algorithms/energy-minimization.html#em ↩
-
Blau, C., Zhmurov, A., Lindahl, E., Villa, A., & Jordan, J. (2021). Calculating Free Energy. Calculating free energy - GROMACS tutorials . https://tutorials.gromacs.org/docs/free-energy-of-solvation.html ↩
-
Lemkul, J. A. (2018). Gromacs tutorial. MD Tutorials. http://www.mdtutorials.com/gmx/free_energy/05_equil.html ↩
-
GROMACS development team. (2018). Getting good performance from mdrun. Getting good performance from mdrun - GROMACS 2018.8 documentation. https://manual.gromacs.org/2018-current/user-guide/mdrun-performance.html#gmx-mdrun-on-gpu ↩
-
Wikimedia Foundation. (2023, August 24). Solvation. Wikipedia. https://en.wikipedia.org/wiki/Solvation ↩ ↩2
-
GROMACS development team. (2023b). Free energy implementation. Free energy implementation - GROMACS 2023.2 documentation. https://manual.gromacs.org/current/reference-manual/special/free-energy-implementation.html ↩
-
GROMACS development team. (2023b). Free energy interactions. Free energy interactions - GROMACS 2023.2 documentation. https://manual.gromacs.org/current/reference-manual/functions/free-energy-interactions.html ↩
-
RCSB Protein Data Bank. (2006, October 7). 2H24: Crystal Structure of human IL-10. RCSB PDB. https://www.rcsb.org/structure/2h24 ↩
-
GROMACS development team. (2023b). gmx msd. gmx msd - GROMACS 2023.2 documentation. https://manual.gromacs.org/documentation/current/onlinehelp/gmx-msd.html ↩
-
Lemkul, J. A. (2018a). Gromacs tutorial. MD Tutorials. http://www.mdtutorials.com/gmx/lysozyme/10_analysis2.html ↩
© 2023 - Content on this site is licensed under a Creative Commons Attribution 4.0 International license.
The repository used to create this website is available at gitlab.igem.org/2023/ubc-vancouver.