In our dry lab work, we pursued three distinct projects, each contributing unique insights to our overall project. These projects included finding more efficient orthologues of DHCR7 (the last enzyme in the campesterol biosynthetic pathway) through enzyme screening, modeling the growth of Yarrowia lipolytica, and designing a hydrophobin. Our research project was dedicated to enabling campesterol biosynthesis within the oleaginous yeast, Yarrowia lipolytica, renowned for its potential in biotechnological applications. A pivotal step in this undertaking involved introducing a gene encoding delta-7-sterol reductase1, a component not naturally found in Yarrowia. To identify the most suitable enzyme source, we conducted a comprehensive screening process that included molecular docking and molecular dynamics studies with the enzyme substrate. Simultaneously, we collaborated closely with our wet lab team to establish a growth model for Y. lipolytica, initially focusing on assessing the impact of glucose and oil-based media to its growth. Our approach involved optimizing experimental designs to pave the way for future investigations under diverse conditions. Additionally, we embarked on a protein design campaign with the objective of designing a hydrophobin characterized by a reduced isoelectric point. This innovation was directed towards enhancing emulsification efficiency in lower pH environments, aligning with Y. lipolytica's growth preferences and the practical characteristics of fermentation processes.
The final step in the campesterol biosynthesis pathways requires a functioning delta-7-sterol reductase, which is not naturally present in Yarrowia lipolytica. This enzyme plays a crucial role in reducing ergosta-5,7-dienol to campesterol1. Several organisms feature genomes containing the DHCR7 gene, which could potentially serve our project. Detailed knowledge and characterization of naturally occurring DHCR7 is lacking, therefore we tried to identify the most promising DHCR7 candidates through in silico screening.
Unlike traditional screening methods that typically search for the best molecule, our focus was on identifying the most suitable receptor – how a protein molecule is commonly referred to in docking studies. We compiled a list of ten organisms known to harbor a delta-7-sterol reductase. Our selection criteria were based on a combination of literature reviews1,2 and homology analyses (further details are available in our dry lab notebook). In our pursuit of identifying the most promising enzymes, we conducted molecular docking and molecular dynamics (MD) simulations of the selected docked conformations.
Except for Gallus gallus and Waddlia chondrophila, we were able to obtain Computed Structure Models (CSMs) from the Protein Data Bank (PDB) for all our target enzymes from other species. These models represent predictions of enzyme structures. Due to the lack of experimentally determined structures of DHCR7, we had to rely on AlphaFold models. Notably, all the obtained structures exhibited very high structural similarity. Unfortunately, structural data for the DHCR7 enzyme from Gallus gallus and Waddlia chondrophila had not been deposited in the PDB database.
As a result, we resorted to predicting these structures using AlphaFold3. Specifically, our team members Xiaodie, Yinfu, and Julia utilized AlphaFold Colab and ColabFold4 tools on Google Colab, subsequently comparing the results obtained from both tools. Default settings were used for both programs. Both AlphaFold Colab and ColabFold consistently produced highly reliable predictions with a notably high level of confidence. Given the quality of these results, we made the decision to forego the use of the full version of the local AlphaFold, as we believed it would not significantly enhance our outcomes.
All models had high pLDDT scores indicating high confidence in the predicted structure. The N-termini of all DHCR7 exhibited lower prediction confidence, in line with its predicted higher flexibility. There were instances where pLDDT values dropped below 70, coinciding with reductions in MSA (multiple sequence alignment) coverage and higher PAE values. These lower-accuracy patches likely correspond to the same region of the protein in both organisms.
The results obtained from both AlphaFold Colab and ColabFold were highly similar. This was somewhat expected since they use the same method but also different techniques for input feature generation. However, due to the slightly higher pLDDT values, we opted to retain the structures generated by AlphaFold Colab. For more comprehensive information regarding the structure prediction process, please refer to our dry lab enzyme screening notebook.
To color the models according to the AlphaFold pLDDT coloring scheme we used a PyMOL plugin5.
Due to the DHCR7 enzyme's transmembrane nature and the inherent challenges associated with determining the structures of such proteins, we found no experimental structural data publicly available. To address this limitation, we sought out proteins that were structurally similar to DHCR7. One such protein, 4QUV, identified as a delta(14)-sterol reductase within the same protein family as DHCR7, exhibited notable structural similarity to DHCR7. Consequently, we utilized the 4QUV structure as a "reference" to aid us in identifying the binding site of NADPH in DHCR7. When conducting the docking of ergosta-5.7-dienol, we used literature as a reference for the validation of the docked substrate’s confirmation1,6. We targeted the putative sterol binding site to guide our investigation.
For our molecular docking experiments, we followed a protocol developed by Professor J. Harvey (KU Leuven, Department of Chemistry). The key steps in molecular docking include preparation of the enzyme and ligand, docking the small molecule with the enzyme, and interpreting the results. Our choice of software was AutoDock Vina7, and we visualized the docking simulations using UCSF Chimera8. To validate the protocol's accuracy, we initiated the process by docking the experimentally determined structure of delta(14)-sterol reductase (PDB code: 4QUV) with its native ligand, NDP but without the nicotinamide-ribose moiety. We observed that the ligand successfully docked in the same location as in the crystal structure, confirming the reliability of our approach. During the docking of NDP with 4QUV, we encountered an issue related to the naming of one of the hydrogens. Fortunately, with the guidance of Professor Harvey, we were able to resolve this issue (additional details can be found in the dry lab enzyme screening notebook).
We started from NADPH and ergosta-5,7-dienol structures obtained from PubChem. Our docking process proceeded sequentially, with NADPH docking performed first. Subsequently, we exported the results to a PDB file and initiated another docking simulation, this time involving ergosta-5,7-dienol. The inclusion of NADPH in the docking process is essential as it serves as a cofactor for reducing ergosta-5,7-dienol to campesterol.
Under Prof. Froeyen's advice, each ligand was docked to each enzyme ten times to gain a measure of dock quality. Autodock's statistical nature does not guarantee the same result for each docking attempt and repetitions are therefore recommended. Since docking scores are not sufficient to guarantee correct docking, we also inspected the potential chemical interactions, such as H-bond and hydrophobic interactions, between ligand and enzymes.
Due to difficulties visualizing the surface of our models in Chimera, we adapted our strategy or conducting post-docking double checks. We employed PoseView 9 and LigPlot+10. These software tools generate two-dimensional diagrams illustrating the interactions between the enzyme and the ligand. Our analysis concentrated on interactions within the NADPH binding site and the putative sterol binding site1,6. After thoroughly assessing the docking results and considering the insights gleaned from these double checks, we selected a single docked conformation for each enzyme to undergo molecular dynamics (MD) simulations.
Organism | Docking score | Interactions found with PoseView | Interactions found with LigPlot+ |
---|---|---|---|
Rattus norvegicus | -6.3 | Hydrogen bond with aspartate from the sterol binding site | Hydrogen bond with aspartate from the sterol binding site |
Mus musculus | -5.4 | No interactions found | No hydrogen bonds found |
Homo sapiens | -9.1 | Hydrogen bond with tyrosine from the sterol binding site | Hydrogen bonds with both the tyrosine and aspartate from the sterol binding site |
Danio rerio | -6 | Hydrogen bond with aspartate from the sterol binding site | Hydrogen bonds with both the tyrosine and aspartate from the sterol binding site |
Xenopus laevis | -8.3 | Hydrogen bond with aspartate from the sterol binding site | Hydrogen bonds with both the tyrosine and aspartate from the sterol binding site |
Xenopus tropicalis | -6.1 | No interactions found | No hydrogen bonds with the enzyme, 2 hydrogen bonds with NADPH |
Gallus gallus | -7.5 | Hydrogen bond with aspartate from the sterol binding site | Hydrogen bonds with both the tyrosine and aspartate from the sterol binding site, one hydrogen bond with NADPH |
Waddlia chondrophila | -6 | Hydrogen bond with aspartate from the sterol binding site | Hydrogen bonds with both the tyrosine and aspartate from the sterol binding site |
Arabidopsis thaliana | -6.5 | No interactions found | No hydrogen bonds found |
Bos taurus | -10.4 | No interactions found | No hydrogen bonds with the enzyme, 1 hydrogen bond with NADPH |
Docking scores are estimated binding energies (therefore the more negative the better) but it is clear from Table 1 that there is limited correlation between binding score and hydrogen bonding patterns, which we expect provide a good proxy for determining if interactions are specific.
In our molecular dynamics (MD) simulations, we used the Amber22 package11 to investigate the behavior of protein-ligand complexes previously identified through molecular docking and interaction checks. Under the guidance of prof. Matheus Froeyen, Julia from our team wrote an MD protocol that we followed to perform the simulations.
Our simulation workflow can be summarized as follows:
This comprehensive simulation and analysis approach allowed us to explore the dynamic behavior of protein-ligand complexes, assess their stability, and gain insights into key interactions that are essential for DHCR7 biological function.
The analysis of hydrogen bond distances held significant importance to us, as it offered insights into the specificity of ligand binding. Our primary focus was on the hydrogen bonds formed between the hydroxyl group of ergosta-5,7-dienol and residues in the putative sterol binding site, consisting of a tyrosine and aspartate. We systematically calculated the distances between atoms that were expected to participate in these hydrogen bonds throughout the entire simulation. This analysis, combined with other relevant information, contributed to our conclusions drawn from the enzyme screening.
Organism | Hydrogen bonds' behavior throughout the simulation |
---|---|
Rattus norvegicus | The tyrosine doesn't seem to form a hydrogen bond with the hydroxyl group at all. There could be a hydrogen bond with aspartate but only for a short moment. These results are inconclusive. |
Mus musculus | As expected after molecular docking, the atoms that should form the hydrogen bonds are too far from each other and they don't form a hydrogen bond. |
Homo sapiens | Due to some problems during simulations, we were not able to obtain results from this organism. |
Danio rerio | Results from molecular docking were very promising but the ergosta-5.7-dienol shifted away from the putative sterol binding site after equilibration and never came close enough to form hydrogen bonds. The residues from the binding sites also moved away from the substrate and they don’t form a hydrogen bond between themselves. |
Xenopus laevis | The hydrogen bonds between ergosta-5,.7-dienol and the enzyme don't seem to hold up throughout the whole simulation (only one lasts for the last 20 thousand frames) |
Xenopus tropicalis | As expected after docking, the substrate didn't form any hydrogen bonds with the binding site residues. |
Gallus gallus | The hydroxyl group of ergosta-5.7-dienol seems to be too far away from the binding site atoms to form a stable hydrogen bond. |
Waddlia chondrophila | The tyrosine doesn't seem to be as far away from the ergosta-5.7-dienol as aspartate, therefore there might be a hydrogen bond, but only for a short period of time. |
Arabidopsis thaliana | The atoms that were supposed to form hydrogen bonds were too far to form a specific interaction. This was expected after docking since the hydroxyl group of the substrate docked away from the residues from the binding site. The fact that instead of aspartate there is a glutamic acid in the putative sterol binding site could also contribute to this result. |
Bos taurus | Even though the docking score for ergosta-5.7-dienol was really good, the molecule didn’t dock close enough to the putative sterol binding site. This could be caused by the strange docking of NADPH which wasn’t resembling the reference structure too closely. |
The results obtained from molecular docking and MD simulations deviated from our initial expectations. In many of the analyzed organisms, the atoms that were anticipated to form hydrogen bonds did not come into close proximity after simulation, even when they appeared to establish hydrogen bonds during the initial docking phase. Of particular surprise is the case of Danio rerio, as the DHCR7 enzyme from this organism had previously demonstrated the highest campesterol yield in the literature2 The docking results indicated strong ligand binding and hydrogen bond formation, yet after conducting MD simulations, the hydroxyl group of ergosta-5.7-dienol and the binding site residues moved apart. This trend was observed in several organisms.
These unexpected outcomes may be attributed to less-than-ideal docking of NADPH. In comparison to the structure of delta(14)-sterol reductase from Methylomicrobium alcaliphilum 20Z6 particularly the positioning of the native ligand, none of the NADPH structures in all organisms anchored as deeply as the one in delta(14)-sterol reductase. Additionally, determining the orientation of the nicotinamide-ribose moiety during docking proved challenging, given its absence in the experimental structure. Consequently, it is possible that suboptimal NADPH docking conformations were selected, potentially leading to the poor docking results observed with ergosta-5.7-dienol.
After analyzing the results from all of the stages of our enzyme screening, we selected few promising DHCR7 enzymes that we could introduce to Yarrowia lipolytica in order to transform ergosta-5.7-dienol to campesterol. In our study, DHCR7 enzymes from Xenopus laevis and Waddlia chondrophila exhibit the best behavior with our substrate, ergosta-5.7-dienol. In order to fully analyze and understand the enzyme's possibilities, they would need to be experimentally tested. We also believe that DHCR7 from Danio rerio should have exhibited hydrogen bonds with the substrate, since the previous literature2 shows the best campesterol yield with this enzyme. Therefore, it should also be tested along with the other selected two.
Our whole project relies on an oleaginous yeast Yarrowia lipolytica. Since it's not a model organism, at the moments there isn't much data on it, even though it's becoming more and more popular in some research fields. We decided to build a growth model of Yarrowia with the help of our wet lab team that measured the growth. In the long run we would want to study the growth under multiple parameters such as various media, pH and temperature values. Considering the limited amount of time that we have to develop the project, we settled on comparing the impact of medium (glucose, oil) on the growth of Yarrowia lipolytica. To plan further experiments including pH, temperature and more media we used optimal design.
To model the growth, we used packages biogrowth15, growthrates16 in R and the library numpy, panda, matplotlib, scipy, and csv in Python (Anaconda Navigator).
Our wet lab team performed growth experiments on Yarrowia lipolytica at 30°C temperature and 200 rpm. The growth medium consisted of YP supplied with glucose (5% w/v) and oil (3.7% v/v) as carbon sources, respectively. The growth started at an initial OD600 (Optical Density at 600 nm) of 0.1 and was measured every 2 hours for 60 hours 17. To our surprise, Yarrowia appeared to have faster growth based on the opticial density in oil versus glucose.
Due to certain problems with the glucose concentration in the medium we needed to repeat the growth experiment and collect more data. To our surprise, Yarrowia turned out to grow better on oil medium compared to glucose medium.
To initiate the modeling of Yarrowia lipolytica growth, it is crucial to establish the relationship between the OD (Optical Density) value and cell density. Previous literature18, has suggested a linear relationship. Hence, building a model on cell density is equivalent to building a model on OD value19. In our study, we apply a logarithmic transformation to the OD. The logarithmic transformation is monotonic and effectively reduces both the overall variance and skewness of the measurement variable.
Based on the data obtained from our wet lab experiments, following data cleaning and log transformation, the growth rate can be visually represented through the following graph. The horizontal axis denotes time in hours, while the vertical axis represents the logarithm of cell density.
In the modeling of growth rates, it is crucial to estimate the maximum growth rate. In our dry lab, we employ several methods and then compare the results from these three approaches.
The model was initially introduced by Baranyi20. Our starting point is a growth model well-known from population dynamics. We examine the first-order ordinary differential equation:
where x is the cell density, t represents time and μ is the specific growth rate. Here, the μ is supposed to be a monotone decreasing function. In this model, the μ is set as
Besides, a new variable, , representing the physiological state of the cells is introduced. The adjustment function is defined as . It can be considered as a capacity-type quantity expressing the proportion of the potential specific growth rate that is utilized by the cells. The process of adjustment (lag period) is characterized by the gradual increase of from a low value towards 1. A more stable transformation is defined as .
After plugging α(t) into the ODE, we have our Baranyi model:
By solving this ODE, we have:
Where and
Growth with oil medium:
For the growth curve fitted by the Baranyi model for Yarrowia growing with oil, the maximum growth rate (µmax) is 0.081. The parameter for the maximum yield (ymax) is 4.060, and the initial cell density is 0.599. Additionally, we can compute the doubling time using the formula .
Growth with glucose medium:
Below is the growth curve fitted by the Baranyi model for glucose. The maximum growth rate is 0.169, while the parameter for the maximum yield is 3.374, and the initial cell density is 0.631.
Glucose vs oil:
Parameter | Glucose medium | Oil medium |
---|---|---|
Initial cell density | 0.599 | 0.631 |
Maximum growth rate | 0.169 | 0.081 |
Maximum yield | 3.374 | 4.060 |
When comparing the two carbon sources, it's notable that the maximum growth rate of oil is two times smaller than for glucose. Besides, the maximum yield for oil is larger than that for glucose.This suggest that our strain of Yarrowia lipolytica grows better on oil medium (which was also seen during growth experiments).
For fitting the exponential growth model, we utilized the method in the "growthrates" package19. The heuristic linear method operates as follows:
Given the density plot and the presence of 30 observations, we found it reasonable to set the width of the window 'h' as 5. Additionally, we considered part of the window fitting for the overall linear fit to be 0.95.
The graph represents the growth rate with glucose as the carbon source, where the maximum growth rate is 0.12, and the lag time is 3.05 hours. The y-axis corresponds to the log of the OD value.
The following graph illustrates the growth rate with oil as the carbon source, revealing a maximum growth rate of 0.119 and a lag time of 0.567 hours.
In this case, spline fitting is performed on log-transformed data, assuming exponential growth at the time point of the maximum of its first derivative, and this process also involves cross-validation.
The results of replicate 1 with the same dilution show that the maximum growth rate with glucose is 0.154, while with oil, it is 0.251.
The polynomial model is the most common one as well as the simplest one when researchers want to perform a curve fitting. In our project, we also tried several polynomial curve-fitting tests to find the optimal model. Only results from the fourth-order polynomial model are shown here. In total we fitted six orders of the model but decided to include here only one of them. The results from fitting the remaining models can be found in our dry lab notebook.
In this part, we used the library numpy, panda, matplotlib, scipy, and csv in Python (Anaconda Navigator).
Before applying the practical model, all the data from the experiments were stored as csv file with 2 columns. One is the time, and the other is the OD value. The format of the number is kept as 0.00 and the different columns were separated by semicolon.
Assumption: “unmeasurable” is equal to 99.
Objective function with fourth-order system
Discussion
From the above results, we could see the 4th order polynomial model fits our data well. The models with higher order showed over-fitting and the linear model cannot describe the growth of Yarrowia lipolytica (see the dry lab notebook for results). In this case, the model can be written in the mathematical way in the following equation.
Group | a | b | c | d | f |
---|---|---|---|---|---|
Glucose-10-Day | -0.8143581005399544 | 0.1546859026553308 | -0.004671823577116975 | 4.155998845229985e-05 | 0.9350957284285495 |
Glucose-100-Day | 0.6787062619569382 | 0.022938337335029934 | -0.0004868143806393447 | 3.0919705812010946e-06 | -1.6382038964754417 |
Oil-10-Day | 3.186276833939559 | -0.2222899617929109 | 0.011076814743971054 | -0.0001485068888501073 | -4.758860961991871 |
Oil-100-Day | 4.002293735995875 | -0.15441160369910434 | 0.004197843586585877 | -4.1418456690800325e-05 | -6.769108894062777 |
Glucose-10-Night | 1.0264171381633578 | 0.013498782238027363 | -0.0036076808022966755 | 8.06001443461879e-05 | 10.192369861942092 |
Glucose-100-Night | -0.196572547099106 | 0.14584600736652165 | -0.006206491088681432 | 8.151455938376628e-05 | 16.059754872289098 |
Oil-10-Night | 3.0185006626659026 | -0.552146841438496 | 0.04284554531381868 | -0.0008603731617413679 | 13.509429349429215 |
Oil-100-Night | -2.9076572646127232 | 0.06493650948214547 | 0.019916378820435177 | -0.0005946122207640681 | 29.831892178768666 |
From the table, it is clear to see that coefficients between oil and glucose are significantly different and the microorganism has a faster trend to grow on oil than glucose. However, because the microorganism might need some time to adopt to the new environment, in the group Oil-100-Night, the decline trend happened first.
Conclusions
Upon comparing the four methods, the maximum growth rate estimated by the heuristic linear method and the ODE model appear similar. However, the result obtained through the smoothing spline method differs from the other two methods. One possible reason for this discrepancy is that the linear model and ODE model rely on parameters, while the spline method is fully nonparametric, making it more sensitive to data points with significant gaps.
After obtaining the growth curve of Yarrowia lipolytica, determining the conditions that yield the maximum output becomes crucial. For future experiments, we have chosen a split-plot D-optimal design. D-optimal designs aim to maximize the determinant of the information matrix, which quantifies the precision and efficiency of parameter estimates.
In the production process, parameters such as temperature, pH, and the choice of medium significantly influence the yield, and we aim to quantify the impact of each parameter. The temperature and pH range from 20 to 30 and 2 to 5, respectively. We are comparing four different mediums: glucose, oil, oil with Tween, and oil with biosurfactant. Additionally, we have designated the medium as a hard-to-change factor, which serves as the whole-plot factor, while the other factors are considered subplot factors. When analyzing the data we have collected, it is essential to account for the influence of the whole-plot factor.
The model we use for analyzing data from a split-plot experiment with b whole plots of k runs is
where Yij is the response measured at the j-th run or sub-plot in the i-th whole plot(medium), xij is a vector that contains the levels of all the experimental factors at the j-th run in the i-th whole plot, f^'(xij) is that vector's model expansion, and β contains the intercept and all the factor effects that are in the model. The term γi represents the i-th whole-plot effect and εij is the residual error associated with the j-th run in whole plot i. Each whole plot corresponds to an independent setting of the hard-to-change factors, and each run or sub-plot corresponds to one independent setting of the easy-to-change factors.
Run | Whole Plots | Temperature | pH | Medium |
---|---|---|---|---|
1 | 1 | 20 | 5 | Oil with tween |
2 | 1 | 20 | 2 | Oil with tween |
3 | 1 | 30 | 5 | Oil with tween |
4 | 1 | 30 | 2 | Oil with tween |
5 | 2 | 30 | 2 | Oil with produced biosurfactant |
6 | 2 | 20 | 5 | Oil with produced biosurfactant |
7 | 2 | 30 | 5 | Oil with produced biosurfactant |
8 | 2 | 20 | 2 | Oil with produced biosurfactant |
9 | 3 | 20 | 2 | Oil with tween |
10 | 3 | 20 | 5 | Oil with tween |
11 | 3 | 30 | 2 | Oil with tween |
12 | 3 | 30 | 5 | Oil with tween |
13 | 4 | 30 | 5 | Oil |
14 | 4 | 30 | 2 | Oil |
15 | 4 | 20 | 2 | Oil |
16 | 4 | 20 | 5 | Oil |
17 | 5 | 20 | 2 | Glucose |
18 | 5 | 20 | 5 | Glucose |
19 | 5 | 30 | 5 | Glucose |
20 | 5 | 30 | 2 | Glucose |
21 | 6 | 30 | 2 | Oil with produced biosurfactant |
22 | 6 | 20 | 2 | Oil with produced biosurfactant |
23 | 6 | 30 | 5 | Oil with produced biosurfactant |
24 | 6 | 20 | 5 | Oil with produced biosurfactant |
The primary objective of our protein design endeavor is to create a custom hydrophobin with a reduced isoelectric point (pI). This design aims to enhance emulsification efficiency under lower pH conditions. Hydrophobins are most effective at emulsification in alkaline pH environments, typically around 9. However, the microorganism Yarrowia lipolytica, with which we are working, thrives in lower pH ranges. Furthermore, in practical fermentation processes, especially those without a controlled pH environment, the pH tends to decrease rapidly.
The foundational knowledge about hydrophobins was graciously provided by Jeroen Vereman, while insights into Yarrowia lipolytica were shared by Vasileios Vangalis and Seppe Dockx from Professor Kevin Verstrepen's laboratory.
The strategic concept for designing a hydrophobin with a lower pI was formulated by our Principal Investigator, Vitor Pinheiro. This innovative approach underscores our commitment to addressing the specific needs of our research and application context.
As a starting point, we developed a Python script that generates all possible combinations of the original sequence, wherein lysine and arginine are substituted with either glutamate or aspartate. This theoretical substitution aims to lower the isoelectric point of the hydrophobin. Our script utilizes BioPython20 to calculate the pI of the protein. To begin, we selected the HFBI sequence, a well-known hydrophobin comprising 75 amino acids (including 2 lysines and 1 arginine), with the corresponding PDB code being 2FZ6. The outcome of this script yielded 26 modified sequences, with pI values ranging from 4.05 to 4.22 within the default pH interval investigated using the pi() method from BioPython. When the minimum pH of the interval is adjusted downward, the calculated pI can drop even further, reaching as low as 3.17, which represents the lowest pI.
The sequence with the lowest pI of 3.17 exhibited mutations at positions K32D, R35D, and K50D. To ensure a fair comparison in our subsequent analysis, we utilized four sequences, including this one.
Predicted structures of:
The figure below shows superposition of the original HFBI and HFBI with 3 random leucines substituted with aspartate. We expected the mutations to highly impact the stability of the protein and result in either a very different structure or unstructured regions but we ended up with a very similar structure with some (but not very substantial) differences (e.g., in the β-sheets region). This points to one of the limitations of AlphaFold3. In particular, it does not predict the impact of mutations on the protein structure or the stability of the protein.
In order to fully explore and confirm the encountered mutations, we predicted a stucture of the HFBI with all seven leucines substituted with aspartate. Figure below shows this structure aligned to the original HFBI structure. Obtained structure is still very similar to the original one and does not display any vast differences. This result further confirms that AlphaFold is not a good tool (at least yet) to study the impact of mutations on the protein stability and structure.
As a result of these conclusions, we decided to no longer use AlphaFold for designing a hydrophobin with lower isoelectric point and we are looking for other strategies to tackle this idea.
PredictSNP22 is a tool that predicts whether a given mutation is disease-related, indicating its potential impact on the stability of the protein. We submitted four jobs to PredictSNP, each involving a modification of the original HFBI sequence, all of which corresponded to the four sequences previously submitted to AlphaFold.
Note: The algorithm is trained on human protein data, while hydrophobins are fungal proteins. Therefore, we can only assume its accuracy for hydrophobins, but it is not necessarily guaranteed.
Mutated HFBI with lowest pI:
Most of the tools integrated into PredictSNP suggest that the mutations are likely to be neutral. The expected accuracy of PredictSNP is approximately 75% for each mutation. The MAPP prediction raises some concerns as its confidence level is approximately 78%. Further exploration of the sequence and the impact of these mutations is necessary. These initial results are promising because there is a possibility that these mutations may not significantly affect the protein's stability, which aligns with our research objectives.
Mutation | PredictSNP prediction | MAPP prediction | PhD-SNP prediction | PolyPhen-1 prediction | PolyPhen-2 prediction | SIFT prediction | SNAP prediction |
---|---|---|---|---|---|---|---|
K32D | NEUTRAL | DELETERIOUS | NEUTRAL | NEUTRAL | NEUTRAL | NEUTRAL | NEUTRAL |
R45D | NEUTRAL | DELETERIOUS | NEUTRAL | NEUTRAL | DELETERIOUS | NEUTRAL | DELETERIOUS |
K50D | NEUTRAL | DELETERIOUS | NEUTRAL | NEUTRAL | NEUTRAL | NEUTRAL | NEUTRAL |
Original sequence with 3 random leucines substituted with aspartate:
All three mutations appear to be deleterious, indicating that they indeed have an impact on the stability of the protein. This result aligns with our expectations, as these mutations were intentionally designed to affect the protein's structure.
Mutation | PredictSNP prediction | MAPP prediction | PhD-SNP prediction | PolyPhen-1 prediction | PolyPhen-2 prediction | SIFT prediction | SNAP prediction |
---|---|---|---|---|---|---|---|
L24D | DELETERIOUS | DELETERIOUS | DELETERIOUS | DELETERIOUS | DELETERIOUS | DELETERIOUS | DELETERIOUS |
L67D | DELETERIOUS | DELETERIOUS | NEUTRAL | DELETERIOUS | DELETERIOUS | DELETERIOUS | DELETERIOUS |
L68D | DELETERIOUS | DELETERIOUS | NEUTRAL | DELETERIOUS | DELETERIOUS | DELETERIOUS | DELETERIOUS |
Sequence with 3 leucines substituted further mutated towards lower pI:
This sequence contains mutations that reduce the pI and involve the substitution of leucines with aspartate, both of which are expected to disrupt protein stability. This result consolidates the findings from the previous two sequences.
Mutation | PredictSNP prediction | MAPP prediction | PhD-SNP prediction | PolyPhen-1 prediction | PolyPhen-2 prediction | SIFT prediction | SNAP prediction |
---|---|---|---|---|---|---|---|
L24D | DELETERIOUS | DELETERIOUS | DELETERIOUS | DELETERIOUS | DELETERIOUS | DELETERIOUS | DELETERIOUS |
K32D | NEUTRAL | DELETERIOUS | NEUTRAL | NEUTRAL | NEUTRAL | NEUTRAL | NEUTRAL |
R45D | NEUTRAL | DELETERIOUS | NEUTRAL | NEUTRAL | DELETERIOUS | NEUTRAL | DELETERIOUS |
K50D | NEUTRAL | DELETERIOUS | NEUTRAL | NEUTRAL | NEUTRAL | NEUTRAL | NEUTRAL |
L67D | DELETERIOUS | DELETERIOUS | NEUTRAL | DELETERIOUS | DELETERIOUS | DELETERIOUS | DELETERIOUS |
L68D | DELETERIOUS | DELETERIOUS | NEUTRAL | DELETERIOUS | DELETERIOUS | DELETERIOUS | DELETERIOUS |
All leucines substituted:
In this sequence, all leucines have been replaced with aspartate. We made this choice to challenge the limitations of AlphaFold. As anticipated, these mutations are expected to influence the protein's structure, a prediction supported by PredictSNP but not detected by AlphaFold. Upon comparing the prediction confidence from all the tools, it is reasonable to assume that all of these mutations would indeed affect the protein's stability.
Mutation | PredictSNP prediction | MAPP prediction | PhD-SNP prediction | PolyPhen-1 prediction | PolyPhen-2 prediction | SIFT prediction | SNAP prediction |
---|---|---|---|---|---|---|---|
L12D | DELETERIOUS | DELETERIOUS | DELETERIOUS | DELETERIOUS | DELETERIOUS | DELETERIOUS | DELETERIOUS |
L24D | DELETERIOUS | DELETERIOUS | DELETERIOUS | DELETERIOUS | DELETERIOUS | DELETERIOUS | DELETERIOUS |
L26D | DELETERIOUS | DELETERIOUS | NEUTRAL | DELETERIOUS | DELETERIOUS | DELETERIOUS | DELETERIOUS |
L29D | DELETERIOUS | DELETERIOUS | DELETERIOUS | DELETERIOUS | DELETERIOUS | DELETERIOUS | DELETERIOUS |
L56D | NEUTRAL | DELETERIOUS | NEUTRAL | DELETERIOUS | DELETERIOUS | NEUTRAL | NEUTRAL |
L67D | DELETERIOUS | DELETERIOUS | NEUTRAL | DELETERIOUS | DELETERIOUS | DELETERIOUS | DELETERIOUS |
L68D | DELETERIOUS | DELETERIOUS | NEUTRAL | DELETERIOUS | DELETERIOUS | DELETERIOUS | DELETERIOUS |
The analysis we have conducted so far is preliminary. Further analysis and exploration are necessary to gain greater confidence in our designed hydrophobin. Our next steps would involve utilizing FoldX23, pyFoldX24 in particular, to thoroughly assess the stability of the mutated HFBI. By applying a force field to our hydrophobin, we could investigate the impact of mutations on both the stability and dynamics of the protein.
In the earlier stages, we addressed one of the limitations of AlphaFold, which is its inability to predict the effect of single nucleotide polymorphisms on protein structure. In September 2023, Google DeepMind, the creators of AlphaFold, introduced AlphaMissense25. This tool, built upon AlphaFold, is designed to predict whether a given mutation is disease-related, potentially disrupting protein stability. Regrettably, we haven't had the opportunity to test it ourselves, but it presents an exciting avenue for our future analysis.