Abstract

In our dry lab work, we pursued three distinct projects, each contributing unique insights to our overall project. These projects included finding more efficient orthologues of DHCR7 (the last enzyme in the campesterol biosynthetic pathway) through enzyme screening, modeling the growth of Yarrowia lipolytica, and designing a hydrophobin. Our research project was dedicated to enabling campesterol biosynthesis within the oleaginous yeast, Yarrowia lipolytica, renowned for its potential in biotechnological applications. A pivotal step in this undertaking involved introducing a gene encoding delta-7-sterol reductase¹, a component not naturally found in Yarrowia. To identify the most suitable enzyme source, we conducted a comprehensive screening process that included molecular docking and molecular dynamics studies with the enzyme substrate. Simultaneously, we collaborated closely with our wet lab team to establish a growth model for Y. lipolytica, initially focusing on assessing the impact of glucose and oil-based media to its growth. Our approach involved optimizing experimental designs to pave the way for future investigations under diverse conditions. Additionally, we embarked on a protein design campaign with the objective of designing a hydrophobin characterized by a reduced isoelectric point. This innovation was directed towards enhancing emulsification efficiency in lower pH environments, aligning with Y. lipolytica's growth preferences and the practical characteristics of fermentation processes.

Enzyme Screening

Motivation

The final step in the campesterol biosynthesis pathways requires a functioning delta-7-sterol reductase, which is not naturally present in Yarrowia lipolytica. This enzyme plays a crucial role in reducing ergosta-5,7-dienol to campesterol¹. Several organisms feature genomes containing the DHCR7 gene, which could potentially serve our project. Detailed knowledge and characterization of naturally occurring DHCR7 is lacking, therefore we tried to identify the most promising DHCR7 candidates through in silico screening.

Unlike traditional screening methods that typically search for the best molecule, our focus was on identifying the most suitable receptor – how a protein molecule is commonly referred to in docking studies. We compiled a list of ten organisms known to harbor a delta-7-sterol reductase. Our selection criteria were based on a combination of literature reviews^1,2 and homology analyses (further details are available in our dry lab notebook). In our pursuit of identifying the most promising enzymes, we conducted molecular docking and molecular dynamics (MD) simulations of the selected docked conformations.

Structure Prediction

Except for Gallus gallus and Waddlia chondrophila, we were able to obtain Computed Structure Models (CSMs) from the Protein Data Bank (PDB) for all our target enzymes from other species. These models represent predictions of enzyme structures. Due to the lack of experimentally determined structures of DHCR7, we had to rely on AlphaFold models. Notably, all the obtained structures exhibited very high structural similarity. Unfortunately, structural data for the DHCR7 enzyme from Gallus gallus and Waddlia chondrophila had not been deposited in the PDB database.

As a result, we resorted to predicting these structures using AlphaFold³. Specifically, our team members Xiaodie, Yinfu, and Julia utilized AlphaFold Colab and ColabFold⁴ tools on Google Colab, subsequently comparing the results obtained from both tools. Default settings were used for both programs. Both AlphaFold Colab and ColabFold consistently produced highly reliable predictions with a notably high level of confidence. Given the quality of these results, we made the decision to forego the use of the full version of the local AlphaFold, as we believed it would not significantly enhance our outcomes.

All models had high pLDDT scores indicating high confidence in the predicted structure. The N-termini of all DHCR7 exhibited lower prediction confidence, in line with its predicted higher flexibility. There were instances where pLDDT values dropped below 70, coinciding with reductions in MSA (multiple sequence alignment) coverage and higher PAE values. These lower-accuracy patches likely correspond to the same region of the protein in both organisms.

Structures of the DHCR7 enzyme in Gallus gallus and Waddlia chondrophila

Figure 1. Structures of the DHCR7 enzyme in Gallus gallus (A) and Waddlia chondrophila (B) predicted with AlphaFold Colab. The structures are colored according to the prediction confidence. The structures are colored according to the prediction confidence: pLDDT > 90 (dark blue), 90 > pLDDT > 70 (light blue), 70 > pLDDT > 50 (yellow), pLDDT < 50 (orange).

The results obtained from both AlphaFold Colab and ColabFold were highly similar. This was somewhat expected since they use the same method but also different techniques for input feature generation. However, due to the slightly higher pLDDT values, we opted to retain the structures generated by AlphaFold Colab. For more comprehensive information regarding the structure prediction process, please refer to our dry lab enzyme screening notebook.

To color the models according to the AlphaFold pLDDT coloring scheme we used a PyMOL plugin⁵.

Molecular Docking

Due to the DHCR7 enzyme's transmembrane nature and the inherent challenges associated with determining the structures of such proteins, we found no experimental structural data publicly available. To address this limitation, we sought out proteins that were structurally similar to DHCR7. One such protein, 4QUV, identified as a delta(14)-sterol reductase within the same protein family as DHCR7, exhibited notable structural similarity to DHCR7. Consequently, we utilized the 4QUV structure as a "reference" to aid us in identifying the binding site of NADPH in DHCR7. When conducting the docking of ergosta-5.7-dienol, we used literature as a reference for the validation of the docked substrate’s confirmation^1,6. We targeted the putative sterol binding site to guide our investigation.

For our molecular docking experiments, we followed a protocol developed by Professor J. Harvey (KU Leuven, Department of Chemistry). The key steps in molecular docking include preparation of the enzyme and ligand, docking the small molecule with the enzyme, and interpreting the results. Our choice of software was AutoDock Vina⁷, and we visualized the docking simulations using UCSF Chimera⁸. To validate the protocol's accuracy, we initiated the process by docking the experimentally determined structure of delta(14)-sterol reductase (PDB code: 4QUV) with its native ligand, NDP but without the nicotinamide-ribose moiety. We observed that the ligand successfully docked in the same location as in the crystal structure, confirming the reliability of our approach. During the docking of NDP with 4QUV, we encountered an issue related to the naming of one of the hydrogens. Fortunately, with the guidance of Professor Harvey, we were able to resolve this issue (additional details can be found in the dry lab enzyme screening notebook).

We started from NADPH and ergosta-5,7-dienol structures obtained from PubChem. Our docking process proceeded sequentially, with NADPH docking performed first. Subsequently, we exported the results to a PDB file and initiated another docking simulation, this time involving ergosta-5,7-dienol. The inclusion of NADPH in the docking process is essential as it serves as a cofactor for reducing ergosta-5,7-dienol to campesterol.

Under Prof. Froeyen's advice, each ligand was docked to each enzyme ten times to gain a measure of dock quality. Autodock's statistical nature does not guarantee the same result for each docking attempt and repetitions are therefore recommended. Since docking scores are not sufficient to guarantee correct docking, we also inspected the potential chemical interactions, such as H-bond and hydrophobic interactions, between ligand and enzymes.

Due to difficulties visualizing the surface of our models in Chimera, we adapted our strategy or conducting post-docking double checks. We employed PoseView ⁹ and LigPlot+¹⁰. These software tools generate two-dimensional diagrams illustrating the interactions between the enzyme and the ligand. Our analysis concentrated on interactions within the NADPH binding site and the putative sterol binding site^1,6. After thoroughly assessing the docking results and considering the insights gleaned from these double checks, we selected a single docked conformation for each enzyme to undergo molecular dynamics (MD) simulations.

Organism	Docking score	Interactions found with PoseView	Interactions found with LigPlot+
Rattus norvegicus	-6.3	Hydrogen bond with aspartate from the sterol binding site	Hydrogen bond with aspartate from the sterol binding site
Mus musculus	-5.4	No interactions found	No hydrogen bonds found
Homo sapiens	-9.1	Hydrogen bond with tyrosine from the sterol binding site	Hydrogen bonds with both the tyrosine and aspartate from the sterol binding site
Danio rerio	-6	Hydrogen bond with aspartate from the sterol binding site	Hydrogen bonds with both the tyrosine and aspartate from the sterol binding site
Xenopus laevis	-8.3	Hydrogen bond with aspartate from the sterol binding site	Hydrogen bonds with both the tyrosine and aspartate from the sterol binding site
Xenopus tropicalis	-6.1	No interactions found	No hydrogen bonds with the enzyme, 2 hydrogen bonds with NADPH
Gallus gallus	-7.5	Hydrogen bond with aspartate from the sterol binding site	Hydrogen bonds with both the tyrosine and aspartate from the sterol binding site, one hydrogen bond with NADPH
Waddlia chondrophila	-6	Hydrogen bond with aspartate from the sterol binding site	Hydrogen bonds with both the tyrosine and aspartate from the sterol binding site
Arabidopsis thaliana	-6.5	No interactions found	No hydrogen bonds found
Bos taurus	-10.4	No interactions found	No hydrogen bonds with the enzyme, 1 hydrogen bond with NADPH

Table 1. A summary of the results from ergosta-5,7-dienol docking. The scores are provided by AutoDock Vina. These results specifically highlight hydrogen bonds and do not include hydrophobic contacts, as our primary aim was to observe specific binding..

Docking scores are estimated binding energies (therefore the more negative the better) but it is clear from Table 1 that there is limited correlation between binding score and hydrogen bonding patterns, which we expect provide a good proxy for determining if interactions are specific.

Molecular Dynamics

In our molecular dynamics (MD) simulations, we used the Amber22 package¹¹ to investigate the behavior of protein-ligand complexes previously identified through molecular docking and interaction checks. Under the guidance of prof. Matheus Froeyen, Julia from our team wrote an MD protocol that we followed to perform the simulations.

Our simulation workflow can be summarized as follows:

Preparation of the Small Molecule:

We initially obtained the ligand from the PubChem database in SDF format, which was subsequently converted to the PDB format.
Using antechamber2 and parmchk2¹², we generated the force field¹³ parameters for the small molecule.

Preparation of the Docked Complex:

We edited the PDB file of the docked complex to assure the compatibility with AMBER software.
To ensure consistency in atom names between the docked ligand and the prepped ligand, we adjusted the atom names in the docked conformation to match those of the prepped ligand.

Generating the coordinates and topology:

LEAP was used to generate AMBER topology and coordinate files for the structure, based on a specified force field.

Molecular Dynamics Simulation Phases:

Minimization: We initiated the simulations with energy minimization to relax the structure.
Heating: The system was gradually heated to the desired temperature.
Density Adjustment: To attain the desired system density, we employed a density adjustment phase.
Equilibration: In this phase, we equilibrated the system over an extended period to ensure stability. No restraints were applied.
Production: The production phase comprised several segments (in our case five), each running for 4 ns, totaling a 20-ns MD simulation. It is essential to monitor the Root Mean Square Deviation (RMSD) throughout this phase to evaluate system stability.

Analysis:

RMSD & RMSF: Using the cpptraj¹⁴ program, we calculated RMSD values to monitor system stability after each production stage. RMSD is used to quantify the difference or similarity between two sets of coordinates, in this context it reflects the similarity between structures. Additionally, RMSF (Root Mean Square Fluctuation) values provided insights into protein flexibility.
Hydrogen Bond Distance Analysis: We monitored the distances between atoms forming hydrogen bonds in the complex. These distances were adjusted as needed in the analysis script based on results from PoseView and LigPlot+.

This comprehensive simulation and analysis approach allowed us to explore the dynamic behavior of protein-ligand complexes, assess their stability, and gain insights into key interactions that are essential for DHCR7 biological function.

The analysis of hydrogen bond distances held significant importance to us, as it offered insights into the specificity of ligand binding. Our primary focus was on the hydrogen bonds formed between the hydroxyl group of ergosta-5,7-dienol and residues in the putative sterol binding site, consisting of a tyrosine and aspartate. We systematically calculated the distances between atoms that were expected to participate in these hydrogen bonds throughout the entire simulation. This analysis, combined with other relevant information, contributed to our conclusions drawn from the enzyme screening.

Organism	Hydrogen bonds' behavior throughout the simulation
Rattus norvegicus	The tyrosine doesn't seem to form a hydrogen bond with the hydroxyl group at all. There could be a hydrogen bond with aspartate but only for a short moment. These results are inconclusive.
Mus musculus	As expected after molecular docking, the atoms that should form the hydrogen bonds are too far from each other and they don't form a hydrogen bond.
Homo sapiens	Due to some problems during simulations, we were not able to obtain results from this organism.
Danio rerio	Results from molecular docking were very promising but the ergosta-5.7-dienol shifted away from the putative sterol binding site after equilibration and never came close enough to form hydrogen bonds. The residues from the binding sites also moved away from the substrate and they don’t form a hydrogen bond between themselves.
Xenopus laevis	The hydrogen bonds between ergosta-5,.7-dienol and the enzyme don't seem to hold up throughout the whole simulation (only one lasts for the last 20 thousand frames)
Xenopus tropicalis	As expected after docking, the substrate didn't form any hydrogen bonds with the binding site residues.
Gallus gallus	The hydroxyl group of ergosta-5.7-dienol seems to be too far away from the binding site atoms to form a stable hydrogen bond.
Waddlia chondrophila	The tyrosine doesn't seem to be as far away from the ergosta-5.7-dienol as aspartate, therefore there might be a hydrogen bond, but only for a short period of time.
Arabidopsis thaliana	The atoms that were supposed to form hydrogen bonds were too far to form a specific interaction. This was expected after docking since the hydroxyl group of the substrate docked away from the residues from the binding site. The fact that instead of aspartate there is a glutamic acid in the putative sterol binding site could also contribute to this result.
Bos taurus	Even though the docking score for ergosta-5.7-dienol was really good, the molecule didn’t dock close enough to the putative sterol binding site. This could be caused by the strange docking of NADPH which wasn’t resembling the reference structure too closely.

Table 2. Summary of the MD results with a focus on the hydrogen bonds between the hydroxyl group of ergosta-5.7-dienol and the putative sterol binding site of the DHCR7 enzyme.

Discussion and Conclusions

The results obtained from molecular docking and MD simulations deviated from our initial expectations. In many of the analyzed organisms, the atoms that were anticipated to form hydrogen bonds did not come into close proximity after simulation, even when they appeared to establish hydrogen bonds during the initial docking phase. Of particular surprise is the case of Danio rerio, as the DHCR7 enzyme from this organism had previously demonstrated the highest campesterol yield in the literature² The docking results indicated strong ligand binding and hydrogen bond formation, yet after conducting MD simulations, the hydroxyl group of ergosta-5.7-dienol and the binding site residues moved apart. This trend was observed in several organisms.

These unexpected outcomes may be attributed to less-than-ideal docking of NADPH. In comparison to the structure of delta(14)-sterol reductase from Methylomicrobium alcaliphilum 20Z⁶ particularly the positioning of the native ligand, none of the NADPH structures in all organisms anchored as deeply as the one in delta(14)-sterol reductase. Additionally, determining the orientation of the nicotinamide-ribose moiety during docking proved challenging, given its absence in the experimental structure. Consequently, it is possible that suboptimal NADPH docking conformations were selected, potentially leading to the poor docking results observed with ergosta-5.7-dienol.

After analyzing the results from all of the stages of our enzyme screening, we selected few promising DHCR7 enzymes that we could introduce to Yarrowia lipolytica in order to transform ergosta-5.7-dienol to campesterol. In our study, DHCR7 enzymes from Xenopus laevis and Waddlia chondrophila exhibit the best behavior with our substrate, ergosta-5.7-dienol. In order to fully analyze and understand the enzyme's possibilities, they would need to be experimentally tested. We also believe that DHCR7 from Danio rerio should have exhibited hydrogen bonds with the substrate, since the previous literature² shows the best campesterol yield with this enzyme. Therefore, it should also be tested along with the other selected two.

Growth Model

Motivation

Our whole project relies on an oleaginous yeast Yarrowia lipolytica. Since it's not a model organism, at the moments there isn't much data on it, even though it's becoming more and more popular in some research fields. We decided to build a growth model of Yarrowia with the help of our wet lab team that measured the growth. In the long run we would want to study the growth under multiple parameters such as various media, pH and temperature values. Considering the limited amount of time that we have to develop the project, we settled on comparing the impact of medium (glucose, oil) on the growth of Yarrowia lipolytica. To plan further experiments including pH, temperature and more media we used optimal design.

To model the growth, we used packages biogrowth¹⁵, growthrates¹⁶ in R and the library numpy, panda, matplotlib, scipy, and csv in Python (Anaconda Navigator).

Basic Growth Data: Oil vs. Glucose Medium

Our wet lab team performed growth experiments on Yarrowia lipolytica at 30°C temperature and 200 rpm. The growth medium consisted of YP supplied with glucose (5% w/v) and oil (3.7% v/v) as carbon sources, respectively. The growth started at an initial OD₆₀₀ (Optical Density at 600 nm) of 0.1 and was measured every 2 hours for 60 hours ¹⁷. To our surprise, Yarrowia appeared to have faster growth based on the opticial density in oil versus glucose.

Due to certain problems with the glucose concentration in the medium we needed to repeat the growth experiment and collect more data. To our surprise, Yarrowia turned out to grow better on oil medium compared to glucose medium.

Primary Growth Model

To initiate the modeling of Yarrowia lipolytica growth, it is crucial to establish the relationship between the OD (Optical Density) value and cell density. Previous literature¹⁸, has suggested a linear relationship. Hence, building a model on cell density is equivalent to building a model on OD value¹⁹. In our study, we apply a logarithmic transformation to the OD. The logarithmic transformation is monotonic and effectively reduces both the overall variance and skewness of the measurement variable.

Based on the data obtained from our wet lab experiments, following data cleaning and log transformation, the growth rate can be visually represented through the following graph. The horizontal axis denotes time in hours, while the vertical axis represents the logarithm of cell density.

Figure 2. Optical density (OD) measurements of Y. lipolytica on a logarithmic scale as a function of time. Each row represents a different carbon source used in the growth of the yeast (sunflower oil or glucose at concentrations of 3.7% and 5%, respectively). The OD values were measured at two different dilution factors (1:10 and 1:100), as highlighted by the 2 separate columns in the panel.

In the modeling of growth rates, it is crucial to estimate the maximum growth rate. In our dry lab, we employ several methods and then compare the results from these three approaches.

Baranyi model

The model was initially introduced by Baranyi²⁰. Our starting point is a growth model well-known from population dynamics. We examine the first-order ordinary differential equation:

The first-order ordinary differential equation

where x is the cell density, t represents time and μ is the specific growth rate. Here, the μ is supposed to be a monotone decreasing function. In this model, the μ is set as

Besides, a new variable, , representing the physiological state of the cells is introduced. The adjustment function is defined as . It can be considered as a capacity-type quantity expressing the proportion of the potential specific growth rate that is utilized by the cells. The process of adjustment (lag period) is characterized by the gradual increase of alpha from a low value towards 1. A more stable transformation is defined as .

After plugging α(t) into the ODE, we have our Baranyi model:

By solving this ODE, we have:

Where The Baranyi model solved, variables and

Growth with oil medium:

For the growth curve fitted by the Baranyi model for Yarrowia growing with oil, the maximum growth rate (µ_max) is 0.081. The parameter for the maximum yield (y_max) is 4.060, and the initial cell density is 0.599. Additionally, we can compute the doubling time using the formula .

Figure 3. Baranyi model for the oil group.

Growth with glucose medium:

Below is the growth curve fitted by the Baranyi model for glucose. The maximum growth rate is 0.169, while the parameter for the maximum yield is 3.374, and the initial cell density is 0.631.

Figure 4. Baranyi model for the glucose group.

Glucose vs oil:

Parameter	Glucose medium	Oil medium
Initial cell density	0.599	0.631
Maximum growth rate	0.169	0.081
Maximum yield	3.374	4.060

Table 3. Maximum growth rates and maximum yields computed with the Baranyi model for Yarrowia lipolytica growth in the glucose and oil medium.

When comparing the two carbon sources, it's notable that the maximum growth rate of oil is two times smaller than for glucose. Besides, the maximum yield for oil is larger than that for glucose.This suggest that our strain of Yarrowia lipolytica grows better on oil medium (which was also seen during growth experiments).

Exponential Growth Model with a Heuristic Linear Method

For fitting the exponential growth model, we utilized the method in the "growthrates" package¹⁹. The heuristic linear method operates as follows:

Firstly, it fits linear regressions to all subsets of 'h' consecutive data points and searches for the highest rate of exponential growth.
Secondly, it identifies the subset with the highest slope and includes data points from adjacent subsets that have a slope of at least 'quota'.
Lastly, it fits a new linear model to the extended data window identified in step 2.

Given the density plot and the presence of 30 observations, we found it reasonable to set the width of the window 'h' as 5. Additionally, we considered part of the window fitting for the overall linear fit to be 0.95.

The graph represents the growth rate with glucose as the carbon source, where the maximum growth rate is 0.12, and the lag time is 3.05 hours. The y-axis corresponds to the log of the OD value.

Figure 5. Linear model for the glucose group.

The following graph illustrates the growth rate with oil as the carbon source, revealing a maximum growth rate of 0.119 and a lag time of 0.567 hours.

Figure 6. Linear model for the oil group.

Exponential Growth Model with Smoothing Spline

In this case, spline fitting is performed on log-transformed data, assuming exponential growth at the time point of the maximum of its first derivative, and this process also involves cross-validation.

The results of replicate 1 with the same dilution show that the maximum growth rate with glucose is 0.154, while with oil, it is 0.251.

Figure 7. Nonparametric model for all replicates.

Polynomial model

The polynomial model is the most common one as well as the simplest one when researchers want to perform a curve fitting. In our project, we also tried several polynomial curve-fitting tests to find the optimal model. Only results from the fourth-order polynomial model are shown here. In total we fitted six orders of the model but decided to include here only one of them. The results from fitting the remaining models can be found in our dry lab notebook.

In this part, we used the library numpy, panda, matplotlib, scipy, and csv in Python (Anaconda Navigator).

Before applying the practical model, all the data from the experiments were stored as csv file with 2 columns. One is the time, and the other is the OD value. The format of the number is kept as 0.00 and the different columns were separated by semicolon.

Assumption: “unmeasurable” is equal to 99.

Figure 8. Scatter plot of the day group.

Figure 9. Scatter plot of the night group.

Objective function with fourth-order system

Figure 10. Fourth-order curve-fitting of the day group.

Figure 11. Fourth-order curve-fitting of the night group.

Discussion

From the above results, we could see the 4th order polynomial model fits our data well. The models with higher order showed over-fitting and the linear model cannot describe the growth of Yarrowia lipolytica (see the dry lab notebook for results). In this case, the model can be written in the mathematical way in the following equation.

Group	a	b	c	d	f
Glucose-10-Day	-0.8143581005399544	0.1546859026553308	-0.004671823577116975	4.155998845229985e-05	0.9350957284285495
Glucose-100-Day	0.6787062619569382	0.022938337335029934	-0.0004868143806393447	3.0919705812010946e-06	-1.6382038964754417
Oil-10-Day	3.186276833939559	-0.2222899617929109	0.011076814743971054	-0.0001485068888501073	-4.758860961991871
Oil-100-Day	4.002293735995875	-0.15441160369910434	0.004197843586585877	-4.1418456690800325e-05	-6.769108894062777
Glucose-10-Night	1.0264171381633578	0.013498782238027363	-0.0036076808022966755	8.06001443461879e-05	10.192369861942092
Glucose-100-Night	-0.196572547099106	0.14584600736652165	-0.006206491088681432	8.151455938376628e-05	16.059754872289098
Oil-10-Night	3.0185006626659026	-0.552146841438496	0.04284554531381868	-0.0008603731617413679	13.509429349429215
Oil-100-Night	-2.9076572646127232	0.06493650948214547	0.019916378820435177	-0.0005946122207640681	29.831892178768666

Table 4. Coefficients for each group of data.

From the table, it is clear to see that coefficients between oil and glucose are significantly different and the microorganism has a faster trend to grow on oil than glucose. However, because the microorganism might need some time to adopt to the new environment, in the group Oil-100-Night, the decline trend happened first.

Conclusions

Upon comparing the four methods, the maximum growth rate estimated by the heuristic linear method and the ODE model appear similar. However, the result obtained through the smoothing spline method differs from the other two methods. One possible reason for this discrepancy is that the linear model and ODE model rely on parameters, while the spline method is fully nonparametric, making it more sensitive to data points with significant gaps.

Optimal Design of Further Experiments

After obtaining the growth curve of Yarrowia lipolytica, determining the conditions that yield the maximum output becomes crucial. For future experiments, we have chosen a split-plot D-optimal design. D-optimal designs aim to maximize the determinant of the information matrix, which quantifies the precision and efficiency of parameter estimates.

In the production process, parameters such as temperature, pH, and the choice of medium significantly influence the yield, and we aim to quantify the impact of each parameter. The temperature and pH range from 20 to 30 and 2 to 5, respectively. We are comparing four different mediums: glucose, oil, oil with Tween, and oil with biosurfactant. Additionally, we have designated the medium as a hard-to-change factor, which serves as the whole-plot factor, while the other factors are considered subplot factors. When analyzing the data we have collected, it is essential to account for the influence of the whole-plot factor.

The model we use for analyzing data from a split-plot experiment with b whole plots of k runs is

where Y_ij is the response measured at the j-th run or sub-plot in the i-th whole plot(medium), x_ij is a vector that contains the levels of all the experimental factors at the j-th run in the i-th whole plot, f^'(x_ij) is that vector's model expansion, and β contains the intercept and all the factor effects that are in the model. The term γ_i represents the i-th whole-plot effect and ε_ij is the residual error associated with the j-th run in whole plot i. Each whole plot corresponds to an independent setting of the hard-to-change factors, and each run or sub-plot corresponds to one independent setting of the easy-to-change factors.

Run	Whole Plots	Temperature	pH	Medium
1	1	20	5	Oil with tween
2	1	20	2	Oil with tween
3	1	30	5	Oil with tween
4	1	30	2	Oil with tween
5	2	30	2	Oil with produced biosurfactant
6	2	20	5	Oil with produced biosurfactant
7	2	30	5	Oil with produced biosurfactant
8	2	20	2	Oil with produced biosurfactant
9	3	20	2	Oil with tween
10	3	20	5	Oil with tween
11	3	30	2	Oil with tween
12	3	30	5	Oil with tween
13	4	30	5	Oil
14	4	30	2	Oil
15	4	20	2	Oil
16	4	20	5	Oil
17	5	20	2	Glucose
18	5	20	5	Glucose
19	5	30	5	Glucose
20	5	30	2	Glucose
21	6	30	2	Oil with produced biosurfactant
22	6	20	2	Oil with produced biosurfactant
23	6	30	5	Oil with produced biosurfactant
24	6	20	5	Oil with produced biosurfactant

Table 5. Design of further growth experiments including different pH, temperature values and two more media.

Hydrophobin Design

Motivation

The primary objective of our protein design endeavor is to create a custom hydrophobin with a reduced isoelectric point (pI). This design aims to enhance emulsification efficiency under lower pH conditions. Hydrophobins are most effective at emulsification in alkaline pH environments, typically around 9. However, the microorganism Yarrowia lipolytica, with which we are working, thrives in lower pH ranges. Furthermore, in practical fermentation processes, especially those without a controlled pH environment, the pH tends to decrease rapidly.

The foundational knowledge about hydrophobins was graciously provided by Jeroen Vereman, while insights into Yarrowia lipolytica were shared by Vasileios Vangalis and Seppe Dockx from Professor Kevin Verstrepen's laboratory.

The strategic concept for designing a hydrophobin with a lower pI was formulated by our Principal Investigator, Vitor Pinheiro. This innovative approach underscores our commitment to addressing the specific needs of our research and application context.

Mutating the Sequence

As a starting point, we developed a Python script that generates all possible combinations of the original sequence, wherein lysine and arginine are substituted with either glutamate or aspartate. This theoretical substitution aims to lower the isoelectric point of the hydrophobin. Our script utilizes BioPython²⁰ to calculate the pI of the protein. To begin, we selected the HFBI sequence, a well-known hydrophobin comprising 75 amino acids (including 2 lysines and 1 arginine), with the corresponding PDB code being 2FZ6. The outcome of this script yielded 26 modified sequences, with pI values ranging from 4.05 to 4.22 within the default pH interval investigated using the pi() method from BioPython. When the minimum pH of the interval is adjusted downward, the calculated pI can drop even further, reaching as low as 3.17, which represents the lowest pI.

The sequence with the lowest pI of 3.17 exhibited mutations at positions K32D, R35D, and K50D. To ensure a fair comparison in our subsequent analysis, we utilized four sequences, including this one.

Prediction of the Impact of the Mutations on the Structure with AlphaFold

Predicted structures of:

Sequence with lowest pI after mutating (lysine and arginine are substituted with either glutamate or aspartate) the original HFBI sequence
Original sequence with 3 random leucines substituted with aspartate (mutations: L24D, L67D, L68D)
Sequence with 3 leucines substituted with aspartate (L24D, L67D, L68D) further mutated to have the lowest pI
Original sequence with all 7 leucines substituted with aspartate

The figure below shows superposition of the original HFBI and HFBI with 3 random leucines substituted with aspartate. We expected the mutations to highly impact the stability of the protein and result in either a very different structure or unstructured regions but we ended up with a very similar structure with some (but not very substantial) differences (e.g., in the β-sheets region). This points to one of the limitations of AlphaFold³. In particular, it does not predict the impact of mutations on the protein structure or the stability of the protein.

Superposition of HFBI and HFBI with three random leucines substituted with aspartate.

Figure 12. Superposition of HFBI (green) and HFBI with three random leucines substituted with aspartate (yellow).

In order to fully explore and confirm the encountered mutations, we predicted a stucture of the HFBI with all seven leucines substituted with aspartate. Figure below shows this structure aligned to the original HFBI structure. Obtained structure is still very similar to the original one and does not display any vast differences. This result further confirms that AlphaFold is not a good tool (at least yet) to study the impact of mutations on the protein stability and structure.

Superposition of HFBI and HFBI with all seven leucines substituted with aspartate.

Figure 13. Superposition of HFBI (green) and HFBI with all seven leucines substituted with aspartate (grey).

As a result of these conclusions, we decided to no longer use AlphaFold for designing a hydrophobin with lower isoelectric point and we are looking for other strategies to tackle this idea.

Prediction of Disease-related Mutations with PredictSNP

PredictSNP²² is a tool that predicts whether a given mutation is disease-related, indicating its potential impact on the stability of the protein. We submitted four jobs to PredictSNP, each involving a modification of the original HFBI sequence, all of which corresponded to the four sequences previously submitted to AlphaFold.

Note: The algorithm is trained on human protein data, while hydrophobins are fungal proteins. Therefore, we can only assume its accuracy for hydrophobins, but it is not necessarily guaranteed.

Mutated HFBI with lowest pI:

Most of the tools integrated into PredictSNP suggest that the mutations are likely to be neutral. The expected accuracy of PredictSNP is approximately 75% for each mutation. The MAPP prediction raises some concerns as its confidence level is approximately 78%. Further exploration of the sequence and the impact of these mutations is necessary. These initial results are promising because there is a possibility that these mutations may not significantly affect the protein's stability, which aligns with our research objectives.

Mutation	PredictSNP prediction	MAPP prediction	PhD-SNP prediction	PolyPhen-1 prediction	PolyPhen-2 prediction	SIFT prediction	SNAP prediction
K32D	NEUTRAL	DELETERIOUS	NEUTRAL	NEUTRAL	NEUTRAL	NEUTRAL	NEUTRAL
R45D	NEUTRAL	DELETERIOUS	NEUTRAL	NEUTRAL	DELETERIOUS	NEUTRAL	DELETERIOUS
K50D	NEUTRAL	DELETERIOUS	NEUTRAL	NEUTRAL	NEUTRAL	NEUTRAL	NEUTRAL

Table 6. PredictSNP result for the mutated HFBI with lowered pI.

Original sequence with 3 random leucines substituted with aspartate:

All three mutations appear to be deleterious, indicating that they indeed have an impact on the stability of the protein. This result aligns with our expectations, as these mutations were intentionally designed to affect the protein's structure.

Mutation	PredictSNP prediction	MAPP prediction	PhD-SNP prediction	PolyPhen-1 prediction	PolyPhen-2 prediction	SIFT prediction	SNAP prediction
L24D	DELETERIOUS	DELETERIOUS	DELETERIOUS	DELETERIOUS	DELETERIOUS	DELETERIOUS	DELETERIOUS
L67D	DELETERIOUS	DELETERIOUS	NEUTRAL	DELETERIOUS	DELETERIOUS	DELETERIOUS	DELETERIOUS
L68D	DELETERIOUS	DELETERIOUS	NEUTRAL	DELETERIOUS	DELETERIOUS	DELETERIOUS	DELETERIOUS

Table 7. PredictSNP result for the HFBI with three random leucines substituted with aspartate.

Sequence with 3 leucines substituted further mutated towards lower pI:

This sequence contains mutations that reduce the pI and involve the substitution of leucines with aspartate, both of which are expected to disrupt protein stability. This result consolidates the findings from the previous two sequences.

Mutation	PredictSNP prediction	MAPP prediction	PhD-SNP prediction	PolyPhen-1 prediction	PolyPhen-2 prediction	SIFT prediction	SNAP prediction
L24D	DELETERIOUS	DELETERIOUS	DELETERIOUS	DELETERIOUS	DELETERIOUS	DELETERIOUS	DELETERIOUS
K32D	NEUTRAL	DELETERIOUS	NEUTRAL	NEUTRAL	NEUTRAL	NEUTRAL	NEUTRAL
R45D	NEUTRAL	DELETERIOUS	NEUTRAL	NEUTRAL	DELETERIOUS	NEUTRAL	DELETERIOUS
K50D	NEUTRAL	DELETERIOUS	NEUTRAL	NEUTRAL	NEUTRAL	NEUTRAL	NEUTRAL
L67D	DELETERIOUS	DELETERIOUS	NEUTRAL	DELETERIOUS	DELETERIOUS	DELETERIOUS	DELETERIOUS
L68D	DELETERIOUS	DELETERIOUS	NEUTRAL	DELETERIOUS	DELETERIOUS	DELETERIOUS	DELETERIOUS

Table 8. PredictSNP result for the HFBI with three random leucines substituted with aspartate and further mutated to get lower pI.

All leucines substituted:

In this sequence, all leucines have been replaced with aspartate. We made this choice to challenge the limitations of AlphaFold. As anticipated, these mutations are expected to influence the protein's structure, a prediction supported by PredictSNP but not detected by AlphaFold. Upon comparing the prediction confidence from all the tools, it is reasonable to assume that all of these mutations would indeed affect the protein's stability.

Mutation	PredictSNP prediction	MAPP prediction	PhD-SNP prediction	PolyPhen-1 prediction	PolyPhen-2 prediction	SIFT prediction	SNAP prediction
L12D	DELETERIOUS	DELETERIOUS	DELETERIOUS	DELETERIOUS	DELETERIOUS	DELETERIOUS	DELETERIOUS
L24D	DELETERIOUS	DELETERIOUS	DELETERIOUS	DELETERIOUS	DELETERIOUS	DELETERIOUS	DELETERIOUS
L26D	DELETERIOUS	DELETERIOUS	NEUTRAL	DELETERIOUS	DELETERIOUS	DELETERIOUS	DELETERIOUS
L29D	DELETERIOUS	DELETERIOUS	DELETERIOUS	DELETERIOUS	DELETERIOUS	DELETERIOUS	DELETERIOUS
L56D	NEUTRAL	DELETERIOUS	NEUTRAL	DELETERIOUS	DELETERIOUS	NEUTRAL	NEUTRAL
L67D	DELETERIOUS	DELETERIOUS	NEUTRAL	DELETERIOUS	DELETERIOUS	DELETERIOUS	DELETERIOUS
L68D	DELETERIOUS	DELETERIOUS	NEUTRAL	DELETERIOUS	DELETERIOUS	DELETERIOUS	DELETERIOUS

Table 9. PredictSNP result for the HFBI with all leucines substituted with aspartate

Next Steps

The analysis we have conducted so far is preliminary. Further analysis and exploration are necessary to gain greater confidence in our designed hydrophobin. Our next steps would involve utilizing FoldX²³, pyFoldX²⁴ in particular, to thoroughly assess the stability of the mutated HFBI. By applying a force field to our hydrophobin, we could investigate the impact of mutations on both the stability and dynamics of the protein.

In the earlier stages, we addressed one of the limitations of AlphaFold, which is its inability to predict the effect of single nucleotide polymorphisms on protein structure. In September 2023, Google DeepMind, the creators of AlphaFold, introduced AlphaMissense²⁵. This tool, built upon AlphaFold, is designed to predict whether a given mutation is disease-related, potentially disrupting protein stability. Regrettably, we haven't had the opportunity to test it ourselves, but it presents an exciting avenue for our future analysis.

Du, H. X. et al. Engineering Yarrowia lipolytica for campesterol overproduction. PLoS One 11, (2016).
Zhang, Y. et al. Improved campesterol production in engineered Yarrowia lipolytica strains. Biotechnol Lett 39, (2017).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, (2021).
Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat Methods 19, (2022).
Cbalbin-Bio. GitHub - cbalbin-bio/Pymol-Color-Alphafold: PYMOL extension to Color AlphaFold Structures by Confidence (PLDDT). GitHub https://github.com/cbalbin-bio/pymol-color-alphafold (2022).
Li, X., Roberti, R. & Blobel, G. Structure of an integral membrane sterol reductase from Methylomicrobium alcaliphilum. Nature 517, 104-107 (2015).
Trott, O. & Olson, A. J. AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem (2009) doi:10.1002/jcc.21334.
Pettersen, E. F. et al. UCSF Chimera - A visualization system for exploratory research and analysis. J Comput Chem 25, (2004).
Stierand, K., Maaß, P. C. & Rarey, M. Molecular complexes at a glance: Automated generation of two-dimensional complex diagrams. in Bioinformatics vol. 22 (2006).
Laskowski, R. A. & Swindells, M. B. LigPlot+: Multiple ligand-protein interaction diagrams for drug discovery. J Chem Inf Model 51, (2011).
D.A. Case, H.M. Aktulga, K. Belfon, I.Y. Ben-Shalom, J.T. Berryman, S.R. Brozell, D.S. Cerutti, T.E. Cheatham, III, G.A. Cisneros, V.W.D. Cruzeiro, T.A. Darden, N. Forouzesh, G. Giambaşu, T. Giese, M.K. Gilson, H. Gohlke, A.W. Goetz, J. Harris, S. Izadi, S.A. Izmailov, K. Kasavajhala, M.C. Kaymak, E. King, A. Kovalenko, T. Kurtzman, T.S. Lee, P. Li, C. Lin, J. Liu, T. Luchko, R. Luo, M. Machado, V. Man, M. Manathunga, K.M. Merz, Y. Miao, O. Mikhailovskii, G. Monard, H. Nguyen, K.A. O’Hearn, A. Onufriev, F. Pan, S. Pantano, R. Qi, A. Rahnamoun, D.R. Roe, A. Roitberg, C. Sagui, S. Schott-Verdugo, A. Shajan, J. Shen, C.L. Simmerling, N.R. Skrynnikov, J. Smith, J. Swails, R.C. Walker, J. Wang, J. Wang, H. Wei, X. Wu, Y. Wu, Y. Xiong, Y. Xue, D.M. York, S. Zhao, Q. Zhu, and P.A. Kollman (2023), Amber 2023, University of California, San Francisco.
Wang, J., Wang, W., Kollman, P. A. & Case, D. A. Automatic atom type and bond type perception in molecular mechanical calculations. J Mol Graph Model 25, (2006).
Wang, J., Wolf, R. M., Caldwell, J. W., Kollman, P. A. & Case, D. A. Development and testing of a general Amber force field. J Comput Chem 25, (2004).
Roe, D. R. & Cheatham, T. E. PTRAJ and CPPTRAJ: Software for processing and analysis of molecular dynamics trajectory data. J Chem Theory Comput 9, (2013).
Garre, A., Koomen, J., den Besten, H. M. W. & Zwietering, M. H. Modeling Population Growth in R with the biogrowth Package. J Stat Softw 107, 1-51 (2023).
Petzoldt, T., Kneis, D. & Seiler, C. Maximum Growth Rates Estimation with-package growthrates.
Tip Biosystems. Understanding OD600 and measuring cell growth.
Lu, L., Yang, G., Zhu, B. & Pan, K. A comparative study on three quantitating methods of microalgal biomass. Indian Journal of Geo-Marine Sciences 46, (2017).
Hall, B. G., Acar, H., Nandipati, A. & Barlow, M. Growth rates made easy. Mol Biol Evol 31, (2014).
Baranyi, J. & Roberts, T. A. A dynamic approach to predicting bacterial growth in food. Int J Food Microbiol 23, (1994).
Cock, P. J. A. et al. Biopython: Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, (2009).
Bendl, J. et al. PredictSNP: Robust and Accurate Consensus Classifier for Prediction of Disease-Related Mutations. PLoS Comput Biol <10>10, (2014).
Schymkowitz, J. et al. The FoldX web server: An online force field. Nucleic Acids Res 33, (2005).
Radusky, L. G. & Serrano, L. PyFoldX: enabling biomolecular analysis and engineering along structural ensembles. Bioinformatics 38, (2022).
Cheng, J. et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science (1979) 381, (2023).

Abstract

Enzyme Screening

Motivation

Structure Prediction

Molecular Docking

Molecular Dynamics

Discussion and Conclusions

Growth Model

Motivation

Basic Growth Data: Oil vs. Glucose Medium

Primary Growth Model

Optimal Design of Further Experiments

Hydrophobin Design

Motivation

Mutating the Sequence

Prediction of the Impact of the Mutations on the Structure with AlphaFold

Prediction of Disease-related Mutations with PredictSNP

Next Steps

References