Modeling | UniMuenster

Overview

Synthetic biology without modeling is like gin without tonic. During our project to improve the binding affinity of the egg yolk protein Vitellogenin of Apis mellifera to zymosan, several opportunities for modeling presented themselves. We relied on 3D structures predicted by AlphaFold for all of them. During the yeast surface display we performed, some questions came up, which prompted us to model and investigate the 3D structures of our constructs (Aga2p - Vitellogenin fragment - cMyc tag). To lay the groundwork for future docking simulations, which would allow computationally screening different mutations of Vitellogenin by getting estimates for the binding affinities, we tried to predict where exactly PAMPs bind to the Vitellogenin. As a rational design approach, we constructed chimera proteins by swapping domains from solitary bee species with (hopefully) improved zymosan binding ability into the Apis mellifera Vitellogenin. We used modeling approaches to make sure the new proteins are stable. Lastly, in case the attempts to increase the binding affinity of Vitellogenin do not yield satisfactory results, we developed an alternative approach using a moth protein, which is known to have good zymosan binding, connecting it to Vitellogenin with a linker.

Introduction

Protein engineering has become an indispensable tool in the field of biotechnology, enabling the creation and optimization of a wide range of valuable proteins, including antibody drugs, fluorescent proteins, and enzymes [1]. Among the various approaches in protein engineering, directed evolution has gained immense popularity. This technique revolves around the generation and screening of a highly diverse library of protein variants, aiming to select those with improved functionalities [2]. However, the success of directed evolution is linked to the size and diversity of the libraries used [1]. To overcome the limitations in library size, strategies such as focusing library diversity on specific residues guided by molecular structures, computational models, or phylogenetic data have emerged [3].

The integration of rational design with directed evolution holds the promise of predicting desired mutations in advance [2]. Advancements in our understanding of protein structure, function, and dynamics, coupled with the development of computational tools, including machine learning-based approaches, have paved the way for the creation of more efficient and focused protein libraries [4]. However, a significant challenge in the field of computational protein engineering lies in handling large, unlabeled sequencing data. In the pursuit of improving protein properties, the BeeVAX project aims to enhance the binding affinity of Apis mellifera Vitellogenin to the fungal pathogen-associated molecular pattern (PAMP) zymosan. Recent studies suggest that Vitellogenin exhibits a stronger binding affinity for peptidoglycan, a bacterial PAMP, compared to zymosan [3]. To address this challenge, a combination of rational design and computational algorithms was employed to identify potential binding pockets within Vitellogenin. Notably, this function of Vitellogenin as a PAMP binding protein is a relatively recent discovery, and Vitellogenin itself is a sizable protein comprising approximately 1770 amino acids [4]. Consequently, computational protein modeling faces significant challenges in understanding and manipulating such a complex protein structure.

Since no experimental 3D structure is available for Western honey bee Vitellogenin (or any insect Vitellogenin for that matter), we relied on the sequencebased structure prediction tool AlphaFold (AF) [5] for our 3D structures. It has been shown that AF structure predictions are well suited for the study of structural characteristics, like binding site prediction, on the condition that the confidence scores provided by the AF model are taken into account [6]. For the Western honey bee, a predicted Vitellogenin structure is available in the AlphaFold database.

Yeast Surface Display Constructs

In preparation for yeast surface display (YSD), we faced the challenge of the Vitellogenin protein's considerable size. Given the impracticality of displaying the entire protein on the yeast surface, we opted to divide it into smaller, more manageable segments. Our decision was influenced by both previous studies [9] and the results obtained from the P2Rank binding site predictions, which indicated the potential significance of the α-helical and DUF1943 domains in pathogen recognition.

To assess the binding affinity of these domains, we designed plasmids for YSD that contained the whole Vitellogenin and the split segments corresponding to the α-helical and DUF1943 domains. This approach allowed us to test their individual binding properties using flow cytometry with fluorescence-activated cell sorting (FACS). Additionally, we constructed a truncated version (Vg_truncated) of the plasmid that included the von Willebrand Factor (vWF) domain. This decision was motivated by previous research demonstrating PAMP recognition involving the DUF1943 and vWF domains in Vitellogenin from fish and a coral species [9].

Figure 1: Comparing of DUF1943 domain of the YSD construct and the DUF1943 domain inside the Vitellogenin

Our first FACS run yielded some unexpected results that we weren’t sure how to interpret. Especially the low affinity of α-helical was unexpected, as it was previously suggested to be the main contributor to PAMP affinity. This prompted us to investigate our surface display constructs by predicting their structure and comparing them to the full Vitellogenin. This analysis aimed to uncover any structural variations or anomalies that might account for the observed differences in binding affinity. We used ChimeraX to align the structures at the α-helical and the DUF1943 domain, respectively, and calculated the rmsd (root mean squared deviation) value in ångström (Å). The results of this comparative analysis are presented in the tables below.

Table 1: Comparison of the RMSD values (in Å) of the DUF1943 (duf) domain residues of Vitellogenin (Vg), truncated, alpha+duf and duf.

RMSD duf	Vg	truncated	alpha+duf	alpha	duf
Vg	0	1.135	2.022	-	2.868
truncated	1.135	0	2.319	-	2.87
alpha+duf	2.022	2.319	0	-	2.78
alpha	-	-	-	0	-
duf	2.868	2.87	2.78	-	0

Table 2: Comparison of the RMSD values (in Å) of the α-helical (alpha) domain residues of Vitellogenin (Vg), truncated, alpha+duf and duf.

RMSD alpha	Vg	truncated	alpha+duf	alpha	duf
Vg	0	0.448	0.556	1.087	-
truncated	0.448	0	0.245	0.968	-
alpha+duf	0.556	0.245	0	1	-
alpha	1.087	0.968	1	0	-
duf	-	-	-	-	0

We see that for the DUF1943, the structure prediction by AlphaFold without the other domains differs significantly from the structure embedded inside the whole Vitellogenin. In fact, even for the DUF1943 domains from the solitary bees (see below), the rmsd between the domains never exceeds 1.5 while only having a sequence identity of around 50%. This suggests that the DUF1943 requires the adjacent protein parts to be present in order to be folded correctly, and may explain the peculiar FACS results for the DUF1943 alone. The α-helical domain, on the other hand, shows no significant deviation in conformation (the overall accuracy of AlphaFold has been evaluated to be around 1.5 Å during CASP14 [5]), so misfolding doesn’t seem to be the reason for its low binding affinity.

Pocket Prediction with P2Rank

Since we wanted to improve the binding affinity of Vitellogenin for zymosan, our ultimate goal was using molecular docking to virtually screen different mutations. Those candidates could then be evaluated inexpensively by our binding assay. Docking simulations usually have some underlying model of binding energy and then try to find the best orientation of the molecules relative to each other. Unfortunately, accurate docking requires knowing where the ligand actually binds, which is unknown in the case of Vitellogenin. That is why we first focused on the binding site prediction. Our hope was to combine our in silico results with the yeast surface display results from the wet lab by sequencing the improved variants and localizing their relevant mutations, but unfortunately there wasn’t enough time for the cell sorting.

We chose P2Rank [7] since it was specifically developed with ease of use in mind, and we wanted to make our approach easy to replicate for other teams looking to do protein engineering. Usually, experimental 3D structures are used to do further function predictions for proteins, but it has been shown that AlphaFold structures can be just as suitable [6]. This is the case when the confidence score predicted by AlphaFold (called pLDDT with 0 being the worst and 100 the best possible score) at the pockets is high (≥90). Because we suspect many teams are relying on AlphaFold models for their protein engineering approaches, we created a command line script that calculates the average AlphaFold confidence score for every pocket predicted by P2Rank and adds it as a column to the created file. This allows the user to easily evaluate how reliable the results are in absence of an experimental 3D structure.

For visualization and further processing of protein structures, we chose ChimeraX over Pymol because of better availability for academic use. But while the P2Rank tool outputs a script for visualizing the pockets in Pymol, no such functionality existed for ChimeraX. So we decided to build a tool that generates a visualization script for ChimeraX from the P2Rank results. The visualization script opens the PDB file in ChimeraX and colors all the predicted pockets. It is available in our software repository.

Initially, P2Rank yields 42 candidates for potential binding sites. While Vitellogenin is a large protein with many functions, this is clearly not satisfactory. Ignoring pockets with a predicted probability of less than 10%, 14 pockets remain. Using the AlphaFold confidence scores as well as findings from biological studies involving the analysis of samples from bees across 15 different countries and 121 protein variants allowed us to narrow it down to 4 promising pockets (see the engineering).

Table 3: The table shows the probability given by P2Rank for the 14 selected pocket candidates. The mean_plddt is the average AlphaFold confidence score for the involved residues. It represents the combination of P2Rank and AlphaFold which is required for further narrowing down the pockets

Name	Rank	Probability	Mean pLDDT
Pocket 1	1	0.98	90.93
Pocket 2	2	0.94	84.99
Pocket 3	3	0.92	95.61
Pocket 4	4	0.76	92.38
Pocket 5	5	0.65	76.17
Pocket 6	6	0.47	73.23
Pocket 7	7	0.27	84.66
Pocket 8	8	0.21	85.54
Pocket 9	9	0.18	83.89
Pocket 10	10	0.17	89.52
Pocket 11	11	0.16	81.08
Pocket 12	12	0.13	94.65
Pocket 13	13	0.12	95.63
Pocket 14	14	0.11	95.59

The picture shows the protein structure of Vitellogenin which is a huge coil of alpha helices and beta- sheets. The vWF domain can be estimated prominent outside of the protein. The 42 pocket candidates are spread in the main body of the protein without any structure.

Figure 2: Vitellogenin structure in ChimeraX, assembled with AlphaFold, indicating 42 predicted pocket candidates highlighted in different colors.

Figure 3: Vitellogenin structure in ChimeraX, assembled with AlphaFold, indicating four predicted pocket candidates, highlighted in different colors. The α-helical domain is presented in turquoise and the DUF1943 domain is presented in dark blue. Pocket 1 is yellow, pocket 3 is orange, pocket 4 is hot pink, and lastly, pocket 10 is red.

Had we obtained sequencing results from our cell sorting, our plan would have been to use the location of mutations to narrow it down further, but unfortunately, we did not get to that point. We suggest using those 4 pockets to do computational docking or employing site directed mutagenesis on them to gain further functional understanding.

Solitary Bees

Building upon our earlier predictions of the binding pocket formed by the α-helical and DUF1943 domains within Vitellogenin (as described in the P2Rank section), we thought about a new approach. The Vitellogenin of solitary bee species may possess distinct immune-related features. The temperature of beehives is highly regulated and varies between 31 °C and 36 °C. Per contra, the optimal temperature for the spore germination and colony growth of most entomopathogenic fungi varies from 20 °C to 25 °C, also temperatures higher than 30° C seem to inhibit the growth [10]. However, in the case of solitary bees that do not form hives, we assumed that there might be a heightened need for immune priming against fungal pathogens. So we decided to swap their relevant domains (α-helical and DUF1943, see measurement) with the ones in Apis mellifera’s Vitellogenin. Swapping the α-helical, the DUF1943, and both domains together for the 6 solitary species we chose (they can be seen in table 4) yielded 18 different chimeras in total. We created structure predictions for the solitary bee Vitellogenins and all the chimeras, and compiled statistics to see how much they differ from the original Apis mellifera Vitellogenin.

For the in silico creation of chimeric proteins, we used AlphaFold [5]. In predicting 3D structures of the Vitellogenins from the solitary bee species, we made intriguing findings. Although there were differences in amino acid sequences, the 3D structures appeared highly conserved. We performed sequence swaps between the α-helical and DUF1943 domains, followed by recalculating the 3D structures. The 3D structures of the chimeric proteins show comparable confidence to that of the native Western honey bee Vitellogenin.

Table 4: Sequence Identity and RMSD (in Å) values between Apis mellifera and different bee species. Sequence identity was calculated in SnapGene. The RMSD values were calculated by comparing the domains.

Bee	alpha identity	duf identity	alpha RMSD	duf RMSD	alpha+duf RMSD
Apis mellifera	100%	100%	0	0	0
Dufourea novaeangliae	67%	48%	0.57	1.17	0.95
Frieseomelitta varia	64%	50%	1.24	1.16	1.28
Habropoda laboriosa	66%	51%	1.16	-	-
Heterotrigona itama	63%	49%	1.03	1.08	1.08
Megachile rotundata	69%	46%	0.48	1.46	1.1
Melipona quadrifasciata	60%	49%	-	0.77	-

For the missing values, a structural alignment wasn’t possible due to differing sequence lengths. We see that AlphaFold predicts the different Vitellogenins to be highly structurally conserved, so we suspect they are well suited for the creation of chimeric proteins.

To further corroborate the stability of these chimeric proteins, we employed molecular dynamics simulations. For this purpose, we used GROMACS, a tool designed for conducting molecular dynamics simulations of biomolecules, including proteins, nucleic acids, and lipids [11]. We simulated the Vitellogenin chimeras in water for around 50 ns and observed the rmsd values converging to 0.6-1.2 Å.

RMSD results

RMSD results: All plots show the all-atom RMSD (in Å) between the starting protein structure and the current one, where the time from start is given in ps. Each plot shows the molecular dynamics simulation of Apis mellifera Vitellogenin, where either the α-helical or the DUF1943 or both domains have been swapped with the domain from one of the solitary bee species.

RMSD 1: α-helical domain of Dufourea novaeangliae inside Apis mellifera Vitellogenin

RMSD 2: α-helical domain of Frieseomelitta varia inside Apis mellifera Vitellogenin

RMSD 3: α-helical domain of Habropoda laboriosa inside Apis mellifera Vitellogenin

RMSD 4: α-helical domain of Heterotrigona itama inside Apis mellifera Vitellogenin

RMSD 5: α-helical domain of Megachile rotundata inside Apis mellifera Vitellogenin

RMSD 6: α-helical domain of Melipona quadrifasciata inside Apis mellifera Vitellogenin

RMSD 7: DUF1943 domain of Dufourea novaeangliae inside Apis mellifera Vitellogenin

RMSD 8: DUF1943 domain of Frieseomelitta varia inside Apis mellifera Vitellogenin

RMSD 9: DUF1943 domain of Habropoda laboriosa inside Apis mellifera Vitellogenin

RMSD 10: DUF1943 domain of Heterotrigona itama inside Apis mellifera Vitellogenin

RMSD 11: DUF1943 domain of Megachile rotundata inside Apis mellifera Vitellogenin

RMSD 12: α-helical and DUF1943 domain of Dufourea novaeangliae inside Apis mellifera Vitellogenin

RMSD 13: α-helical and DUF1943 domain of Frieseomelitta varia inside Apis mellifera Vitellogenin

RMSD 14: α-helical and DUF1943 domain of Heterotrigona itama inside Apis mellifera Vitellogenin

RMSD 15: α-helical and DUF1943 domain of Megachile rotundata inside Apis mellifera Vitellogenin

RMSD 15: α-helical and DUF1943 domain of Melipona quadrifasciata inside Apis mellifera Vitellogenin

Since the accuracy of AlphaFold has been evaluated to be around 1.5 Å, we consider those to be satisfying results indicating the stability of our chimeric proteins. The next step would be to use our binding assay to evaluate their zymosan binding affinity, but unfortunately, we lacked the time for that.

CTL14 Modeling

As an alternative approach to Vitellogenin optimization, we discussed the linkage between the Apis mellifera Vitellogenin and the moth Helicoverpa armigera protein C-type lectin 14 (CTL14). Vitellogenin exhibits a binding preference for Gram-positive bacteria [3]. This is why we aimed to specifically enhance the Vitellogenin protein's affinity for fungal PAMPs through rational design. CTL14 is a protein, which is naturally able to bind fungi PAMPs [12]. By connecting both proteins, we want to combine the fungal PAMP binding abilities of CTL14 with the Western honey bee Vitellogenin.

We simulated the linkage of the two proteins via a selection of potential linkers using AlphaFold [8, 5]. We tested three different linkers, two rigid and a flexible one. Rigid linkers often have an α-helical conformation or are rich in the amino acid proline. This ensures a stable connection and a fixed distance between linked proteins. In contrast, flexible linkers allow the proteins to move and interact with each other. Flexible linkers are rich in small or hydrophilic amino acids, such as glycine and serine [13, 14].

To be used as a flexible linker, we simulated an amino acid with (GGGGS)_n sequence. In the simulation with (GGGGS)₁, the structures remained separated, and the His-Tag was accessible for purification. Further simulations showed that when increasing the length of the flexible linker, the protein structures started to rotate into each other (Fig. 4).

The first rigid linkers simulated had the amino acid sequence (EAAAK)_n. In these simulations, Vitellogenin and CTL14 are firmly separated. However, the His-Tag added to the protein would rotate into the Vitellogenin domain. This would be an obstacle to separating and cleaning the combined proteins, as the His-Tag would be poorly accessible in this rigid linker configuration (Fig. 5).

The second rigid linker with amino acid sequence A(EAAAK)₂A showed that the structures of the proteins remained separated and that the His-Tag remained accessible without rotating into either of the structures (Fig. 6). Furthermore, research into the subject of linkers showcased that this linker is beneficial for biological activity of fusion proteins [13]. According to the simulation and the current research, this linker is optimal for our expression and the following binding assay.

Figure 4: Illustration of the fusion protein consisting of the Apis mellifera Vitellogenin (blue) and Helicoverpa armigera CTL14 (orange). Proteins were linked by a flexible linker with the amino acid sequence (GGGGS)_n (pink). A C-terminal 10x His-Tag (red) was added for purification of the protein. A) Fusion protein with the linker (GGGGS)₂. B) Fusion protein with the linker (GGGGS)₃. C) Fusion protein with the linker (GGGGS)₄. D) Fusion protein with the linker (GGGGS)₅.

Figure 5: Illustration of the fusion protein consisting of the Apis mellifera Vitellogenin (blue) and Helicoverpa armigera CTL14 (orange). Proteins were linked by a rigid linker with the amino acid sequence (EAAAK)_n. (pink). A C-terminal 10x His-Tag (red) was added for purification of the protein. A) Fusion protein with the linker (EAAAK)₁. B) Fusion protein with the linker (EAAAK)₂.

Figure 6: Illustration of the fusion protein consisting of the Apis mellifera Vitellogenin (blue) and Helicoverpa armigera CTL14 (orange). Proteins were linked by a rigid linker with the amino acid sequence A(EAAAK)₂A (pink). A C-terminal 10x His-Tag was added for purification of the protein. The illustration shows the fusion protein from different angles.

References

[1] M. Baker, “Protein engineering: navigating between chance and reason,” Nature Publishing Group, 2011, doi: 10.1038/nmeth.1654.
[2] Y. Wang, P. Xue, M. Cao, T. Yu, S. T. Lane, and H. Zhao, “Directed Evolution: Methodologies and Applications,” 2021, doi: 10.1021/acs.chemrev.1c00260.
[3] H. Salmela, G. V Amdam, and D. Freitak, “Transfer of Immunity from Mother to Offspring Is Mediated via Egg-Yolk Protein Vitellogenin,” 2015, doi: 10.1371/journal.ppat.1005015.
[4] G. Harwood, G. Amdam, and D. Freitak, “The role of Vitellogenin in the transfer of immune elicitors from gut to hypopharyngeal glands in honey bees (Apis mellifera),” J Insect Physiol, vol. 112, pp. 90–100, Jan. 2019, doi: 10.1016/J.JINSPHYS.2018.12.006.
[5] J. Jumper et al., “Highly accurate protein structure prediction with AlphaFold,” Nature, vol. 596, p. 583, 2021, doi: 10.1038/s41586-021-03819-2.
[6] M. Akdel et al., “A structural biology community assessment of AlphaFold2 applications Check for updates,” Nature Structural & Molecular Biology |, vol. 29, 2022, doi: 10.1038/s41594-022-00849-w.
[7] R. Krivák and D. Hoksza, “P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure,” J Cheminform, vol. 10, no. 1, pp. 1–12, Dec. 2018, doi: 10.1186/S13321-018-0285-8/TABLES/4.
[8] M. Varadi et al., “NAR Breakthrough Article AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models,” Nucleic Acids Res, vol. 50, 2021, doi: 10.1093/nar/gkab1061.
[9] V. Leipart et al., “Identification of 121 variants of honey bee Vitellogenin protein sequences with structural differences at functional sites,” 2022, doi: 10.1002/pro.4369.
[10] S. Gisder et al., “Entomopathogenic Fungi for Pests and Predators Control in Beekeeping,” 2022, doi: 10.3390/vetsci9020095.
[11] M. J. Abraham et al., “ScienceDirect GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers”, doi: 10.1016/j.softx.2015.06.001.
[12] Y. Cheng, Z. Lin, J.-M. Wang, L.-S. Xing, G.-H. Xiong, and Z. Zou, “CTL14, a recognition receptor induced in late stage larvae, modulates anti-fungal immunity in cotton bollworm Helicoverpa armigera,” Developmental and comparative immunology, vol. 84, pp. 142–152, 2018, doi: 10.1016/j.dci.2018.02.010.
[13] X. Chen, J. L. Zaro, and W.-C. Shen, “Fusion protein linkers: property, design and functionality,” Advanced drug delivery reviews, vol. 65, no. 10, pp. 1357–1369, 2013, doi: 10.1016/j.addr.2012.09.039.
[14] V. P. Reddy Chichili, V. Kumar, and J. Sivaraman, “Linkers in the structural biology of protein-protein interactions,” Protein science : a publication of the Protein Society, vol. 22, no. 2, pp. 153–167, 2013, doi: 10.1002/pro.2206.

Results

Measurement

Contribution