Synthetic biology without modeling is like gin without tonic. During our project to improve the binding affinity of the egg yolk protein Vitellogenin of Apis mellifera to zymosan, several opportunities for modeling presented themselves. We relied on 3D structures predicted by AlphaFold for all of them. During the yeast surface display we performed, some questions came up, which prompted us to model and investigate the 3D structures of our constructs (Aga2p - Vitellogenin fragment - cMyc tag). To lay the groundwork for future docking simulations, which would allow computationally screening different mutations of Vitellogenin by getting estimates for the binding affinities, we tried to predict where exactly PAMPs bind to the Vitellogenin. As a rational design approach, we constructed chimera proteins by swapping domains from solitary bee species with (hopefully) improved zymosan binding ability into the Apis mellifera Vitellogenin. We used modeling approaches to make sure the new proteins are stable. Lastly, in case the attempts to increase the binding affinity of Vitellogenin do not yield satisfactory results, we developed an alternative approach using a moth protein, which is known to have good zymosan binding, connecting it to Vitellogenin with a linker.
Protein engineering has become an indispensable tool in the field of biotechnology, enabling the creation and optimization of a wide range of valuable proteins, including antibody drugs, fluorescent proteins, and enzymes [1]. Among the various approaches in protein engineering, directed evolution has gained immense popularity. This technique revolves around the generation and screening of a highly diverse library of protein variants, aiming to select those with improved functionalities [2]. However, the success of directed evolution is linked to the size and diversity of the libraries used [1]. To overcome the limitations in library size, strategies such as focusing library diversity on specific residues guided by molecular structures, computational models, or phylogenetic data have emerged [3].
The integration of rational design with directed evolution holds the promise of predicting desired mutations in advance [2]. Advancements in our understanding of protein structure, function, and dynamics, coupled with the development of computational tools, including machine learning-based approaches, have paved the way for the creation of more efficient and focused protein libraries [4]. However, a significant challenge in the field of computational protein engineering lies in handling large, unlabeled sequencing data. In the pursuit of improving protein properties, the BeeVAX project aims to enhance the binding affinity of Apis mellifera Vitellogenin to the fungal pathogen-associated molecular pattern (PAMP) zymosan. Recent studies suggest that Vitellogenin exhibits a stronger binding affinity for peptidoglycan, a bacterial PAMP, compared to zymosan [3]. To address this challenge, a combination of rational design and computational algorithms was employed to identify potential binding pockets within Vitellogenin. Notably, this function of Vitellogenin as a PAMP binding protein is a relatively recent discovery, and Vitellogenin itself is a sizable protein comprising approximately 1770 amino acids [4]. Consequently, computational protein modeling faces significant challenges in understanding and manipulating such a complex protein structure.
Since no experimental 3D structure is available for Western honey bee Vitellogenin (or any insect Vitellogenin for that matter), we relied on the sequencebased structure prediction tool AlphaFold (AF) [5] for our 3D structures. It has been shown that AF structure predictions are well suited for the study of structural characteristics, like binding site prediction, on the condition that the confidence scores provided by the AF model are taken into account [6]. For the Western honey bee, a predicted Vitellogenin structure is available in the AlphaFold database.
In preparation for yeast surface display (YSD), we faced the challenge of the Vitellogenin protein's considerable size. Given the impracticality of displaying the entire protein on the yeast surface, we opted to divide it into smaller, more manageable segments. Our decision was influenced by both previous studies [9] and the results obtained from the P2Rank binding site predictions, which indicated the potential significance of the α-helical and DUF1943 domains in pathogen recognition.
To assess the binding affinity of these domains, we designed plasmids for YSD that contained the whole Vitellogenin and the split segments corresponding to the α-helical and DUF1943 domains. This approach allowed us to test their individual binding properties using flow cytometry with fluorescence-activated cell sorting (FACS). Additionally, we constructed a truncated version (Vg_truncated) of the plasmid that included the von Willebrand Factor (vWF) domain. This decision was motivated by previous research demonstrating PAMP recognition involving the DUF1943 and vWF domains in Vitellogenin from fish and a coral species [9].
Our first FACS run yielded some unexpected results that we weren’t sure how to interpret. Especially the low affinity of α-helical was unexpected, as it was previously suggested to be the main contributor to PAMP affinity. This prompted us to investigate our surface display constructs by predicting their structure and comparing them to the full Vitellogenin. This analysis aimed to uncover any structural variations or anomalies that might account for the observed differences in binding affinity. We used ChimeraX to align the structures at the α-helical and the DUF1943 domain, respectively, and calculated the rmsd (root mean squared deviation) value in ångström (Å). The results of this comparative analysis are presented in the tables below.
RMSD duf | Vg | truncated | alpha+duf | alpha | duf |
---|---|---|---|---|---|
Vg | 0 | 1.135 | 2.022 | - | 2.868 |
truncated | 1.135 | 0 | 2.319 | - | 2.87 |
alpha+duf | 2.022 | 2.319 | 0 | - | 2.78 |
alpha | - | - | - | 0 | - |
duf | 2.868 | 2.87 | 2.78 | - | 0 |
RMSD alpha | Vg | truncated | alpha+duf | alpha | duf |
---|---|---|---|---|---|
Vg | 0 | 0.448 | 0.556 | 1.087 | - |
truncated | 0.448 | 0 | 0.245 | 0.968 | - |
alpha+duf | 0.556 | 0.245 | 0 | 1 | - |
alpha | 1.087 | 0.968 | 1 | 0 | - |
duf | - | - | - | - | 0 |
We see that for the DUF1943, the structure prediction by AlphaFold without the other domains differs significantly from the structure embedded inside the whole Vitellogenin. In fact, even for the DUF1943 domains from the solitary bees (see below), the rmsd between the domains never exceeds 1.5 while only having a sequence identity of around 50%. This suggests that the DUF1943 requires the adjacent protein parts to be present in order to be folded correctly, and may explain the peculiar FACS results for the DUF1943 alone. The α-helical domain, on the other hand, shows no significant deviation in conformation (the overall accuracy of AlphaFold has been evaluated to be around 1.5 Å during CASP14 [5]), so misfolding doesn’t seem to be the reason for its low binding affinity.
Since we wanted to improve the binding affinity of Vitellogenin for zymosan, our ultimate goal was using molecular docking to virtually screen different mutations. Those candidates could then be evaluated inexpensively by our binding assay. Docking simulations usually have some underlying model of binding energy and then try to find the best orientation of the molecules relative to each other. Unfortunately, accurate docking requires knowing where the ligand actually binds, which is unknown in the case of Vitellogenin. That is why we first focused on the binding site prediction. Our hope was to combine our in silico results with the yeast surface display results from the wet lab by sequencing the improved variants and localizing their relevant mutations, but unfortunately there wasn’t enough time for the cell sorting.
We chose P2Rank [7] since it was specifically developed with ease of use in mind, and we wanted to make our approach easy to replicate for other teams looking to do protein engineering. Usually, experimental 3D structures are used to do further function predictions for proteins, but it has been shown that AlphaFold structures can be just as suitable [6]. This is the case when the confidence score predicted by AlphaFold (called pLDDT with 0 being the worst and 100 the best possible score) at the pockets is high (≥90). Because we suspect many teams are relying on AlphaFold models for their protein engineering approaches, we created a command line script that calculates the average AlphaFold confidence score for every pocket predicted by P2Rank and adds it as a column to the created file. This allows the user to easily evaluate how reliable the results are in absence of an experimental 3D structure.
For visualization and further processing of protein structures, we chose ChimeraX over Pymol because of better availability for academic use. But while the P2Rank tool outputs a script for visualizing the pockets in Pymol, no such functionality existed for ChimeraX. So we decided to build a tool that generates a visualization script for ChimeraX from the P2Rank results. The visualization script opens the PDB file in ChimeraX and colors all the predicted pockets. It is available in our software repository.
Initially, P2Rank yields 42 candidates for potential binding sites. While Vitellogenin is a large protein with many functions, this is clearly not satisfactory. Ignoring pockets with a predicted probability of less than 10%, 14 pockets remain. Using the AlphaFold confidence scores as well as findings from biological studies involving the analysis of samples from bees across 15 different countries and 121 protein variants allowed us to narrow it down to 4 promising pockets (see the engineering).
Name | Rank | Probability | Mean pLDDT |
---|---|---|---|
Pocket 1 | 1 | 0.98 | 90.93 |
Pocket 2 | 2 | 0.94 | 84.99 |
Pocket 3 | 3 | 0.92 | 95.61 |
Pocket 4 | 4 | 0.76 | 92.38 |
Pocket 5 | 5 | 0.65 | 76.17 |
Pocket 6 | 6 | 0.47 | 73.23 |
Pocket 7 | 7 | 0.27 | 84.66 |
Pocket 8 | 8 | 0.21 | 85.54 |
Pocket 9 | 9 | 0.18 | 83.89 |
Pocket 10 | 10 | 0.17 | 89.52 |
Pocket 11 | 11 | 0.16 | 81.08 |
Pocket 12 | 12 | 0.13 | 94.65 |
Pocket 13 | 13 | 0.12 | 95.63 |
Pocket 14 | 14 | 0.11 | 95.59 |
Had we obtained sequencing results from our cell sorting, our plan would have been to use the location of mutations to narrow it down further, but unfortunately, we did not get to that point. We suggest using those 4 pockets to do computational docking or employing site directed mutagenesis on them to gain further functional understanding.
Building upon our earlier predictions of the binding pocket formed by the α-helical and DUF1943 domains within Vitellogenin (as described in the P2Rank section), we thought about a new approach. The Vitellogenin of solitary bee species may possess distinct immune-related features. The temperature of beehives is highly regulated and varies between 31 °C and 36 °C. Per contra, the optimal temperature for the spore germination and colony growth of most entomopathogenic fungi varies from 20 °C to 25 °C, also temperatures higher than 30° C seem to inhibit the growth [10]. However, in the case of solitary bees that do not form hives, we assumed that there might be a heightened need for immune priming against fungal pathogens. So we decided to swap their relevant domains (α-helical and DUF1943, see measurement) with the ones in Apis mellifera’s Vitellogenin. Swapping the α-helical, the DUF1943, and both domains together for the 6 solitary species we chose (they can be seen in table 4) yielded 18 different chimeras in total. We created structure predictions for the solitary bee Vitellogenins and all the chimeras, and compiled statistics to see how much they differ from the original Apis mellifera Vitellogenin.
For the in silico creation of chimeric proteins, we used AlphaFold [5]. In predicting 3D structures of the Vitellogenins from the solitary bee species, we made intriguing findings. Although there were differences in amino acid sequences, the 3D structures appeared highly conserved. We performed sequence swaps between the α-helical and DUF1943 domains, followed by recalculating the 3D structures. The 3D structures of the chimeric proteins show comparable confidence to that of the native Western honey bee Vitellogenin.
Bee | alpha identity | duf identity | alpha RMSD | duf RMSD | alpha+duf RMSD |
---|---|---|---|---|---|
Apis mellifera | 100% | 100% | 0 | 0 | 0 |
Dufourea novaeangliae | 67% | 48% | 0.57 | 1.17 | 0.95 |
Frieseomelitta varia | 64% | 50% | 1.24 | 1.16 | 1.28 |
Habropoda laboriosa | 66% | 51% | 1.16 | - | - |
Heterotrigona itama | 63% | 49% | 1.03 | 1.08 | 1.08 |
Megachile rotundata | 69% | 46% | 0.48 | 1.46 | 1.1 |
Melipona quadrifasciata | 60% | 49% | - | 0.77 | - |
For the missing values, a structural alignment wasn’t possible due to differing sequence lengths. We see that AlphaFold predicts the different Vitellogenins to be highly structurally conserved, so we suspect they are well suited for the creation of chimeric proteins.
To further corroborate the stability of these chimeric proteins, we employed molecular dynamics simulations. For this purpose, we used GROMACS, a tool designed for conducting molecular dynamics simulations of biomolecules, including proteins, nucleic acids, and lipids [11]. We simulated the Vitellogenin chimeras in water for around 50 ns and observed the rmsd values converging to 0.6-1.2 Å.
Since the accuracy of AlphaFold has been evaluated to be around 1.5 Å, we consider those to be satisfying results indicating the stability of our chimeric proteins. The next step would be to use our binding assay to evaluate their zymosan binding affinity, but unfortunately, we lacked the time for that.
As an alternative approach to Vitellogenin optimization, we discussed the linkage between the Apis mellifera Vitellogenin and the moth Helicoverpa armigera protein C-type lectin 14 (CTL14). Vitellogenin exhibits a binding preference for Gram-positive bacteria [3]. This is why we aimed to specifically enhance the Vitellogenin protein's affinity for fungal PAMPs through rational design. CTL14 is a protein, which is naturally able to bind fungi PAMPs [12]. By connecting both proteins, we want to combine the fungal PAMP binding abilities of CTL14 with the Western honey bee Vitellogenin.
We simulated the linkage of the two proteins via a selection of potential linkers using AlphaFold [8, 5]. We tested three different linkers, two rigid and a flexible one. Rigid linkers often have an α-helical conformation or are rich in the amino acid proline. This ensures a stable connection and a fixed distance between linked proteins. In contrast, flexible linkers allow the proteins to move and interact with each other. Flexible linkers are rich in small or hydrophilic amino acids, such as glycine and serine [13, 14].
To be used as a flexible linker, we simulated an amino acid with (GGGGS)n sequence. In the simulation with (GGGGS)1, the structures remained separated, and the His-Tag was accessible for purification. Further simulations showed that when increasing the length of the flexible linker, the protein structures started to rotate into each other (Fig. 4).
The first rigid linkers simulated had the amino acid sequence (EAAAK)n. In these simulations, Vitellogenin and CTL14 are firmly separated. However, the His-Tag added to the protein would rotate into the Vitellogenin domain. This would be an obstacle to separating and cleaning the combined proteins, as the His-Tag would be poorly accessible in this rigid linker configuration (Fig. 5).
The second rigid linker with amino acid sequence A(EAAAK)2A showed that the structures of the proteins remained separated and that the His-Tag remained accessible without rotating into either of the structures (Fig. 6). Furthermore, research into the subject of linkers showcased that this linker is beneficial for biological activity of fusion proteins [13]. According to the simulation and the current research, this linker is optimal for our expression and the following binding assay.