Model | PekingHSC - iGEM 2023

Through the previous experiments conducted by Professor Zhou's team, it was found that oncolytic viruses can bind to the transferrin receptor 1 (TfR1) on the surface of tumor cells by interacting with the NA protein on their surface, thereby mediating viral internalization. Therefore, the initial experiments will focus on these two proteins.

Our physical simulations primarily include protein structure prediction, protein-protein docking, quantum chemistry calculations, and molecular dynamics simulations.

Structure of Human Transferrin Receptor

The transferrin receptor (TfR) is involved in a dynamic process of endocytosis and subsequent reemergence at the cell surface. During this process, the receptor internalizes iron-loaded transferrin (Tf) through clathrin-mediated endocytosis. Once inside the cell, iron is released from Tf in the endosome, allowing the TfR to recycle apotransferrin back to the cell surface for further use.

Fig. 1. Structure of TfR1 on plasma membrane.

Human TfR1 is a homodimeric type II transmembrane protein consisting of two 90-kD subunits. Each subunit comprises a short NH2-terminal cytoplasmic region (residues 1 to 67) containing the internalization motif YTRF, followed by a single transmembrane pass (residues 68 to 88). The large extracellular portion (ectodomain, residues 89 to 760) of each subunit contains a binding site for the 80-kD Tf molecule.

The TfR1 dimer has a globular extracellular structure, separated from the membrane by a stalk of about 30Å, which includes residues immediately following the transmembrane pass and an O-linked glycan at Thr104, along with two intermolecular disulfide bonds (formed by Cys89 and Cys98). Notably, the intermolecular disulfide bonds are not required for dimerization.

Fig. 2. Individual TfR domains.

In the ribbon diagrams, we can observe three domains of the protein structure: domain I, which is the protease-like domain (labeled as A); domain II, known as the apical domain (labeled as B); and domain III, referred to as the helical domain (labeled as C). The secondary structure elements are appropriately labeled on the diagrams. In the text, we identify these elements first based on their respective domain numbers and then based on their linear order within each domain^[1].

Structure of Neuraminidase

The viral NA (Neuraminidase) assembles as a tetramer of four identical polypeptides and constitutes approximately 10-20% of the total glycoproteins on the virion surface when embedded in the viral envelope. An average-sized virion of 120 nm typically possesses around 40-50 NA spikes and 300-400 HA spikes^[2][3].Each of the four NA monomers consists of approximately 470 amino acids and folds into four distinct structural domains: the cytoplasmic tail, the transmembrane region, the stalk, and the catalytic head.

Fig. 3. NA exists as a tetramer of four identical monomers.

The NA active site comprises two parts: an inner shell consisting of eight highly conserved residues (Arg118, Asp151, Arg152, Arg224, Glu276, Arg292, Arg371, and Tyr406) that directly interact with sialic acids. Additionally, there is an outer shell of ten residues (Glu119, Arg156, Trp178, Ser179, Asp198, Ile222, Glu227, Glu277, Asn294, and Glu425) which do not interact with sialic acid but play a crucial structural role, defined as framework residues^[4].Among the inner shell residues, three arginine residues (Arg118, 292, 371) interact with the carboxylate of the sialic acid substrate. Arg152 binds to the acetamido group on the sugar ring, while Glu276 interacts with the 8- and 9-hydroxyl groups on the glycerol side chain. The active site of the enzyme is highly conserved in terms of spatial orientation and sequence properties, making it an excellent target for drug inhibition.

Fig. 4. The catalytic sites responsible for cleaving the sugar residues.

Abstract

Scientific Problems

Proteins play a crucial role in life, and understanding their structure is essential to comprehend their functions. Although significant efforts have led to the determination of structures for about 100,000 unique proteins, this is just a fraction of the vast number of known protein sequences. The process of determining protein structures is time-consuming, causing a bottleneck in achieving comprehensive structural coverage. To overcome this limitation, accurate computational methods are necessary to predict a protein's three-dimensional structure based solely on its amino acid sequence, a longstanding problem known as the "protein folding problem."

Solution ideas

In recent years, with the continuous development of computer hardware and algorithms, artificial intelligence, especially neural networks, has made significant advancements. In addressing the challenging task of protein structure prediction from sequences, researchers have started to explore the use of neural networks as a potential solution.One of the most remarkable advancements in this field is AlphaFold2.

Expected results

Investigating structural changes due to amino acid Insertion and deletion and in the stalk region of the NA monomer.
Exploring the feasibility of the project at the structural level.
Evaluating the importance of homologous sequences in protein structure prediction.

Technical Principles

AlphaFold 2 is a cutting-edge deep learning system developed by DeepMind, an artificial intelligence research lab under Alphabet Inc. It is designed to predict the 3D structure of proteins, a crucial task in the field of biology and bioinformatics. Released in 2020, AlphaFold 2 utilizes advanced deep learning techniques, including deep convolutional neural networks and attention mechanisms, to accurately predict the folding of proteins from their amino acid sequences^[5].

Fig. 5. Architectural details of alphafold2.

1.Trunk of the Network (Evoformer):

The inputs to AlphaFold are the primary amino acid sequence of a protein and aligned sequences of homologous proteins (Multiple Sequence Alignments or MSAs).
The trunk of the network processes these inputs through repeated layers of a novel neural network block called the Evoformer. The Evoformer incorporates attention-based and non-attention-based components to exchange information within the MSA and pair representations.
The MSA representation is initialized with the raw MSA and is continuously refined as the processing proceeds. The Evoformer allows direct reasoning about the spatial and evolutionary relationships of the protein's residues.
The trunk produces two arrays: an Nseq x Nres array representing the processed MSA and an Nres x Nres array representing residue pairs.

2.Structure Module:

The structure module takes the processed MSA and residue pair representations as inputs and introduces an explicit 3D structure in the form of rotations and translations for each residue of the protein (global rigid body frames).
Key innovations in this section include breaking the chain structure to allow simultaneous local refinement of all parts of the structure, a novel equivariant transformer to reason about unrepresented side-chain atoms, and a loss term that emphasizes the orientational correctness of the residues.
The network reinforces the notion of iterative refinement by repeatedly applying the final loss to outputs and feeding the outputs recursively into the same modules (recycling), contributing significantly to accuracy with minor extra training time.

Besides,the above is just a rough introduction to AlphaFold2. For more information about AlphaFold, please refer to its supplementary materials.^[4]

Results and Discussion

Predicted Structure

Our project experimentally inserted amino acids into the stalk region of the NA receptor on the virus surface. To explore the structural changes of NA before and after the insertion of amino acids, we utilized alphafold2 to predict the structures of NA-Insert-14, NA-Insert-28, NA-Delete-9, NA-Delete-24 respectively(The numbers following indicate the quantity of inserted or deleted amino acids).The predictions were conducted on Google Colab platform.

pLDDT (predicted Local Distance Difference Test) evaluates the deviation of the distance between each pair of amino acid residues in the predicted structure from statistically acceptable distance distributions. Higher plDDT scores indicate better quality of the predicted structure.

PAE (Predicted Aligned Error) measures the average root-mean-square error of the backbone atoms between the predicted structure and its sequence comparative template. Lower PAE values indicate the predicted structure aligns better with the template structure and thus is of higher quality.

An MSA contains multiple homologous protein sequences aligned together.The per-residue counts provide an indication of how conserved each alignment position is and higher counts indicate more conserved positions. Regions with very low per-residue counts are more likely to be variable or gapped regions.

As for Insert-14, its predicted structure(the portion highlighted in blue represents the predicted results of insertion segment), MSA, pLDDT, and pAE are as follows:

Fig. 6-1 The predicted structure of NA with the insertion of 14 amino acids was determined using AlphaFold2.

Fig. 6-2 The multiple sequence alignment of NA with the insertion of 14 amino acids.

Fig. 6-3 The predicted Local Distance Difference Test and predicted aligned error of NA with the insertion of 14 amino acids.We can see that the pLDDT score is lower in the region where amino acids are inserted, indicating that the predictions for this region are relatively unreliable.

As for Insert-28, its predicted structure(the portion highlighted in blue represents the predicted results of insertion segment), MSA, pLDDT, and pAE are as follows:

Fig. 7-1 The predicted structure of NA with the insertion of 28 amino acids was determined using AlphaFold2.

Fig. 7-2 The multiple sequence alignment of NA with the insertion of 28 amino acids.

Fig. 7-3 The predicted Local Distance Difference Test and predicted aligned error of NA with the insertion of 28 amino acids.Similarly,We can see that the pLDDT score is lower in the region where amino acids are inserted, indicating that the predictions for this region are relatively unreliable.

As for Delete-9, its predicted results(The portion highlighted in red indicates the positions of deleted segment), MSA, pLDDT, and pAE are as follows:

Fig. 8-1 The predicted structure of NA with the deletion of 9 amino acids was determined using AlphaFold2.

Fig. 8-2 The multiple sequence alignment of NA with the deletion of 9 amino acids.

Fig. 8-3 The predicted Local Distance Difference Test and predicted aligned error of NA with the deletion of 9 amino acids.

As for Delete-24, its predicted results(The portion highlighted in red indicates the positions of deleted segment), MSA, pLDDT, and pAE are as follows:

Fig. 9-1 The predicted structure of NA with the deletion of 24 amino acids was determined using AlphaFold2.

Fig. 9-2 The multiple sequence alignment of NA with the deletion of 24 amino acids.

Fig. 9-3 The predicted Local Distance Difference Test and predicted aligned error of NA with the deletion of 24 amino acids.

Prediction Convergence

AlphaFold2 can quickly reach its final state when predicting the structures of simple proteins, but it requires a longer period to converge for more complex protein structures. In order to investigate whether the predicted protein structures have converged in the prediction results, a comparison was made between the structures of si28 and si14 at epoch numbers of 20 and 40. It was found that the structures they obtained were very close in the end. Therefore, it can be considered that the predicted protein structures have converged.

For the insertion of 14 amino acid residues, we used AlphaFold to predict their structures over 20 and 40 epochs, and then compared the structural differences using PyMOL.

Fig. 10 The align result of NA with the insertion of 14 amino acids.

For the insertion of 28 amino acid residues, we used AlphaFold to predict their structures over 20 and 40 epochs, and then compared the structural differences using PyMOL.

Fig. 11 The align result of NA with the insertion of 28 amino acids.

The root mean square deviation (RMSD) of atomic positions is a measure of the average distance between corresponding atoms (typically backbone atoms) in superimposed proteins.Typically, a lower RMSD value is considered better.

The RMSD for SI14 and SI28 is 0.073 and 0.010, respectively. The structures at different cycles for each of them show no significant differences, indicating that the predicted protein structures have converged.