MODELLING

scFv Structure Modelling

We make all data available for analysis in our GitHub repository.

A bit on Interleukin-8

Interleukin-8 (IL-8) plays a crucial role in protein binding by facilitating interactions with chemokine receptors, particularly CXCR1. IL-8 is a small globular protein with defined structural regions. It possesses binding sites primarily in its N-loop and 40s loop regions. These regions contain specific charged and hydrophobic residues that are essential for receptor recognition. When IL-8 encounters CXCR1, these binding sites on IL-8 engage with complementary sites on the receptor's N-terminal domain, such as ND-CXCR1(1–38). These interactions create a stable protein-protein complex, which serves as a signaling mechanism to initiate cellular responses, such as chemotaxis and immune cell activation, ultimately aiding in immune responses and the regulation of inflammation.

This section lays out our process of predicting the structure for the anti-IL8 scFv format antibody. The amino acid sequence for the antibody has the following parts.

A signal peptide sequence to ensure that the antibody is secreted from the cell.
MPLLLLLPLLWAGALA
The variable heavy chain (VH) of the antibody.
EVQLLESGGGLVQPGGSLRLSCAASGFTFSYYGMGWVRQAPGKGLEWVSGISYSGSGTYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCARDYVGNLDYWGQGTLVTVSS
A linker sequence to connect the VH and VL chains.
GGGGSGGGGSGGGGS
The variable light chain (VL) of the antibody.
DIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSDTPSTFGQGTKLEIK
A 3xFLAG and a 6xHis-tag sequence to facilitate purification of the antibody.
RTDYKDHDGDYKDHDIDYKDDDDKAAALPETGGHHHHHH

Therefore, the full amino acid sequence that we will work with is

MPLLLLLPLLWAGALAEVQLLESGGGLVQPGGSLRLSCAASGFTFSYYGMGWVRQAPGKGLEWVSGISYSGSGTYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCARDYVGNLDYWGQGTLVTVSSGGGGSGGGGSGGGGSDIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSDTPSTFGQGTKLEIKRTDYKDHDGDYKDHDIDYKDDDDKAAALPETGGHHHHHH

Note that we keep the signal peptide and the tags attached during the modelling process, because we do not know whether (and how) they interact with Interleukin-8 during docking.

AlphaFold2

AlphaFold2 is a deep learning system that predicts protein structures from amino acid sequences. We used the open-source distribution of AlphaFold2, ColabFold to predict the structure of the antibody. We used the AlphaFold2_mmseqs2 notebook. This notebook differs from full AlphaFold2 and AlphaFold2 Colab in that it uses MMseqs2 (Many-against-Many sequence searching) in place of homology detection and MSA pairing.

We used ColabFold with two different schemes: one without templates, and one with PDB70 as a database for templates. We also relaxed the top structure in either scheme with AMBER.

With PDB70

Without Templates

Both of these have very similar average predicted aligned errors, as well as predicted lDDT scores. For both, folding is poor near the ends where the signal peptide and flags were attached, and in the middle where the linker is present.

With PDB70	Without Templates

Interpretation of AlphaFold Results

Predicted Local Distance Difference Test (pLDDT)

AlphaFold produces a per-residue estimate of its confidence on a scale from 0 – 100 . This confidence measure is called pLDDT and corresponds to the model’s predicted score on the lDDT-Cα metric. It is stored in the B-factor fields of the mmCIF and PDB files available for download (although unlike a B-factor, higher pLDDT is better). pLDDT is also used to colour-code the residues of the model in the 3D structure viewer. The following rules of thumb provide guidance on the expected reliability of a given region:

Regions with pLDDT > 90 are expected to be modelled to high accuracy. These should be suitable for any application that benefits from high accuracy (e.g. characterising binding sites).
Regions with pLDDT between 70 and 90 are expected to be modelled well (a generally good backbone prediction).
Regions with pLDDT between 50 and 70 are low confidence and should be treated with caution.
The 3D coordinates of regions with pLDDT < 50 often have a ribbon-like appearance and should not be interpreted. We show in our paper that pLDDT < 50 is a reasonably strong predictor of disorder, i.e. it suggests such a region is either unstructured in physiological conditions or only structured as part of a complex.
Structured domains with many inter-residue contacts are likely to be more reliable than extended linkers or isolated long helices.
Unphysical bond lengths and clashes do not usually appear in confident regions. Any part of a structure with several of these should be disregarded.

The pLDDT per position is also given as a plot for the five models made in every run and gives a simpler overview:

We see prediction models of a protein presented in pLDDT graph. Most of the graph has pLDDT score above 75-80 and hence it has a high confidence level on the structure and relative inter atomic distances. Different models have different pLDDT scores and Model 1 has on average the highest level. Hence model 1 is the best predicted Structure.

Predicted Aligned Error (PAE)

Our protein clearly has 2 domains. We use the Predicted Aligned Error (PAE) plot provided by AlphaFold. PAE is a 2D plot.

The colour at (x, y) corresponds to the expected distance error in residue x’s position, when the prediction and true structure are aligned on residue y. Dark Blue is good (low error), red is bad (high error). For example, aligning on residue 150: 

We’re confident in the relative position of residue 100
We’re not confident in the relative position of residue 200

The two low-error squares correspond to the two domains.

AlphaFold produces a per-residue confidence score (pLDDT) between 0 and 100. Some regions with low pLDDT may be unstructured in isolation.

Dark blue- Very High pLDDT (MORE CONFIDENCE)
Light Blue-Confident (pLDDT level moderate)
White-red- Low (pLDDT level low)
Red- Very low (pLDDT levels very low)

AlphaFold Database

365 K predicted for proteins from 21 model organisms.
For the organisms currently covered, predicted structures are available for the sequences in the UniProt reference proteome that are between 16 and 2700 amino acids long and contain only standard amino acids.

They use mmCIF files from the model archive extension to get resources and information of predicted proteins. It contains molecular description, Taxonomy id, Quality measures, per residue quality.

Impact of structural bioinformatics:

Predicting complexes between macromolecules.
About intrinsically disordered proteins and structures of protein-protein, protein-nucleic acid complexes.
Provide information on protein dynamics, i.e., relevant conformation states.
Ligand Predictions.
It will accelerate Structural Biology.
The structural studies and its uses in mechanizing reactions of biomolecules.

Multiple sequence alignment (MSA)

Alpha fold predicts various possible structures for a given sequence of amino acids. So, one of the tools developed are the MSA graphs. Here we can take two different alignments and combine them based on requirements, organism and other information. We can pair and unpair them to get better results, which depends on the sequences and how well the software predicts on each of them.  

Here we can see for unpaired MSA case, we have possible sequence counts and their positions and based on it, the software well developed 1 graph which has minimum error in relative positions. While in the case of Paired MSA, we have mostly all better than Unpaired case but their quality is decreasing.  

Thus combining proteins to form a bigger one can help us determine structures and relative inter atomic distances of that protein better. 

The MSA help us to even determine structural accuracy of bigger proteins. For example, we want to determine the PAE graph for a protein P which is made of 2 copies of protein A, 1 copy of B and 2 copies of C, we can create MSA sequence of each and using the graphs we can create multiple possible structure of P with respective PAE graphs and choose the best one. Hence this simplifies and increase information regarding protein analysis.   

Citations

O-IL8-15 Biological Probe, in Structural Genomics Consortium: thesgc.org/biological-probes/il-8
Mirdita, M., Schütze, K., Moriwaki, Y., Heo, L., Ovchinnikov, S., & Steinegger, M. (2022). ColabFold: Making Protein folding accessible to all. Nature Methods.
Mirdita, M., Steinegger, M., & Söding, J. (2019). MMseqs2 desktop and local web server app for fast, interactive sequence searches. Bioinformatics, 35(16), 2856–2858.
Mirdita, M., Driesch, L., Galiez, C., Martin, M., Söding, J., & Steinegger, M. (2017). Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res., 45(D1), D170–D176.
Mitchell, A., Almeida, A., Beracochea, M., Boland, M., Burgin, J., Cochrane, G., Crusoe, M., Kale, V., Potter, S., Richardson, L., Sakharova, E., Scheremetjew, M., Korobeynikov, A., Shlemov, A., Kunyavskaya, O., Lapidus, A., & Finn, R. (2019). MGnify: the microbiome analysis resource in 2020. Nucleic Acids Res.
Steinegger, M., Meier, M., Mirdita, M., Vöhringer, H., Haunsberger, S., & Söding, J. (2019). HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinform., 20(1), 473.
Berman, H., Henrick, K., & Nakamura, H.. (2003). Announcing the worldwide Protein Data Bank.
Eastman, P., Swails, J., Chodera, J., McGibbon, R., Zhao, Y., Beauchamp, K., Wang, L.P., Simmonett, A., Harrigan, M., Stern, C., Wiewiora, R., Brooks, B., & Pande, V. (2017). OpenMM 7: Rapid development of high performance algorithms for molecular dynamics. PLOS Comput. Biol., 13(7).