EZBinder

Highlights

We developed a one-click, automated binder design pipeline that will generate .pdb structures that bind to a given .pdb structure or rcsb record. Everything is controlled through a single configuration file, minimizing the barrier to entry for binder design and allowing someone with little to no molecular biology or computer science experience to design a custom binder for any protein.

We found that our computational team was spending a lot of time on menial tasks that follow the design of a protein binder from start to finish, such as editing many text files in many locations and running remote processing jobs on our university’s HPC cluster. We were left with the choice of organizing our work and taking an inordinate amount of time or relying on memory and human error to progress rapidly through the design pipeline We realized that most of the tools we were using had easy to use user-facing interfaces, but were all in different places and programs. EZBinder puts a handpicked set of the most useful settings from all three programs in a single configuration file, with documentation for setting definitions in practical terms for the end user: iGEM teams or anyone who wants to make binders without becoming a protein biochemist or a computer scientist. This isn’t the best way to design binders, and there are inefficiencies with this approach, but EZBinder only needs one click to run and it will generate viable binders.

Binders produced using this protocol have great results in silico, our best mCherry binder candidate using this protocol has a pLDDT of 93.6123, mean PAE of 2.0887, and pTM of 0.8860. See below for the distribution of generated structures fitting each ESMFold metric, compared to the experimentally verified LAM antibodies in RCSB.

graph of pLDDT from EZBbinder mCherry binders

pLDDT scores from an EZBinder run targeting mCherry. Histogram bars are colored by ESMFold confidence metrics, and ESMFold scores for experimentally verified mCherry nanobodies are shown. The highest pLDDT structure generated by this run is indicated in bold.

graph of PTM scores from EZBbinder mCherry binders

PTM scores from an EZBinder run targeting mCherry. Histogram bars are colored by ESMFold confidence metrics, and ESMFold scores for experimentally verified mCherry nanobodies are shown. The highest pLDDT structure generated by this run is indicated in bold.

graph of PAE scores from EZBbinder mCherry binders

PAE scores from an EZBinder run targeting mCherry. Histogram bars are colored by ESMFold confidence metrics, and ESMFold scores for experimentally verified mCherry nanobodies are shown. The highest pLDDT structure generated by this run is indicated in bold.

These results are encouraging for the stability of the generated proteins, but docking with RosettaDock shows that the existing nanobodies still outperform the first round of binders generated with EZBinder. We also tested partial diffusion, rescuing old designs and recycling through our protocol for our computational results (and these did outperform the existing nanobodies), but these features are not yet implemented in EZBinder.

Docking funnel of lowest pLDDT mCherry Binders generated by EZBinder, complex energy

Docking funnel of lowest pLDDT mCherry binders generated by EZBinder, truncated to the region between 1200 to -1200. This plot uses the total complexed energy of the protein in REU (Rosetta Energy Units).

Docking funnel of lowest pLDDT mCherry binders generated by EZBinder, truncated to the region between 1200 to -1200. This plot uses the interface energy of the protein in REU (Rosetta Energy Units).

What does it do?

EZBinder is a python script that automates every step of a common binder design pipeline. It is able to fully automate a workflow that runs binder designs through the pipeline of: rfDiffusion → ProteinMPNN → ESMFold

EZBinder will write SLURM batch scripts to be run on a linux-based remote server with working installs of RFdiffusion, ProteinMPNN, and ESMFold. We won’t cover how to install those programs on your own server, but we’ve uploaded copies of the python virtual environments we used to our wiki for future iGEM teams to use. Based off of the parameters in the configuration file, EZBinder will:

Fetch a pdb file from RCSB using and RCSB identifier (users can provide their own pdb files if they prefer)
Prepare the structure for RFdiffusion by removing waters and other contaminants using the pdbtools library.
Relax the protein using PyRosetta FastRelax (optional, but highly recommended)
Generate the appropriate SLURM scripts to run rfDiffusion, ProteinMPNN, and ESMFold remotely
Upload the scripts to an EZBinder remote working directory
Run bash commands to submit jobs for rfDiffusion, ProteinMPNN, and ESMFold, each one using the previous output as input
Archive the entire working directory into one easily downloadable file, by default both scores and final pdb files are stored in ESM_outs.

results

How do I use it?

Prerequisites:

Remote Machine: slurm, dl_binder_design, ESMFold, RFdiffusion
Local Machine: python, PyRosetta (optional)

Download everything in this directory to a single folder
Run pip -r requirements.txt to install the required python packages (if you choose to skip the FastRelax step you can comment out the lines that include pyrosetta in main.py and `helpers.py`)
Fill in the configuration file pyproject.toml. Certain parameters need to be adjusted, these are marked with #CHANGEME. What do all those settings on the configuration do?
Run python main.py
EZBinder will prompt you for any information it doesn’t have in its configuration file

What is RFdiffusion?

RFdiffusion (RoseTTAFold diffusion) is a program built by the Baker Lab at the University of Washington that generates protein backbones using a diffusion model. This same type of model powers image generation AIs like Stable Diffusion or DALL-E. RFdiffusion can do quite a lot, but the capability we’re interested in is its ability to generate plausible binder structures given a protein structure and, optionally, hotspot residues. For our purposes, this means two things.

RFdiffusion performs better with hotspot residues
RFdiffusion outputs a protein backbone with no sequence

Both of these considerations influenced our design decisions for EZBinder. 1) In order to minimize complexity and increase readability, we decided to omit the option to run RFdiffusion without specifying hotspot residues. Our concern was that protein design novices would ignore the optional parameter and waste both time and computational resources on a suboptimal binder candidate. We understand that picking potential binding residues can be tricky, so EZBinder will automatically select interface contacts as hotspot residues if the RCSB structure is a complex and the hotspots parameter is left blank. For those who would rather specify their own hotspots, we suggest looking for regions with hydrophobic residues exposed to the surface, ideally with few glycans. These can be found in RCSB itself by following the guide in their FAQ. 2) Getting a protein backbone means that the structure is plausible, but a cursory examination will show that it is all glycines. This won’t fold into any helpful conformation, and will be degraded by pretty much any chassis through the natural mechanisms for disposing of disordered proteins. In order to get a sequence we can test in vivo, we need to generate a protein sequence from this structure. The Baker Lab suggests using ProteinMPNN for this task, and we’ve seen good in silico results with this as well. Future iGEM teams might consider alternatives to generate sequences from backbones, such as ESMFold’s inverse folding.

For more information, check out the RFdiffusion repository on github or the RFdiffusion paper

What is ProteinMPNN?

ProteinMPNN is a deep learning program from the Baker Lab that predicts a sequence given a protein backbone. Since RFdiffusion returns a backbone with no sequence, after a binder design is run through both programs, it will have a plausible backbone and plausible sequence. We decided to access ProteinMPNN through the dl_binder_design suite, which we found to be easier to install than ProteinMPNN directly. Although we haven’t tested this, it should be possible to evaluate binders with AlphaFold as well as ESMFold with slight modifications to EZBinder. Using dl_binder_design simplifies the installation process and allows for the possibility of future feature expansion.

For more information, check out the ProteinMPNN paper and github repository as well as the dl_binder_design paper and github repository

What is ESMFold?

ESMFold (Evolutionary Structure Modelling Fold) is a program from Facebook (now Meta) Research that uses a transformer protein language model to predict a protein’s structure given a sequence. Unlike AlphaFold, it doesn’t need an MSA (Multiple Sequence Alignment) and it runs nearly an order of magnitude faster with nearly identical performance. We chose ESMFold for the performance boost, allowing us to measure separate metrics for the complex and monomer binder, while still saving time relative to an AlphaFold based approach. Future iGEM teams could use ESMFold inverse folding to complement ProteinMPNN for protein sequence prediction from backbone, with minimal modifications to EZBinder.

EZBinder uses ESMFold to score the sequences produced from ProteinMPNN. Each sequence is a binder/target complex, EZBinder folds the complex structure,scores the complex, then folds the monomer independently and scores that as well. Each ESMFold run produces a .csv file containing selected ESMFold metrics.

For more information, check out the ESMFold paper and github repository

What are all those files?

esm.py – This is a script we built that runs ESMFold on the remote machine. It takes a directory full of pdb files with sequences created by ProteinMPNN, converts the entire directory into a fasta file of binder/target complexes, and folds every binder/complex in the fasta as well as each binder independently. All the ESMFold metrics for each folding run are stored in a csv file for later analysis, and the folded pdb files are left in the same directory as the output csv. This script can be run from the command line independently, but users don’t have to interact with it during normal operation of EZBinder.

interfaceresidues.py – This is a PyMOL script we used last year, it will find the residues at the interface of two chains in a pdb file. We’re using it to automatically create hotspot residues from pdbs of complexed targets.
helpers.py – In order to keep main.py as readable as possible, we’ve refactored the more complicated EZBinder logic into helper functions and classes.
main.py – The main EZBinder executable. This is what users should run. If any parameters are missing from pyproject.toml, it will prompt the user for the missing information. It handles preparing the pdb, creating SLURM scripts, submitting those scripts, and compressing all of the files generated into a single archive.
pyproject.toml – This is the configuration file that controls EZBinder. Users only need to adjust the parameters in here, specifically the parameters marked with #CHANGEME. EZBinder performs some very basic sanity checks on inputs from the configuration file, but this should not be relied on. Make sure to match the actual parameters with the examples for best results.

What do all those settings on the configuration do?

EZBinder is built with accessibility as our guiding principle. No one who uses this program should have to tweak the logic, just fill in values of the config file. The simplest case (which creates more stable binders than existing nanobodies), doesn’t require any prior knowledge of protein design. For more advanced users, we’ve exposed several helpful settings.

Basic configuration:

`[dirs]`

No need to adjust any parameters here!

`[cleanup]`

pdb_code: The default setting is “false”, which will prompt you for a 4 character RCSB code. Change this to an RCSB code to avoid the prompt.
target_chain: The chain you want to extract from the structure given to EZBinder. Replace “A” with the chain ID of your choice
binder_chain: The chain you want to extract interface residues from (assuming a complex structure). If you manually specify hotspots this setting has no effect.

`[LocalRelax]`

No need to adjust any parameters here!

`[HPC]`

remotedir: Put the directory you want EZBinder to work in on the remote machine.
Replicates: The number of parallel RFdiffusion → ProteinMPNN → ESMFold runs to execute.
conda: The path to miniconda’s conda.sh. This will likely be something similar to: /shared/miniconda3/etc/profile.d/conda.sh

`[HPC.creds]`

hostname: This is the name of the server you use when you ssh in manually.
username: The username you use to log in to account on the server

`[HPC.pass]`

password: The password you use to log in to your account on the server. If you leave this blank, the script will prompt you for your password.

`[HPC.hpcdirs]`

No need to adjust parameters here!

`[HPC.slurm]`

account: The account name to charge SLURM jobs to
partition: The SLURM partition you want to use, we suggest using a powerful GPU with CUDA capability (our tests used a100-equipped machines)

`[HPC.RFD]`

venv: The name of the python virtual environment used for RFdiffusion.
inference_path: The path to /RFdiffusion/scripts/run_inference.py
checkpoint_path: The path to: /RFdiffusion/models/Complex_base_ckpt.pt
contigmap_contigs: The default value is: "[B1-223/0 60-120]". Replace 233 with the length of your actual target, and 60-120 with the minimum-maximum length of your binder you want to design. For more information read the RFdiffusion github.
hotspots: Leave this blank if you want to automatically generate hotspots from the interface residues between target_chain and binder_chain. If you want to manually specify hotspots, add the residue indices here. Good choices of hotspot residues are exposed and hydrophobic (ideally an active site). For more information read the RFdiffusion github.
num_designs: The number of designs you want to generate per RFdiffusion replicate. In our initial tests we found that RFdiffusion was able to create 0.25 backbones per minute on a 10GB slice of an a100.

`[HPC.MPNN]`

script_path: The path to: /dl_binder_design/mpnn_fr/dl_interface_design.py

`[HPC.ESM]`

venv: The name of the python virtual environment used for ESMFold constraint: A SLURM constraint, in our testing the ESMFold model wouldn’t fit in the default 10GB of VRAM.

Advanced Configuration:

`[dirs]`

WORKINGDIR: Path to the local working directory, if unspecified the script will default to the directory it’s currently in.

`[cleanup]`

skip_clean: Leave this blank to skip the cleanup step. If you want to skip the cleaning process, put a path to a pdb file, and EZBinder move through the pipeline with that structure.
cleanup_args: This is a list of python dicts that govern interaction with pdbtools. Example: {opfile = "pdb_delhetatm.py", args = ""}. The args argument will pass those arguments to the script opfile. Default behavior is to delete hetero atoms, sort the pdb lines by index, renumber atoms, extract chain substrate_chain and rename it to chain B. The output of each script is piped to the script below it in the list.

`[LocalRelax]`

skip_relax: Leave blank to use pyrosetta FastRelax, replace with a path to a relaxed pdb file to skip the FastRelax step.

`[HPC]`

No additional parameters

`[HPC.creds]`

No additional parameters

`[HPC.pass]`

No additional parameters

`[HPC.hpcdirs]`

No additional parameters

`[HPC.slurm]`

No additional parameters

`[HPC.RFD]`

guiding_potentials: Our host lab has found that these specific weights bias RFdiffusion towards better binders, adjust at your own risk. More information on adjusting potentials in the RFdiffusion repo
noise_scale_ca: Our host lab has found that these specific weights bias RFdiffusion towards better binders, adjust at your own risk. More information on adjusting potentials in the RFdiffusion repo
noise_scale_frame: Our host lab has found that these specific weights bias RFdiffusion towards better binders, adjust at your own risk. More information on adjusting potentials in the RFdiffusion repo
bashtime: the delay, in bash-readable time to wait before running RFDiffusion, see: https://linux.die.net/man/3/sleep

`[HPC.MPNN]`

relax_cycles: EZBinder default is zero, we didn’t see significant improvements for the significant computational cost. We’ve exposed the configuration option, but haven’t tested other settings.
seqs_per_struct: Number of sequences to generate per backbone structure debug: Include debug output.
bashtime: the delay, in bash-readable time to wait before running ProteinMPNN. This should be a few minutes after RFdiffusion has successfully executed.

`[HPC.ESM]`

bashtime: the delay, in bash-readable time to wait before running ESMFold. This should be a few minutes after ProteinMPNN has successfully executed.

`[HPC.zipup]`

archivename: Name of the archive of the working directory.
bashtime: the delay, in bash-readable time to wait before compressing the working directory into archivename.tar.gz. This should be a few minutes after ESMFold has been successfully executed.

How do I evaluate my binders?

EZBinder puts the ESMFold evaluation metrics in a csv under /working_directory/ESM_outs/#/esmscores.csv.

The most predictive metric is mean_plddt, and this is also the easiest to interpret. We suggest that any binders with a pLDDT greater than 90, PAE < 10, and PTM > 0.7 is a good candidate for docking and/or experimental validation.

Design: this is the name of the ProteinMPNN sequence that ESMFold used as an input. This is also the name of the output pdb.
input_sequence: the amino acid sequence used.
mean_plddt: Predicted per-residue scores on the lDDT-Cα metric. Higher is better; a good rule of thumb is that a pLDDT greater than 90 is very high accuracy, 70-90 is high accuracy, and 50-70 is low accuracy. More information on pLDDT metrics here.
meanpae: PAE is “Predicted Aligned Error”. This is usually used to evaluate protein structure on a per-residue basis, but we found that when compressed into a single value it is still correlated with pLDDT. Lower values are best, proteins with PAE values greater than 10 are unlikely to work experimentally.
substratepae: PAE for the target in a complex prediction, 0 for monomer binders
binderpae: PAE for the binder in a complex prediction, 0 for monomer binders
ptm: Predicted TM-score, the predicted similarity between the designed binder and an existing structure in pdb. A value > 0.5 is good confidence, and > 0.7 is high confidence: A higher score is generally better. More information on pTM can be found on the ESM Metagenomic Atlas and the ESMFold Paper.
seq_hash: The sha1sum of the amino acid sequence of a protein. EZBinder names binders uniquely within runs, but in order to keep a unique value when comparing runs a fully unique value is required. EZBinder doesn’t produce this value directly, but we’ve included it in our sample data for 5 EZBinder runs.