Software

EZBinder - A one-click solution to De Novo protein binder design that anyone can use.

EZBinder

Highlights

We developed a one-click, automated binder design pipeline that will generate .pdb structures that bind to a given .pdb structure or rcsb record. Everything is controlled through a single configuration file, minimizing the barrier to entry for binder design and allowing someone with little to no molecular biology or computer science experience to design a custom binder for any protein.

We found that our computational team was spending a lot of time on menial tasks that follow the design of a protein binder from start to finish, such as editing many text files in many locations and running remote processing jobs on our university’s HPC cluster. We were left with the choice of organizing our work and taking an inordinate amount of time or relying on memory and human error to progress rapidly through the design pipeline We realized that most of the tools we were using had easy to use user-facing interfaces, but were all in different places and programs. EZBinder puts a handpicked set of the most useful settings from all three programs in a single configuration file, with documentation for setting definitions in practical terms for the end user: iGEM teams or anyone who wants to make binders without becoming a protein biochemist or a computer scientist. This isn’t the best way to design binders, and there are inefficiencies with this approach, but EZBinder only needs one click to run and it will generate viable binders.

Binders produced using this protocol have great results in silico, our best mCherry binder candidate using this protocol has a pLDDT of 93.6123, mean PAE of 2.0887, and pTM of 0.8860. See below for the distribution of generated structures fitting each ESMFold metric, compared to the experimentally verified LAM antibodies in RCSB.

graph of pLDDT from EZBbinder mCherry binders

pLDDT scores from an EZBinder run targeting mCherry. Histogram bars are colored by ESMFold confidence metrics, and ESMFold scores for experimentally verified mCherry nanobodies are shown. The highest pLDDT structure generated by this run is indicated in bold.

graph of PTM scores from EZBbinder mCherry binders

PTM scores from an EZBinder run targeting mCherry. Histogram bars are colored by ESMFold confidence metrics, and ESMFold scores for experimentally verified mCherry nanobodies are shown. The highest pLDDT structure generated by this run is indicated in bold.

graph of PAE scores from EZBbinder mCherry binders

PAE scores from an EZBinder run targeting mCherry. Histogram bars are colored by ESMFold confidence metrics, and ESMFold scores for experimentally verified mCherry nanobodies are shown. The highest pLDDT structure generated by this run is indicated in bold.

These results are encouraging for the stability of the generated proteins, but docking with RosettaDock shows that the existing nanobodies still outperform the first round of binders generated with EZBinder. We also tested partial diffusion, rescuing old designs and recycling through our protocol for our computational results (and these did outperform the existing nanobodies), but these features are not yet implemented in EZBinder.

Docking funnel of lowest pLDDT mCherry Binders generated by EZBinder, complex energy

Docking funnel of lowest pLDDT mCherry binders generated by EZBinder, truncated to the region between 1200 to -1200. This plot uses the total complexed energy of the protein in REU (Rosetta Energy Units).

”Docking

Docking funnel of lowest pLDDT mCherry binders generated by EZBinder, truncated to the region between 1200 to -1200. This plot uses the interface energy of the protein in REU (Rosetta Energy Units).

What does it do?

EZBinder is a python script that automates every step of a common binder design pipeline. It is able to fully automate a workflow that runs binder designs through the pipeline of: rfDiffusion ProteinMPNN ESMFold

EZBinder will write SLURM batch scripts to be run on a linux-based remote server with working installs of RFdiffusion, ProteinMPNN, and ESMFold. We won’t cover how to install those programs on your own server, but we’ve uploaded copies of the python virtual environments we used to our wiki for future iGEM teams to use. Based off of the parameters in the configuration file, EZBinder will:

  1. Fetch a pdb file from RCSB using and RCSB identifier (users can provide their own pdb files if they prefer)
  2. Prepare the structure for RFdiffusion by removing waters and other contaminants using the pdbtools library.
  3. Relax the protein using PyRosetta FastRelax (optional, but highly recommended)
  4. Generate the appropriate SLURM scripts to run rfDiffusion, ProteinMPNN, and ESMFold remotely
  5. Upload the scripts to an EZBinder remote working directory
  6. Run bash commands to submit jobs for rfDiffusion, ProteinMPNN, and ESMFold, each one using the previous output as input
  7. Archive the entire working directory into one easily downloadable file, by default both scores and final pdb files are stored in ESM_outs.
  8. Check the resultssection for computational evaluation of binders produced using this pipeline

    How do I use it?

    1. Prerequisites:
      • Remote Machine: slurm, dl_binder_design, ESMFold, RFdiffusion
      • Local Machine: python, PyRosetta (optional)
    2. Download everything in this directory to a single folder
    3. Run pip -r requirements.txt to install the required python packages (if you choose to skip the FastRelax step you can comment out the lines that include pyrosetta in main.py and `helpers.py`)
    4. Fill in the configuration file pyproject.toml. Certain parameters need to be adjusted, these are marked with #CHANGEME. What do all those settings on the configuration do?
    5. Run python main.py
    6. EZBinder will prompt you for any information it doesn’t have in its configuration file
    7. Check back on the server and download the zipped archive

    What is RFdiffusion?

    RFdiffusion (RoseTTAFold diffusion) is a program built by the Baker Lab at the University of Washington that generates protein backbones using a diffusion model. This same type of model powers image generation AIs like Stable Diffusion or DALL-E. RFdiffusion can do quite a lot, but the capability we’re interested in is its ability to generate plausible binder structures given a protein structure and, optionally, hotspot residues. For our purposes, this means two things.

    1. RFdiffusion performs better with hotspot residues
    2. RFdiffusion outputs a protein backbone with no sequence

    Both of these considerations influenced our design decisions for EZBinder. 1) In order to minimize complexity and increase readability, we decided to omit the option to run RFdiffusion without specifying hotspot residues. Our concern was that protein design novices would ignore the optional parameter and waste both time and computational resources on a suboptimal binder candidate. We understand that picking potential binding residues can be tricky, so EZBinder will automatically select interface contacts as hotspot residues if the RCSB structure is a complex and the hotspots parameter is left blank. For those who would rather specify their own hotspots, we suggest looking for regions with hydrophobic residues exposed to the surface, ideally with few glycans. These can be found in RCSB itself by following the guide in their FAQ. 2) Getting a protein backbone means that the structure is plausible, but a cursory examination will show that it is all glycines. This won’t fold into any helpful conformation, and will be degraded by pretty much any chassis through the natural mechanisms for disposing of disordered proteins. In order to get a sequence we can test in vivo, we need to generate a protein sequence from this structure. The Baker Lab suggests using ProteinMPNN for this task, and we’ve seen good in silico results with this as well. Future iGEM teams might consider alternatives to generate sequences from backbones, such as ESMFold’s inverse folding.

    For more information, check out the RFdiffusion repository on github or the RFdiffusion paper

    What is ProteinMPNN?

    ProteinMPNN is a deep learning program from the Baker Lab that predicts a sequence given a protein backbone. Since RFdiffusion returns a backbone with no sequence, after a binder design is run through both programs, it will have a plausible backbone and plausible sequence. We decided to access ProteinMPNN through the dl_binder_design suite, which we found to be easier to install than ProteinMPNN directly. Although we haven’t tested this, it should be possible to evaluate binders with AlphaFold as well as ESMFold with slight modifications to EZBinder. Using dl_binder_design simplifies the installation process and allows for the possibility of future feature expansion.

    For more information, check out the ProteinMPNN paper and github repository as well as the dl_binder_design paper and github repository

    What is ESMFold?

    ESMFold (Evolutionary Structure Modelling Fold) is a program from Facebook (now Meta) Research that uses a transformer protein language model to predict a protein’s structure given a sequence. Unlike AlphaFold, it doesn’t need an MSA (Multiple Sequence Alignment) and it runs nearly an order of magnitude faster with nearly identical performance. We chose ESMFold for the performance boost, allowing us to measure separate metrics for the complex and monomer binder, while still saving time relative to an AlphaFold based approach. Future iGEM teams could use ESMFold inverse folding to complement ProteinMPNN for protein sequence prediction from backbone, with minimal modifications to EZBinder.

    EZBinder uses ESMFold to score the sequences produced from ProteinMPNN. Each sequence is a binder/target complex, EZBinder folds the complex structure,scores the complex, then folds the monomer independently and scores that as well. Each ESMFold run produces a .csv file containing selected ESMFold metrics.

    For more information, check out the ESMFold paper and github repository

    What are all those files?

    esm.py – This is a script we built that runs ESMFold on the remote machine. It takes a directory full of pdb files with sequences created by ProteinMPNN, converts the entire directory into a fasta file of binder/target complexes, and folds every binder/complex in the fasta as well as each binder independently. All the ESMFold metrics for each folding run are stored in a csv file for later analysis, and the folded pdb files are left in the same directory as the output csv. This script can be run from the command line independently, but users don’t have to interact with it during normal operation of EZBinder.

    1. interfaceresidues.py – This is a PyMOL script we used last year, it will find the residues at the interface of two chains in a pdb file. We’re using it to automatically create hotspot residues from pdbs of complexed targets.

    2. helpers.py – In order to keep main.py as readable as possible, we’ve refactored the more complicated EZBinder logic into helper functions and classes.
    3. main.py – The main EZBinder executable. This is what users should run. If any parameters are missing from pyproject.toml, it will prompt the user for the missing information. It handles preparing the pdb, creating SLURM scripts, submitting those scripts, and compressing all of the files generated into a single archive.
    4. pyproject.toml – This is the configuration file that controls EZBinder. Users only need to adjust the parameters in here, specifically the parameters marked with #CHANGEME. EZBinder performs some very basic sanity checks on inputs from the configuration file, but this should not be relied on. Make sure to match the actual parameters with the examples for best results.

    What do all those settings on the configuration do?

    EZBinder is built with accessibility as our guiding principle. No one who uses this program should have to tweak the logic, just fill in values of the config file. The simplest case (which creates more stable binders than existing nanobodies), doesn’t require any prior knowledge of protein design. For more advanced users, we’ve exposed several helpful settings.

    Basic configuration:

    [dirs]

    No need to adjust any parameters here!

    [cleanup]

    [LocalRelax]

    No need to adjust any parameters here!

    [HPC]

    [HPC.creds]

    [HPC.pass]

    [HPC.hpcdirs]

    No need to adjust parameters here!

    [HPC.slurm]

    [HPC.RFD]

    [HPC.MPNN]

    [HPC.ESM]

    Advanced Configuration:

    [dirs]

    [cleanup]

    [LocalRelax]

    [HPC]

    No additional parameters

    [HPC.creds]

    No additional parameters

    [HPC.pass]

    No additional parameters

    [HPC.hpcdirs]

    No additional parameters

    [HPC.slurm]

    No additional parameters

    [HPC.RFD]

    [HPC.MPNN]

    [HPC.ESM]

    [HPC.zipup]

    How do I evaluate my binders?

    EZBinder puts the ESMFold evaluation metrics in a csv under /working_directory/ESM_outs/#/esmscores.csv.

    The most predictive metric is mean_plddt, and this is also the easiest to interpret. We suggest that any binders with a pLDDT greater than 90, PAE < 10, and PTM > 0.7 is a good candidate for docking and/or experimental validation.