In the early stages of our project, we realized that our mission was straightforward: to conduct extensive trials in order to arrive at the most suitable peptide, with the highest binding affinity for each of our selected volatile organic compounds (VOCs). However, rather than resorting to conventional and resource-intensive laboratory techniques such as phage display, we chose to embark on a journey into the realm of computational research.
This journey entailed the creation of a comprehensive computational pipeline, encompassing critical stages such as peptide generation, peptide and ligand preparation, evaluation of search space, and finally, molecular docking. To streamline and automate these processes, we developed a set of Python scripts.
Our choice to embrace in silico methods over phage display underscores our commitment to efficiency, cost-effectiveness, and innovative research practices. By bypassing extensive laboratory work, we not only conserve valuable time and resources but also contribute to the evolution of computational biology as a powerful tool in molecular research.
Our dedication to efficiency extends further. We seamlessly transformed these individual scripts into a versatile Python library, designed with user-friendliness as a priority. This library empowers fellow scientists to effortlessly conduct their own in silico phage display experiments.
Beyond computational aspects, we compiled an extensive database, housing over 290,000 peptides tested for their binding affinity to the four selected VOCs. This dataset stands as a valuable resource for future research and exploration.
Furthermore, our software endeavors encompassed the development of image processing and analysis algorithms, along with machine learning techniques. These components converged to create an innovative application intended to aid in the diagnosis of specific medical conditions.
In our software-focused journey, we bridge the gap between experimental and computational biology. We provide a robust platform for peptide discovery and analysis, driven by our commitment to scientific progress and our deliberate choice to forgo the challenges of traditional phage display techniques.
In our quest to find the right genetic modifications of the coat protein of our M13 Bacteriophage, we run into a daunting challenge. The task was to find the best candidate sequence of amino-acids; the right peptide following two simple rules:
• We must maximize the affinity of our receptors for their corresponding ligands.
• We must minimize the affinity of our receptors for any other ligand in the sample.
Simple enough, right? Just perform docking. Well, that’s what we thought. Well, not quite… All the amino acids that can cover each place of the sequence are 20. The search space of 3-peptides is 20^3 = 8000. Manageable but we didn’t get satisfactory results. Next search space volume: 20^4 = 160,000. This was a problem.
But we can find someone to lend us their cluster. Right? Well, unfortunately not. We had to adapt. To improve. To overcome. Besides, we are engineers. And that’s what we do. And that’s what we did, and, in the process, we developed an automated Python docking library with integrated search space algorithms to significantly improve our chances. And it was a success. Let’s now delve into PanSimPy, our biopanning simulation library, and its instructions.
PanSimPy is a versatile Python library designed for researchers and scientists working in the field of biopanning and molecular docking. It simplifies the process of peptide generation, conversion between different molecular file types, ligand and receptor preparation for docking, and docking of peptides to ligands using either AutoDock Vina or a strategic algorithm.
The library's capabilities are grounded in both scientific insights and practical utility, making it an indispensable asset for those delving into biopanning and molecular docking studies. Whether you're a seasoned researcher seeking innovative tools to expedite your work or a newcomer looking to harness the power of molecular docking, PanSimPy is your gateway to unlocking the potential of biopanning and molecular docking.
PanSimPy offers the following key features:
Easily generate peptides of specific lengths for biopanning experiments. This feature allows you to explore a wide range of peptide candidates for a given target, making it an essential tool for screening potential binding sequences.
Convert between different molecular file formats, such as mol files, PDB files, or SMILES, effortlessly. This flexibility ensures seamless integration with other molecular modeling tools and databases.
Streamline the preparation of ligands and receptors for molecular docking. PanSimPy automates the necessary steps to ensure proper input files for docking simulations, saving you valuable time and effort.
Automatically generate docking grids for efficient molecular docking. The grid generation process is optimized for accuracy and speed, making it suitable for large-scale docking experiments.
Perform molecular docking of peptides to ligands using two different methods:
Before installing PanSimPy, you need to ensure that your environment is properly configured. The following sections outline the steps required to set up your environment on different operating systems.
To make use of PanSimPy on Windows, we recommend using the Windows Subsystem for Linux (WSL). To install WSL, follow the instructions on the Microsoft website.
Once you have WSL installed, you can follow the instructions for Linux to install PanSimPy.
To use PanSimPy on Linux, you need to install the following dependencies:
To install these dependencies on Ubuntu, you can use the following commands:
$ apt update
$ apt install python3 python3-pip autodock-vina pymol openbabel
pip install rdkit numpy
To install PanSimPy on MacOS, follow these steps:
Automatically generate docking grids for efficient molecular docking. The grid generation process is optimized for accuracy and speed, making it suitable for large-scale docking experiments.
To install Homebrew, follow the instructions on the Homebrew website.
To install Pip, follow the instructions on the Pip website.
To install AutoDock Vina, follow the instructions on the AutoDock Vina website.
brew install open-babel rdkit pymol
pip install rdkit numpy
Clone the PanSimPy repository to your local machine:
git clone https://gitlab.igem.org/2023/software-tools/athens.git
Change your working directory to the cloned repository:
cd PanSimPy
You can install PanSimPy using the following command:
pip install .
This will install the package and make it available for import in your Python environment.
To verify that PanSimPy was successfully installed, you can run the following command:
python -c "import pansimpy; print(pansimpy.__version__)"
If the installation was successful, you should see the version number of the installed package printed to the console.
The following sections provide a detailed overview of PanSimPy's features and how you can use them in your research.
import pansimpy
amino_acids = ["A", "R", "N", "D", "C", "Q", "E", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"]
peptide_length = 5
random_peptide = generate_random_peptide(peptide_length, amino_acids)
print("Random Peptide:", random_peptide)
import pansimpy
ligand_file = "ligand.pdbqt"
ligand_center = (10.0, 15.0, 20.0)
grid_size_x = 20.0
grid_size_y = 20.0
grid_size_z = 20.0
exhaustiveness = 32
no_of_peptides_to_search = 3
run_docking("vina", "receptor_directory", ligand_file, ligand_center, grid_size_x, grid_size_y, grid_size_z, exhaustiveness, no_of_peptides_to_search)
import pansimpy
no_of_threads = 4
receptor_directory = "receptors"
ligand_file = "ligand.pdbqt"
ligand_center = (10.0, 15.0, 20.0)
grid_size_x = 20.0
grid_size_y = 20.0
grid_size_z = 20.0
exhaustiveness = 32
no_of_peptides_to_search = 10
run_docking_parallel(no_of_threads, "vina", receptor_directory, ligand_file, ligand_center, grid_size_x, grid_size_y, grid_size_z, exhaustiveness, no_of_peptides_to_search)
import pansimpy
output_folder = "peptide_optimization"
peptide_length = 7
ligand_file = "ligand.pdbqt"
ligand_center = (10.0, 15.0, 20.0)
grid_size_x = 20.0
grid_size_y = 20.0
grid_size_z = 20.0
amino_acids = ["A", "R", "N", "D", "C", "Q", "E", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"]
exhaustiveness = 32
num_generations = 10
vina_score_threshold = -4
best_peptide = genetic_algorithm_peptide_optimization(output_folder, peptide_length, ligand_file, ligand_center, grid_size_x, grid_size_y, grid_size_z, amino_acids, exhaustiveness, num_generations, vina_score_threshold)
print("Best Peptide:", best_peptide)
import pansimpy
output_folder = "full_docking_workflow"
peptide_length = 7
ligand_file = "ligand.pdbqt"
ligand_center = (10.0, 15.0, 20.0)
grid_size_x = 20.0
grid_size_y = 20.0
grid_size_z = 20.0
amino_acids = ["A", "R", "N", "D", "C", "Q", "E", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"]
exhaustiveness = 32
num_generations = 10
vina_score_threshold = -4
no_of_threads = 4
no_of_peptides_to_search = 10
full_docking_workflow(output_folder, peptide_length, ligand_file, ligand_center, grid_size_x, grid_size_y, grid_size_z, amino_acids, exhaustiveness, num_generations, vina_score_threshold, no_of_threads, no_of_peptides_to_search)
The thought to use genetic algorithms to navigate through are vast peptide search space was game-changing. Brute force docking wasn’t yielding significant results, and then we incorporated genetic algorithms. It was a breakthrough. Below, we just touch on the topic encouraging fellow iGem teams to strategically explore their problems’ search spaces:
Genetic algorithms (GAs) are optimization and search techniques inspired by the process of natural selection. Drawing parallels to biological evolution, GAs work by initializing a population of potential solutions, then iteratively selecting, crossing over, and mutating individuals based on a fitness function that evaluates their quality. Over successive generations, this evolutionary process promotes the proliferation of higher-quality solutions and gradually converges towards an optimal or near-optimal solution to a given problem. The adaptability and robustness of GAs make them effective for a wide range of complex problems where the search space is large or not well-understood.
The creation of the app was of utmost importance to conclude the SCENTIPD project. We built the prototype making sure was extremely easy to use. Our app, containing an advanced image processing and deep learning pipeline can serve as a valuable point of reference for future iGEM teams in need of such an MVP.
The APIs implemented in our app implemented the logic of our app’s functionalities. Our APIs/app’s functionalities are the following:
@image_processing.route('/scentipd/image_processing', methods=['POST'])
This controller processes the image and extract the image’s attributes, storing the in the Database. It also applies the deep learning model on the image and classifies it, extracting the diagnosis.
@patient_handler.route('/api/patients', methods=['POST'])
This controller handles the sign-up process of a user and stores their data in the Database.
@patient_handler.route('/api/patients/login', methods=['POST'])
This controller handles user log-in.
@patient_handler.route('/api/patients/
This controller extracts and stores the email of the user that logged in to make it possible to bring their past results stored in the database upon opening the ‘Results’ page.
Our database is noSQL (MongoDB). Our app’s database schema is the following:
ImageSchema
- id
- patientId
- imageAnalysis
- prediction
- date
PatientSchema
- _id
- fullName
- age
- gender
- images
- username
- password
Let’s now follow our image processing algorithm steps:
Filter pixels where the RGB values don't differ more than 15 from each other Apply the filter to the image making the filtered pixels black.
1. Load the image.
2. Filter pixels where the RGB values don't differ more than 15 from each other (|R1-B1|<15, |R1-G1|<15, |B1-G1|<15)
3. Apply the filter to the image making the filtered pixels black
4. Convert the filtered image to grayscale
5. Perform morphological operations to enhance the rectangles.
6. Find contours in the eroded (edited) image
7. Keep track of the bounding rectangles
8. Handle noise outside of the rectangles: Handle shapes that haven’t been filtered out. Iterate over the contours
a) Calculate the area of the contour
b) Check if the contour area exceeds the threshold (1000 pixels):
i) Find the convex hull of the contour
ii) Find the bounding rectangle of the hull
iii) Shrink the current rectangle by a buffer (e.g., 10 pixels)
iv) Append the shrunken rectangle to the list of bounding rectangles
9) Convert the filtered image back to PIL Image format
10) Create a new image to hold the side-by-side rectangles
11) Iterate over the rectangles and copy them to the new image:
a) Crop the rectangle from the filtered image
b) Paste the rectangle onto the new image
12) Handle noise inside of the rectangles
a) Convert PIL image to OpenCV image
b) Create the mask for black pixels
c) Perform inpainting telea of radius 3
d) Convert the inpainted image to PIL format
13) Adjust the size to match the model’s input
14) Display the inpainted image
But why do we use deep learning to extract the diagnosis from the colorimetric results?
Let C1=(R1,B1,G1) and C2=(R2,B2,G2). We define ΔColor to be the vector difference of the two: ΔC=(R1-R2,B1-B2,G1-G2)
In order to encapsulate their proximity in RGB dimensions. Of course, distance metrics like Euclidean Distance wouldn’t work, because they would fail to distinguish differences between each dimension individually. The RGB channels were chosen arbitrarily. In order to differentiate the phage colorimetric results between PD patients and healthy participants it is important to study the color boundaries of healthy and PD results. Thus, other color channels/coordinate systems should be examined as well, like YIQ or CMYK. The conversion formulas across different channels will facilitate this boundary colorimetric result study. In the bibliography, there hasn’t been a colorimetric result analysis between phages and PD-related VOCs. So, the colorimetric boundaries might not be well established by any of the traditional color channels and linear or non-linear transformations might be imperative to be applied on the color channels/coordinate systems. This fact, coupled with the fact that our colorimetric results consist of 4 lanes of independent color gradients, make it really hard for the boundary to be simple enough to be manually calculated or for the necessary linear or non linear dimensionality transformation functions to be tracked. In conclusion, for the reasons listed above, we propose a Deep Learning pipeline, to uncover those hidden transformations and boundaries of PD patients and healthy participants. Deep learning models are tailored to the boundary and dimensionality problems we encountered, because they are comprised of many layers of nonlinear functions that aim to make the samples separable linearly. Deep learning models, when trained correctly, are adept at identifying intricate patterns in vast amounts of data and can automatically learn the ideal transformations for the problem at hand.
Yet, without real world data, we can’t build a robust model. For totality purposes we present a simple CNN.
Because clinical data will be limited and it will be tough at first to train our deep learning models effectively, we have developed a synthetic image pipeline, simulating the colorimetric results of the kit, so that we will be able to expand our datasets after the initial clinical studies:
PepDB is a powerful tool designed by Team Athens in 2023. This database, named peptides_database.db, is specifically designed to store and manage critical information about peptides and their binding affinities for four different VOCs (Volatile Organic Compounds): Hippuric Acid, Perillaldehyde, Octadecanal, and Icosane. This information is invaluable for researchers and scientists studying various fields, including molecular biology, chemistry, and drug development.
To use the PepDB tool, you need to download the database file (peptides_database.db) and place it in a directory of your choice. You can then use the database file to interact with the data using the SQLite command-line shell, a GUI tool, or a Python script.
Another option is to download the csv files (hippuric_acid.csv, perillaldehyde.csv, octadecanal.csv, and icosane.csv) and use them as a reference for creating your own database. This method is useful if you want to customize the database schema or add more data to it.
The database schema is structured to provide a comprehensive storage solution for peptide data. It consists of a single table named peptides with several essential columns:
- id (INTEGER, PRIMARY KEY, AUTOINCREMENT): This column serves as a unique identifier for each record in the table. It ensures that every peptide entry can be distinguished easily.
- peptide (TEXT, UNIQUE): The peptide column stores the unique sequences of peptides. This column is marked as UNIQUE, meaning that no two peptides with the same sequence can exist in the database.
- binding_affinity_hippuric_acid (REAL): This column holds the binding affinity data for the VOC named ‘hippuric_acid.’ It records the strength of the interaction between the peptide and hippuric acid, providing valuable insights into their compatibility.
- binding_affinity_perillaldehyde (REAL): Similar to the ‘hippuric_acid’ column, this column stores the binding affinity information for ‘perillaldehyde,’ another VOC. It helps researchers assess the peptide’s interaction with perillaldehyde.
- binding_affinity_octadecanal (REAL): The ‘octadecanal’ column is dedicated to recording the binding affinity between peptides and octadecanal, offering crucial data for understanding their relationship.
- binding_affinity_icosane (REAL): Lastly, the ‘icosane’ column tracks the binding affinity between peptides and icosane, providing insights into their compatibility with this specific VOC.
Team Athens has designed the PepDB tool to be user-friendly and flexible, offering multiple methods for interacting with the database:
1. SQLite Command-Line Shell: If you prefer a command-line interface, you can use the SQLite command-line shell. Simply navigate to the directory containing the database file (peptides_database.db) and run sqlite3 peptides_database.db to start the shell. This allows you to execute SQL queries and interact with the data interactively.
2. SQLite GUI Tools:For those who prefer graphical interfaces, there are excellent tools available, such as "DB Browser for SQLite" or "DBeaver". These tools provide a user-friendly environment for managing and querying the database. You can use them to explore the database schema, run SQL queries, and view data in a convenient tabular format.
3. Python Scripting: To interact with the database using Python, the sqlite3 library is your ally. This method allows you to automate database operations and integrate them into your research or application.
sqlite3 peptides_database.db
SQLite version 3.40.1 2022-12-28 14:03:47
Enter ".help" for usage hints.
sqlite> SELECT * FROM peptides WHERE peptide = 'WWW';
290841|WWW||-3.555||
sqlite> .quit
import sqlite3
connection = sqlite3.connect('peptides_database.db')
cursor = connection.cursor()
cursor.execute("SELECT * FROM peptides WHERE peptide = 'WWW'")
results = cursor.fetchall()
print(results)
connection.close()
It's important to note that when no binding affinity data is available for a particular peptide from a specific source, the corresponding column contains NULL. In Python, this is represented as None. Researchers should be aware of these null values when analyzing the data.
For optimal performance and resource management, remember to close the database connection after use. Failing to do so can lead to resource leaks, which may affect the stability of your application or research environment.
This Jupyter notebook, Color_Extraction.ipynb, demonstrates how to extract the dominant colors from an image using OpenCV and K-Means clustering. The notebook includes code for loading an image from a file, converting the image to the HSV color space, applying a color mask to extract the desired colors, and then computing the dominant colors using K-Means clustering. The resulting colors are printed out, along with the original image and the masked image. The notebook also includes code for displaying the colors as a color palette.
To run this notebook, you will need to have the following dependencies installed:
- OpenCV
- NumPy
- Matplotlib
The notebook assumes that you have an image file called kit.jpg in the notebook’s directory. You can download the kit.jpg file from the notebook’s directory and run the cells in the notebook in order. The notebook will load the image from the file, convert it to the HSV color space, apply a color mask to extract the desired colors, and then compute the dominant colors using K-Means clustering. The resulting colors will be printed out, along with the original image and the masked image.
To use this notebook, download the kit.jpg file in the notebook’s directory and run the cells in the notebook in order. You can specify the path to your image file in the img_path variable, and modify the color mask and K-Means clustering parameters as needed.
This Jupyter notebook, Diagnosis_Alg.ipynb, demonstrates how to train a deep learning model to diagnose medical images using TensorFlow and the ResNet50 architecture. The notebook includes code for loading and preprocessing image data, defining and training a deep learning model, and evaluating the model’s performance.
To run this notebook, you will need to have the following dependencies installed:
- TensorFlow
- scikit-learn
- NumPy
- PIL
- Data
The notebook assumes that you have a dataset of medical images in a directory called data. The images should be organized into subdirectories based on their class labels, with each subdirectory containing images of a single class. The notebook uses the ImageDataGenerator class from TensorFlow to load and preprocess the image data.
The notebook uses the ResNet50 architecture, a pre-trained convolutional neural network, as the base model for the diagnosis algorithm. The notebook loads the pre-trained ResNet50 model without the top (classification) layer, freezes the layers in the base model so they won’t be trained, and then defines a new model that starts with the base model and ends with a Dense classification layer. The new model includes several additional layers, including a Flatten layer, BatchNormalization layers, and Dropout layers, to improve the model’s performance.
The notebook splits the data into a training set and a test set using the train_test_split function from scikit-learn. The notebook then compiles the model using the Adam optimizer and binary cross-entropy loss, and trains the model for 10 epochs using the fit method. The notebook saves the trained model to a file called model.h5.
The notebook evaluates the performance of the trained model using the evaluate method, which calculates the loss and accuracy of the model on the test set. The notebook prints the test accuracy to the console.
To use this notebook, download and extract the kit_colors.zip in the notebook’s directory and run the. You can specify the path to your image data in the data_dir variable, and modify the model architecture and training parameters as needed.
We welcome contributions from the community that can help improve PanSimPy and make it even more useful for researchers and scientists in the field of biopanning and molecular docking. If you're interested in contributing, please follow these steps to get started:
1. Fork the Repository: Start by forking the PanSimPy repository to your GitLab account.
2. Clone the Forked Repository: Clone the forked repository to your local machine using the following command:
git clone https://gitlab.igem.org/2023/software-tools/athens.git
3. Create a New Branch: Create a new branch for your contribution. Make sure to choose a descriptive name that reflects the nature of your contribution.
git checkout -b feature/new-feature
4. Make Changes: Make the necessary changes and improvements to the codebase.
5. Test Your Changes: Before submitting a pull request, ensure that your changes are tested and do not introduce any regressions.
6. Commit and Push: Commit your changes and push them to your forked repository.
git commit -m "Add your commit message here"
git push origin feature/new-feature
7. Submit a Pull Request: Go to the PanSimPy repository on GitLab and click on the “New Pull Request” button. Choose your branch and provide a clear description of your changes.
8. Code Review: Your pull request will be reviewed by the maintainers. Be prepared to address any feedback or suggestions.
The Software Tools have been developed by a dedicated team of four individuals who poured their expertise and efforts into every aspect of the project. We extend our heartfelt thanks and recognition to these core members for their substantial contributions:
The Dry Lab memberss of the iGEM Athens 2023 team are responsible for the development of PanSimPy. Their collective efforts have resulted in a powerful tool that can be used by researchers and scientists to streamline their biopanning and molecular docking studies. Their hands-on usage of PanSimPy has been instrumental in the development of the tool, and has played a crucial role in the creation of PepDB.
Furthermore, we extend our appreciation to the authors of the following pivotal papers that have significantly influenced our work:
- Eberhardt, J., Santos-Martins, D., Tillack, A.F., Forli, S. (2021). “AutoDock Vina 1.2.0: New Docking Methods, Expanded Force Field, and Python Bindings.” Journal of Chemical Information and Modeling.
- Trott, O., & Olson, A. J. (2010). “AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading.” Journal of Computational Chemistry, 31(2), 455-461.
We also acknowledge the importance of the following software tools that have been instrumental in our development process:
- Open Babel: N M O’Boyle, M Banck, C A James, C Morley, T Vandermeersch, and G R Hutchison. “Open Babel: An open chemical toolbox.” Journal of Cheminformatics, 3, 33 (2011). DOI:10.1186/1758-2946-3-33
- PyMOL: Schrödinger, LLC, ‘The PyMOL Molecular Graphics System, Version 1.8’. Nov-2015.
- RDKit: Open-source cheminformatics. RDKit Website
The functionalities and insights provided by these papers and software tools have played a crucial role in the creation of PanSimPy.
Should you be interested in becoming part of PanSimPy as well as PepDB and contributing to their continued growth, please refer to the “Contributing” section above for more information on how you can participate.