Preliminary Design

After recognizing the plethora and complexity of platforms in the protein optimization and expression process, with no one-stop service and unfriendliness to beginners, we embarked on the integration of various platforms. Our initial idea is as follows:

  • Step1: Users input protein sequence.
  • Step2: The backend optimizes the protein sequence for users to download.
  • Step3: Display optimization sites to the user.
  • Step4: Show the potential structure of the optimized protein sequence to the user.
  • Step5: Users choose the vector and insertion site.
  • Step6: According to the homologous recombination vector construction method, primers are designed for the optimized protein sequence and vector respectively to obtain a complete expression system.

The advantages of this design compared to other software are:

  1. Higher integration and user-friendliness. Compared to other bioinformatics tools with singular functionalities, this design realizes the full process of protein sequence optimization, structural display, and final expression.
  2. More valuable information is provided to users. Not only can the platform optimize the protein sequence, but it can also display the optimization results and predict the protein structure. This allows users to use basic biochemical knowledge to target the optimized protein's properties.

Build

Model Selection

PROSS

In the domain of protein optimization, we initially contemplated approaching it from a unique optimization perspective. Many natural and designed proteins exhibit only marginal stability, which limits their practicality in both research and diverse applications. So we turned to PROSS, an automated structure and sequence-based design method for optimizing protein stability and heterologous expression levels. Using position-specific substitution matrix (PSSM), PROSS can select “allowed” mutation. Then by calculating the energy difference between the original sequence and the sequence with “allowed” mutation, Pross can generate a new sequence that will be more stable and having higher heterologous expression level.

SWISS-MODEL

During the expression process, proteins may encounter issues such as abnormal folding, making the prediction information of protein structure crucial for users. There is a principle in structural biology: if the amino acid sequences of proteins are similar, their three-dimensional structures are likely to be similar. The higher the similarity of the amino acid sequences, the higher the similarity of the three-dimensional structures, generally speaking.

SWISS-MODEL, based on the aforementioned principle, predicts the three-dimensional structure of proteins whose sequences are known but structures are unknown.

Using the structures of proteins that are highly similar (highly homologous) in sequence to the unknown protein, and whose structures are already known, as templates, it is possible to make relatively accurate predictions about the unknown protein structure.

BLAST & Primer3

When we aim to construct a protein expression system using the homologous recombination method, the commonly utilized technique in the lab is the Gibson Assembly. We employ Primer3 to design primers for both the vector and the protein and then assemble them together.

BLAST (Basic Local Alignment Search Tool) is an algorithm used for sequence alignment. In this context, BLAST is employed to inspect whether there are any mismatches in the protein and vector sequences at sites other than the termini, which could result in assembly failures.

Tools

Back-end

Tencent Cloud
Tencent Cloud is a product of Tencent, providing developers and enterprises with cloud services, cloud data, cloud operations and other integrated one-stop service solutions. We use TencentCloud Lighthouse to deploy our back-end applications which is Easier to use and closer to the application.

Front-end

Flask icon by Icons8
Flask
Flask is a Web microframework written in Python, which is based on the robust foundation of the Jinja2 template engine and Werkzeug's comprehensive WSGI web application library, allowing us to quickly implement a website or Web service using the Python language.
Nginx icon by Icons8
Nginx
Nginx is an open source reverse proxy server for HTTP, HTTPS, SMTP, POP3, and IMAP protocols, as well as a load balancer, HTTP cache, and a web server (origin server). The nginx project started with a strong focus on high concurrency, high performance and low memory usage. It is licensed under the 2-clause BSD-like license and it runs on Linux, BSD variants, Mac OS X, Solaris, AIX, HP-UX, as well as on other *nix flavors. It also has a proof of concept port for Microsoft Windows. We deploy our Flask project with Nginx.

Test

  1. During testing, PROSS performed well in enhancing protein stability, but its challenge to the server made it difficult for us to achieve its full efficacy. At the same time, PROSS must receive structure files in PDB format as input, preferably X-ray parsing structure, and such format files are not friendly to inexperienced workers. We wanted to find a model that was more general and more friendly to novice users.
  2. When students in the wet lab used our prot-DAG codon optimization, they found that the results of single codon optimization did not get good results by comparing with the actual experimental results.
  3. After switching to EVmutation based on unsupervised learning, in the course of the initial test, we found that the protein input EVmutation had a mismatch in sequence length. At the same time, the students who participated in the test reported that when preparing the source data for mutation site prediction, it was necessary to manually add protein sequences one by one, which greatly increased the complexity of the user's operation.
  4. After obtaining the correct results of EVmutation operation, the test students said that the tabular output results were not very helpful for the intuitive analysis of the mutation site, but the heat map drawn by the delta_E table can provide intuitive and effective guidance on the selection of mutation sites. At the same time, the test students recommended Jalview, a 2D visualization tool, whose embedded JalviewJS may provide some help for us.

Learn

Iterations

  1. Iterations in protein optimization model: enhance versatility, adapt to more scenarios, and facilitate easy invocation:
    PROSS presents challenges in direct deployment to servers, leading to a subpar integrated user experience. Moreover, the application scope of PROSS is limited to enhancing protein stability. In order to increase the universality of our tool, we explored the use of models trained via machine learning for protein optimization. After trialing multiple models, we selected EVmutation, an unsupervised learning-based protein optimization model. Compared to other models, it offers higher efficiency and produces more consistent results.
    For codon optimization, we collaborated with the experimental team. Based on recent research indicating that optimizing based on two consecutive codons yields better results than single codon optimization, we integrated a di-codon-based codon optimization algorithm co-authored by both the software and experimental teams.
  2. Iterations in Integration: Enhance integration and increase convenience:
    We've incorporated the Uniprot protein query module, allowing for the direct input of multiple protein results into EVmutation. This eliminates the need to manually search and compile proteins from the same family as datasets for EVmutation.
    In our expression system, we've added data for several commonly used plasmids, relieving a significant portion of users from the hassle of searching for and entering vector sequences themselves.
  3. Iterations in user-friendliness: Enhance output information for a deeper user understanding:
    We've incorporated Jalview to provide a 2-D visualization of mutation sites, allowing users to more easily identify the mutation locations. We've visualized the saturation mutation prediction results from EVmutation and displayed them to users as a heatmap. This presentation allows users to gain a clearer insight into whether a specific mutation at a particular site might lead to protein optimization.

Improvement of build

  1. Build a user-friendly GUI interface:

    Our interface design aims to minimize cognitive load for the user. Through a clear and straightforward layout, consistent color coding, and intuitive icons, users can quickly become familiar with the platform upon their first visit. We also provide detailed hints and documentation to ensure users never feel lost or confused during their experience

  2. Suitable for a wide range of users, from beginners to seasoned researchers:

    Recognizing the diversity of user backgrounds, skills, and experiences, our interface was rebuilt to offers a high degree of customization. Whether a beginner or an expert, everyone can adjust the tool's parameters and display according to their needs and preferences.

Reference

[1] Jonathan Jacob Weinstein, Adi Goldenzweig, ShlomoYakir Hoch, Sarel Jacob Fleishman, PROSS 2: a new server for the design of stable and highly expressed protein variants, Bioinformatics, Volume 37, Issue 1, January 2021, Pages 123-125, https://doi.org/10.1093/bioinformatics/btaa1071

[2]Goldenzweig A, Goldsmith M, Hill SE, Gertman O, Laurino P, Ashani Y, Dym O, Unger T, Albeck S, Prilusky J, Lieberman RL, Aharoni A, Silman I, Sussman JL, Tawfik DS, Fleishman SJ. Automated Structure- and Sequence-Based Design of Proteins for High Bacterial Expression and Stability. Mol Cell. 2016 Jul 21;63(2):337-346. doi: 10.1016/j.molcel.2016.06.012. Epub 2016 Jul 14. Erratum in: Mol Cell. 2018 Apr 19;70(2):380. PMID: 27425410; PMCID: PMC4961223.

[3] Waterhouse A, Bertoni M, Bienert S, Studer G, Tauriello G, Gumienny R, Heer FT, de Beer TAP, Rempfer C, Bordoli L, Lepore R, Schwede T. SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res. 2018 Jul 2;46(W1):W296-W303. doi: 10.1093/nar/gky427. PMID: 29788355; PMCID: PMC6030848.

[4] Untergasser A, Cutcutache I, Koressaar T, Ye J, Faircloth BC, Remm M, Rozen SG. Primer3--new capabilities and interfaces. Nucleic Acids Res. 2012 Aug;40(15):e115. doi: 10.1093/nar/gks596. Epub 2012 Jun 22. PMID: 22730293; PMCID: PMC3424584.

[5] Hopf TA, Ingraham JB, Poelwijk FJ, Schärfe CP, Springer M, Sander C, Marks DS. Mutation effects predicted from sequence co-variation. Nat Biotechnol. 2017 Feb;35(2):128-135. doi: 10.1038/nbt.3769. Epub 2017 Jan 16. PMID: 28092658; PMCID: PMC5383098.

[6] Chung BK, Lee DY. Computational codon optimization of synthetic gene for protein expression. BMC Syst Biol. 2012 Oct 20;6:134. doi: 10.1186/1752-0509-6-134. PMID: 23083100; PMCID: PMC3495653.