General Overview

Protease-mediated degradation is widely used in synthetic biology to have a more controllable circuit and optimise the design. For our team, in order to solve the colour mixing issue and shorten the time for giving the result in our test kits, we have utilised protease-mediated degradation with SspB degron which acts as an adaptor tethering ssrA-tagged substrates to the ClpXP protease [1]. Different ssrA tags are divided into different categories and have different effects on the degradation rate of the targeted protein.


Figure 1: Circuit design for cycle 1

Background

According to the paper [2], ssrA tags can be divided into 3 Categories
  • 1. Unstable (in the absence of SspB)
    • a. The targeted substrate is being degraded rapidly
  • 2. Stable (in both the presence and absence of SspB)
    • a. The targeted substrate is not being degraded in both situation
  • 3. SspB- dependent
    • a. The targeted substrate is not degraded when there is absence of SspB but degraded when there is SspB. In the paper, the targeted substrate, GFP-ssrA, was degraded completely within an hour.

As we are aiming to prove more conditional degradation, we are interested in the third group (SspB-dependent tag), we collected the data provided from the paper and did the analysis on the sequences and their structure.

Methadology

Bioinformatics

      To analyze the sequences of ssrA tags, we started with bioinformatics approach to analyse sspB-dependent category by using Position-Specific Scoring Matrix (PSSM), which involves building a matrix that assigns a score to each amino acid at each position in a protein sequence for finding out conserved patterns or motifs associated with functional elements.


Figure 2: Heatmap of PSSM

      After the conversion of the matrix to probability matrix, by assuming equal frequency of each amino acid be presented at a given position, the log-odds scores, a score calculated as the logarithm of the likelihood of an event relative to its likelihood under a null model, by comparing the observed and expected frequencies of amino acids[3], is calculated. Then we got the result from below:


Figure 3: Top 10 sequences with highest scores

      We can see that the approach of using PSSM might not be a suitable method for us. The reason for predicting the already characterised unstable tag (-ADAS) might be due to the lack of data, which causes some frequency for an amino acid in a specific site to be 0. In order to build a library while limited by the data we have, we are planning to do a directed evolution on the ssrA tag for expanding our dataset first. Then we will try with machine learning methods such as using Bayes formula to do the prediction of the sspB-dependent tags.
      However, due to limited time we are not able to do direct evolution during the iGEM period. Rather than just testing out tags randomly in wet lab experiments, we still follow the result predicted above by neglecting the first tag (-ADAS).

3D Modeling

      The underlying mechanism of ssRA degradation tag-mediated degradation depends mainly on three components, namely, ClpXP ATP-dependent protease, SspB degron and the degradation tag attached to the target protein.


      With the deficiency of data regarding the characterized degradation tag sequences for mediated degradation system, that is, merely 14 tag sequences classified by other researchers, we believe the utilization of 3-dimensional protein modeling tools can be adopted to predict primary raw sequences for wet lab to characterize degradation performance concerning novel tag sequences. Pymol is a widely-used open-sourced molecular visualization tool that provides researchers with a platform to simulate protein and DNA structure pertaining to kinetics, binding and folding. It is known that ClpX subunit interact with the residues 9-11 at the C-terminus of the tag. Hence, in a bid to reduce the level of variations and complexities of experiment, mutational analysis of the tag regarding only position 9-11 is performed with the help of Pymol.


      From the 14 characterised data we collected from academic journals, we tried to assess one of the most dominant intermolecular forces between protein molecules, hydrogen bonding. Mutagenesis is an inbuilt function in Pymol, which allows users to mutate amino acid sequences and observe the subsequent changes. We first import the protein with identification code, 6WRF, which is ClpX-ClpP complex bound to GFP-ssrA(-ALAA-3') from Protein Data Bank. For every mutation we made that resemble the 14 characterised sequences, the features of hydrogen bond such as the respective amino acid position, polar bond acceptor and receiver, and the quantity of hydrogen bond with respect to SspB-ClpX and ssrA tag-ClpX interactions are examined (Figure 4).


Figure 4: Pymol 3D Modelling Figure 4: Pymol 3D Modelling Figure 4: Pymol 3D Modelling

      Regrettably, the relationship is shown to be unclear and arbitrary, indicating that Pymol modeling and mere hydrogen bonding analysis are likely not a sound in-silico experimentation tool.

Experiment

      It is crucial to test out every construct to understand the stability and degradation kinetics among all the degradation tags, as well as providing data for the bioinformatic database. Fluorescent assay is one way to investigate the properties of the degradation tags. In addition, fluorescent microscopy and western blotting could also be performed to achieve even more accurate results.
For more detailed plan on carrying out fluroescent assay, click Experiments: Circuit Design

Future Work

      The solutions of today’s problem cannot be found with yesterday’s mindset. In the era of artificial intelligence, it is believed that machine learning can be an innovative way to investigate protein-protein interaction(PPI). Due to the deficiency of time to enforce the entire process for model training and model development for accurate sequence classification, a future direction plan is proposed, particular in the situation when significant amount of data is collected.


      Protein-protein interaction hinges on a wide variety of factors. In the traditional way, to identify sites of protein interactions, experimental approach is adopted to evaluate the 3-dimensional crystalline structure of protein through X-ray crystallography and NMR spectroscopy[4]. The problem lies on the fact that thousands of combinations are possible. In our case, though only the amino acids on position 9-11 at c-terminus on degradation tag are to be altered, the number of possible combinations is 20(#positions to be mutated)=203=8000, not to mention further modifications on the remaining amino acids in the whole string of degradation tag. Consequently, statistical classification method, in specific, Naive Bayes Classifier(NBC) is chosen to estimate and classify the degradation tag sequences into the 3 desired categories that we aimed at: 1) stable, 2) unstable, 3)SspB- dependent.


Naiive Bayes Classifier

      Naive Bayes Classifier is a supervised learning algorithms based on Bayes’ Theorem with the ‘naiive assumption’. In essence, Bayer’s Theorem is a mathematical formula used for calculating conditional probabilities. It states that a certain prior probability[P(B|E)] can be calculated if prior probability[P(B)], marginal probability[P(E)] and the likelihood[P(E|B)] of an event is known. The formula of Bayer's Theroem is given as:

P(B|E)=P(E|B)/P(E)

, where B is the belief and E is the evidence. In situation where there are multiple beliefs, that is, Bi, where i>1, and multiple evidences, that is, E={e 1,e 2,e 3,...e n}, the general form of the Bayes’ theorem becomes:

P(Bi|E) = P(Bi)P(E|Bi) / (P(E|B1)P(B1) + P(E|B2)P(B2) + P(E|B3)P(B3) + · · · + P(E|Bn)P(Bn)).

      The Bayes Theorem has the assumption that each input variable is dependent upon all other variables. This complicates the calculation. Hence, Naive Bayes comes into place, which places a new assumption that each evidence is indeed independent and equal contribution to the belief. Through the calculation of posterior probability with numbers of evidence using the formula below, classification tasks can be performed.


Naiive Bayes Classifier General Equation

Collection of Training Set

      To make predictions using Naiive Bayes Classifier, a number of preliminary steps are required to be implemented. First, a large enough training dataset should be prepared. Now that the degradation performance of different degradation tag sequences is scarcely characterised, wet lab experimentation must be vastly performed with the protein complexes structure precisely documented and annotated in open-sourced database(e.g.Protein Data Bank[PDB]). A wide, diversified, and filtered collection of training dataset and testing dataset with different degradation tag sequences ensures that the method we use can be generally applied.


Feature Selection

      Second, the features(evidence) describing the degradation mechanism should be identified and set, which can be denoted by E=(e1,e2,e3,...en). For each feature, the categories should be quantified. The induced degradation mechanism with degradation tag involves the interactions between the tagged substrate and the protease(/the recognition subunit, in our case, ClpX). Also, as SspB is reported to behave as a specificity-enhancing factor by 1)preferentially stimularting degradation of ssrA-tagged proteins, and 2)directing the specific class of substrates to ClpXP but not to other protease complexes, interactions between the SspB and ClpX will as well be a crucial factor determining the class to which the degradation tag sequence be classified. Hence, we propose Structure information(Binding kinetics) as input factors to Naiive Bayes Classifier Structure information(Binding kinetics).


      Let;s raise an example on how to determine suitable features for NBC. It is known that the binding of ClpX to a just few C-terminal residues of the whole string of the degradation tag is sufficient to generate the forces to denature the substrate. With the degradation tag sequence being changed, the overall crystalline spatial structure will likely to be tweaked, thus affecting the binding pocket and binding affinity of the respective AAs that are interacting. Hence, the binding affinity at the corresponding well-agreed binding pocket and interacting residues can be a feature to be calculated as one of the features (Figure 5). Other structural features can be interatomic distances, solvent accessibility and secondary structure information


Figure 5: Recognition Determinants

Evaluation of Algorithm using K-fold cross validation

      Following the development of model by training data fitting, evaluation should be made to assess the performance of the prediction. k-fold cross-validation is proposed to measure the performance (Figure 6). Essentially, the validation test divides a randomly shuffled dataset into k non-overlapping folds of approximately equal size and is trained k times. With one of the group being hold as the validation set, the remaining data (k-1) is being trained with the classifier. For each iteration with the respective k value, the error is calculated by averaging the values in the loop, or using 1-F1 score where F1-score is determined with true positives(TP), true negatives(TN), false positives(FP) and false negatives(FN). Classification performance can also be evaluated with different measurements including sensitivity, specificity, precision and accuracy. With the performance result received and analysed, further model-tuning actions can be implemented by polishing feature selection, model selection, and parameter tuning.


Figure 6: k-fold cross validation

      With the final test data being inputted for final evaluation, with a satisfying and accurate enough model developed, the desired degradation tag sequences can be fed to predict their categories to be classified into.



      The prediction of a machine learning model is highly likely to require concerted efforts, both in wet lab and dry lab. Wet lab experiments offers valuable authentic data for in-silico simulation, while dry lab computerized the experimental results to perform less expensive, timesaving and systematic work. In fact, the framework above extends to different analysis that is beyond the construction of degradation tag library. It also demonstrates how we emphasize the possibility to use machine learning in studying biological systems, especially in the field of synthetic biology.