Software
PrAtZy
Developing more robust in-silico methods for generating DNA/RNA nanostructures for detection faces challenges, particularly due to the absence of standardized software equipped with modules for synthesizing novel Aptamers/DNAzymes. The complexity is heightened by the difficulty in discerning potential Aptamer/DNAzyme sequences from non-functional ones within a library, mainly due to the significant property changes observed after folding into functional 3D structures. Addressing these issues, the PrAtZy suite serves as a comprehensive solution, offering a single platform for all Protein, Aptamer, and DNAzyme-related computational needs. This software package employs computationally feasible methods, significantly enhancing the speed at which new discoveries can be made in this field.
Modules in PrAtZy include:-
- KAMI_2.0 (Kwick Aptamer Motif identification 2.0, uses deep learning against a library of known protein aptamer interactions)
- KAMI_1.1 (Uses user input of protein library, and protein vectorization. Determines protein to a given aptamer. Builds from KAMI presented by iGEM IISER-Mohali 2022)
- UdBAS (Unique determined base for Apta selection, uses KAMI 2.0, to determine a library of sequences to start SELEX with. This can be used experimentally or in-silico)
- Strucgen (determination of mutations in primary sequences that preserve secondary structure).
- Bolt (Checks the heat stability of DNA/RNA).
KAMI_2.0
Kwick Aptamer Motif Identification 2.0
(Protein to Aptamer)
The generation of Aptamers purely in silico is a daunting task. There are multiple steps that must be optimised, to generate an aptamer that is both specific and sensitive. Ideally, the number of cycles of SELEX should be minimized while the binding affinity of the Aptamer should be maximized. This saves both the time and effort taken to generate an aptamer. KAMI uses a Neural Network model to categorize proteins into known reference. This reference is used to calculate the probabilistic secondary structure of the aptamer for that protein. The probabilistic secondary structure can then be used iteratively to generate a whole list of potential secondary structures for that protein, out of which successive /progressive docking studies can be used to generate the actual aptamer. The module used to categorize the proteins is called BART, which is a pre-trained prot2vec Python library. This categorizes proteins in the same way that word2vec processes data. This allows the module to break a protein into small tuples and consider the triplet distribution in the polypeptide. This is then used to assign an x,y score, after being trained in ‘’context’’ which in the case of proteins is the pre-classified protein families, by various methods of alignment and 3d structural homology. This trained model can now classify new data sets of proteins. The breakthrough arises from two main insights. Firstly we use a database of proteins with known aptamers. The clustering of the proteins creates a natural basis space for classifying the new protein. The protein of interest (poi, henceforth) is now classified against this database and the distances between each of the proteins and poi are noted. Secondly, we represent the Vienna structure of an aptamer using H and U. This is done by replacing all the ‘.’ by U and all the ( or ) by H. This creates a new string representation of the secondary structure of the aptamer. The closest n neighbour proteins from KAMI2.0 are noted, by the user and the associated aptamer secondary structures are aligned, added (H is given a value of +1, U = -1 and - is given a value of 0) and rescaled to a value between (0,1). This new vector output is the longest aptamer sequence with a probabilistic secondary structure. The vector represents the probability of internal bonding or not. The closer a value is to 0, the higher the probability of that base pair being unbound. Conversely, position values closer to 1 represent base pairs that are most probably bound. Multiple secondary structures can be generated from this probability distribution, using energy decomposition methods, available online. The only additional boundary condition that these sequences need to follow is the fact that every paired base pair must be symmetrical around a region of unpaired base pairs. The different structures that are generated from these probabilistic secondary structures can be plugged into strucgen and analysed for their properties.
KAMI_1.1
Kwick Aptamer Motif Identification 1.1
(Apatamer to Protein)
The rise of aptamer-based cell identification technology has led to several experimentally verified aptamers generated by whole-cell SELEX. The major drawback of this is that the substrate bound by the aptamer is not known. This prevents the possibility of improving the aptamer using in-silico methods. We solve this problem by leveraging the fact that due to the stochastic nature of whole-cell SELEX in developing an aptamer, the binding of the aptamer will not be an all-or-none kind (as seen with Antibodies). The binding of the aptamers to similar proteins, other than the target will be reduced, but not zero. This was verified by iGEM IISER_Mohali-2022, by docking all unique outer membrane proteins of Salmonella typhi against a salmonella-detection aptamer.4 Ideally, the problem of protein identification can be done non-probabilistically by docking all the potential candidate proteins against the aptamer sequence, but this is computationally impossible even for simple systems. KAMI optimized this process by uniquely assigning each protein to a cluster. This allowed the user to dock the central element of each cluster and get an average value for the binding energy of the protein cluster to the aptamer of interest. The cluster with the lowest ensemble energy is selected and further divided, to finally arrive at the target protein. This method allows drastically reducing the number of docking tests needed to identify the protein that is binding to the aptamer, by a factor of 1000. KAMI_1.1 uses improved protein vectorization techniques. While keeping the basics of the PseAAC module that was introduced in the KAMI module, we have added hydrophobicity and charge measures that are now included in the vector produced. The module for generating proteins has also been changed, to include a better HMM model with an updated transition matrix. Consider a polypeptide chain that can be represented as a vector P. The entries of the vector will be amino acids, and the vector will have a dimension equal to the number of amino acids in the polypeptide. We can represent this as shown:-
The task now is to convert this protein vector into a format that can be read and understood meaningfully by a computer. During this process it will also be helpful to have vectors of common dimensions, to be able to compare different polypeptide chains of different lengths. We use the PseAAC converter to do just that. This can be mathematically summarized as:-
Where,
and
Such that Ф(R) is a characteristic of the amino acid, and T is the total number of characteristics that we consider. This factor J stands for a single number that takes into account the variation of characteristics (as many as wanted) between an amino acid at distance i and i+k (kth tuple). We call this the k-tier correlation factor. So for every tier correlation of a polypeptide, we can get a single numerical descriptor for every pair of amino acids at a distance k. This is akin to the resolution/level of correlation that we want to draw from the polypeptide. The factor ح takes into account all the correlation terms (for a single tier) and scales it to get a weight value. This can be thought of as compressing the individual amino-acid correlations and assigning it to a single numerical descriptor. This factor is then rescaled against the sum of all the levels of correlation drawn from the protein to get the final entry in the PseAAC vector. Notice how the final output vector has a dimension that just depends on the number of towers of correlation drawn from the polypeptide. Thus polypeptides with different lengths are reduced to vectors of the same dimension that can then be compared.
Working
- Either generate a list of polypeptide chains from our generator or enter your own list of protein fasta
- Run the KAMI1.1 engine
- The Cluster results can be seen on the screen and the cluster centers can be downloaded
UdBAS
Developing Aptamers take 15-30 rounds of SELEX. This is a time-consuming process due to the mutation by Error-prone PCR and selection by affinity SELEX steps (1). Ways to bypass this can be using modified base pairs for aptamer selection but these are costlier to produce and cannot be increased by PCR amplification. Furthermore, using mDNA bp adds another variable into the folding landscape which needs optimization after aptamer synthesis (2). One potential way of optimizing the search time for Aptamer development is starting from good libraries. Several libraries have been developed and have provided varying success in shortening aptamer development time. In general, there is now a sense of universal grammar that all aptamer sequences share. These include low mfe (minimum free energy, henceforth), complex structure and bulge-associated regions(3). We use these criteria in a reverse fashion, to what has previously been done in several in-silico Aptamer screening protocols. Instead of starting with a random sequence, we take a protein of interest and run KAMI 2.0 on the protein. This allows us to locate the protein, against a background of selected proteins. These proteins are all known proteins associated with DNA/RNA aptamers and have been collected and presented in a free-for-use format by our team (apgen database). The protein of interest is thereby grouped with similar proteins. The aptamers associated with these proteins are called the base generators. These are then subjected to mutations and mfe is scanned against a known reference to generate a better library. UdBAS is a library generator that can be used in many ways. In general, the grouped aptamers can be ordered and EP-PCR can be performed to generate the aptamer to experimentally produce the aptamers. The library can be further subjected to general SELEX workflow, to generate aptamers purely in-silico.
Strucgen
Many aptamers show low binding to the target sequence due to energy barriers and instability at higher temperatures. It may also be the case that low-affinity aptamers can show the potential to be very good aptamers if their primary structure is changed, but the secondary structure is kept the same. This is the same case for DNAzyme, where in different target sequences must have different primary sequences, but all must share the same secondary structure. Strucgen is a module that simplifies this generation. It accepts two inputs, a secondary structure and its associated primary structure and provides multiple sequences that have the same secondary structure but different primary structure. The module further produces a bar graph showing the mfe of the different predicted structures that allow the user to decide which sequence is best for continuing experiments. Strucgen works on the basis of mutation and folding. It utilizes the Vienna package to do the folding for RNA and the seqfold package for DNA, After doing the prediction of the sequences, the program verifies itself by refolding the predicted sequences and matching it to the original sequence’s secondary structure.
Bolt
It is important to know the heat stability and the heat stability of different DNA/RNA-based detection tools. This is important not only due to heat being a permanent denaturant for RNA but also due to the temperature dependent ensemble frequency of the active molecule. The calculation of the bond breakage and the rise in the mfe of the ensemble is done by Bolt, which derives its name from Boltzmann distribution, which the energy of the system follows on increasing the temperature. The module takes in min and max temperature values for the simulation folds the DNA/RNA again at that temperature and counts the number of bonds present in the most stable structure. It also notes the value of the mfe. Finally, a graph is given as output that plots the number of bonds and the energy in the same plot, to give a sense of the total number of mfe structures in the solution.
References
- Lauridsen, L. H., Shamaileh, H. A., Edwards, S. L., Taran, E., & Veedu, R. N. (2012). Rapid One-Step Selection Method for Generating Nucleic Acid Aptamers: Development of a DNA Aptamer against α-Bungarotoxin. PLOS ONE, 7(7), e41702. https://doi.org/10.1371/journal.pone.0041702
- Kohlberger, M., & Gadermaier, G. (2022). SELEX: Critical factors and optimization strategies for successful aptamer selection. Biotechnology and Applied Biochemistry, 69(5), 1771-1792. https://doi.org/10.1002/bab.2244
- Wilson DS, Keefe AD. Random mutagenesis by PCR. Curr Protoc Mol Biol. 2001 May;Chapter 8:Unit8.3. doi: 10.1002/0471142727.mb0803s51. PMID: 18265275.
- NeuraSyn - IISER Mohali 2022 iGEM: https://2022.igem.wiki/iiser-mohali/