Overview

In the bacterial genome of Xanthomonas we find few genes which are essential for its biofilm formation and survival. We have seen in the biomodel section our crispr machinery tries to effectively silence such genes and the method is specific for that particular gene only. Now from an evolutionary point of view those genes can mutate and eventually render our bio-control agent ineffective.
To overcome this potential problem, we tried to look for the mutations that might be occurring to those genes and their protein products in the near future using Machine Learning Tools.

Objective

Using an ancestral-based machine learning method that helps predict future domain-specific mutations, find out the possible near future mutations of XopN protein in Xanthomonas oryzae.

Model

Mutation events are inherently stochastic, characterized by their random occurrence. Constructing a machine learning model for predictive purposes entails a meticulous endeavor involving the careful consideration of which features will best encapsulate and quantify the mutation phenomenon. In this regard, we adhered to the model framework introduced by Sangeet and colleagues (Sangeet et al., 2022).
This model selection was motivated by its proven capacity to address the challenges posed by the inherently random nature of mutations.

Feature selection:

The features we incorporated into our model were derived directly from the aforementioned study, as they were empirically demonstrated to effectively capture the stochasticity inherent in mutation events. This selection process ensures that the model's input variables possess the requisite discriminatory power to capture the nuanced variability associated with mutations, thereby enhancing the predictive capability of our machine learning approach. The following features were utilized to train the model:

1. Amino Acid Pair Predictability

This feature capitalizes on the intrinsic variability in amino acid pairing within a protein sequence. It acknowledges that distinct amino acids tend to exhibit specific co-occurrence patterns in protein sequences. Consequently, when a mutation occurs in one amino acid, it has the potential to alter the amino acid pair, thereby influencing the frequency of that specific pair. By leveraging the prevalence of these variable amino acid pairs associated with a particular amino acid, we can quantitatively compute both the observed (actual) frequency and the expected (predicted) frequency of these pairs. This analysis provides valuable insights into the dynamic alterations occurring within the protein sequence due to mutations, contributing to a deeper understanding of the underlying biological processes. $PP_{ij}=\frac{\text{No. Of} i}{\text{Length of sequence}}\times \frac{\text{No. Of} j}{\text{Length of sequence}-1}\times \left(\text{length of sequence}-1\right)$

2. Future Count of Amino Acid

This feature revolves around the translation probabilities inherent in RNA codons, which code for specific amino acids during the translation process. It encapsulates the mutational likelihood associated with amino acids, essentially denoting the probability of a particular amino acid undergoing a mutation and transitioning into another amino acid. This phenomenon is rooted in the genetic code, where each amino acid is encoded by one or more codons, and mutations in the nucleotide bases of codons can lead to changes in the corresponding amino acid. The probabilities of such mutational events for amino acids are detailed in a referenced table. For instance, consider Methionine (M), encoded by the codon "AUG." Mutations at the first position of "AUG" can yield alternative codons such as "CUG," "GUG," and "UUG," which respectively translate to Leucine, Valine, and another Leucine. This feature incorporates the probabilities of these amino acid mutations, providing insight into the potential variability and substitutions that may occur within a protein sequence due to genetic mutations. Thus, for Methionine we have a final mutational probability relation which looks like: $\text{Mutational Probability of Methionine} = \frac{1}{9}R+ \frac{3}{9}I+ \frac{2}{9}L+ \frac{1}{9}K+ \frac{1}{9}T+ \frac{1}{9}V$

Fig 1: Mutational Probability of Amino acids inferred from the RNA codon table

3. Entropy of Amino Acid

This feature serves the purpose of quantifying the impact of mutations by assessing the entropy of individual amino acid residues. The selection of this feature is underpinned by the principle of residue conservation across evolutionary timeframes. Specifically, if a protein sequence remains unaltered by mutations over successive generations, the amino acid at a given position within the sequence remains conserved, leading to an entropy value of zero for that particular residue. Conversely, the introduction of mutations into the protein sequence disrupts the conserved amino acid pattern, thereby causing an increase in the entropy value. In essence, this feature provides a means to gauge the evolutionary stability and genetic variability of amino acid residues within a protein sequence, offering valuable insights into the mutational dynamics impacting the sequence's structure and function. The entropy for the residues are calculated using Shannon’s entropy: $H= - \Sigma_{i\epsilon (A,T,G,C)}\left(p_i\log(p_i)\right)$

Model Architecture:

In our study, we employed a feedforward backpropagation neural network characterized by a specific model architecture denoted as 3–8–8–1. This architecture encompasses four distinct layers. The initial layer accommodates three input features, facilitating the input data's entry into the network. Subsequently, we integrated two hidden layers, each containing eight neurons, which play a pivotal role in feature transformation and representation learning. The final layer, consisting of a solitary neuron, corresponds to the network's output and directly addresses the prediction of our target variable.
In terms of the activation functions employed within this network architecture, we opted for hyperbolic-tangent functions for the first three layers. These functions are known for their capacity to introduce non-linearity into the model, facilitating complex feature mapping. In contrast, for the output layer, we selected the sigmoid activation function. This particular choice restricts the network's output to values within the range of zero to one, rendering it interpretable as the probability of a mutation event occurring, aligning with the specific context and objectives of our study.

Fig 2: pipeline for the model architecture

Results

Fig 3: Model Prediction for the potential mutational residues of XopN To test out the validation of the model, we used the trained model to make predictions of a validation sequence. The model predicts a total of 57 potential positions that can mutate in the future version of XopN. Out of these 57 predicted mutations, the model predicted 9 residues that have actually mutated in the validation sequence. The dotted line corresponding to a value of 0.15 (user-defined) is the cutoff value for the mutation, suggesting those residues with a mutational probability of more than 0.15 have a higher chance of occurring in the future generation. Residues having a mutational probability less than the cutoff value have a lower chance of occurring in future generations. The following positions were predicted by the model that coincided with the actual mutations: 31, 88, 91, 106, 156, 290, 494, 524, 531, and 581.

Conclusions

From the results we see that the model has actually learned to give higher mutational probability towards the mutational sites. Even though the overall probability of mutation is quite low, we can get rid of this by taking a much larger training data set compared to the 63,000 datapoint training set of ours. From these results, we can conclude fairly confidently given a current sequence of the XopN gene, what all can be the possible mutations in the sequence.

Machine Learning Model for Mutational Response in Xanthomonas

Contents