Our artificial intelligence (AI) model was constructed and trained using pre-train and fine-tuning paradigm, with the aim of providing a new, highly accurate AI for biology paradigm that is more suitable for small sample sizes to subsequent iGEM teams and other biological researchers. The final results also demonstrated the effectiveness of our pre-train and fine-tuning paradigm, particularly highlighting the advantages of our AI model on small sample data, while not being inferior to carefully constructed AI models on large sample data.
We developed two models: an AI model and a base mutation model. The AI model aims to predict expression intensity based on core promoter sequences. With this AI model, we can predict expression intensity for a wide range of core promoters without the need for wet lab experiments, allowing us to identify high expression promoter sequences. The base mutation model simulates the occurrence of base mutations in core promoters during continuous cultivation and predicts the resulting promoter sequences after mutation. Combined with the AI model, we can predict the expression intensity of mutated promoters, simulate the evolution process of promoters, and select high expression promoter sequences resistant to mutations.
In our artificial intelligence (AI) model, we treat DNA sequences as bio-language and process them in a similar manner to natural language processing, treating DNA bases as words. Our project further demonstrates the effectiveness of this approach, as AI models can perform deep learning on DNA sequences, extracting features that are difficult for humans to identify, and apply them to downstream tasks, such as predicting expression intensity based on promoter sequences.
In the base mutation model, we observed that many randomly synthesized sequences in the original dataset lacked well-defined core promoter structures, making it inappropriate to apply existing mutation hotspots and preferences. After reviewing the literature, we found that random mutations could be used. Therefore, we hypothesize that the promoter sequences have no specific mutation hotspots or preferences and only consider mutations inside the core promoter sequence.
Our artificial intelligence (AI) model is essentially a regression model, and as such, we selected the Pearson Correlation Coefficient as the evaluation metric to assess goodness of fit. This coefficient is commonly used to measure the linear relationship between two sets of variables. It ranges from 0.0 to 1.0, where 0.8-1.0 indicates a very strong correlation, 0.6-0.8 indicates a strong correlation, 0.4-0.6 indicates a moderate correlation, 0.2-0.4 indicates a weak correlation, and 0.0- 0.2 indicates a very weak or no correlation.
For evaluating resistance to mutations, we used the total expression intensity of the top 1000 promoter sequences as our measure. Since the initial expression intensity is similar for all sequences, the larger the sum of expression intensities for the 100 mutated sequences, the more resistant to mutations and higher expression the sequence is considered.
PS: click to expand, double-click to collapse
It refers to the use of pre-trained models that have been trained on a large and diverse dataset. These pre-trained models can then be fine-tuned on specific datasets for transfer learning to specific tasks.
It refers to the process of adjusting the crucial parameters of a model to make the model's output approximate the measured quantities. This is achieved by quantifying the deviation between the model's output and the measured quantities using a loss function, and then updating the parameters of the Neural Network using an optimizer to minimize the loss. By repeating this process, the model's output can, to some extent, represent the measured quantities, i.e., make predictions.
A tokenizer, or word segmenter, is used to divide a given sequence into shorter sequences of length kmer. These shorter sequences are then converted into numerical values based on a predefined mapping table, facilitating the extraction of feature values by the computer.
The optimizer guides the parameters of the loss function (objective function) in the correct direction and appropriate magnitude during the backpropagation process of deep learning, allowing the updated parameters to continuously approach the global minimum point of the objective function.
When applying the gradient descent algorithm to optimize the learning rate, a coefficient called the learning rate α is multiplied to the gradient term in the weight update rule. If the learning rate is too small, convergence will be slow. On the other hand, if the learning rate is too large, it can lead to cost function oscillations and overly rapid iteration, causing the gradient descent algorithm to potentially overshoot the global minimum point or even diverge. As shown in the figure below, lower loss corresponds to better results.
The loss function measures the degree of deviation between the predicted and actual values. A smaller loss function indicates a better prediction performance.
1. The basic operation of our AI model is as follows first, the sequence is tokenized by the Tokenizer, then transformed into corresponding numbers. Next, the Neural Network extracts feature values, which are further compressed into a single number through a linear layer.
2. Training the artificial intelligence model We collected data, constructed the model framework, and then performed fine-tuning. Subsequently, we trained the model from scratch using small sample sizes of 300, 3,000, 30,000, 300,000, 3,000,000, and 6,000,000. We assessed the goodness of fit of our AI model using dedicated test datasets and found that the AI models we built using the pre-train and fine-tuning paradigm outperformed AI models built by others specifically for the original data, especially in the case of small sample sizes.
PS: click to expand, double-click to collapse
1. We extensively searched for raw data suitable for training artificial intelligence and finally discovered a dataset publicly available in a Nature article [1]. The dataset consists of a total of 30 million pairs of core promoter sequences and expression levels. The format is shown in the diagram below, with random synthetic core promoter sequences in the front and high-throughput measured expression intensities in the back. Specifically, the log2(RFP/YFP) of the dual-fluorescent expression driven by the promoter is the data (further details in the wet lab section).
2. Initially, we chose DNABERT as the pre-trained model to load.
3. We developed a code for a linear layer to be added to the original BERT model, enabling the use of the BERT pre-trained model to obtain a single output.
4. We determined the format of the raw data and wrote code to read and transform it into the appropriate format.
5. We conducted literature research and tentatively determined the important parameters, such as the length of kmer tokenization, learning rate, optimizer, and loss function. These parameters were fine-tuned individually.
6. We determined the evaluation metrics for the regression model, including Mean Squared Error (MSE), R2, and Pearson correlation coefficient. We then developed code to calculate these evaluation metrics.
7. We wrote code to record the evaluation metrics calculated during the training process into an Excel spreadsheet for visualization purposes.
8. Training on a single GPU was too slow, so we employed multiple GPUs for parallel training and implemented it through code.
1. We reviewed the Nature paper [1] as the source of the data and determined their use of the Adam optimizer and loss function. We temporarily adopted them as our optimizer and loss function.
2. We conducted a search for the range of kmer and tentatively selected a value.
3. We chose a commonly used learning rate and ran the code to ensure functionality. Then, we gradually reduced the learning rate until the calculated Pearson correlation coefficient no longer showed "NUM" but displayed a normal numeric value. This helped us determine the maximum learning rate.
4. Using the determined maximum learning rate, try different kmers (only 3, 4, 5, and 6), and determine that using kmer 4 has higher training accuracy and can quickly reach the highest Pearson correlation coefficient.
5. Due to the large scale of the entire dataset, training with all the data consumes a significant amount of time. Therefore, in the initial stage, a training set consisting of only 10% of the data is chosen, and the SGD optimizer, which is suitable for handling random mini-batches of samples, is utilized. Subsequently, deep training is performed by gradually decreasing the learning rate until the Pearson correlation coefficient no longer increases and converges to approximately 0.83.
6. We then attempted different optimizers, including a variant of Adam called Adamax, RMSprop, which addresses the gradient explosion problem and is suitable for non-stationary objectives, and Adagrad, which automatically adjusts the learning rate for each parameter. However, the Pearson correlation coefficient remained around 0.84.
7. During discussions with other iGEM teams, they pointed out the presence of conserved sequences at both ends of the original sequences. We speculated that these conserved sequences may have affected the deep learning feature extraction of the model. Subsequently, we removed the conserved sequences at both ends and reinitialized the training. However, we observed no significant impact, and the final performance remained at 0.84.
8. We determined that DNABERT achieved a final performance of 0.84. After consulting our instructor, we discovered several other pre-trained models that are also suitable for our project. Particularly, the recently developed DNABERT2 showed potentially better performance. Combining our instructor's advice and literature research, we identified three additional pre-trained models: DNABERT2, BioBERT, and RNAprob. Their highest Pearson correlation coefficients were found to be 0.71, 0.59, and 0.78. Based on these findings, we decided to proceed with DNABERT as our pre-trained model. We then trained the best pre-training and fine-tuning combinations with the entire dataset of 30 million instances. As a result, the highest Pearson correlation coefficient reached 0.85, indicating only a marginal improvement.
1. We used our trained model to predict the expression levels of 30 million original sequences. Then, we plotted the predicted expression levels against the measured ones, as shown in the figure below. The x-axis represents the model's predicted expression levels, while the y-axis represents the actual measured expression levels. For more details, refer to the dry lab cycle of "engineering success." We observed that the errors in integer data were larger. Therefore, we decided to remove the integers with larger errors and extract a new small sample to restart the training of our model. This approach better fits the scenario of using small samples obtained without high-throughput measurements.
2. We extracted six million, three million, three hundred thousand, thirty thousand, three thousand, and three hundred data sets respectively from a pool of six million data sets with integers removed. Each group of data sets was trained from scratch, and the optimizer and learning rate were fine-tuned for each group. To minimize the impact of the randomness in small sample extraction, we fixed the random seed.
3. Then, we randomly selected a corresponding number of data sets from the original sequence of thirty million datasets without removing integers, and trained our model again from scratch. We observed that training the model with the datasets where integers were removed led to better outcomes.
4. We discovered that small sample data is susceptible to overfitting. To address this issue, we implemented weight decay as a regularization technique. Moreover, we fine-tuned the learning rate and weight decay specifically for each data group, resulting in noticeable improvements.
5. Subsequently, we employed the data sets with integers removed to extract a corresponding number of data sets for training the model specified in the Nature article [1]. Utilizing the code provided by the authors and adhering to their recommended training parameters, our models consistently outperformed the accuracy of the models trained with the entire dataset as per the authors' instructions.
6. I also observed that the Pearson correlation coefficient in that Nature article [1] reached 0.960. This achievement was attributed to the utilization of a unique test set, consisting of 60,000 independently measured data points. With reduced errors, this dedicated test set yielded superior performance. In an attempt to reproduce the findings, I trained models of varying dataset sizes and employed them to predict this test set sequence. By comparing the predicted expression levels with the measured ones, I successfully replicated the results reported in the Nature article [1]. Our Pearson correlation coefficient was 0.959, with a negligible deviation of 0.001. Notably, all our models outperformed those described in the aforementioned article [1]. The model trained with the complete dataset showcased the highest level of goodness of fit, attaining a Pearson correlation coefficient of 0.963.
In the base mutation model, we noticed that numerous sequences randomly synthesized from the original data lacked a discernible core promoter sequence structure, rendering the application of existing mutation hotspots and preferences inappropriate. After conducting a literature search, we discovered the possibility of using random mutations. Consequently, we assumed that the promoter sequences did not possess any mutation hotspots or preferences, and limitations were placed on mutations occurring outside the core promoter sequence. To explore this further, we subjected the top 1000 sequences in terms of expression levels to 100 generations of random mutations, with each generation undergoing 1 to 3 random mutations, resulting in the substitution of other bases.
Subsequently, we generated visual representations of the in vitro evolution experiment for the base mutation model. From the pool of high-expression promoter sequences, we specifically chose two sequences that exhibited resistance to mutations, one sequence that did not, and two sequences that demonstrated significant susceptibility to mutations. These five promoter sequences underwent 100 generations of random mutation simulations. In each generation, we utilized an artificial intelligence model to predict the corresponding expression levels. Consequently, we created a graphical depiction of the changes in expression intensity across the 100 generations of random mutations for these five promoter sequences. The adoption of the base mutation model aided in comprehending the evolutionary patterns of promoter sequences, facilitating further classification and identification of mutation-resistant high-expression promoter sequences.
The red curve in the figure above represents the expression intensity curve of mutation-resistant promoters, while the blue curve depicts mutation-sensitive sequences, and the yellow curve indicates non-mutation-resistant sequences. It is evident that mutation-resistant, high-expression promoters are less susceptible to the impacts of mutations and consistently maintain elevated expression levels. Conversely, mutation-sensitive, high-expression sequences are significantly affected by mutations, leading to a steep decline in expression intensity once a mutation occurs, ultimately resulting in an overall state of low expression intensity. As for non-mutation- resistant, high-expression promoters, their expression intensity exhibits considerable variability throughout the 100 generations. At times, they display high expression intensity, while at other times, they demonstrate lower levels. By developing a base mutation model and conducting computer-based simulations to replicate the evolutionary progression of promoter sequences, we enhance our understanding of their evolutionary patterns. This, in turn, enables us to further categorize high- expression promoter sequences and identify mutation-resistant variants, facilitating subsequent wet lab experiments with increased accuracy.
As is shown by wet lab experiments, our anti-mutant promoters show a much more preserved and constant expression rate than natural constitutive promoters, which indicate that they are truly anti-mutant and highly preserved and shows the success of our Bas mutation model.Our AI model, constructed using the pre-training and fine-tuning paradigm, was trained on datasets of different sizes: 300, 3,000, 30,000, 300,000, 3,000,000, and 30,000,000. We also trained a comparative AI model, as described in the Nature article [1], using datasets of the same sizes as ours. The results show that our AI model consistently outperforms the comparative model in terms of goodness of fit. This advantage is particularly pronounced when dealing with small sample sizes, as depicted in the figure below. Generally, the comparative models require ten times more data to achieve a similar level of fit as our AI model. This directly demonstrates the efficacy of our pre-training and fine-tuning paradigm in addressing the issue of limited data. For more details, please refer to the "Proofs" section.
In addition, our pre-training and fine-tuning paradigm has another advantage besides high accuracy and low data requirement, which is its low entry barrier. The experimental group was almost entirely composed of one student, who had no prior knowledge of artificial intelligence. The entire project, from self-learning to completion, was carried out by this student, fully demonstrating the low entry barrier and ease of use of our paradigm for students with limited background in artificial intelligence.
Wet lab experiments apparently favors the success of our model, since extremely high expression promoters generated by our AI model can drive expression over 2 times of natural constitutive ones on the basis of transcription, protein density, and fluorescence intensity. Furthermore, this ability remains as the downstream codon changes. These thrilling results show our AI model Pymaker a great success!
Our AI Pymaker and base mutation models give us a brand new understanding of the evolutional pattern of yeasts promoter sequences, and a deep insight into the highly complex mechanisms behind the interaction between cis and trans acting elements.
Our AI Pymaker and base mutation models give us the ability to predict expression intensity based on core promoter sequences and simulate the promoter evolution process, which, has never been done so successful before. Our success in experiments proves our models are theoretically and practically powerful.
Our AI Pymaker and base mutation models verifies that the ‘pre-train + fine-tuning’ paradigm can effectively and practically address the issue of small dataset size.
Our AI Pymaker and base mutation models can be integrated to identify the sequence mutations that cause changes in expression levels. In other words, we aim to identify the sequences responsible for high promoter expression——the functional hot-point in promoter sequences which remain unknown to this day, and study the biological mechanisms underlying these high expression sequences in conjunction with wet lab experiments in the future.
All these thrilling things can be successfully done using our AI Pymaker and base mutation models.
Furthermore, larger-scale and longer-term wet lab experiments can be conducted to measure a greater amount of data, collecting tens of thousands of low-throughput, high-precision data points. These data can then be fed back into our computational experiments, allowing us to further improve the goodness of fit of our AI model. Additionally, multiple generations of cultivation followed by sequencing can be performed, enabling evolutionary analysis to identify hotspots and mutation preferences in the promoter region. This information can be used to optimize the base mutation model and can also provide data for studying factors that influence expression intensity due to sequence mutations in regions other than the core promoter. These researches can in turn help improve our understanding of promoter structure and enable the addition of new parameters to our base mutation model.
A major lesson learned from our experimental work is the crucial importance of data for training artificial intelligence models. Communication with other iGEM teams revealed that many teams are struggling with data-related issues, particularly data quality. In our own experiments, the low quality of the raw data and the high level of noise in the test set prevented accurate testing of the goodness of fit of the AI model. However, by using a high-precision small test set, we were able to successfully evaluate the goodness of fit of the AI model and replicate the experimental results described in the Nature article [1]. Therefore, finding high-quality data is of utmost importance, followed by the quantity of the data. The pre-training and fine-tuning paradigm we designed exhibits high accuracy and is well-suited for low-throughput, high-precision small sample sizes, effectively addressing the issue of insufficient sample quantity. Furthermore, our model performs just as well as meticulously constructed AI models on larger datasets. In the future, other iGEM teams and biological researchers can draw inspiration from our pre-training and fine- tuning paradigm.
Additionally, some non-specialized students in artificial intelligence may hesitate to engage in AI for biology tasks, considering it a domain reserved for students specializing in artificial intelligence or bioinformatics. Historically, this may have been the case due to the high entry barrier for constructing AI models. However, our pre-training and fine-tuning paradigm effectively lowers the entry barrier for AI in biology. We hope that our example will encourage other iGEM teams to actively embrace the revolutionary and efficient tool of artificial intelligence.
[1] Vaishnav E D, de Boer C G, Molinet J, et al. The evolution, evolvability and engineering of gene regulatory DNA[J]. Nature, 2022, 603(7901): 455-463.