MODEL

Discover the beauty of AI × synthetic biology !

Overview

Our artificial intelligence (AI) model was constructed and trained using pre-train and fine-tuning paradigm, with the aim of providing a new, highly accurate AI for biology paradigm that is more suitable for small sample sizes to subsequent iGEM teams and other biological researchers. The final results also demonstrated the effectiveness of our pre-train and fine-tuning paradigm, particularly highlighting the advantages of our AI model on small sample data, while not being inferior to carefully constructed AI models on large sample data.

We developed two models: an AI model and a base mutation model. The AI model aims to predict expression intensity based on core promoter sequences. With this AI model, we can predict expression intensity for a wide range of core promoters without the need for wet lab experiments, allowing us to identify high expression promoter sequences. The base mutation model simulates the occurrence of base mutations in core promoters during continuous cultivation and predicts the resulting promoter sequences after mutation. Combined with the AI model, we can predict the expression intensity of mutated promoters, simulate the evolution process of promoters, and select high expression promoter sequences resistant to mutations.

Hypothesis

In our artificial intelligence (AI) model, we treat DNA sequences as bio-language and process them in a similar manner to natural language processing, treating DNA bases as words. Our project further demonstrates the effectiveness of this approach, as AI models can perform deep learning on DNA sequences, extracting features that are difficult for humans to identify, and apply them to downstream tasks, such as predicting expression intensity based on promoter sequences.

In the base mutation model, we observed that many randomly synthesized sequences in the original dataset lacked well-defined core promoter structures, making it inappropriate to apply existing mutation hotspots and preferences. After reviewing the literature, we found that random mutations could be used. Therefore, we hypothesize that the promoter sequences have no specific mutation hotspots or preferences and only consider mutations inside the core promoter sequence.

Evaluation indicators

Our artificial intelligence (AI) model is essentially a regression model, and as such, we selected the Pearson Correlation Coefficient as the evaluation metric to assess goodness of fit. This coefficient is commonly used to measure the linear relationship between two sets of variables. It ranges from 0.0 to 1.0, where 0.8-1.0 indicates a very strong correlation, 0.6-0.8 indicates a strong correlation, 0.4-0.6 indicates a moderate correlation, 0.2-0.4 indicates a weak correlation, and 0.0- 0.2 indicates a very weak or no correlation.

For evaluating resistance to mutations, we used the total expression intensity of the top 1000 promoter sequences as our measure. Since the initial expression intensity is similar for all sequences, the larger the sum of expression intensities for the 100 mutated sequences, the more resistant to mutations and higher expression the sequence is considered.

Background knowledge

PS: click to expand, double-click to collapse

Pre-train
Fine-tuning
Tokenizer
Optimizer
Learning rate
Loss function
Artificial Intelligence (AI) model

1. The basic operation of our AI model is as follows first, the sequence is tokenized by the Tokenizer, then transformed into corresponding numbers. Next, the Neural Network extracts feature values, which are further compressed into a single number through a linear layer.

2. Training the artificial intelligence model We collected data, constructed the model framework, and then performed fine-tuning. Subsequently, we trained the model from scratch using small sample sizes of 300, 3,000, 30,000, 300,000, 3,000,000, and 6,000,000. We assessed the goodness of fit of our AI model using dedicated test datasets and found that the AI models we built using the pre-train and fine-tuning paradigm outperformed AI models built by others specifically for the original data, especially in the case of small sample sizes.

PS: click to expand, double-click to collapse

Data collection and model framework construction
Fine-tuning
Training with small sample size.
Base mutation model

In the base mutation model, we noticed that numerous sequences randomly synthesized from the original data lacked a discernible core promoter sequence structure, rendering the application of existing mutation hotspots and preferences inappropriate. After conducting a literature search, we discovered the possibility of using random mutations. Consequently, we assumed that the promoter sequences did not possess any mutation hotspots or preferences, and limitations were placed on mutations occurring outside the core promoter sequence. To explore this further, we subjected the top 1000 sequences in terms of expression levels to 100 generations of random mutations, with each generation undergoing 1 to 3 random mutations, resulting in the substitution of other bases.

Subsequently, we generated visual representations of the in vitro evolution experiment for the base mutation model. From the pool of high-expression promoter sequences, we specifically chose two sequences that exhibited resistance to mutations, one sequence that did not, and two sequences that demonstrated significant susceptibility to mutations. These five promoter sequences underwent 100 generations of random mutation simulations. In each generation, we utilized an artificial intelligence model to predict the corresponding expression levels. Consequently, we created a graphical depiction of the changes in expression intensity across the 100 generations of random mutations for these five promoter sequences. The adoption of the base mutation model aided in comprehending the evolutionary patterns of promoter sequences, facilitating further classification and identification of mutation-resistant high-expression promoter sequences.

The red curve in the figure above represents the expression intensity curve of mutation-resistant promoters, while the blue curve depicts mutation-sensitive sequences, and the yellow curve indicates non-mutation-resistant sequences. It is evident that mutation-resistant, high-expression promoters are less susceptible to the impacts of mutations and consistently maintain elevated expression levels. Conversely, mutation-sensitive, high-expression sequences are significantly affected by mutations, leading to a steep decline in expression intensity once a mutation occurs, ultimately resulting in an overall state of low expression intensity. As for non-mutation- resistant, high-expression promoters, their expression intensity exhibits considerable variability throughout the 100 generations. At times, they display high expression intensity, while at other times, they demonstrate lower levels. By developing a base mutation model and conducting computer-based simulations to replicate the evolutionary progression of promoter sequences, we enhance our understanding of their evolutionary patterns. This, in turn, enables us to further categorize high- expression promoter sequences and identify mutation-resistant variants, facilitating subsequent wet lab experiments with increased accuracy.

As is shown by wet lab experiments, our anti-mutant promoters show a much more preserved and constant expression rate than natural constitutive promoters, which indicate that they are truly anti-mutant and highly preserved and shows the success of our Bas mutation model.

Achievements

Our AI model, constructed using the pre-training and fine-tuning paradigm, was trained on datasets of different sizes: 300, 3,000, 30,000, 300,000, 3,000,000, and 30,000,000. We also trained a comparative AI model, as described in the Nature article [1], using datasets of the same sizes as ours. The results show that our AI model consistently outperforms the comparative model in terms of goodness of fit. This advantage is particularly pronounced when dealing with small sample sizes, as depicted in the figure below. Generally, the comparative models require ten times more data to achieve a similar level of fit as our AI model. This directly demonstrates the efficacy of our pre-training and fine-tuning paradigm in addressing the issue of limited data. For more details, please refer to the "Proofs" section.

In addition, our pre-training and fine-tuning paradigm has another advantage besides high accuracy and low data requirement, which is its low entry barrier. The experimental group was almost entirely composed of one student, who had no prior knowledge of artificial intelligence. The entire project, from self-learning to completion, was carried out by this student, fully demonstrating the low entry barrier and ease of use of our paradigm for students with limited background in artificial intelligence.

Wet lab experiments apparently favors the success of our model, since extremely high expression promoters generated by our AI model can drive expression over 2 times of natural constitutive ones on the basis of transcription, protein density, and fluorescence intensity. Furthermore, this ability remains as the downstream codon changes. These thrilling results show our AI model Pymaker a great success!

Success and Future prospects

Our AI Pymaker and base mutation models give us a brand new understanding of the evolutional pattern of yeasts promoter sequences, and a deep insight into the highly complex mechanisms behind the interaction between cis and trans acting elements.

Our AI Pymaker and base mutation models give us the ability to predict expression intensity based on core promoter sequences and simulate the promoter evolution process, which, has never been done so successful before. Our success in experiments proves our models are theoretically and practically powerful.

Our AI Pymaker and base mutation models verifies that the ‘pre-train + fine-tuning’ paradigm can effectively and practically address the issue of small dataset size.

Our AI Pymaker and base mutation models can be integrated to identify the sequence mutations that cause changes in expression levels. In other words, we aim to identify the sequences responsible for high promoter expression——the functional hot-point in promoter sequences which remain unknown to this day, and study the biological mechanisms underlying these high expression sequences in conjunction with wet lab experiments in the future.

All these thrilling things can be successfully done using our AI Pymaker and base mutation models.

Furthermore, larger-scale and longer-term wet lab experiments can be conducted to measure a greater amount of data, collecting tens of thousands of low-throughput, high-precision data points. These data can then be fed back into our computational experiments, allowing us to further improve the goodness of fit of our AI model. Additionally, multiple generations of cultivation followed by sequencing can be performed, enabling evolutionary analysis to identify hotspots and mutation preferences in the promoter region. This information can be used to optimize the base mutation model and can also provide data for studying factors that influence expression intensity due to sequence mutations in regions other than the core promoter. These researches can in turn help improve our understanding of promoter structure and enable the addition of new parameters to our base mutation model.

Lessons Learned and Assistance

A major lesson learned from our experimental work is the crucial importance of data for training artificial intelligence models. Communication with other iGEM teams revealed that many teams are struggling with data-related issues, particularly data quality. In our own experiments, the low quality of the raw data and the high level of noise in the test set prevented accurate testing of the goodness of fit of the AI model. However, by using a high-precision small test set, we were able to successfully evaluate the goodness of fit of the AI model and replicate the experimental results described in the Nature article [1]. Therefore, finding high-quality data is of utmost importance, followed by the quantity of the data. The pre-training and fine-tuning paradigm we designed exhibits high accuracy and is well-suited for low-throughput, high-precision small sample sizes, effectively addressing the issue of insufficient sample quantity. Furthermore, our model performs just as well as meticulously constructed AI models on larger datasets. In the future, other iGEM teams and biological researchers can draw inspiration from our pre-training and fine- tuning paradigm.

Additionally, some non-specialized students in artificial intelligence may hesitate to engage in AI for biology tasks, considering it a domain reserved for students specializing in artificial intelligence or bioinformatics. Historically, this may have been the case due to the high entry barrier for constructing AI models. However, our pre-training and fine-tuning paradigm effectively lowers the entry barrier for AI in biology. We hope that our example will encourage other iGEM teams to actively embrace the revolutionary and efficient tool of artificial intelligence.

Reference

[1] Vaishnav E D, de Boer C G, Molinet J, et al. The evolution, evolvability and engineering of gene regulatory DNA[J]. Nature, 2022, 603(7901): 455-463.