Project Description | NJU-China

Overview

For those who know the application of AI in the field of synthetic biology, it’s generally believed that the application of artificial intelligence in the synthetic biology industry has bright prospect. As we considered about the specific convenience or revolution that AI can bring to biology, improving efficiency and shortening research cycle won the highest score among the few options we have listed. For current challenges or constraints which limits the development of AI for synthetic biology, nearly 70% of them agreed that it lacks uniform standards and specifications(e.g. data formats, sharing platforms, etc.) About 60% of them believe that the scarce of AI expertise and skills in biological researchers and lack of synthetic biology data with high quality and quantity are also main problems. Mentioned by William A.V. Beardail [1], ‘synthetic biology has a natural synergy with deep learning.’ Synthetic biology can be used to generate large data sets to train models, while in turn, deep learning models trained on biological data can be used to inform design, such as by generating novel parts or suggesting optimal experiments to conduct. Considering deep learning models’ out-standing performance in mining the nonlinear and complex relationships behind large scale of biological data, combining synthetic biology with deep learning is of great prospects.

AI usage in biology

Actually, AI has been used in synthetic biology in many situations. For example, researchers have made significant recent progress in using deep learning to predict and generating the function of biological 'parts', such as promoters, ribosome binding sites (RBSs), and 5' and 3' untranslated regions (UTRs) [8,9,10]. What's more, the rapid progress of geometric deep learning, we can have better understanding of not only protein structures(over 100,000 of which have been characterized) but also 3D RNA structures( which only have a handful number characterized) [1]. Computer vision is also an area where deep learning has enabled exceptional progress [11]. In the context of synthetic biology, imaging application can include automated detection of properties within an image, such as colony formation on a plate or analysis if microscopy data. More excitingly, NLP, as a branch of AI, can process text at a large scale, enabling topic organization in published articles [12]. For example, researchers have established using GPT-4 to get relevant bioprocess features from published papers and they applied transfer learning to make it possible to predict novel microbial cell factories with limited data [13]. In conclusion, AI usage in synthetic biology can obviously reduce the investigation of manpower, time and money in the field of data analyzing, genetic and protein engineering, which will drive the development of molecular design and personalised medicine.

Shortcomings

Binding deep learning with synthetic biology still faces many limitations, and data comes the first. On the one hand, great quantity of data to apply artificial intelligence practically in biotechnology means a vast investment in human resources and money. On the other hand, according to our interviews with experts in synthetic biology, although applying AI to biology is exciting, urgent demand in experimental data has not got enough attention from the scientists. For example, As is pointed out by William A.V. Beardail in the review, ‘nonfunctional sequences or those that result in very high expression are typically underrepresented.

Solution

With the aim of pushing forward the boundary of deep learning in synthetic biology, our team has built an AI based on the paradigm of transfer learning called Pymaker, which learns potential connections of sequence and expression rate from very limited experimental yeast promoter data [2] and generates extremely high-expression promoter sequences that are difficult to obtain under natural conditions. Through our project, we not only show the profiting prospect and unlimited possibilities of deep learning combined with synthetic biology, but provide a possible solution for generating excellent AI under limited experimental data. Meanwhile, our project can enrich the database of extreme expression promoter sequence, and also pave the way for industrialization of biological synthesis.

Background

1.The eukaryotic yeast expression system

The eukaryotic yeast expression system has been widely used in synthetic biology for their good advantages. Compared to the prokaryotic expression system, the eukaryotic yeast expression system has a post-transcriptional processing system, a post-translation modification system and can also achieve true secretory expression. While compared to other eukaryotic expression systems such as insect cell expression system and mammalian cell expression system, yeast cell fermentation is appeared to be simple, fast, cheap and easy to genetic manipulation, which purification process is also easy to operate [3]. And also, eukaryotic expression systems usually have complex expression of promoters, increasing the difficulty of building and testing promoters. Another advantage of eukaryotic yeast expression system is that yeast is the first totally analyzed eukaryotic expression system, which means we can have enough data to use.

2.Strong promoters

As the target sequence of RNA polymerase binding to some transcription factors, promoter plays a role in regulating and controlling transcription initiation, determining the initiation of gene expression process and under what conditions. Therefore, the recognition and analysis of promoters are the premise and foundation of expression regulation research [4]. The core promoter is the minimum DNA sequence necessary for the normal start of RNA polymerase transcription. A good promoter is like a good engine, transforming gene expression from the "steam age" to the "electrical age", better responding to various regulations and initiating various life activities. At present, promoter editing is also applied in various fields such as plant yield, protein production, and drug production. Gene editing technology cannot do without various promoters to achieve tissue specific expression.

3.The advantage of AI-based promoter design

Generative AI has a wide range of applications in various fields, including natural language processing, image generation, music composition, video generation, and more. The core idea of generative AI is to learn from a large amount of training data and models to generate new data or content. Generative AI offers the following advantages in promoter design:

Automated design

Promoters are DNA sequences that regulate gene expression. Designing new promoters requires considering multiple factors such as expression levels and specificity. Generative AI can automatically analyze the relationship between promoter sequences and expression features by learning from a large dataset of promoters, thereby generating new promoter sequences.

Large-scale search

Generative AI can generate a large number of candidate promoter sequences, enabling the screening of promoters with desired characteristics. This large-scale search process helps discover promoter sequences with potential activity and superior performance, expediting the promoter design and optimization process.Also, Machine learning could construct quantitative models based on a limited database to analyze the data distribution characteristics of the designed promoter library, helping us better understand the underlying interaction principle [5].

Interpretability and optimization

Generative AI can generate promoter sequences and explain the characteristics and advantages of each generated sequence through the model's interpretability. This helps researchers understand the design principles and mechanisms of promoters and optimize the generation model to achieve better design outcomes.

Procedure

1.Building and modifying AI model

We developed Pymaker using the pre-train + fine-tuning paradigm. To simulate small sample conditions, we randomly extracted datasets of three hundred, three thousand, thirty thousand, three hundred thousand, three million, and six million from the original data for training our Pymaker, as well as the AI model developed by us specifically for the original data [1]. The results revealed that our Pymaker exhibited superior fitting performance compared to other AI models, particularly in small sample scenarios.

2.Finding Mutation-Resistant Sequences

We constructed a base mutation model to simulate the process of in vitro evolution of core promoters. We performed 100 generations of random mutations and inputted the mutated sequences into Pymaker to predict their expression intensities. We then summed up the predicted expression intensities to identify the promoter sequence with the highest overall expression level, known as a mutation-resistant and highly expressed promoter.

3.Testing generated promoter sequences

Our goal is to test the efficiency of our Pymaker-generated promoter sequence. For training and setting the control group, we use our double fluorescence report system. The data generated by our experiments also guides the fine-tuning process of our AI model.

4.LTB expression

To prove our learning model’s ability, we engineered the promoter sequence of S. cerevisiae to produce E. coli heat-labile enterotoxin subunit B (LTB), which is one of the most popular oral vaccine adjuvants and intestine adsorption enhancers [7]. Our goal is using our AI to tackle the low production yield issue of LTB. In this way, our project can facilitate its mass production, which will help the popularization of oral vaccine, strengthening our preparation for emergent pandemics. Furthermore, our project shows the promising future of combining deep learning with synthetic biology, and overcomes the limitation of experimental data.

REFERENCE

[1]Beardall, W. A. V., Stan, G. B., & Dunlop, M. J. (2022). Deep Learning Concepts and Applications for Synthetic Biology. GEN biotechnology, 1(4), 360–371. https://doi.org/10.1089/genbio.2022.0017

[2]Vaishnav, E.D., de Boer, C.G., Molinet, J. et al. The evolution, evolvability and engineering of gene regulatory DNA. Nature 603, 455–463 (2022). https://doi.org/10.1038/s41586-022-04506-6

[3]Garvey, M. (2022). Non-Mammalian Eukaryotic Expression Systems Yeast and Fungi in the Production of Biologics. Journal of Fungi, 8(11), 1179.

[4]Danino, Y. M., Even, D., Ideses, D., &Juven-Gershon, T. (2015). The core promoter: At the heart of gene expression. Biochimica et Biophysica Acta (BBA)-Gene Regulatory Mechanisms, 1849(8), 1116-1131

[5]Tang, H., Wu, Y., Deng, J., ... &Keasling, J. D. (2020). Promoter architecture and promoter engineering in Saccharomyces cerevisiae. Metabolites, 10(8), 320.

[6]Ji, Y., Zhou, Z., Liu, H., & Davuluri, R. V. (2021). DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics (Oxford, England), 37(15), 2112–2120. https://doi.org/10.1093/bioinformatics/btab083

[7]So, KK., Le, N.M.T., Nguyen, NL. et al. Improving expression and assembly of difficult-to-express heterologous proteins in Saccharomyces cerevisiae by culturing at a sub-physiological temperature. Microb Cell Fact 22, 55 (2023). https://doi.org/10.1186/s12934-023-02065-7

[8]Kotopka BJ, Smolke CD. Model-driven generation of artificial yeast pro- moters. Nat Commun 2020;11(1):2113; doi: 10.1038/s41467-020-15977-4.

[9]Angenent-Mari NM, Garruss AS, Soenksen LR, et al. A deep learning ap- proach to programmable RNA switches. Nat Commun 2020;11(1):5057; doi: 10.1038/s41467-020-18677-1.

[10]Gilliot P-A, Gorochowski TE. Sequencing enabling design and learning in synthetic biology. Curr Opin Chem Biol 2020;58:54–62; doi: 10.1016/ j.cbpa.2020.06.002.

[11]Voulodimos A, Doulamis N, Doulamis A, et al. Deep learning for computer vision: A brief review. Comput Intell Neurosci 2018;2018:7068349; doi: 10.1155/2018/7068349.

[12]Zhu, J.-J.; Ren, Z. J. The evolution of research in resources, conservation & recycling revealed by Word2vec-enhanced data mining. Resources, Conservation and Recycling 2023, 190, 106876. DOI: https://doi.org/10.1016/j.resconrec.2023.106876.

[13]Xiao Z, Li W, Moon H, Roell GW, Chen Y, Tang YJ. Generative Artificial Intelligence GPT-4 Accelerates Knowledge Mining and Machine Learning for Synthetic Biology. ACS Synth Biol. 2023 Sep 8. doi: 10.1021/acssynbio.3c00310. Epub ahead of print. PMID: 37682043.