CONTRIBUTION

Introduction

From a more fundamental perspective, our project shows a practical and efficient way to utilize AI in synthetic biology, which is a pure invention and has not been proposed before as far as we know [1]. We provide a excellent performing AI model that can generate promoter sequences with specific expression rate, which will help fasten the commercialization of synthetic biology. Furthermore, we proved that through our ‘pre-train + fine-tuning’ paradigm, synthetic biologists can fully utilize limited data generated in experiments to generate high-performing AI model. Meanwhile, our paradigm is very biologists-favored, which means high professional model building skills are not requirements in our paradigm to persue a high-performing AI model [2].

Trough our project, we hope we can introduce a tighter bound between synthetic biology and AI, pushing forward the boundary of AI for biology.


At the mean time, our artificial intelligence product Pymarker has shown its ability in tackling real challenges. Currently, significant breakthroughs have been achieved in the application of brewing yeast in the field of biotechnology. One of them is the production of the heat-labile toxin B subunit (LTB) of Escherichia coli using brewing yeast. LTB is an important oral vaccine adjuvant widely used to prevent various diseases such as cholera, traveler’s diarrhea, and E. coli infection. LTB produced by brewing yeast has good safety and immunogenicity. Compared traditional LTB production methods, such as the E. coli expression system, LTB produced by brewing yeast is purer and does not contain residual host cell materials and endotoxins. At the same time, LTB produced by brewing yeast shows good immunogenicity in oral vaccines, effectively activating the immune system and inducing specific immune responses [3].

HOWEVER, limited expression rate of LTB in yeasts have remaining a grate problem for a long time. The promoter is an important element in gene expression regulation. It determines the expression levels of genes in cells. Our Pymaker can predict the strength levels of different promoters, providing a basis for selecting suitable promoters to ensure efficient expression of LTB in brewing yeast. Ultimately, we successfully applied the predicted promoters to express LTB in brewing yeast by combining the predicted results with the gene expression system of brewing yeast. This provides new methods and tools for the efficient and controllable production of LTB. Importantly, our project offers a more feasible and efficient approach to LTB production. Traditional production methods may require a significant amount of time and resources, while our project uses artificial intelligence technology to predict promoter strength, enabling the rapid screening of suitable promoters and thereby improving the yield and purity of LTB. is of great meeting the demand for oral vaccine adjuvants.

Pymaker

1. principle

2. Training data source

We extensively searched for raw data suitable for training artificial intelligence and finally discovered a dataset publicly available in a Nature article [4]. The dataset consists of a total of 30 million pairs of core promoter sequences and expression levels. The format is shown in the diagram below, with random synthetic core promoter sequences in the front and high-throughput measured expression intensities in the back. Specifically, the log2(RFP/YFP) of the dual-fluorescent expression driven by the promoter is the data.


3. Pearson coefficient

Our artificial intelligence (AI) model is essentially a regression model, and as such, we selected the Pearson Correlation Coefficient as the evaluation metric to assess goodness of fit. This coefficient is commonly used to measure the linear relationship between two sets of variables. It ranges from 0.0 to 1.0, where 0.8-1.0 indicates a very strong correlation, 0.6-0.8 indicates a strong correlation, 0.4-0.6 indicates a moderate correlation, 0.2-0.4 indicates a weak correlation, and 0.0-0.2 indicates a very weak or no correlation.


The figure illustrates the correlation between expression predicted by Pymaker and measured expression from the dataset.


Strong promoter sequence

1. Parts

registry No. Name Full Name Type URL Link
BBa_K4815000* H1 PYPH1 promoter link
BBa_K4815001 H2 PYPH2 promoter link
BBa_K4815002 H3 PYPH3 promoter link
BBa_K4815003 H4 PYPH4 promoter link
BBa_K4815004 H5 PYPH5 promoter link
BBa_K4815005 H6 PYPH6 promoter link
BBa_K4815006 H7 PYPH7 promoter link
BBa_K4815007 L1 PYPL1 promoter link
BBa_K4815008 L2 PYPL2 promoter link
BBa_K4815009 L3 PYPL3 promoter link
BBa_K4815021 LTB-eGFP LTB-eGFP fusion protein link
BBa_K4815011 pDual pDual-fluorescence
reporter system
plasmid link
BBa_K4815012 pDual pDual-fluorescence
reporter system
plasmid link
BBa_K4815013 pDual pDual-fluorescence
reporter system
plasmid link
BBa_K4815014 pDual pDual-fluorescence
reporter system
plasmid link
BBa_K4815015 pDual pDual-fluorescence
reporter system
plasmid link
BBa_K4815016 pDual pDual-fluorescence
reporter system
plasmid link
BBa_K4815017 pDual pDual-fluorescence
reporter system
plasmid link
BBa_K4815018 pDual pDual-fluorescence
reporter system
plasmid link
BBa_K4815019 pDual pDual-fluorescence
reporter system
plasmid link
BBa_K4815020 pDual pDual-fluorescence
reporter system
plasmid link

* we use BBa_K4815000 (PYPH1) for the judging of new basic part, as it outperforms any other pats.

2. Recombinant construction plasmid

As is shown in the figure, our Pymaker originated promoter PYPH/PYPLs consists of two parts: the core promoter and the scaffold. The core promoter is an 80 bp sequence and is seated at approximately -170 to -90 upstream to the codon (which is the presumed transcription start site-TSS and is where most transcription factors binding sites lie). The scaffold is a preserved sequence in all PYPH/PYPLs (‘pT’ and ‘pA’ sign in the figure) . It is a structure that we learned and utilized from previous research that can link the core promoter with the codon and provide restriction sites of BamH I and Xho I which make it possible for the plasmids with the scaffold to be inserted by various core promoter sequences at ease. Then, the whole promoter sequence is ligated into the plasmid framework, driving the expression of YeGFP(details can be found in Part:BBa K4815011 - parts.igem.org).


3. Flow cytometry

We utilized flow cytometry to monitor the two fluorescence signals excited by different light channels and analyzed the corresponding data. We plotted the natural logarithm of the ratio of GFP to mCherry (ln(GFP/mCherry)) as a frequency distribution graph to showcase the relative expression strength of different promoters in yeast.


LTB expression test

The target proteins were detected with specific primary antibodies (rabbit anti-GFP) and HRP-conjugated secondary antibodies. Western band intensities, which reflect the relative amount of target proteins in the samples, were determined using the ImageJ software.

In the figure above, the selected bands in the red box are the targeted LTB expressed in yeasts.


Reference

[1]Beardall, W. A. V., Stan, G. B., & Dunlop, M. J. (2022). Deep Learning Concepts and Applications for Synthetic Biology. GEN biotechnology, 1(4), 360–371. https://doi.org/10.1089/genbio.2022.0017

[2]Theodoris, C. V., Xiao, L., Chopra, A., Chaffin, M. D., Al Sayed, Z. R., Hill, M. C., Mantineo, H., Brydon, E. M., Zeng, Z., Liu, X. S., & Ellinor, P. T. (2023). Transfer learning enables predictions in network biology. Nature, 618(7965), 616–624. https://doi.org/10.1038/s41586-023-06139-9

[3]So, K. K., Le, N. M. T., Nguyen, N. L., & Kim, D. H. (2023). Improving expression and assembly of difficult-to-express heterologous proteins in Saccharomyces cerevisiae by culturing at a sub-physiological temperature. Microb Cell Fact, 22(1),55.https://doi.org/10.1186/s12934-023-02065-7

[4]Vaishnav, E. D., de Boer, C. G., Molinet, J., Yassour, M., Fan, L., Adiconis, X., Thompson, D. A., Levin, J. Z., Cubillos, F. A., & Regev, A. (2022). The evolution, evolvability and engineering of gene regulatory DNA. Nature, 603(7901), 455-463.https://doi.org/10.1038/s41586-022-04506-6