responsible and good for the world


AI for biology has become a new trend, especially in the field of synthetic biology, because genes are the language of biology and are particularly suitable for transfer learning using popular large language models. However, through our interviews with relevant experts, we have learned that the limited amount of data is a significant challenge for AI for biology.

Dry Lab


We aim to design a pre-training + fine-tuning paradigm to improve the fitting accuracy of AI models trained on small sample data and address the issue of insufficient data for training good AI models.


We will experiment with different k-mers, optimizers, learning rates, and pre-training models. We will then train AI models from scratch using datasets of varying sizes: 300, 3,000, 30,000, 300,000, 3,000,000, 6,000,000, and 30,000,000. The details can be found in the dry lab cycle.


We will test all AI models trained with different sample sizes using a specifically low-throughput measurement test dataset with lower noise. We will calculate the fitting accuracy and identify the AI model with the highest fitting accuracy for each sample size. We will also train an AI model based on the carefully designed model in a Nature article using the same sample size. We will then test this AI model with the specialized test dataset and calculate its fitting accuracy.


We found that all models trained using the pre-training + fine-tuning paradigm exhibited higher fitting accuracy compared to models built from scratch. This validates the effectiveness of our pre-training + fine-tuning paradigm. Additionally, after discussing with professors and other teams, we realized that our model overlooked certain factors, such as the occurrence of random mutations in promoter bases after multiple cultivations. Mutations in the promoter can lead to changes in the sequence and potentially lead to decreased expression levels.

Wet Lab


To address the aforementioned issue, we plan to design a base mutation model that simulates the process of base mutation. This will generate mutated promoter sequences, which will be input into the AI model to predict their expression levels. Finally, we will plot the expression intensity changes of each generation of mutated promoters, fitting the curve to illustrate the expression strength variation under multiple generations of random mutations.


As many synthetically generated core promoter sequences lack standard core promoter structures, it is not possible to apply existing core promoter mutation hotspots and preferences to our base mutation model. Therefore, we assume that mutations do not have any specific hotspots or preferences but occur purely randomly. We found that some studies constructing mutation models also used the same assumption. Based on this assumption, we constructed the base mutation model and combined it with the AI model to fit the expression intensity changes of 1000 highly expressed promoters under multiple generations of random mutations. We have also identified some mutation-resistant highly expressed promoters.


Wet lab experiments.


Wet lab experiments confirmed that the AI model predictions were relatively accurate. However, due to time constraints, we were unable to measure the expression levels after multiple generations of cultivation and mutation to verify their resistance to mutation. Given sufficient time, we could further improve the fitting accuracy of the AI model and the accuracy of the base mutation model by incorporating wet lab experimental data. For example, we could identify mutation hotspots and preferences for synthetically generated promoter sequences to optimize our base mutation model. We could also integrate the promoter and expression intensity data obtained from wet lab experiments into the original dataset to further train our AI model and enhance its fitting accuracy.

Dry Lab Cycle
Cycle 0. Fine-tuning


First, we load the DNABERT pre-trained model that has generally shown the best performance. We identify the important parameters and perform fine-tuning to achieve the best fitting accuracy for our AI model.


We identify the important parameters, including the tokenization length (kmer), optimizer, and learning rate. We appropriately combine and test these important parameters.


We train different variations of the AI model using different parameter combinations. At this stage, we use 10% of the original data, randomly splitting three million data points into a 70:30 ratio for training and testing sets respectively. The training set is used to train the AI model, while the testing set is used to assess the model's fitting accuracy.


After training the AI models, we find that the best Pearson correlation coefficient achieved is 0.84, which is significantly lower than the well-constructed AI model mentioned in reference. Next, we plan to train the models using the entire dataset. However, after discussing with our teacher, we discover that several other pre-trained models are also suitable for our project, especially the recently released DNABERT2 model, which may potentially yield better results.

Cycle 1. Pre-trained Model


After considering our teacher's advice and conducting literature research, we have identified three additional pre-trained models: DNABERT2, BioBERT, and RNAprob. To improve the fitting accuracy of our model, we will load each of these models and perform fine-tuning.


We will load each of these pre-trained models, download the corresponding packages, and modify the code for the fully connected layer. The modified code will take the output from the pre-trained models and pass it through a linear layer to produce a single numerical output. We will then perform fine-tuning cycles to achieve the highest fitting accuracy for our AI model.


We will compare the highest fitting accuracies achieved by the four pre-trained models and observe that the AI model trained with the DNABERT pre-trained model has the best fitting accuracy. We will then use the entire dataset of 30 million samples to train the best pre-trained model and fine-tuning combination. We find that the Pearson correlation coefficient can reach a maximum of 0.85, with only slight improvement.

In addition, we plotted the predicted expression levels obtained from the model trained with a dataset of 30 million raw data against the existing measured expression levels. The x-axis represents the predicted expression levels generated by the model, while the y-axis represents the actual measured expression levels. Upon observing the generated graph, it became apparent that many horizontal lines were present, indicating that the model exhibits significant variations in predicting expression levels for integer sequences. This suggests a limitation in the model's ability to accurately predict expression levels for integer values. Additionally, we noticed that data values, apart from integers, predominantly had a precision of up to five decimal places. Consequently, obtaining precisely integer values from measurements in such a high precision scenario becomes challenging. Moreover, around three-quarters of the data points are integers, further adding to the perplexity of the situation, indicating a potential underlying issue.


In order to ascertain the source of the data, we decided to email the first author of the Nature article. In response, the author informed us that they conducted measurements of expression levels across multiple culture vessels. They found that for sequences where the expression levels were measured as integers, it indicated that the sequence was only detected in a single culture vessel. Consequently, the presence of integer data introduced significant errors, as it was greatly affected by noise. Moreover, they employed a separate high-precision, low-throughput test dataset specifically for measuring the model's goodness of fit, aiming to minimize the interference of noise during testing.

Cycle 2. Dataset


In order to further investigate the impact of small sample sizes, we conducted simulations by removing integer data from the original dataset comprising 30 million data points. This step aimed to reduce noise and simulate scenarios characterized by low-throughput and high-precision. Subsequently, we randomly drew samples of 300, 3,000, 30,000, 300,000, 3,000,000, and 6,000,000 instances from the dataset for training purposes. Additionally, we assessed the model's goodness of fit using a dedicated test dataset.


During the random sample selection process, we observed a tendency for small sample data to exhibit overfitting. To mitigate this, we applied weight decay treatment and performed fine-tuning, incorporating optimizers and adjusting the learning rate. These modifications were crucial in achieving optimal goodness of fit. Furthermore, to ensure the reliability of our findings, we employed a specialized test dataset to test the model's performance, and the results were subsequently visualized.


To assess the performance of our model, we trained the meticulously developed AI model from the Nature article using the preprocessed data. Remarkably, our model outperformed the Nature model across all dataset sizes. Notably, our model demonstrated a pronounced advantage in handling small sample data. Additionally, we discovered that models trained on the randomly partitioned original dataset, which yielded the highest goodness of fit during the evaluation using the test set, also delivered superior performance when evaluated on the specialized test dataset. This confirms the efficacy of employing a dedicated test dataset for the evaluation process.

Wet Lab Cycle
Cycle 0. Build Promoter Library


How does the 80bp promoter sequence generated by Pymaker drive actual protein expression? Firstly, we gained an understanding of the biological significance of the 80bp promoter sequence through literature review. The length of 80bp is considered an appropriate length for RNA polymerase binding. Therefore, we placed it at the -160 to -80 position within the entire promoter framework, which is the assumed transcription start site (TSS) and the initial binding site for RNA polymerase. We expect it to play an important role. Additionally, besides the AI-generated sequence, are there any other components required to form a complete and functional promoter?

We conducted in-depth discussions and consultations with external expert Hu Yilin. Under his guidance, we designed the sequence within the pT and pA sequences of the ADH1 promoter framework, which had all possible cis-acting elements removed. This resulted in the formation of a complete promoter.


Based on the aforementioned requirements and the subsequent need for a fluorescence reporting system, we constructed a complete fluorescent reporter vector that contains 20 AI-derived promoter sequences. To facilitate subsequent experimental procedures, we designed primers targeting the conserved pA and pT sequences at both ends for selective amplification and extraction of the promoter sequences. Additionally, we introduced XhoI and BamHI restriction enzyme recognition sites for future enzymatic digestion if needed.


we extract promoter sequences and test them using agarose gel electrophoresis.

The figure illustrates that we successfully extract designed promoter sequences from the dual-fluorescence reporter plasmids.


To confirm the adequacy of the promoter framework we used in driving downstream gene expression and the functionality of the complete promoter, we conducted preliminary experiments. We used an empty promoter framework and a promoter framework containing the desired sequences to drive the expression of a fluorescent gene. We obtained appropriate results that provided evidence for the effectiveness of our promoter constructs.


We observed yeast smear results under fluorescence microscope.


We have obtained an effective framework that can be used to experimentally test the function of transcription factor cis elements.

Cycle 1. Establish Reporter System


We have obtained functional promoter sequences, and now we need to quantify their expression strength and further compare them. We engaged in discussions with Professor Mali Jia from West Lake University. Prof. Jia specializes in functional genomics and has constructed gene libraries for human regulatory sequences, providing extensive experience in experimentally characterizing the strength of regulatory sequences. Based on this collaboration and a thorough review of relevant literature, we explored the feasibility and ultimately decided to employ a dual-fluorescence reporter system for characterization (the model is shown in the figure below)..

The advantage of this system is its ability to eliminate the influence of plasmid copy number and the growth status of the bacterial host, thereby providing a more direct measurement of the relative expression strength of the designed promoters.


We used a synthetic promoter to drive the expression of the YeGFP gene on the same plasmid, while the TEF1 promoter was used in the reverse orientation to drive the expression of the mCherry gene. Additionally, we incorporated a lactose-inducible switch to enhance safety.

The plasmid structure is as follows


Spacers H1, H2, H3, H4, H5, H6, H7 means our Pymaker generated generated high expression rate promoters and L1, L2, L3 means low expression rate ones.


We utilized flow cytometry to monitor the two fluorescence signals excited by different light channels and analyzed the corresponding data. We plotted the natural logarithm of the ratio of GFP to mCherry (ln(GFP/mCherry)) as a frequency distribution graph to showcase the relative expression strength of different promoters in yeast.



We have established a high-throughput experimental system to monitor the strength of promoter sequences and applied it in practice.

Cycle 2. Constitute Express System


We have encountered challenges in accurately and rapidly quantifying the expression levels of the produced products, despite being able to transfer the expression of the target gene using a dual-fluorescence reporter system.

Firstly, due to the unique nature of yeast cell walls, it is not possible to obtain secreted proteins directly. Therefore, we need to lyse the yeast strain and perform purification to obtain the protein of interest. In the case of LTB (Lytic transglycosylase B), there is no readily available and affordable antibody. After reviewing literature on LTB expression, we found that the main approach for expressing LTB in yeast is through fusion proteins. To facilitate purification and enable multi-level detection, we designed a fusion protein strategy by fusing LTB with GFP.


Using a high-fidelity PCR system, we extracted the AI-optimized promoter sequence from the commercially synthesized complete dual-reporter system and introduced specific restriction enzyme sites. Subsequently, we employed double digestion and ligation methods to connect the promoter to the LTB-YeGFP plasmid framework


We initially conducted shake flask cultivation and plasmid extraction

We extracted total RNA from yeast and performed real-time quantitative fluorescence PCR to measure the expression of the plasmid. Furthermore, we lysed the yeast cells to extract total protein and selectively purified the protein using an EGFP antibody. Subsequently, we conducted Western blot analysis.


Western blot using rabbit anti-GFP antibody shows that LTB-eGFP fusion protein is successfully expressed in yeasts.

We intend to use flow cytometry to record the GFP density, while we come across a failure in getting data from the flow cytometer, which may result from too much cell fragments that block the cytometer. However, the remain result we get is in correspond with that of our western blot result.


The strength of the promoter sequence is closely related to the expression of downstream genes. The absolute quantitative data obtained for LTB expression will be used to construct a new dataset. This data can be utilized to further optimize our AI model using our paradigm, enabling us to generate promoter sequences that specifically drive high expression of LTB.