Highly expressed sequences

From raw data and a pre-trained model to an AI model, and finally to highly expressed sequences, we follow a series of steps.

First, we utilize a pre-trained model to construct an AI model. Then, we train the AI model using the raw data, enabling it to predict expression strength based on the provided core promoter sequences. Subsequently, we filter out highly expressed sequences from the original data. To create a diversity of new sequences, we randomly mutate these selected sequences. Finally, we use a state-of-the-art AI model with the highest goodness of fit to predict the expression levels of both the mutated and original sequences. This allows us to identify highly expressed promoter sequences. The overall procedure, which involves transitioning from raw data and a pre-trained model to filtering out highly expressed sequences, is depicted in the table below:

Mutation-resistant highly expressed sequences

From highly expressed sequences to mutation-resistant highly expressed sequences:

We constructed a base mutation model, which involved performing 100 generations of random mutations. Each generation assumed random mutations occurring 1-3 times at equal probabilities for each position within the core promoter. Moreover, each mutation was always transformed into one of the other three bases. Consequently, we obtained mutated sequences for each generation.

Next, we inputted 1000 filtered highly expressed sequences into the model for 100 generations of random mutations. Subsequently, we inputted the resulting 100 generations of mutated sequences into an AI model to predict their expression levels. We used the sum of the expression levels for the 100 generations of mutated sequences as an indicator of mutation resistance. Finally, we selected a few core promoter sequences that were both mutation-resistant and highly expressed for wet lab experiments. Furthermore, we identified five representative promoter sequences and depicted their evolutionary processes.

The expression intensity curves of mutation-resistant promoters are represented by the red curve, while the blue curve represents mutation-sensitive sequences, and the yellow curve represents non-mutation-resistant sequences. It can be observed that mutation-resistant, highly expressed promoters are not easily affected by mutations and consistently maintain a high level of expression. On the other hand, mutation-sensitive, highly expressed sequences are greatly influenced by mutations, as their expression intensity rapidly decreases when mutations occur, resulting in an overall low expression intensity state. For non-mutation-resistant, highly expressed promoters, their expression intensity fluctuates significantly throughout the 100 generations, alternating between high and low expression states.

High-expression validation

The AI-generated mutant-resistant high-expression promoter sequence was compared with the natural promoter using flow cytometry. It can be observed that the mutant-resistant high-expression promoter, predicted by AI, possesses the following characteristics compared to the less stable expression intensity of the natural promoter:

1. High expression intensity

The average expression intensity of the promoter sequence designed by our AI model is more than twice that of the natural promoter under the same conditions. (The x-axis in the graph represents the natural logarithm of the expression intensity, and the y-axis represents the normalized frequency of yeast).

2. Stable intensity

Compared to the unstable expression intensity of the natural promoter in different yeast strains, the extremely strong expression intensity of the AI-designed high-expression promoter sequence is not influenced by individual yeast strains and remains stable over a wide range.

Application in practical scenarios

We utilized the obtained promoter sequence to drive the expression of the mucosal vaccine adjuvant LTB downstream in yeast, resulting in the fusion protein of LTB and GFP. The expression level of this fusion protein was quantitatively analyzed using flow cytometry, and expression analysis was conducted at both the transcriptional and translational levels. The results are as follows:

Using GAPDH as an internal control ,we quantify the expression intensity of LTB-eGFP as Intensity[LTB-eGFP]/intensity[GAPDH]. The below figure illustrates that expression driven by our Pymaker generated promoter is significantly higher than natural promoters(p = 0.016), and our Pymaker-promoters-driving expression is up to 3 times higher than natural promoters.

We then checked the quantitative gene expression levels using quantitative RT-PCR, and the results indicated that our generated promoters drive a much higher transcript accumulation than natural promoters. The result gives a strong validation that it is our generated promoters that play a fundamental role in driving a extremely high promoter sequences.

The AI-designed high-expression promoter driving LTB expression outperforms the natural promoter. This demonstrates that our high-expression promoter's robust expression is not influenced by downstream gene sequences and can maintain its intensity. It confirms that the promoter element we designed can be applied in various scenarios as a more efficient, stable, and mutation-resistant alternative to natural promoters.