Model | WHU-China - iGEM 2023

Part 1: sgRNAs Design

Background and goals

On-target score

Specificity score (Lower Off-target probability)

Two stage generation strategy

Results

Part 2: Sampling and delay settings

Part 3: Analysis and correction of recorded data

References

To top

We nominate our team for Best Model. See Awards.

Part 1: sgRNAs Design

This part mainly concerns the software-aided design of gRNAs in multi-cascades editing. Given the specie category and editing system(Cas9 or Cas12 etc.), our method gives out a list of candidate sgRNAs. These sgRNAs have both high on-target scores and low off-target probabilities.

Background and goals

The quality of sgRNA has an important impact on CRISPR experiments, which requires sgRNA to have high on-target activity and low off-target rate. Among them, the on-target activity is related to the nature of the target sequence, the characteristics of the Cas9 enzyme and the homologous repair mechanism. The off-target rate here mainly considers incorrect editing of the genome. The greater the similarity between the target and the irrelevant genome, the higher the off-target rate.

Rencently, many tools have emerged that use deep neural networks to predict the target activity of sgRNA. The work of Hui Kown Kim^[1] et al. encodes sgRNA as a one-hot channel and introduces convolutional neural network to predict the editing activity of a given sequence. On this basis, many scholars (Guohui Chuai^[2] and Daqi Wang^[3] Xiang, X. ^[4] et al) continue to conduct in-depth research on how to predict the target activity of sgRNA and continuously improve the prediction performance of deep networks. Also, Li Xue's article^[5] explores the biological meaning of the parameters of the nervous system,especially the weight of convolution kernal.

However, directly generating a large number of non-interfering sgRNA-splicing sites is still a problem to be studied. In this project, both the editing site and sgRNA are completely designed by us, and we need to achieve the goals of high on-target activity and low off-target rate as much as possible.

On-target score

According to the research of Xiang, X. et al., we used CRISPRon^[4] to predict the on-target score of sgRNA. CRISPRon is a software tool that utilizes deep neural networks to predict sgRNA editing efficiency. It takes a 30nt sgRNA key sequence as input and predicts the relative target efficiency score through a convolutional neural network(CNN) and Gibbs free energy \(Delta G_B\). After training on the real experimental data set ^[6], this method can well evaluate the on-target score of given sgRNA sequences.

1. Input: 30nt sequence containing protospacer on sgRNA

The input region not only includes the 20nt protospacer (binding region), but also includes its upstream 4nt prefix, NGG sequence and downstream 3nt suffix sequence. This allows CRISPRon to fully consider the impact of the binding region and its upstream and downstream regions when making predictions.

Prefix (4nt) -- Target/N20 Protospacer(20nt) -- PAM (3nt, NGG) -- Suffix (3nt)

In our own project, the sequence is:

AGAT——N20 Protospacer——GGG——TTA
or
ACAA——N20 Protospacer——GGG——TTA

Therefore, what we want to design and optimize is only the middle N20 protospacer.

2. One-hot encoding

The input sequence is one-hot encoded into four channels, one for each ATCG. Sequence information is represented in the model as a 30*4 binary matrix.

3. Convoluted and Pooled

Then, the encoded sequene was fed into filters (sized 3, 5, and 7) acting directly. Output convolutions were pooled and flattened and for further computation in fully connected layers.

4. Computing Gibbs free energy

At the same time, CRISPRon uses CRISPRoff^[7] to compute Gibbs free energy \(Delta G_B\). Gibbs free energy is passed to the fully connected layer as additional feature information.

5. Fully connected layers

Finally, we can give a predicted on-target editing efficiency score for each candidate sgRNA sequence. We define \(e_i^{seq}\) describe information from the sgRNA sequence. Then our model \(P_{on}\) gets the on-target score \(p^{on}_{i}\) of such given sgRNA sequence \(i\): \(p_i^{on}=P_{on}(e_i^{seq})\).

Fig 1. Diagram of CRISPRon's network structure^[4]

Specificity score (Lower Off-target probability)

Since the editing recognition of the CRISPR system mainly relies on the binding of N20 protospacer to DNA, we believe that the closer the target sequence is to the genome, the higher the off-target rate. This similarity can be measured using sequence alignment methods, such as Basic Local Alignment Search Tool(BLAST)^[8]. BLAST finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance[8].

The genome information of K-12 MG1655 E. coli we used was obtained from NCBI GCF_000005845.2. The candidate 20nt protospacer will be mapped to the E. coli genome to obtain similarity in the form of bit score.

The higher the bit score given by BLAST, the more similar the target sequence is to the genome, and the higher the possibility of off-target. Therefore, we take the reciprocal of the bit score to obtain the specificity score. The higher the specificity score, the more accurate the editing.

Thus, we have the off_target score of a candidate sgRNA: \[p_i^{off}=\text{BLAST}(e_i^{seq})\]

Two stage generation strategy

After deriving the above measures of editing efficiency and off-target rate, we used a two-stage screening strategy to obtain the final candidate sgRNA sequences. First, we generated and selected a batch of N20 protospacers with strong specificity based on genomic information, and then predicted the editing efficiency of their composed sgRNAs. Finally, the sgRNA sequences with both high specificity and high editing efficiency were left and submitted to the wet lab group.

1. Generation of specific sequences

Intuitively, subsequences with strong specificity are easier to combine into long sequences with strong specificity. We split the 20nt protospacer sequence into four 5nt sequences to consider, and calculate their bit scores for \(4^5=1024\) sequences. Subsequently, we select p% of sequences with strong specificity and normalize them with the reciprocal of their bit scores (specificity scores) as the generation probability.

These highly specific 5nt subsequences will generate m (default set to 400) 10nt sequences according to the above probability. Then the same sequence alignment, screening, and generation processes were performed on the 10nt subsequences to obtain the final 20nt protospacers.

Furthermore, the final 20nt sequence will also be calculated by BLAST for bit socre, which will be used for joint evaluation with editing efficiency.

2. Consideration of target editing efficiency

The final m 20nt protospacer sequences will be combined with other parts of the sgRNA to calculate its on-target scores. Finally, each 30nt gRNA fragment will be accompanied by an on-target score and a specificity score. According to the synthesized score formula above, we calculate the final comprehensive score and sort all results according to the comprehensive score. Finally, the top 50 sgRNA sequences with comprehensive scores were selected and submitted to the wet lab group.

Results

Following the above method, we selected 50 good N20 sequences and calculated their on-target scores and specificity scores.These results are presented in the table below and are also available through our Software GitLab.

Among them, the calculation of on-target scores takes into account the case of two upstream and downstream sequences. Case1 means 'ACAA' + N20 + 'GGGTTA', while Case2 means 'AGAT' + N20 + 'GGGTTA'. At the same time, sequences that were not matched to the genome in the specificity scores were given the highest specificity score (ie, Specificity_scores=1).

N20
ATGTAACTAGTTAGTGAGAC
TTAGTGAGACCCTAGGTGTA
CCTAGTTAGTAGCTACCTAG
GTCCTTCTAGCCTAGTTAGT
CCTAGGTGTATCTAAAGTCT
CTAGTTTAGAGACTCACGAG
ACCTAAGTATTTAGTGAGAC
AGCTACCTAGGACTCACGAG
GACCCCCCTCCTAGTTTAGA
CCTAGGACCCCTAGTTTAGA
GACCCCCCTCGACCCCCCTC
GTCTATAGTAGTCTATAGTA
CCTCCCTACTTCTAAAGTCT
CCTCCCTACTACCTAAGTAT
GACCCCCCTCAGCTACCTAG
TAGACCTAGACCTAGTTAGT
TCTAAAGTCTCAAGAAGTAT
TCTAAAGTCTCCTAGGACCC
TCTAAAGTCTGTCCTTCTAG
CAAGAAGTATGACCCCCCTC
CCTAGGTGTACCTAGTTAGT
CTCCTCTACTTCTAAAGTCT
AGCTACCTAGCCTCCCTACT
CTAGCCTAGTCTAGTTACGA
TAGACCTAGATAGTTTCCAA
GACCCCCCTCAGCTACCTAG
TAGTTTCCAAGACCCCCCTC
AGCTACCTAGTTAGTGAGAC
GACTCACGAGGTCTATAGTA
CTAGAAGCTATCTAAAGTCT
ATGTAACTAGCTCCTCTACT
CCTAGGACCCCCTAGGACCC
CTAGTTACGAATAGTACTAG
TAGTTTCCAATTAGTGAGAC
CTAGTTACGAGACCCCCCTC
GACTCACGAGTAGACCTAGA
TAGACTAGAATAGACTAGAA
TCTAAAGTCTTAGACCTAGA
CTAGAGTCTCCTAGTTACGA
TAGACTAGAACTAGAAGCTA
GTCTATAGTACTAGAAGCTA
CTAGTTTAGAATGTAACTAG
ATAGTACTAGCCTCCCTACT
TTAGTGAGACTAGACCTAGA
TCTAAAGTCTGACCCCCCTC
ACTAGCTAGCACTAGCTAGC
TTAGTGAGACAGCTACCTAG
GACCCCCCTCCCTAGGACCC
ACTAGCTAGCCCTAGGTGTA
CCTAGGACCCATGTAACTAG

On-target Scores(Case1)
28.88
55.36
46.28
24.13
53.85
60.84
21.33
65.29
24.74
16.20
27.94
39.90
46.38
44.41
67.92
43.50
24.64
58.65
45.08
45.29
22.50
43.88
34.14
58.10
59.99
67.92
47.44
25.37
52.36
45.24
39.06
44.89
47.01
25.52
40.08
50.39
52.75
27.95
60.28
72.85
61.72
47.12
36.14
47.27
45.21
24.25
66.73
57.36
53.38
59.61

On-target Scores(Case2)
28.71
55.71
49.22
25.80
56.03
61.91
20.39
69.05
25.69
17.36
28.66
41.57
47.44
48.26
70.16
42.07
23.36
58.65
45.51
47.20
23.87
43.24
33.57
58.14
57.98
70.16
46.70
22.71
53.29
46.79
40.41
48.27
47.18
23.18
44.67
51.03
50.66
25.78
58.85
72.38
62.98
46.71
38.85
45.03
44.21
26.99
67.36
60.35
56.66
62.81

Specificity Scores
0.45
1.00
0.45
1.00
0.14
0.04
0.06
0.45
1.00
1.00
1.00
0.45
0.22
0.21
0.45
1.00
0.04
0.15
0.11
0.11
0.45
0.45
0.21
1.00
0.11
0.45
0.41
0.45
0.45
0.09
0.45
1.00
0.06
0.06
10.00
0.11
1.00
0.45
0.45
1.00
1.00
0.11
0.22
1.00
0.06
1.00
0.22
1.00
0.41
0.45

Fold

For more information, please visit our gitlab:On_target_evaluation.ipynb, BLAST_based Specificity Computation and N20_results.

Part 2: Sampling and delay settings

The original problem identified in our CRISPReporter model is not about the inaccuracy of the CRISPR-Cas9 system in recognizing multiple occurrences of a biological stimulus. In fact, the system is functioning as intended, and it accurately performs gene edits whenever the stimulus is present. The core issue here is the rate or frequency at which these gene edits are recorded.

Our system, which utilizes CRISPR-Cas9 for event logging, records each instance of a specific stimulus with a corresponding gene edit. Although this is technically accurate, it can be problematic when the stimulus persists over a given period. In such cases, multiple gene edits could occur in quick succession, which is not desirable for our specific application. We want to limit the frequency of these edits so that within specific time intervals only one gene edit is recorded.

To address this issue, we propose adding an oscillator to the system that governs the frequency of permissible edits. This oscillator would work in tandem with an AND gate to ensure that edits are spaced apart by predetermined time intervals.

The oscillator serves as a biological timer that resets after each gene editing event. Only when the oscillator completes a full cycle will the AND gate permit another round of gene editing. This ensures that, within each specified time period only one gene edit can occur, regardless of how many times the stimulus is detected.

Combined Action

The oscillator and AND gate work in synergy. The AND gate will only trigger the CRISPR-Cas9 system for a new round of editing if two conditions are met:

A stimulus is detected.
The timer is in the recording window.

The oscillator

By ensuring that both these conditions are met, the AND gate allows us to be more confident that each recorded edit corresponds to a separate, genuine biological event, rather than the lingering after-effects of a prior one.

Realization of the oscillator:

We use a classical Van der pol oscillator with the equation: \[ \begin{cases} \frac{dx}{dt} = y \\ \frac{dy}{dt} = \mu \left( 1 - x^2 \right) y - x \\ \end{cases} \] where \( \mu = 10 \).
The threshold for \( X \) is set at 2.2.
Solved into first order equation: \[ \begin{align*} \frac{dy}{dt} &= v \\ \frac{dv}{dt} &= \mu (1 - y^2) v - y \end{align*} \]

The AND gate

Further details of the AND GATE realization.

Fig 3. The function of concentration of X over time

Function formulation

Here we have a promoter concentration function that simulates our biological system after a given light stimulus.
For the Promoter Concentration \( X \) governed by decreasing light intensity: \[ \frac{dx}{dt} = k_1 \cdot \Phi(\text{Light}) - k_2 \cdot X \]
where \( k_1 = 1.0 \), \( k_2 = 0.2 \), and \( \Phi(\text{Light}) = e^{-0.1 \times t} \).

The details are as follows:

1. Generation Term

\[ \text{Generation Term} = k_1 \times \Phi(\text{Light}(t)) \]

This term represents the rate at which the promoter is generated in response to the light stimulus. The light-dependent promoter could be analogous to the blue light promoter EL222 in your system, which can be turned on or off easily. This allows for specific, time-sensitive control over gene expression, a crucial feature for monitoring biomarkers in fluctuating conditions like IBD.

2. Degradation Term

\[ \text{Degradation Term} = k_2 \times X \]

This term accounts for the natural degradation or dilution of the promoter over time. In an IBD context, this is akin to the transient nature of biomarkers like tetrathionate that might vary in concentration due to metabolic activity or other gut processes.

3. Threshold

The Threshold is the minimum concentration of the promoter required for reaction.

If we combine the charts, we can see that the period for the oscillator to be above the activation threshold, above zero in our setting, is longer than the period for the concentration of X to be higher than the required reaction threshold. With this oscillator combined with the AND gate, we can realize our goal: to record only one stimulus within a certain period of time.

Fig 4. Combine promoter concentration with oscillator

For drawing codes and more parameters for this section, please read Sampling_and_Delay.py.

Analysis and correction of recorded data

We posit a cultivation medium replete with bacteria, each harboring a multitude of plasmids denoted as \(m\) , where \(m\) is a large integer. We have designed a specialized CRISPR experiment wherein the plasmids are sequentially excised by the CRISPR system in response to specific stimuli. The excision of subsequent plasmids only proceeds once the preceding plasmid has been successfully excised. To facilitate this, we introduce a periodically expressed Cas9 enzyme to act as an internal circadian rhythm within the bacterial system.

We aim to establish an expression cycle with a duration exceeding that of a single stimulus, thus preventing the occurrence of two Cas9 protein expressions within one stimulus event. However, it is imperative to consider the potential for leakage phenomena, where excision may occur even in the absence of stimuli, albeit at an extremely low probability.

We hypothesize that the probability of plasmid excision is solely dependent on the magnitude of the stimulus, and is independent of the ordinal position of the plasmid within the sequence.

For instance, if the concentration of the stimulating agent during the \(i^{th}\) stimulus is 0.5mol/L, then the probability of excision for the first plasmid in a given bacterium is 0.8. Likewise, for the \(j^{th}\) plasmid in another bacterium, the excision probability is also 0.8. Considering that leakage can be interpreted as a stimulus of exceedingly low magnitude, and that the excision probability is a function of the stimulus concentration, we subsequently treat leakage as a form of stimulus. The probability of excision is estimated using the excision frequency.

This probability is estimated through the excision frequency, denoted as \(x_i\). Given that leakage can be conceptualized as an ultra-low-level stimulus, we incorporate it into this definition as well.

This is predicated on the notion that a single stimulus is insufficient to induce excision in all bacteria. It is evident that \(1\ge x_i \ge 0\). For example, during the first stimulus, only approximately 40% of bacteria may undergo first-level excision. In a subsequent second stimulus, 80% of the remaining 60% may undergo first-level excision again (attributed to the possibility of varying concentrations of the stimulus between events), while 80% within the initial 40% undergo second-level excision.

We introduce a variable \(p_{i}^{j}\) to denote the proportion of bacteria that have undergone at most \(j\) excisions by the \(i^{th} \) stimulus. We stipulate that \(p^{0}_{0} = 1 \), and \(i\) and \(j\) range from 0 to \(n\).

To elaborate, suppose that immediately following the initiation of the experiment, the first stimulus occurs, leading to the excision of the first plasmid in 80% of the bacteria. In this case, \(p_{1}^{1} = 0.8\) . During the subsequent circadian cycle, a second stimulus ensues, leading to excision in 50% of the bacteria. This includes 50% of the bacteria with previously excised first plasmids undergoing second-plasmid excision, as well as 50% of the first plasmids in 20% of the bacteria that did not undergo first-level excision.

Under these circumstances, a total of 0.4 proportion of the bacteria undergo both first and second-level excisions, reaching a maximum of two excisions. Additionally, 0.4 + 0.1 proportion of the bacteria undergo only the first-level excision, reaching a maximum of one excision, while a proportion of 0.1 of the bacteria do not undergo any excision. Consequently, we have \(p_2^{0} = 0.1,p_2^{1} = 0.5,p_2^{2} = 0.4\).

The parameter \(n\) is experimentally determined, signifying that \(n+1\) plasmids were assessed, resulting in \(p_{i}^{n=1} = 0\) and \(p_{i}^{n} \neq 0\). \(n\) can be estimated through the circadian cycle; if the experiment lasted \(T\) minutes, dividing \(T\) by the cycle duration gives the number of times Cas9 appears during the experiment, denoted as \(n\).Evidently, the \(n+1\)th plasmid will not undergo excision, as Cas9 only appears n times, allowing for a maximum of n plasmid excisions.

Leakage does not result in out-of-sequence excision.

We stipulate that \(i=0\) denotes the initial introduction of the plasmid system, where leakage or excision is not possible. Therefore, \(p_0 = [1,0,0\cdots 0]^{\mathrm{T}}\)

We define \(D\) as the matrix representing downward transitions, given as: \[ D=\left[\begin{array}{cccccc}0 & & & & \\1 & 0 & & & \\& 1 & \ddots & \\& & & 0 & \\& & & 1 & 0\end{array}\right] \]

Assuming that as long as the magnitude of the stimulus is constant, the probability for any level of excision remains uniform, we have: \[ p_{i+1} = (1-x_i)p_i + x_iDp_i = A_i p_i \] \[ p_n = \prod_{i=0}^{n-1} A_i p_0 \]

We have only \(p_0\) and \(p_n\) known, and it's important to note that for any \(i\), the 1-norm is 1.

Our current objective is to determine the value of \(x\) in order to ascertain the number of leakage events that have occurred over this period. Components with lower values are attributed to leakage, allowing us to deduce the actual number of stimuli.

We now aim to determine \(x_i, \forall i\) given known \(p_n,p_0\) such that: \[ p_n = (\prod_{i=0}^{n-1}A_i) p_0 \]

Regrettably, it is challenging to find a straightforward analytical solution for this equation, even though an analytical solution objectively exists. This complexity arises because \(p_n\) has n components, and \(\prod_{i=0}^{n-1}A_i\) forms an n-dimensional square matrix, resulting in n unknowns \(x_i\) .

Moreover, utilizing numerical methods for solution-finding also poses significant challenges due to the highly unstable nature of the solutions. Conventional numerical techniques generally consider a difference smaller than \(10^{-6} \) as negligible; however, even under such conditions, the numerically derived solution \(x^{*}\) can still significantly deviate from the true value \(x\) . This issue persists even when the threshold is adjusted to \(10^{-10}\) .

However, from an alternative perspective, if the excision frequency corresponding to leakage is substantially lower than that for a genuine stimulus, a single leakage event would not significantly alter \(p_{i}\) . Specifically, if \(x_{i} < 0.001\), then \(p_{i+1} \thickapprox p_{i}\) , which implies \(p_{i+1}^{i+1} \thickapprox 0\) . On the other hand, in the case of a genuine stimulus that has not been preceded by leakage, \(p_{i+1}^{i+1} \gg 0 \) should hold. Therefore, one can straightforwardly infer the number of leakage events by examining how many components of \(p_n\) are very close to zero.

However, it should be noted that this form of judgment is constrained by the requirement that there must be a substantial disparity between leakage and genuine stimuli. According to calculations, in the case of genuine stimuli, \(x_i >0.6\) , whereas for leakage, \(x_i < 10^{-6}\)

References

Kim, N., Kim, H.K., Lee, S. et al. Prediction of the sequence-specific cleavage activity of Cas9 variants. Nat Biotechnol 38, 1328-1336 (2020). https://doi.org/10.1038/s41587-020-0537-9
Chuai, G., Ma, H., Yan, J. et al. DeepCRISPR: optimized CRISPR guide RNA design by deep learning. Genome Biol 19, 80 (2018). https://doi.org/10.1186/s13059-018-1459-4
Wang, D., Zhang, C., Wang, B. et al. Optimized CRISPR guide RNA design for two high-fidelity Cas9 variants by deep learning. Nat Commun 10, 4284 (2019). https://doi.org/10.1038/s41467-019-12281-8
Xiang, X., Corsi, G.I., Anthon, C. et al. Enhancing CRISPR-Cas9 gRNA efficiency prediction by data integration and deep learning. Nat Commun 12, 3238 (2021). https://doi.org/10.1038/s41467-021-23576-0
Li Xue, Bin Tang, Wei Chen, and Jiesi Luo. Prediction of CRISPR sgRNA Activity Using a Deep Convolutional Neural Network. Journal of Chemical Information and Modeling 2019 59 (1), 615-624. DOI: 10.1021/acs.jcim.8b00368
SpCas9 activity prediction by DeepSpCas9, a deep learning-based model with high generalization performance. Kim, H.K. et al. Sci Adv 5, eaax9249 (2019).
CRISPR-Cas9 off-targeting assessment with nucleic acid duplex energy parameters. Alkan F, Wenzel A, Anthon C, Havgaard JH, Gorodkin J Genome Biol. 2018 Oct 26;19(1):177
blast.ncbi.nlm.nih.gov/Blast.cgi