We calculated a distance matrix for the collected homologous sequences and extracted distance values exclusively for sequences where the gRNA fits, utilizing these as an evaluation metric. While the magnitude of the distance-score is vital, questions remain regarding its significance. Specifically, whether a large distance-score truly indicates a shared common ancestor. Inspired by methods used to estimate the confidence of phylogenetic tree branching, we divided the distance matrix values into two groups: sequences where gRNA fits and those where it doesn't. We then statistically tested the differences in distributions between these two groups. If a significant difference in distance distribution exists between the two groups, it can be postulated that the difference stems from one group evolving from a common ancestor, while the other did not.

- The distance matrix was computed, and by dividing by the number of sequences that the gRNA covers, a corresponding score for the gRNA was derived.
- This was calculated separately for both sets of sequences: those containing the substring and those without.

The distance matrix was computed from an alignment file, which was gathered by conducting a BLAST search targeting the core glucosylase gene.

Click here to view the PDF

- To ensure that the DNA distance score (X) is not influenced by the substring, it was normalized using sampling.
- Di represents the distance of the ith data.
- N represents the sample size. For sequences where the gRNA fits, it corresponds to the number of sequences that fit. For those where the gRNA doesn't fit, it corresponds to the number of sequences that don't fit.

- We investigated whether there was a difference in the distribution of distances between sequences where the gRNA fits and those where it doesn't. The means were compared using a t-test, and the variances were compared using an F-test.

- An F-test was conducted to determine if there was equal variance in the distances between sequences with the substring and those without.

Mean: 0.001551310394346494

Standard Deviation: 0.0033599778116293237

Does Not Contain Substring:

Mean: 0.0060427413411937875

Standard Deviation: 0.008760222510708869

F-test for variance:

Statistic: 57.61857103415576

p-value: 8.626561392219624e-14

t-test for means:

Statistic: -3.386874224556088

p-value: 0.0014825895221861304

Mean: 0.00019916747993386802

Standard Deviation: 0.0005428100492883209

Does Not Contain Substring:

Mean: 0.019962965062449198

Standard Deviation: 0.01930261649221101

F-test for variance:

Statistic: 691.6574467531636

p-value: 3.1831150037514676e-108

t-test for means:

Statistic: -8.984205877104541

p-value: 1.284330533959997e-13

For many gRNA candidates, we have successfully rejected the null hypothesis that "the distribution of the population that fits and the population that doesn't fit is indistinguishable from a random selection." For instance, the average DNA distance-score for TCTTTAAGCGATAATTATAC. If there's a difference in the distribution of distances between populations that fit a certain gRNA and those that don't, it indicates that the gRNA fits populations that have evolved from a common ancestor.

- F-test
- The F-test is used to test whether two samples have equal variances. If the F value deviates significantly from 1, it suggests that the variances are not equal.
- t-test
- The t-test is used to test whether the means of two samples are equal.