Modeling

Design

We calculated a distance matrix for the collected homologous sequences and extracted distance values exclusively for sequences where the gRNA fits, utilizing these as an evaluation metric. While the magnitude of the distance-score is vital, questions remain regarding its significance. Specifically, whether a large distance-score truly indicates a shared common ancestor. Inspired by methods used to estimate the confidence of phylogenetic tree branching, we divided the distance matrix values into two groups: sequences where gRNA fits and those where it doesn't. We then statistically tested the differences in distributions between these two groups. If a significant difference in distance distribution exists between the two groups, it can be postulated that the difference stems from one group evolving from a common ancestor, while the other did not.

Procedure

Distance Matrix Calculation and Scoring:

The distance matrix was computed from an alignment file, which was gathered by conducting a BLAST search targeting the core glucosylase gene.
Click here to view the PDF

The distance matrix was computed, and by dividing by the number of sequences that the gRNA covers, a corresponding score for the gRNA was derived.
This was calculated separately for both sets of sequences: those containing the substring and those without.

Normalization:

To ensure that the DNA distance score (X) is not influenced by the substring, it was normalized using sampling.

Di represents the distance of the ith data.
N represents the sample size. For sequences where the gRNA fits, it corresponds to the number of sequences that fit. For those where the gRNA doesn't fit, it corresponds to the number of sequences that don't fit.

Statistical Testing:

We investigated whether there was a difference in the distribution of distances between sequences where the gRNA fits and those where it doesn't. The means were compared using a t-test, and the variances were compared using an F-test.

Variance Equality Testing:

An F-test was conducted to determine if there was equal variance in the distances between sequences with the substring and those without.

Results

The distribution of p-values from t-tests and F-tests

A gRNA candidate TCTTTAAGCGATAATTATAC

Contains Substring:
Mean: 0.001551310394346494
Standard Deviation: 0.0033599778116293237

Does Not Contain Substring:
Mean: 0.0060427413411937875
Standard Deviation: 0.008760222510708869

F-test for variance:
Statistic: 57.61857103415576
p-value: 8.626561392219624e-14

t-test for means:
Statistic: -3.386874224556088
p-value: 0.0014825895221861304

A gRNA candidate TGCTAAGGCTGATGATTCTT

Contains Substring:
Mean: 0.00019916747993386802
Standard Deviation: 0.0005428100492883209

Does Not Contain Substring:
Mean: 0.019962965062449198
Standard Deviation: 0.01930261649221101

F-test for variance:
Statistic: 691.6574467531636
p-value: 3.1831150037514676e-108

t-test for means:
Statistic: -8.984205877104541
p-value: 1.284330533959997e-13

Conclusion

For many gRNA candidates, we have successfully rejected the null hypothesis that "the distribution of the population that fits and the population that doesn't fit is indistinguishable from a random selection." For instance, the average DNA distance-score for TCTTTAAGCGATAATTATAC. If there's a difference in the distribution of distances between populations that fit a certain gRNA and those that don't, it indicates that the gRNA fits populations that have evolved from a common ancestor.

Glossary Section

Glossary

F-test: The F-test is used to test whether two samples have equal variances. If the F value deviates significantly from 1, it suggests that the variances are not equal.
t-test: The t-test is used to test whether the means of two samples are equal.

iGEM_Gifu_2023