Summary

Our project's primary objective revolves around the swift and accurate detection of pathogens carried by mosquitoes. What sets our project apart is its capability to seamlessly transmit this crucial pathogen information directly to researchers and epidemiologists in real-time. To tackle this ambitious goal, we harnessed the remarkable potential of the SHERLOCK CRISPR Cas13 system.

Our approach is elegantly simple yet highly effective. We employ guideRNA sequences that are specifically tailored to target the viral RNA of pathogens present in the mosquitoes. These guideRNA sequences serve as the navigators for the Cas13 enzyme, which is tasked with identifying and binding to the pathogen's RNA. When this binding event occurs, it triggers a cascade of enzymatic reactions that ultimately produce a fluorescent signal, thanks to the sensor molecule BIOTIN-FAM. This fluorescence serves as the readout, providing us with the valuable information needed to detect and identify the presence of these harmful pathogens.

However, one of the primary challenges we grapple with in this endeavor is the inherent mutability of viral RNA. Viruses are known for their rapid mutation rates, which can pose difficulties in accurately targeting and detecting them.

Initially, we undertook the labor-intensive process of manually designing guideRNA and target sequences, using the available online tools. While this approach did yield some success, it quickly became evident that it was a time-consuming and tedious task. It required meticulous precision and a substantial investment of effort.

Recognizing the need for a more efficient and accessible solution, we embarked on the idea of developing specialized software. This software, once realized, will provide researchers and students with a user-friendly platform to generate consensus guideRNA sequences swiftly and accurately. By simplifying and automating this critical step, we aim to empower a broader community of scientists and budding researchers to contribute to the advancement of pathogen detection and virology research.

To be more precise, the software will generate a reverse DNA primer tailored specifically to the target sequence. When coupled with the forward DNA primer designed exclusively for the enzyme employed in the SHERLOCK experiment and utilized in conjunction with RT-PCR, we yield a guide RNA, as elucidated in the Wet Lab section.

In essence, our project represents the convergence of cutting-edge technology, molecular biology, and software development. We are driven by a commitment to make pathogen detection more accessible and efficient, all while addressing the ever-present challenge of viral RNA mutability. Our vision is to facilitate greater understanding and control of mosquito-borne diseases, ultimately contributing to global health efforts.

Our primary objective was to develop software that can be easily reused by researchers, regardless of the target sequence they are working on. We are committed to simplifying complex processes to ensure that using our software is not only functional but also an enjoyable experience. Our goal is to make the life of the researcher easier by automating and speeding up tasks that would have otherwise taken a lot of time.

Key Features

1/ Custom Consensus Sequence Calculation:

Enables the calculation of consensus sequences from multiple sequence alignments. This can be valuable in bioinformatics and molecular biology for identifying common nucleotides or amino acids in aligned sequences. Moreover, this consensus sequence can be used in the creation of Guide RNA.

3/ Scalability:

Adaptable for different numbers of clusters and desired consensus sequences, enhancing flexibility to accommodate various research requirements. Furthermore, the code works with multiple alignment files independently of the number of sequences in each file.

5/ Extensibility:

Depending on research requirements, users can extend and integrate the software with additional bioinformatics tools and pipelines for more comprehensive analyses.

2/ Clustering Capabilities:

Utilizes k-means clustering, a powerful unsupervised machine learning technique, to cluster sequences, helping group similar sequences together, aiding various biological and genetic analyses.

4/ User Interaction:

Prompts user input, making the software interactive and adaptable to specific research requirements. Moreover, some plots are interactive, allowing the user greater freedom.

Software Development

1) Understanding Phase:

This was the first step where the team thought about the problem we were trying to solve. In this case, it was about dealing with the quick changes in viral RNA and the need for a constant supply of guideRNA sequences for a system called SHERLOCK.

2) Defining Phase:

In this stage, the team carefully detailed what the users needed from the software. They set the rules for BLAST searches and sequence alignment, and decided how to measure the analysis of subsequences.

3) Ideation Phase:

This stage involved translating the gathered requirements into a structured plan. The software's architecture was designed to efficiently handle sequence input, analysis, and alignment while ensuring flexibility through user-defined parameters. This included running BLAST searches, applying user filters, aligning sequences, and analyzing subsequences. The team also thought about how to give users the option to choose the best consensus sequences and run additional BLAST searches, making the analysis process more flexible

4) Prototyping and Testing Phases:

The final stages involved turning the design into actual code and testing it thoroughly with users. The goal was to create a strong, user-friendly tool that could efficiently handle sequence input, analysis, alignment, identification of consensus sequences, and additional searches. The tool was also designed to let users customize results to meet their specific goals. In addition, the software was built using appropriate programming languages and libraries, such as Python and BioPython, to handle biological data efficiently. As code development progressed, rigorous testing was conducted. This included unit testing to ensure individual components functioned correctly, integration testing to verify the interactions between different modules.This is the phase that took the most time during the development.

In summary, the software development process was a meticulously orchestrated journey that began with a comprehensive requirement analysis, delving deep into the intricacies of user needs. This was followed by a design phase, where the blueprint of the application was crafted, laying the foundation for its success. Subsequently, the coding and testing phases came into play, breathing life into the design and rigorously evaluating its functionality to deliver valuable results for the experiments.

The software begins by installing various Python packages, including Biopython, Plotly, scikit-learn, seaborn, numpy, and others. These packages are essential for different aspects of sequence analysis and visualization.

The software initiates a BLAST search using the `NCBIWWW.qblast` function from Biopython. It searches against the "nt" (nucleotide) database with a specified sequence (E.g. :"NC_004162.2"). The results are saved in an XML file.

The BLAST results are read from the XML file, and various filtering criteria are applied, such as e-value threshold, identity threshold, and length threshold. The filtered hits are stored in a list.

The software extracts sequences, accessions, titles, and descriptions from the filtered hits. The sequences are then saved to a FASTA file.

**Fig 2:** Performing a BLAST search, filtering the result and saving the resulting FASTA file.

The software performs multiple sequence alignment using Clustal. It uses the `ClustalwCommandline` from Biopython to run the ClustalW tool. The user can choose to use muscle instead and it will use MuscleCommandline from the same library. The aligned sequences are saved in an output file as a FASTA file or a Clustal Alignment Format file.

**Fig 3:** Multiple Sequence Alignment with Clustal

This code analyzes a multiple sequence alignment file by extracting subsequences of length 28 nucleotides from each position in the sequences, counting the number of occurrences of each subsequence, and storing the counts in a Pandas DataFrame. This could provide valuable information for the creation of a consensus sequence by identifying conserved regions, highlighting variations, providing frequency and positional information, and facilitating data exploration and visualization.

**Fig 5:** Extraction of subsequences of length 28 nucleotides

After calculating the nucleotide frequencies in the multiple alignment, the software identifies the top consensus sequences based on mean frequencies. Users can specify the number of consensus sequences they want to retrieve. The top sequences, mean frequencies, and their positions are displayed.In addition, the software will calculate the similarity percentages between all pairs of sequences and display an interactive similarity matrix, which could inform the researcher about the need to use a clustering algorithm.

**Fig 6-7-8:** Identification of the top consensus sequences based on mean frequencies.

Then, the software prompts the user to enter an organism name to filter out. It conducts a BLAST search using the first sequence from the list of the best consensus sequences, giving the user the option to choose whether they want the BLAST to target a single organism or all of them. The resulting filtered alignments are then printed. This step is included to verify if the consensus sequence created is not found in nature, which could potentially lead to false positives.

In the case that the sequences are too different among themself, the software proposes an alternative by using k-mean clustering. The code provides a way of determining the optimal number of clusters using the Elbow and Silhouette methods. Ultimately, the results are visualized through plots, allowing for effective sequence grouping based on similarity.

As it was done at number 7 the software calculates the top consensus sequences based on mean frequencies for each cluster and finally prints the number of consensus sequences for each cluster depending on the user input.

**Fig 12:** Top consensus sequences based on mean frequencies for each cluster.

In the final step of the process, the user is prompted to indicate whether they have worked with clusters and what type of enzyme they will be working with. The reverse complement sequence of the stem loop is then appended to all the consensus sequences that were created, whether or not they are part of a cluster. The user can choose between the stem loops of Cas RX and Cas13a.

**Fig 13:** Analysis based on clusters and types of enzymes used : Cas13a or CasRX

Future Direction

Envisioning the future direction of this project involves considering potential enhancements and expansions that can further empower researchers and students in the field of molecular biology and bioinformatics. Here are some future directions for the project:

1/ Integration of Advanced Algorithms:

Incorporate more advanced bioinformatics algorithms and tools to expand the software's analytical capabilities. This could include implementing machine learning models for predicting guideRNA efficacy, enhancing sequence alignment algorithms, or incorporating deep learning for pattern recognition within sequences.

3/ Community and Forum:

Establish an online community or forum where users can collaborate, seek assistance, and share knowledge. This community-driven approach can foster a supportive environment for users to exchange ideas and troubleshoot issues.

5/ Feedback Mechanisms and Maintenance:

Implement user-friendly feedback mechanisms within the software to collect user suggestions and bug reports. Actively incorporate user feedback into future updates and improvements.

2/ Customizable Workflows:

Implement a feature that enables users to design and save custom analysis workflows. Researchers could define a series of sequential or parallel analyses, apply filters, and save these workflows for future use, streamlining repetitive tasks.

4/ Internationalization:

Offer multi-language support to make the software accessible to users worldwide. This can be especially beneficial for researchers and students in non-English-speaking regions.

The future direction of this project should focus on continuous innovation, user-centric design, and scalability to meet the evolving needs of researchers and students in the dynamic field of molecular biology and bioinformatics.

Getting Started

System Requirements

Before you begin the software, please ensure that you have the following prerequisites in place: Python Interpreter: this software requires a Python interpreter to run. If you haven't installed Python on your computer, you can download it from Python's official website.

Alignment Options

The software offers flexibility when it comes to performing multiple sequence alignment: Online Alignment.You can choose to perform multiple sequence alignment online through our web interface. Simply visit here to access the online alignment tool.

Alignment (On Your Computer)

If you prefer to perform alignment on your local machine, please follow these steps:

Locate the relevant section in the code where alignment is performed.

Depending on your alignment format preference, use one of the following lines:

For Clustal format: alignment = AlignIO.read("aligned_sequences.aln", "clustal")

For Fasta format: alignment = AlignIO.read("aligned_sequences.fasta", "fasta")

Note on Performance

Please be aware that the execution time can vary depending on the size of your data and the performance of your computer. For large datasets and lower-end machines, alignment may take a significant amount of time.

Gitlab Release

You will find our code on gitlab

Software
CRAFT a’ GUIDE

Summary

Key Features

Software Development

Software Workflow

Future Direction

Getting Started

System Requirements

Alignment Options

Alignment (On Your Computer)

Note on Performance

Gitlab Release

Summary

Key Features

Software Development

Software Workflow

1. Installation of Packages

2. BLAST Search/3.Filtering BLAST Results/4.Sequence Output

5. Multiple Sequence Alignment

6. Subsequence Analysis

7. Top Consensus Sequences

8. Blast Small Sequences

9. Optimal Clustering identification

10. Top Consensus Sequences Calculation for each cluster

11. Creation of the reverse DNA primer

Future Direction

Getting Started

System Requirements

Alignment Options

Alignment (On Your Computer)

Note on Performance

Gitlab Release