In order to create our biological tool, it was essential to develop the software. We wanted to develop software to solve a technical problem for which no existing solution was found. Indeed, the only available tool only finds guide RNAs on a single gene. But for our project, we wanted to find guide RNAs capable of hybridizing on multiple genes of the same class. Our software creates guide RNAs for CRISPR-cas9 (and its derivatives) for a whole gene database. We ensured that our code was compatible with new workflows. Furthermore, this software's open-source accessibility and adaptability make it applicable to other scientists and IGEM teams, as it includes a user-friendly interface compatible with Windows, Mac, and Linux.
Our project aims to combat antibiotic resistance globally, using the CRISPR tool in combination with a cytidine deaminase. This approach enables us to maintain the functionality of a bacterium while neutralizing its ability to resist antibiotics. We encountered an initial difficulty when it came to finding guide RNAs compatible with our method. In fact, there are limited options available to obtain guide RNAs capable of inducing nonsense mutations and loss of function in the resistance protein thanks to cytidine deaminase. Fortunately, we have identified an application capable of meeting this challenge, but it can only be applied to one gene rather than a whole gene database. Our second goal was to discover the ideal motif, but we soon realized this was an impossible task. We, therefore, looked for a program that could find motifs in a database to obtain motifs capable of binding to several genes. In short, our program emerged from the need to find a versatile motif in a database, in this case, beta-lactamase, to combat antibiotic resistance more effectively.
Our team has created software to analyze FASTA format gene databases and identify relevant patterns. The program is designed to search specifically for motifs chosen by the user according to pre-defined parameters. In addition, the user can filter these motifs by eliminating those also found in FASTA format genome databases. Ultimately, the user is presented with a list of motifs that will not interact with the organism's genome but will exclusively target the desired genes.
In addition, if the user wishes, the program can also isolate motifs that are guaranteed to create a nonsense mutation. Our program uses a search window between the 13th and 17th nucleotides after the PAM motif to identify the desired motifs. After collecting all the data, the program performs a search to rank the motifs according to their coverage rate in the gene database. In other words, it sorts motifs according to the number of genes they can bind. This approach finds the minimum number of motifs capable of binding to the maximum number of genes.
A command prompt, often referred to as 'cmd,' is a means of interacting with your computer, a daily task performed with a slightly less user-friendly interface compared to the graphical interface. However, don't worry, we will provide you with all the instructions you need to use it in order to access our app!
1. First, you need to open it:
You will only need to use a few command lines:
2. Prerequisites:
py -m pip install --upgrade pip' in your terminal.
pip install --upgrade pip' in your terminal.
Navigate to the directory of your choice using the terminal and execute the following command:
git clone https://gitlab.igem.org/2023/software-tools/insaenslyon1.git
Alternatively, you can download it from this link.
- On Windows, type '
py -m pip install -r dependencies.txt' in your terminal.
- On Linux/MacOS, type '
pip install -r dependencies.txt' in your terminal.
python3 app.py
You are now on your way to enter all the needed infos to run the program! You should find yourselves on a page named by the name of the program and with a lot of infos to fill up :
Click on start whenever you filled all the infos to lauch the program.
After clicking, you will be redirected to the download page Just click on the blue link to start the download. If you left the page, don't worry. Go to the "temp directory" and search for the directory corresponding to your job! It's important to regularly clear the "temp directory", especially when working with large files.
To illustrate this, suppose you want to replicate the steps we performed to obtain the patterns used in our laboratory experiments. You can do this using the Git project you installed earlier.
By following these steps, you'll be able to recreate the steps required to obtain the patterns we used in our laboratory experiments. You will then discover a folder containing several files. The first two files correspond to the files you selected at the start. The other two files contain the patterns extracted from these data.
We chose to code our software using Python. Even if this language is less effective and slower than C, it is also the most commonly used for bioinformatic purposes. For this reason, it seemed to be the most appropriate language to ensure accessibility and enable researchers or future iGEM teams to use and improve our software.
The program follows a multi-step process to obtain the desired patterns. First, it performs a pattern search by browsing the file. It uses a regular expression to find patterns of the specified length, ending with the pattern chosen by the user.
After obtaining the gene patterns, the program repeats this operation for the reference genome if the user has supplied it. Next, it performs a binary search to retain only those motifs that are exclusively present in the gene database and not in the genome database.
Once the motifs have been extracted, if the user has selected the mutation option, the program performs mutations for each gene-pattern pair to obtain the desired nucleotide sequence. The mutations are applied directly to the nucleotide sequence. Once the mutations have been made, the sequence is translated into proteins.
The program then counts the number of stop codons to determine whether the mutation has resulted in the appearance of a stop codon. In addition, it assesses for each motif whether the mutation will inevitably result in a stop codon or whether certain mutations will result in a different codon. All this information is stored in a file in CSV format.
We then search for the minimum number of guide RNAs capable of binding to many of our class genes. To do so, we transform our list of genes into an array of 0 and 1, where each index represents a gene. The association between a gene and its index is stored in a dictionary for easy access. We associate a gene array for each guide RNA motif where every gene it binds is represented by a 1 at their index and a 0 when it doesn't. Now that our motifs are associated with their gene array, we filter them by checking if a gene array includes another. To do so, we go through our list of motifs sorted by the number of genes they cover, starting with the ones with the most coverage.
In this loop, for each motif, we go through our list a second time, but this time, we begin with the motifs that cover the least amount of genes. We then compare their gene array. If all the genes of the least covering motifs are included in the most covering motifs, the least covering motif is filtered out and will not be processed in the first loop. With this method, we can discard motifs with redundant covering. This method filters out a lot of motifs, but there is still some overlap. Because even if a motif gene array is not included in another, the combination of other gene arrays can include it. That is why we compute the importance of each motif.
Firstly, we go through all the gene arrays and add them up. We get a full array where each index is at least one. Overlapping is represented by cells with a number superior to 1. With this full array, we remove each motif from the full array and check for each motif how many cells in the full gene array became zeros. It gives us the impact of a motif in a gene. If a motif has the same impact as its number of genes covered, it is essential, as no combination of motifs can cover its genes. Then, we sort our motifs by their impact. We then search for the best combination of genes thanks to the impact. At the end, we obtain a list ordered by gene importance. We call importance the value that a motif adds to the overall coverage. Therefore, the first motif in this list is optimal if we want to use only one guide RNA. The second one is optimal if used with the first one to get the maximum coverage with two guide RNAs, and so on. The files output for this part are two CSV files. One contains the sorted list of motifs by importance with the percentage of coverage if combined with the previous ones. The second is the list of motifs with their associated genes ordered by their importance.
In addition, we set out to develop an artificial intelligence-based program to predict the effectiveness of the guide RNAs we use. To this end, we relied on data from the paper we mentioned earlier. We used their training data in conjunction with the patterns generated by our program. We experimented with two types of regression, first a Lasso-type regression and then a regression based on the random forest method.
However, it is important to note that the training data was based on the E. Coli genome, which differs considerably from the multiple species present in our beta-lactamase database.
We have provided a simplified notebook to enable other teams to explore using these methods with their own training and test data, which they could generate. We chose not to include this AI approach in our final software due to poor results due to our training dataset. However, our functions still allow us to get the gene context needed for the AI approach and could be implemented by future iGEM teams.
Our goal was to create software usable by anyone for creating guide RNA. We succeeded, but there is still room for improvement like :
Reuter A, Hilpert C, Dedieu-Berne A, Lematre S, Gueguen E, Launay G, Bigot S, Lesterlin C. Targeted-antibacterial-plasmids (TAPs) combining conjugation and CRISPR/Cas systems achieve strain-specific antibacterial activity. Nucleic Acids Res. 2021 Apr 6;49(6):3584-3598. doi: 10.1093/nar/gkab126. PMID: 33660775; PMCID: PMC8034655.
Bioinformatics Lab. (2019). "Be-Designer: A Tool for Genome Editing Design." Accessed on October 11, 2023. http://www.rgenome.net/be-designer/