SOFTWARE

Software


The presence of Restriction-Modification (R-M) systems is widespread in prokaryotes and presents one of the most significant barriers impeding genetic tractability in non-model organisms. This barrier is evident in M.aeruginosa, which possesses an extensive and robust RM system that remains largely uncharacterized [1]. Our team recognized that effectively engineering M. Aeruginosa would require careful consideration of its R-M system, which is known to be strain-specific. This barrier was particularly frustrating in the context of a time-constrained iGEM project. We felt that the greatest contribution that TABI could make to the iGEM community would be a generalizable solution to restriction site avoidance. We developed the Chameleon project to provide an automated platform for future iGEM teams to more effectively engineer non-model species.

Our software project titled, Chameleon, is built around the Stealth program written by David L. Bernick. Our project enhances the usability of Stealth by streamlining the application of its resultant motif data. Via the release of the Chameleon project, we hope to catalyze the adoption of this incredibly powerful tool. The Chameleon project provides a comprehensive and easy-to-use automated solution to practically apply the bioinformatic insight of Stealth, and to contribute to the field of synthetic biology by enabling the evasion of RM-systems in non-model species with desirable phenotypes- with the click of a button. (gitlab)

As cutting edge genomic sequencing technologies mature, their accessibility is enhanced while costs shrink, in contrast to specialized technologies such as Single-molecule real-time (SMRT) that reveal methylomes. The application of methylomic data to predict R-M sites is established, however, costly overheads prevent adoption on the same scale as genomic sequencing technologies.

In contrast to established methods for R-M components,Stealth performs a statistical comparison between the genomic counts of motifs within a specified size range (k-mers), and probabilities based on a genome specific null model derived from a markov (k-2) chain. Stealth outputs a list of motifs that are systematically underrepresented in the genome, which are expected to include novel R-M substrate motifs. In a 2022 paper, Bernick showed that synthetic DNA designed to avoid some of these underrepresented motifs in H. Pylori yielded improvements in transformation efficiency of >10,000% [2] .

The signifigance of under-represented motifs

RM systems are typically composed of a suite of restriction nucleases and cognate methyltransferases. The restriction component of these RM systems provide a defense strategy by selectively cleaving DNA at recognition sites, often characterized by short 4-6 base pair reverse complement palindromic motifs. On the other hand, the modification component shields the host genome by enzymatically modifying potential cleavage sites within the host DNA. Meselson and Stahl (1958) demonstrated the semi-conservative model of bacterial double-stranded DNA genome replication [3] . This process yields chromosomal copies that are transiently hemi-methylated, as the nascent strand of the daughter chromosomes have not yet had the opportunity to undergo protective methylation by the RM system. What ensues is a race against time—a race between the RM cleavage machinery and the protective DNA methylation processes. This molecular interplay imparts a selective pressure on the genome towards reduction of endogenous RM recognition motifs, leading to the underrepresentation of RM motifs within the genome over time, allowing them to be identified as systemically underrepresented in genomic sequence.

Chameleon encapsulates a command line interface (CLI) tool in the form of a software pipeline titled Plasmid-Stealth (pstealth) and the series of importable modules used to construct it.

The Chameleon project is simple and straightforward to install with Python using the pip package installer. Simply make sure you have a valid Python 3.10+ distribution and pip installer on MacOS/Linux and run pip install chameleontools in the terminal.

For more detailed instructions on how to install Chameleon, visit our iGEM gitlab repository found here and follow the Installation steps detailed in the README.

The Plasmid-Stealth Pipeline

The main functionality of our project Chameleon is the Plasmid-Stealth [pstealth] pipeline found in the chameleontools package. pstealth is a pipeline built around Stealth that takes an annotated plasmid sequence and Stealth-optimizes it for transformation and expression in a target host.

pstealth is command line interface tool that can be called in a terminal with a usage as follows

        pstealth --genome (-g)  --plasmid (-p)  --outfile -o [outfile | default: stdout] -[zPMmrs]
          Optional Args:
            --zScore (-z) -> [zscore cutoff value | default: -4]
            --pseudo (-P) [pseudo-count value | default: 0]
            --max (-M) [maximum motif size | default: 8]
            --min (-m) [minimum motif size | default: 1]
            --palindrome (-r) [Remove RC palindromes only | default: off]
            --silent (-s) [Hide report message | default: show]
            --keep (-k) [Adds annotations to consider in Mutable Regions | default = {'ORF','gene','CDS'}]
            --ignore (-i) [Adds annotations to ignore when defining Mutable Regions | default = {'source'}]
        
      

pstealth takes in two required inputs. An annotated plasmid --plasmid (-p) in GenBank format and the genome of the target host --genome (-g) in either GenBank or FastA format

For well annotated genomes, opt to use an annotated GenBank record for the most accurate relative codon frequencies

For both genome and plasmid inputs, the GenBank record must contain the nucleotide sequence of the input.

A run of pstealth will output a plasmid GenBank record almost identical to the input plasmid, only the sequence will be Stealth optimized.

Usage Example

This example covers using the pstealth pipeline to optimize

In this example, an example plasmid was optimized for transformation and expression in E.coli. -r was used to selectively remove only the reverse-complement palindromes identified by Stealth analysis.

Now for a comparison of the original plasmid and our Stealth-modified plasmid

Plasmid Preparation

Mutable regions are sections of the coding sequence of plasmids that can handle mutation without potentially breaking functionality. The PlasmidParse class from the SeqParser module parses out mutable regions by finding CDS/ORF regions that do not overlap non-coding DNA sequence or other CDS/ORF regions as mutations to these regions could completely render a plasmid nonfunctional. In an effort to preserve the translation of alternative start codons, the first codon of every CDS/ORF region is protected and deemed immutable. There is a special case where overlapping CDS/ORF regions are defined as mutable when the two regions overlap but sit in the same reading frame.

The pstealth pipeline can only accept a GenBank record for the target plasmid as PlasmidParse relies on annotations to define what is and is not mutable.This section covers how to prepare a plasmid for a successful run through the pstealth pipeline.

Annotations

Firstly, make sure all important non-coding regions are properly annotated. These could include but are not limited to

  • Origin of Replication
  • Origin of Transfer
  • Promoters
  • Terminators
  • Ribosome Binding Sites
and annotate any CDS/ORFs important to the functionality of your plasmid; making sure the annotation is a feature that is 'CDS', 'ORF', 'gene' or something distinct and descriptive. It is important that these annotations are accurate and begin from the start codon and end at the end of the stop codon, going in the correct direction.

If you already have a plasmid with accurate CDS/ORF annotated as features not "CDS","gene","ORF" (i.e "antibiotic_resistance", "reporter_protein", ect.) then you can specify these annotations by using the --keep (-k) optional argument as follows
          
            pstealth -p plasmid.gb -g genome.gb --keep reporter_protein
          
        
Furthermore, if there are annotations with feature labels that do not signal an actual function and can be ignored, use of the --ignore (-i) optional argument can be used to disregard sections annotated with a specified annotation.
Example:
          
            pstealth -p plasmid.gb -g genome.gb --ignore misc_feature
          
        

Genome Input

The genome input for the pstealth pipeline accepts both a GenBank record format and a FastA file format by checking for the (.gb/gbk) or (.fa/.fasta) file extensions.

Well annotated genomes can be input using a GenBank record containing all CDS/gene annotations. By default, CDS/gene annotations are kept and used to develop a relative codon usage frequency table that is then used to regenerate gene sequences later in the pipeline. When opting to use a GenBank record for a genome input, it is important to include all annotations and the sequence. Without either of these included in the GenBank record, the pipeline will fail to run.

If opting to use a FastA record, CDS/ORF regions are selected using a naive ORF finder included in the ORFfinder class from the ORFfinder module. ORFs are selected by looking for the ATG start codon and keeping the longest gene found within a reading frame as a gene candidate. Keep in mind that due to the nature of how CDS/ORFs are gathered from a FastA file, relative codon usage frequency statistics may be inaccurate to the true usage statistics of the species.

Output File

The output file of a successful pipeline run is defaulted to STDOUT if the --outfile (-o) option is not specified. The output file is simply a copy of the plasmid input file with a Stealth-optimized sequence. This preserves the annotations of the original plasmid.

Optional Arguments

The Z-score argument allows a user to determine the Z-score cutoff value that Stealth uses to determine if a motif is underrepresented. By default, this value is -4.0.

--zScore (-z) accepts a single valid integer if called.

The pseudo-count argument allows a user to set the pseudo-count of of all kmer motifs during Stealth analysis. This pseudo-count will be added to all motif counts regardless if a given motif appears inside a genome or not. By default the pseudo-count is 0.

--pseudo (-P) accepts a single non-negative integer if called.

The maximum motif size argument allows the user to specify the largest sized kmer motif to be analyzed by Stealth. The motif search space growns by a factor of 4k where k is the maximum motif size. By default the maximum motif size is 8.

--max (-M) accepts a single positive integer in the range [2,9].

The minimum motif size argument allows the user to specify the smallest sized kmer to be reported by Stealth. Analysis still occurs but kmers shorter than the minimum motif size are omitted from being marked for avoidance.

--min (-m) accepts a single positive integer in the range [1,8].

The palindrome flag allows the user to remove only reverse-complement palindrome Stealth identified motifs. In living systems, many restriction enzymes that are apart of R-M systems recognize these reverse-complement palindromic sequences [2]. Selecting only reverse-complement palindromes for avoidance can be a more conservative strategy for Stealth-optimizing a plasmid. By default the palindrome flag is off.

--palindrome (-r) is a boolean flag that is toggled on when called.

The silent flag allows the user to disable the final report message for a pipeline run. By default the silent flag is off, showing the final report message.

--silent (-s) is a boolean flag that is toggled on when called.

The keep annotations argument allows a user to define additional feature annotations to valid mutable regions on a plasmid. Mutable regions are defined as CDS/ORF regions that can be modified to change the nucleotide sequence while preserving the coded amino acid sequence. Modifications are made by swapping out synonomous codons to avoid Stealth motifs in a coding sequence. By default the feature annotations defined as mutable are are {'CDS','ORF','gene'}.

--keep (-k) accepts one or more arguments as a space-separated list of case-sensitive strings.
Example

                
                  pstealth -g genome.gb -p plasmid.gb --keep cool_feature1 nice_feature2 weird_feature3
                
              

The ignore annotations argument allows a user to add additional feature annotations to ignore when parsing out mutable regions from a plasmid. Mutable regions are defined as CDS/ORF regions that can be modified to change the nucleotide sequence while preserving the coded amino acid sequence. Modifications are made by swapping out synonomous codons to avoid Stealth motifs in a coding sequence. By default the feature annotations that are ignored are {'source'}

--ignore (-i) accepts one or more arguments as a space-separated list of case-sensitive strings.
Example

                
                  pstealth -g genome.gb -p plasmid.gb --ignore useless_feature1 obsolete_feature2 irrelevent_feature3
                
              

chameleontools Module Documentation

Development of the Chameleon project and chameleontools started when we wrote software to process the Stealth output of M.aeruginosa and realized we could write a pipeline to automate what was a manual and time consuming process. This software ultimately became what is now the StealthParser module.

As the software we were writing evolved into the Chameleon project, we realized the importance of each module serving a specialized purpose and have written and organized the chameleontools Python package to reflect that sentiment. chameleontools contains 7 distinct modules that can be imported into any Python script and freely used. Each importable module found within chameleontools was written to be flexible and serves a distinct function for uses far beyond just the pstealth pipeline.

To use any module in your own Python project, simply import them as follows.

        
          # Python3.x
          import chameleontools.ChromatoSeq # Motif removal from sequence
          import chameleontools.CodonAnalyzer # Codon Analysis of CDS
          import chameleontools.FastAreader 
          import chameleontools.ORFfinder # Very basic ORF finder
          import chameleontools.SeqParser # Plasmid and Genome input handling
          import chameleontools.Stealth # Stealth analysis
          import chameleontools.StealthParser # Stealth output parser
        
      
Each module can function independently of each other and share dependencies of the parent chameleontools package.
Below is detailed documentation of the contents of each module.

Documentation

ChromatoSeq is a module for generating stochastic motif-avoidant ORFs

Stochasitcally regenerates ORFs while avoiding input motifs using a defined relative codon usage frequency table.

Parameters

motifs

A collection of motifs to avoid. Compatable with StealthV0 motifs.

Methods

motifInSeq(seq: str)

Returns a boolean for if a sequence contians any avoidant motifs.

optimizeCodon(seq: str, start: int)

Returns an organized optimization of the codon sequence at a given start position in a DNA sequence.

optimizeSequence(seq: str)

Returns the most optimized generated version of input sequence avoiding motifs.

Helper class, used to fetch final statistics on motifs present

Parameters

motifs

A collection of motifs to avoid. Compatable with StealthV0 motifs.

Methods

checkMotifs(seq: str)

Returns an integer count of all existing motifs in input sequence.

CodonAnalyzer is a module that contains two classes dedicated to performing codon analysis and optimizing coding sequences.

CDSanalyzer reads in a collection of CDS and counts relative codon usage for use in a relative codon frequency table.

Parameters

input

Can be a filename of a FastA file, a StealthGenome() object, or list of BioPython SeqRecord objects.

Methods

self.addCDS(cds: str)

Method to add a single CDS sequence to existing CDSanalyzer object

self.len()

Returns the length of all CDS sequences analyzed

self.getFrequency(amino_acid = None)

Returns the entire frequency table if no arguments are passed. Table is returned in a dictionary of type dict[amino_acid(str) : dict[codon(str) : frequency(float)]]

Accepts an optional single-letter amino acid code. If amino acid is specified, returns the codon frequency of a specific amino acid. Table is returned in a dictionary of type dict[codon(str) : frequency(float)]]
If an invalid amino acid is passed, raises a custom InvalidAA error

self.getUsage(amino_acid = None)

Returns the entire usage table if no arguments are passed. Table is returned in a dictionary of type dict[amino_acid(str) : dict[codon(str) : usage(int)]]

Accepts an optional single-letter amino acid code. If amino acid is specified, returns the codon usage of a specific amino acid. Table is returned in a dictionary of type dict[codon(str) : usage(float)]]
If an invalid amino acid is passed, raises a custom InvalidAA error

CodonOptimizer takes codon usage data and regenerates sequences roughly fit to levels of tRNA availibility in an organism

Parameters

codon_frequency_table

Accepts a relative codon usage frequency table with the type dict[amino_acid(str) : dict[codon(str) : frequency(float)]]. Codon frequency table can be generated through the CDSanalyzer class with the getFrequency() function

Methods

self.assembleSeed(amino_acid_sequence: str)

Accepts a single amino acid sequence string and returns a reverse translated coding nucleotide sequence based on the relative codon usage frequency table used to initalize the object.

self.getFrequency()

Returns the relative codon usage frequency used to intialize the object. Table is returned in a dictionary of type dict[amino_acid(str) : dict[codon(str) : frequency(float)]]

Custom exception that inherits from KeyError. Raised when invalid Amino Acids are passed into the getUsage() and getFrequency() functions.

A simple FastA reader class written by David L. Bernick at UC Santa Cruz.

Intialize a FastAreader of an input file.

Parameters

infile

infile is the name of the file to be read by the FastA reader. If left to None, the FastA reader will read input from STDIN.

Methods

self.readFasta()

A generator function that yields a FastA header and its associated sequence in the form [header(str),sequence(str)]

Example usage:

                        
                          # Python3
                          from chameleontools.FastAreader import FastAreader

                          # usage
                          reader = FastAreader(infile)

                          for header,sequence in reader.readFasta():
                            print(header,sequence)
                        
                      

ORF finder that reports all possible ORFs from a given genome sequence.

Naive ORF finder modified from ORFfinder Class written for Winter 2022 BME160 class taught by David L. Bernick

Parameters

genome

A string of nucleotides representing a genome sequence. ORFfinder will report all found ORFs in this sequence.

longest_gene

Optional argument. Boolean flag to report the longest gene. If False, ORFfinder will report all sub-ORFs found. Default = False.

min_gene_size

Optional argument. Integer that determines the minimum size of reported genes. By default, ORFfinder reports genes greater than 100 NT long.

starts

Optional argument. A list or set of valid start codons that ORFfinder will use to find ORFs. By default, only uses the 'ATG' start codon.

stops

Optional argument. A list or set of valid stop codons that ORFfinder will use to find ORFs. By default, uses the 'TAA','TAG','TGA' stop codons.

Methods

self.get_genes()

Returns a list of gene candidates found by ORFfinder. Genes are stored in a list of tuples in the form list[ tuples[start_pos(int), stop_pos(int), length(int), frame(int)] ].
Negative frames indicate an ORF on the reverse reverse-complement strand.

Classes to process input genome or plasmid files for futher Stealth optimizations

Class to properly set up a genome input for Stealth analysis.

Parameters

genome_infile

Accepts a filename or filepath to open and read in. File must me in GenBank or FastA format. Enforced by checking file extension (.fa || .fasta || .gb || .gbk)

Methods

self.getGenome()

Returns a list of genome sequences in BioPython Seq() objects.

self.getCDS()

Returns a list of CDS sequences in BioPython SeqRecord() objects.

Class to parse out mutable regions from a plasmid.

Parameters

plasmid_infile

Accepts a filename or filepath to open and read in. File must me in GenBank format. Enforced by checking file extension (.gb || .gbk).

Methods

self.getGenBank()

Returns a BioPython SeqRecord() object of the input plasmid.

self.getSeq()

Returns a BioPython Seq() object of the plasmid nucleotide sequence.

self.regions()

Returns a iterable list of BioPython SeqFeature.SimpleLocation() objects representing mutable regions of the plasmid. These regions are then truncated slightly at the start and end to lie in frame of the parent CDS/ORF they originated from.

self.mutableCount()

Returns an integer representing the total nucleotide count spanning over all mutable regions.

self.unmutableCount()

Returns an integer representing the total nucleotide count spanning over all unmutable regions.

Unmutable regions are CDS/ORF regions that cannot be altered without possibly impacting plasmid function.

self.regionCount()

Returns an integer representing the total nucleotide count covered by CDS/ORF regions before parsing out mutable regions.

The Stealth sub-module contains the original Stealth as written by our PI David Bernick with minimal modifications. The original version of Stealth can be found here. To see the specific modifications made, see the README located in the Stealth sub-module located in our gitlab reposity

Stealth looks for under-represented Kmers in a genome file by using a statistical model to establish an expected count of a Kmer and comparing that to what is actually there.

Genome class to calculate under-represented motifs from.

Parameters

min

Optional argument. Integer that defines the minimum motif size to be analyzed. By default, 1.

max

Optional argument. Integer that defines the maximum motif size to be analyzed. Note that search space of motifs grows by a factor of 4max. By default, 8.

pseudo

Optional arugment. Integer that is added to initalize all counts over the motif search space. By default, 1.

Methods

self.addSequence(seq)

Accepts a sequence in the form of a string to add to the parent Genome object. Added sequences are used to calculate motif statistics.

self.E(motif)

Accepts a motif in the form of a string and returns a float that represents the expected count.

self.pValue(motif)

Accepts a motif in the form of a string and returns a float that represents the P Value.

self.Zscore(motif)

Accepts a motif in the form of a string and returns a float that represents the Z-score.

self.Evalue(motif)

Accepts a motif in the form of a string and returns a float that represents the E-value.

Function to generate a reverse complement sequence. Accepts a string of nucleotide sequence and returns the reverse complement sequence. Accepts the nucleotides 'ACTGN'

Written by UCSC TABI to store Stealth analysis in a Python object.

Parameters

seq

String or list of strings representing nucleotide sequences to perform Stealth analysis on.

zscore

Float to determine the z-score cuttoff value that Stealth analysis will report.

pseudo

Integer that is added to initalize all counts over the motif search space.

Integer that defines the maximum motif size to be reported. Note that search space of motifs grows by a factor of 4kMax.

kMin

Optional argument. Integer that defines the minimum motif size to be reported. Note that all motifs of size kMax and below will be analyzed. By default, 1

Methods

self.getOutput()

Returns a list of Stealth motif reports in the form of type list[ tuple[ motif(str), zscore(float), rc_palindrome_flag(bool) ]]

A single class that parses Stealth outputs

Inherits from Python type set Takes stealth output data processes it as a set of motifs.

Parameters

input

Accepts a filename of a Stealth output file or StealthV0 object.

References:

[1] Q. Yan and S. S. Fong, “Challenges and Advances for Genetic Engineering of Non-model Bacteria and Uses in Consolidated Bioprocessing,” Front. Microbiol., vol. 8, p. 2060, Oct. 2017, doi: 10.3389/fmicb.2017.02060.

[2] S. Hu, S. Giacopazzi, R. Giacopazzi, K. Karplus, D. Bernick, and K. Ottemann, “Altering under-represented DNA sequences elevates bacterial transformation efficiency.” University of California, Santa Cruz, Aug. 06, 2023.

[3] M. Meselson and F. W. Stahl, “The replication of DNA in Escherichia coli,” Proc. Natl. Acad. Sci., vol. 44, no. 7, pp. 671–682, Jul. 1958, doi: 10.1073/pnas.44.7.671.