The presence of Restriction-Modification (R-M) systems is widespread in prokaryotes and presents one of the most significant barriers impeding genetic tractability in non-model organisms. This barrier is evident in M.aeruginosa, which possesses an extensive and robust RM system that remains largely uncharacterized [1]. Our team recognized that effectively engineering M. Aeruginosa would require careful consideration of its R-M system, which is known to be strain-specific. This barrier was particularly frustrating in the context of a time-constrained iGEM project. We felt that the greatest contribution that TABI could make to the iGEM community would be a generalizable solution to restriction site avoidance. We developed the Chameleon project to provide an automated platform for future iGEM teams to more effectively engineer non-model species.

Our software project titled, Chameleon, is built around the Stealth program written by David L. Bernick. Our project enhances the usability of Stealth by streamlining the application of its resultant motif data. Via the release of the Chameleon project, we hope to catalyze the adoption of this incredibly powerful tool. The Chameleon project provides a comprehensive and easy-to-use automated solution to practically apply the bioinformatic insight of Stealth, and to contribute to the field of synthetic biology by enabling the evasion of RM-systems in non-model species with desirable phenotypes- with the click of a button. (gitlab)

As cutting edge genomic sequencing technologies mature, their accessibility is enhanced while costs shrink, in contrast to specialized technologies such as Single-molecule real-time (SMRT) that reveal methylomes. The application of methylomic data to predict R-M sites is established, however, costly overheads prevent adoption on the same scale as genomic sequencing technologies.

In contrast to established methods for R-M components,Stealth performs a statistical comparison between the genomic counts of motifs within a specified size range (k-mers), and probabilities based on a genome specific null model derived from a markov (k-2) chain. Stealth outputs a list of motifs that are systematically underrepresented in the genome, which are expected to include novel R-M substrate motifs. In a 2022 paper, Bernick showed that synthetic DNA designed to avoid some of these underrepresented motifs in H. Pylori yielded improvements in transformation efficiency of >10,000% [2] .

The signifigance of under-represented motifs

RM systems are typically composed of a suite of restriction nucleases and cognate methyltransferases. The restriction component of these RM systems provide a defense strategy by selectively cleaving DNA at recognition sites, often characterized by short 4-6 base pair reverse complement palindromic motifs. On the other hand, the modification component shields the host genome by enzymatically modifying potential cleavage sites within the host DNA. Meselson and Stahl (1958) demonstrated the semi-conservative model of bacterial double-stranded DNA genome replication [3] . This process yields chromosomal copies that are transiently hemi-methylated, as the nascent strand of the daughter chromosomes have not yet had the opportunity to undergo protective methylation by the RM system. What ensues is a race against time—a race between the RM cleavage machinery and the protective DNA methylation processes. This molecular interplay imparts a selective pressure on the genome towards reduction of endogenous RM recognition motifs, leading to the underrepresentation of RM motifs within the genome over time, allowing them to be identified as systemically underrepresented in genomic sequence.

Chameleon encapsulates a command line interface (CLI) tool in the form of a software pipeline titled Plasmid-Stealth (pstealth) and the series of importable modules used to construct it.

The Chameleon project is simple and straightforward to install with Python using the pip package installer. Simply make sure you have a valid Python 3.10+ distribution and pip installer on MacOS/Linux and run pip install chameleontools in the terminal.

For more detailed instructions on how to install Chameleon, visit our iGEM gitlab repository found here and follow the Installation steps detailed in the README.

The Plasmid-Stealth Pipeline

The main functionality of our project Chameleon is the Plasmid-Stealth [pstealth] pipeline found in the chameleontools package. pstealth is a pipeline built around Stealth that takes an annotated plasmid sequence and Stealth-optimizes it for transformation and expression in a target host.

pstealth is command line interface tool that can be called in a terminal with a usage as follows

        pstealth --genome (-g)  --plasmid (-p)  --outfile -o [outfile | default: stdout] -[zPMmrs]
          Optional Args:
            --zScore (-z) -> [zscore cutoff value | default: -4]
            --pseudo (-P) [pseudo-count value | default: 0]
            --max (-M) [maximum motif size | default: 8]
            --min (-m) [minimum motif size | default: 1]
            --palindrome (-r) [Remove RC palindromes only | default: off]
            --silent (-s) [Hide report message | default: show]
            --keep (-k) [Adds annotations to consider in Mutable Regions | default = {'ORF','gene','CDS'}]
            --ignore (-i) [Adds annotations to ignore when defining Mutable Regions | default = {'source'}]

pstealth takes in two required inputs. An annotated plasmid --plasmid (-p) in GenBank format and the genome of the target host --genome (-g) in either GenBank or FastA format

For well annotated genomes, opt to use an annotated GenBank record for the most accurate relative codon frequencies

For both genome and plasmid inputs, the GenBank record must contain the nucleotide sequence of the input.

A run of pstealth will output a plasmid GenBank record almost identical to the input plasmid, only the sequence will be Stealth optimized.

Usage Example

This example covers using the pstealth pipeline to optimize

In this example, an example plasmid was optimized for transformation and expression in E.coli. -r was used to selectively remove only the reverse-complement palindromes identified by Stealth analysis.

Now for a comparison of the original plasmid and our Stealth-modified plasmid

Plasmid Preparation

Mutable regions are sections of the coding sequence of plasmids that can handle mutation without potentially breaking functionality. The PlasmidParse class from the SeqParser module parses out mutable regions by finding CDS/ORF regions that do not overlap non-coding DNA sequence or other CDS/ORF regions as mutations to these regions could completely render a plasmid nonfunctional. In an effort to preserve the translation of alternative start codons, the first codon of every CDS/ORF region is protected and deemed immutable. There is a special case where overlapping CDS/ORF regions are defined as mutable when the two regions overlap but sit in the same reading frame.

The pstealth pipeline can only accept a GenBank record for the target plasmid as PlasmidParse relies on annotations to define what is and is not mutable.This section covers how to prepare a plasmid for a successful run through the pstealth pipeline.

Annotations

Firstly, make sure all important non-coding regions are properly annotated. These could include but are not limited to

Origin of Replication
Origin of Transfer
Promoters
Terminators
Ribosome Binding Sites

and annotate any CDS/ORFs important to the functionality of your plasmid; making sure the annotation is a feature that is 'CDS', 'ORF', 'gene' or something distinct and descriptive. It is important that these annotations are accurate and begin from the start codon and end at the end of the stop codon, going in the correct direction.

If you already have a plasmid with accurate CDS/ORF annotated as features not "CDS","gene","ORF" (i.e "antibiotic_resistance", "reporter_protein", ect.) then you can specify these annotations by using the --keep (-k) optional argument as follows

          
            pstealth -p plasmid.gb -g genome.gb --keep reporter_protein

Furthermore, if there are annotations with feature labels that do not signal an actual function and can be ignored, use of the --ignore (-i) optional argument can be used to disregard sections annotated with a specified annotation.
Example:

          
            pstealth -p plasmid.gb -g genome.gb --ignore misc_feature

Genome Input

The genome input for the pstealth pipeline accepts both a GenBank record format and a FastA file format by checking for the (.gb/gbk) or (.fa/.fasta) file extensions.

Well annotated genomes can be input using a GenBank record containing all CDS/gene annotations. By default, CDS/gene annotations are kept and used to develop a relative codon usage frequency table that is then used to regenerate gene sequences later in the pipeline. When opting to use a GenBank record for a genome input, it is important to include all annotations and the sequence. Without either of these included in the GenBank record, the pipeline will fail to run.

If opting to use a FastA record, CDS/ORF regions are selected using a naive ORF finder included in the ORFfinder class from the ORFfinder module. ORFs are selected by looking for the ATG start codon and keeping the longest gene found within a reading frame as a gene candidate. Keep in mind that due to the nature of how CDS/ORFs are gathered from a FastA file, relative codon usage frequency statistics may be inaccurate to the true usage statistics of the species.

Output File

The output file of a successful pipeline run is defaulted to STDOUT if the --outfile (-o) option is not specified. The output file is simply a copy of the plasmid input file with a Stealth-optimized sequence. This preserves the annotations of the original plasmid.

Optional Arguments

The Z-score argument allows a user to determine the Z-score cutoff value that Stealth uses to determine if a motif is underrepresented. By default, this value is -4.0.

--zScore (-z) accepts a single valid integer if called.

The pseudo-count argument allows a user to set the pseudo-count of of all kmer motifs during Stealth analysis. This pseudo-count will be added to all motif counts regardless if a given motif appears inside a genome or not. By default the pseudo-count is 0.

--pseudo (-P) accepts a single non-negative integer if called.

The maximum motif size argument allows the user to specify the largest sized kmer motif to be analyzed by Stealth. The motif search space growns by a factor of 4^k where k is the maximum motif size. By default the maximum motif size is 8.

--max (-M) accepts a single positive integer in the range [2,9].

The minimum motif size argument allows the user to specify the smallest sized kmer to be reported by Stealth. Analysis still occurs but kmers shorter than the minimum motif size are omitted from being marked for avoidance.

--min (-m) accepts a single positive integer in the range [1,8].

The palindrome flag allows the user to remove only reverse-complement palindrome Stealth identified motifs. In living systems, many restriction enzymes that are apart of R-M systems recognize these reverse-complement palindromic sequences [2]. Selecting only reverse-complement palindromes for avoidance can be a more conservative strategy for Stealth-optimizing a plasmid. By default the palindrome flag is off.

--palindrome (-r) is a boolean flag that is toggled on when called.

The silent flag allows the user to disable the final report message for a pipeline run. By default the silent flag is off, showing the final report message.

--silent (-s) is a boolean flag that is toggled on when called.

The keep annotations argument allows a user to define additional feature annotations to valid mutable regions on a plasmid. Mutable regions are defined as CDS/ORF regions that can be modified to change the nucleotide sequence while preserving the coded amino acid sequence. Modifications are made by swapping out synonomous codons to avoid Stealth motifs in a coding sequence. By default the feature annotations defined as mutable are are {'CDS','ORF','gene'}.

--keep (-k) accepts one or more arguments as a space-separated list of case-sensitive strings.
Example

                
                  pstealth -g genome.gb -p plasmid.gb --keep cool_feature1 nice_feature2 weird_feature3

The ignore annotations argument allows a user to add additional feature annotations to ignore when parsing out mutable regions from a plasmid. Mutable regions are defined as CDS/ORF regions that can be modified to change the nucleotide sequence while preserving the coded amino acid sequence. Modifications are made by swapping out synonomous codons to avoid Stealth motifs in a coding sequence. By default the feature annotations that are ignored are {'source'}

--ignore (-i) accepts one or more arguments as a space-separated list of case-sensitive strings.
Example

                
                  pstealth -g genome.gb -p plasmid.gb --ignore useless_feature1 obsolete_feature2 irrelevent_feature3

`chameleontools` Module Documentation

Development of the Chameleon project and chameleontools started when we wrote software to process the Stealth output of M.aeruginosa and realized we could write a pipeline to automate what was a manual and time consuming process. This software ultimately became what is now the StealthParser module.

As the software we were writing evolved into the Chameleon project, we realized the importance of each module serving a specialized purpose and have written and organized the chameleontools Python package to reflect that sentiment. chameleontools contains 7 distinct modules that can be imported into any Python script and freely used. Each importable module found within chameleontools was written to be flexible and serves a distinct function for uses far beyond just the pstealth pipeline.

To use any module in your own Python project, simply import them as follows.

        
          # Python3.x
          import chameleontools.ChromatoSeq # Motif removal from sequence
          import chameleontools.CodonAnalyzer # Codon Analysis of CDS
          import chameleontools.FastAreader 
          import chameleontools.ORFfinder # Very basic ORF finder
          import chameleontools.SeqParser # Plasmid and Genome input handling
          import chameleontools.Stealth # Stealth analysis
          import chameleontools.StealthParser # Stealth output parser

Each module can function independently of each other and share dependencies of the parent chameleontools package.
Below is detailed documentation of the contents of each module.

Documentation

ChromatoSeq is a module for generating stochastic motif-avoidant ORFs

Stochasitcally regenerates ORFs while avoiding input motifs using a defined relative codon usage frequency table.

Parameters

`motifs`

A collection of motifs to avoid. Compatable with StealthV0 motifs.

Methods

`motifInSeq(seq: str)`

Returns a boolean for if a sequence contians any avoidant motifs.

`optimizeCodon(seq: str, start: int)`

Returns an organized optimization of the codon sequence at a given start position in a DNA sequence.

`optimizeSequence(seq: str)`

Returns the most optimized generated version of input sequence avoiding motifs.

Helper class, used to fetch final statistics on motifs present

Parameters

`motifs`

A collection of motifs to avoid. Compatable with StealthV0 motifs.

Methods

`checkMotifs(seq: str)`

Returns an integer count of all existing motifs in input sequence.

CodonAnalyzer is a module that contains two classes dedicated to performing codon analysis and optimizing coding sequences.

CDSanalyzer reads in a collection of CDS and counts relative codon usage for use in a relative codon frequency table.

Parameters

`input`

Can be a filename of a FastA file, a StealthGenome() object, or list of BioPython SeqRecord objects.

Methods

`self.addCDS(cds: str)`

Method to add a single CDS sequence to existing CDSanalyzer object

`self.len()`

Returns the length of all CDS sequences analyzed

`self.getFrequency(amino_acid = None)`

Returns the entire frequency table if no arguments are passed. Table is returned in a dictionary of type dict[amino_acid(str) : dict[codon(str) : frequency(float)]]

Accepts an optional single-letter amino acid code. If amino acid is specified, returns the codon frequency of a specific amino acid. Table is returned in a dictionary of type dict[codon(str) : frequency(float)]]
If an invalid amino acid is passed, raises a custom InvalidAA error

`self.getUsage(amino_acid = None)`

Returns the entire usage table if no arguments are passed. Table is returned in a dictionary of type dict[amino_acid(str) : dict[codon(str) : usage(int)]]

Accepts an optional single-letter amino acid code. If amino acid is specified, returns the codon usage of a specific amino acid. Table is returned in a dictionary of type dict[codon(str) : usage(float)]]
If an invalid amino acid is passed, raises a custom InvalidAA error

CodonOptimizer takes codon usage data and regenerates sequences roughly fit to levels of tRNA availibility in an organism

Parameters

`codon_frequency_table`

Accepts a relative codon usage frequency table with the type dict[amino_acid(str) : dict[codon(str) : frequency(float)]]. Codon frequency table can be generated through the CDSanalyzer class with the getFrequency() function

Methods

`self.assembleSeed(amino_acid_sequence: str)`

Accepts a single amino acid sequence string and returns a reverse translated coding nucleotide sequence based on the relative codon usage frequency table used to initalize the object.

`self.getFrequency()`

Returns the relative codon usage frequency used to intialize the object. Table is returned in a dictionary of type dict[amino_acid(str) : dict[codon(str) : frequency(float)]]

Custom exception that inherits from KeyError. Raised when invalid Amino Acids are passed into the getUsage() and getFrequency() functions.

A simple FastA reader class written by David L. Bernick at UC Santa Cruz.

Intialize a FastAreader of an input file.

Parameters

`infile`

infile is the name of the file to be read by the FastA reader. If left to None, the FastA reader will read input from STDIN.

Methods

`self.readFasta()`

A generator function that yields a FastA header and its associated sequence in the form [header(str),sequence(str)]

Example usage:

                        
                          # Python3
                          from chameleontools.FastAreader import FastAreader

                          # usage
                          reader = FastAreader(infile)

                          for header,sequence in reader.readFasta():
                            print(header,sequence)

ORF finder that reports all possible ORFs from a given genome sequence.

Naive ORF finder modified from ORFfinder Class written for Winter 2022 BME160 class taught by David L. Bernick

Parameters

`genome`

A string of nucleotides representing a genome sequence. ORFfinder will report all found ORFs in this sequence.

`longest_gene`

Optional argument. Boolean flag to report the longest gene. If False, ORFfinder will report all sub-ORFs found. Default = False.

`min_gene_size`

Optional argument. Integer that determines the minimum size of reported genes. By default, ORFfinder reports genes greater than 100 NT long.

`starts`

Optional argument. A list or set of valid start codons that ORFfinder will use to find ORFs. By default, only uses the 'ATG' start codon.

`stops`

Optional argument. A list or set of valid stop codons that ORFfinder will use to find ORFs. By default, uses the 'TAA','TAG','TGA' stop codons.

Methods

`self.get_genes()`

Returns a list of gene candidates found by ORFfinder. Genes are stored in a list of tuples in the form list[ tuples[start_pos(int), stop_pos(int), length(int), frame(int)] ].
Negative frames indicate an ORF on the reverse reverse-complement strand.

Classes to process input genome or plasmid files for futher Stealth optimizations

Class to properly set up a genome input for Stealth analysis.

Parameters

`genome_infile`

Accepts a filename or filepath to open and read in. File must me in GenBank or FastA format. Enforced by checking file extension (.fa || .fasta || .gb || .gbk)

Methods

`self.getGenome()`

Returns a list of genome sequences in BioPython Seq() objects.

`self.getCDS()`

Returns a list of CDS sequences in BioPython SeqRecord() objects.

Class to parse out mutable regions from a plasmid.

Parameters

`plasmid_infile`

Accepts a filename or filepath to open and read in. File must me in GenBank format. Enforced by checking file extension (.gb || .gbk).

Methods

`self.getGenBank()`

Returns a BioPython SeqRecord() object of the input plasmid.

`self.getSeq()`

Returns a BioPython Seq() object of the plasmid nucleotide sequence.

`self.regions()`

Returns a iterable list of BioPython SeqFeature.SimpleLocation() objects representing mutable regions of the plasmid. These regions are then truncated slightly at the start and end to lie in frame of the parent CDS/ORF they originated from.

`self.mutableCount()`

Returns an integer representing the total nucleotide count spanning over all mutable regions.

`self.unmutableCount()`

Returns an integer representing the total nucleotide count spanning over all unmutable regions.

Unmutable regions are CDS/ORF regions that cannot be altered without possibly impacting plasmid function.

`self.regionCount()`

Returns an integer representing the total nucleotide count covered by CDS/ORF regions before parsing out mutable regions.

The Stealth sub-module contains the original Stealth as written by our PI David Bernick with minimal modifications. The original version of Stealth can be found here. To see the specific modifications made, see the README located in the Stealth sub-module located in our gitlab reposity

Stealth looks for under-represented Kmers in a genome file by using a statistical model to establish an expected count of a Kmer and comparing that to what is actually there.

Genome class to calculate under-represented motifs from.

Parameters

`min`

Optional argument. Integer that defines the minimum motif size to be analyzed. By default, 1.

`max`

Optional argument. Integer that defines the maximum motif size to be analyzed. Note that search space of motifs grows by a factor of 4^max. By default, 8.

`pseudo`

Optional arugment. Integer that is added to initalize all counts over the motif search space. By default, 1.

Methods

`self.addSequence(seq)`

Accepts a sequence in the form of a string to add to the parent Genome object. Added sequences are used to calculate motif statistics.

`self.E(motif)`

Accepts a motif in the form of a string and returns a float that represents the expected count.

`self.pValue(motif)`

Accepts a motif in the form of a string and returns a float that represents the P Value.

`self.Zscore(motif)`

Accepts a motif in the form of a string and returns a float that represents the Z-score.

`self.Evalue(motif)`

Accepts a motif in the form of a string and returns a float that represents the E-value.

Function to generate a reverse complement sequence. Accepts a string of nucleotide sequence and returns the reverse complement sequence. Accepts the nucleotides 'ACTGN'

Written by UCSC TABI to store Stealth analysis in a Python object.

Parameters

`seq`

String or list of strings representing nucleotide sequences to perform Stealth analysis on.

`zscore`

Float to determine the z-score cuttoff value that Stealth analysis will report.

`pseudo`

Integer that is added to initalize all counts over the motif search space.

Integer that defines the maximum motif size to be reported. Note that search space of motifs grows by a factor of 4^kMax.

`kMin`

Optional argument. Integer that defines the minimum motif size to be reported. Note that all motifs of size kMax and below will be analyzed. By default, 1

Methods

`self.getOutput()`

Returns a list of Stealth motif reports in the form of type list[ tuple[ motif(str), zscore(float), rc_palindrome_flag(bool) ]]

A single class that parses Stealth outputs

Inherits from Python type set Takes stealth output data processes it as a set of motifs.

Parameters

`input`

Accepts a filename of a Stealth output file or StealthV0 object.

[1] Q. Yan and S. S. Fong, “Challenges and Advances for Genetic Engineering of Non-model Bacteria and Uses in Consolidated Bioprocessing,” Front. Microbiol., vol. 8, p. 2060, Oct. 2017, doi: 10.3389/fmicb.2017.02060.

[2] S. Hu, S. Giacopazzi, R. Giacopazzi, K. Karplus, D. Bernick, and K. Ottemann, “Altering under-represented DNA sequences elevates bacterial transformation efficiency.” University of California, Santa Cruz, Aug. 06, 2023.

[3] M. Meselson and F. W. Stahl, “The replication of DNA in Escherichia coli,” Proc. Natl. Acad. Sci., vol. 44, no. 7, pp. 671–682, Jul. 1958, doi: 10.1073/pnas.44.7.671.

Software

An Overview of Stealth Developed by David Bernick

The Plasmid-Stealth Pipeline

Usage Example

Plasmid Preparation

Genome Input

Output File

Optional Arguments

--zScore (-z)

--pseudo (-P)

--max (-M)

--min (-m)

--palindrome (-r)

--silent (-s)

--keep (-k)

--ignore (-i)

chameleontools Module Documentation

Documentation

ChromatoSeq

CLASS patternConstrainer(motifs, frequency)

motifs

motifInSeq(seq: str)

optimizeCodon(seq: str, start: int)

optimizeSequence(seq: str)

CLASS MotifChecker(motifs)

motifs

checkMotifs(seq: str)

CodonAnalyzer

CLASS CDSanalyzer(input)

input

self.addCDS(cds: str)

self.len()

self.getFrequency(amino_acid = None)

self.getUsage(amino_acid = None)

CLASS CodonOptimizer(codon_frequency_table)

codon_frequency_table

self.assembleSeed(amino_acid_sequence: str)

self.getFrequency()

EXCEPTION InvalidAA(KeyError)

FastAreader

CLASS FastAreader(infile=None)

infile

self.readFasta()

ORFfinder

CLASS ORFfinder(genome, longest_gene=False, min_gene_size=100, starts={'ATG'}, stops={'TAA','TAG','TGA'})

genome

longest_gene

min_gene_size

starts

stops

self.get_genes()

SeqParser

CLASS StealthGenome(genome_infile)

genome_infile

self.getGenome()

self.getCDS()

CLASS PlasmidParse(plasmid_infile)

plasmid_infile

self.getGenBank()

self.getSeq()

self.regions()

self.mutableCount()

self.unmutableCount()

self.regionCount()

Stealth

CLASS Genome(min=1, max=8, pseudo=1)

min

max

pseudo

self.addSequence(seq)

self.E(motif)

self.pValue(motif)

self.Zscore(motif)

self.Evalue(motif)

FUNCTION reverse_complement(seq)

CLASS StealthV0(seq, zscore, pseudo, kMax, kMin= 1)

seq

zscore

pseudo

kMin

`chameleontools` Module Documentation

`motifs`

`motifInSeq(seq: str)`

`optimizeCodon(seq: str, start: int)`

`optimizeSequence(seq: str)`

`motifs`

`checkMotifs(seq: str)`

`input`

`self.addCDS(cds: str)`

`self.len()`

`self.getFrequency(amino_acid = None)`

`self.getUsage(amino_acid = None)`

`codon_frequency_table`

`self.assembleSeed(amino_acid_sequence: str)`

`self.getFrequency()`

`infile`

`self.readFasta()`

`genome`

`longest_gene`

`min_gene_size`

`starts`

`stops`

`self.get_genes()`

`genome_infile`

`self.getGenome()`

`self.getCDS()`

`plasmid_infile`

`self.getGenBank()`

`self.getSeq()`

`self.regions()`

`self.mutableCount()`

`self.unmutableCount()`

`self.regionCount()`

`min`

`max`

`pseudo`

`self.addSequence(seq)`

`self.E(motif)`

`self.pValue(motif)`

`self.Zscore(motif)`

`self.Evalue(motif)`

`seq`

`zscore`

`pseudo`

`kMin`

`self.getOutput()`

`input`