• ### State Complexity of Overlap Assembly(1710.06000)

Dec. 11, 2018 cs.FL
The \emph{state complexity} of a regular language $L_m$ is the number $m$ of states in a minimal deterministic finite automaton (DFA) accepting $L_m$. The state complexity of a regularity-preserving binary operation on regular languages is defined as the maximal state complexity of the result of the operation where the two operands range over all languages of state complexities $\le m$ and $\le n$, respectively. We find a tight upper bound on the state complexity of the binary operation \emph{overlap assembly} on regular languages. This operation was introduced by Csuhaj-Varj\'u, Petre, and Vaszil to model the process of self-assembly of two linear DNA strands into a longer DNA strand, provided that their ends "overlap". We prove that the state complexity of the overlap assembly of languages $L_m$ and $L_n$, where $m\ge 2$ and $n\ge1$, is at most $2 (m-1) 3^{n-1} + 2^n$. Moreover, for $m \ge 2$ and $n \ge 3$ there exist languages $L_m$ and $L_n$ over an alphabet of size $n$ whose overlap assembly meets the upper bound and this bound cannot be met with smaller alphabets. Finally, we prove that $m+n$ is a tight upper bound on the overlap assembly of unary languages, and that there are binary languages whose overlap assembly has exponential state complexity at least $m(2^{n-1}-2)+2$.
• ### An investigation into inter- and intragenomic variations of graphic genomic signatures(1503.00162)

March 10, 2015 q-bio.GN
We provide, on an extensive dataset and using several different distances, confirmation of the hypothesis that CGR patterns are preserved along a genomic DNA sequence, and are different for DNA sequences originating from genomes of different species. This finding lends support to the theory that CGRs of genomic sequences can act as graphic genomic signatures. In particular, we compare the CGR patterns of over five hundred different 150,000 bp genomic sequences originating from the genomes of six organisms, each belonging to one of the kingdoms of life: H. sapiens, S. cerevisiae, A. thaliana, P. falciparum, E. coli, and P. furiosus. We also provide preliminary evidence of this method's applicability to closely related species by comparing H. sapiens (chromosome 21) sequences and over one hundred and fifty genomic sequences, also 150,000 bp long, from P. troglodytes (Animalia; chromosome Y), for a total length of more than 101 million basepairs analyzed. We compute pairwise distances between CGRs of these genomic sequences using six different distances, and construct Molecular Distance Maps that visualize all sequences as points in a two-dimensional or three-dimensional space, to simultaneously display their interrelationships. Our analysis confirms that CGR patterns of DNA sequences from the same genome are in general quantitatively similar, while being different for DNA sequences from genomes of different species. Our analysis of the performance of the assessed distances uses three different quality measures and suggests that several distances outperform the Euclidean distance, which has so far been almost exclusively used for such studies. In particular we show that, for this dataset, DSSIM (Structural Dissimilarity Index) and the descriptor distance (introduced here) are best able to classify genomic sequences.
• ### Transducer Descriptions of DNA Code Properties and Undecidability of Antimorphic Problems(1503.00035)

Feb. 27, 2015 cs.CC, cs.FL
This work concerns formal descriptions of DNA code properties, and builds on previous work on transducer descriptions of classic code properties and on trajectory descriptions of DNA code properties. This line of research allows us to give a property as input to an algorithm, in addition to any regular language, which can then answer questions about the language and the property. Here we define DNA code properties via transducers and show that this method is strictly more expressive than that of trajectories, without sacrificing the efficiency of deciding the satisfaction question. We also show that the maximality question can be undecidable. Our undecidability results hold not only for the fixed DNA involution but also for any fixed antimorphic permutation. Moreover, we also show the undecidability of the antimorphic version of the Post Corresponding Problem, for any fixed antimorphic permutation.
• ### Mapping the Space of Genomic Signatures(1406.4105)

Oct. 9, 2014 q-bio.GN
We propose a computational method to measure and visualize interrelationships among any number of DNA sequences allowing, for example, the examination of hundreds or thousands of complete mitochondrial genomes. An "image distance" is computed for each pair of graphical representations of DNA sequences, and the distances are visualized as a Molecular Distance Map: Each point on the map represents a DNA sequence, and the spatial proximity between any two points reflects the degree of structural similarity between the corresponding sequences. The graphical representation of DNA sequences utilized, Chaos Game Representation (CGR), is genome- and species-specific and can thus act as a genomic signature. Consequently, Molecular Distance Maps could inform species identification, taxonomic classifications and, to a certain extent, evolutionary history. The image distance employed, Structural Dissimilarity Index (DSSIM), implicitly compares the occurrences of oligomers of length up to $k$ (herein $k=9$) in DNA sequences. We computed DSSIM distances for more than 5 million pairs of complete mitochondrial genomes, and used Multi-Dimensional Scaling (MDS) to obtain Molecular Distance Maps that visually display the sequence relatedness in various subsets, at different taxonomic levels. This general-purpose method does not require DNA sequence homology and can thus be used to compare similar or vastly different DNA sequences, genomic or computer-generated, of the same or different lengths. We illustrate potential uses of this approach by applying it to several taxonomic subsets: phylum Vertebrata, (super)kingdom Protista, classes Amphibia-Insecta-Mammalia, class Amphibia, and order Primates. This analysis of an extensive dataset confirms that the oligomer composition of full mtDNA sequences can be a source of taxonomic information.
• ### An efficient algorithm for computing the edit distance of a regular language via input-altering transducers(1406.1041)

June 4, 2014 cs.FL
We revisit the problem of computing the edit distance of a regular language given via an NFA. This problem relates to the inherent maximal error-detecting capability of the language in question. We present an efficient algorithm for solving this problem which executes in time $O(r^2n^2d)$, where $r$ is the cardinality of the alphabet involved, $n$ is the number of transitions in the given NFA, and $d$ is the computed edit distance. We have implemented the algorithm and present here performance tests. The correctness of the algorithm is based on the result (also presented here) that the particular error-detection property related to our problem can be defined via an input-altering transducer.
• ### Binary pattern tile set synthesis is NP-hard(1404.0967)

April 3, 2014 cs.CC
In the field of algorithmic self-assembly, a long-standing unproven conjecture has been that of the NP-hardness of binary pattern tile set synthesis (2-PATS). The $k$-PATS problem is that of designing a tile assembly system with the smallest number of tile types which will self-assemble an input pattern of $k$ colors. Of both theoretical and practical significance, $k$-PATS has been studied in a series of papers which have shown $k$-PATS to be NP-hard for $k = 60$, $k = 29$, and then $k = 11$. In this paper, we close the fundamental conjecture that 2-PATS is NP-hard, concluding this line of study. While most of our proof relies on standard mathematical proof techniques, one crucial lemma makes use of a computer-assisted proof, which is a relatively novel but increasingly utilized paradigm for deriving proofs for complex mathematical problems. This tool is especially powerful for attacking combinatorial problems, as exemplified by the proof of the four color theorem by Appel and Haken (simplified later by Robertson, Sanders, Seymour, and Thomas) or the recent important advance on the Erd\H{o}s discrepancy problem by Konev and Lisitsa using computer programs. We utilize a massively parallel algorithm and thus turn an otherwise intractable portion of our proof into a program which requires approximately a year of computation time, bringing the use of computer-assisted proofs to a new scale. We fully detail the algorithm employed by our code, and make the code freely available online.
• ### Map of Life: Measuring and Visualizing Species' Relatedness with "Molecular Distance Maps"(1307.3755)

July 14, 2013 cs.CV, q-bio.QM, q-bio.PE, q-bio.GN
We propose a novel combination of methods that (i) portrays quantitative characteristics of a DNA sequence as an image, (ii) computes distances between these images, and (iii) uses these distances to output a map wherein each sequence is a point in a common Euclidean space. In the resulting "Molecular Distance Map" each point signifies a DNA sequence, and the geometric distance between any two points reflects the degree of relatedness between the corresponding sequences and species. Molecular Distance Maps present compelling visual representations of relationships between species and could be used for taxonomic clarifications, for species identification, and for studies of evolutionary history. One of the advantages of this method is its general applicability since, as sequence alignment is not required, the DNA sequences chosen for comparison can be completely different regions in different genomes. In fact, this method can be used to compare any two DNA sequences. For example, in our dataset of 3,176 mitochondrial DNA sequences, it correctly finds the mtDNA sequences most closely related to that of the anatomically modern human (the Neanderthal, the Denisovan, and the chimp), and it finds that the sequence most different from it belongs to a cucumber. Furthermore, our method can be used to compare real sequences to artificial, computer-generated, DNA sequences. For example, it is used to determine that the distances between a Homo sapiens sapiens mtDNA and artificial sequences of the same length and same trinucleotide frequencies can be larger than the distance between the same human mtDNA and the mtDNA of a fruit-fly. We demonstrate this method's promising potential for taxonomical clarifications by applying it to a diverse variety of cases that have been historically controversial, such as the genus Polypterus, the family Tarsiidae, and the vast (super)kingdom Protista.
• ### 3-color Bounded Patterned Self-assembly(1306.3257)

June 13, 2013 cs.CC
Patterned self-assembly tile set synthesis PATS is the problem of finding a minimal tile set which uniquely self-assembles into a given pattern. Czeizler and Popa proved the NP-completeness of PATS and Seki showed that the PATS problem is already NP-complete for patterns with 60 colors. In search for the minimal number of colors such that PATS remains NP-complete, we introduce multiple bound PATS (mbPATS) where we allow bounds for the numbers of tile types of each color. We show that mbPATS is NP-complete for patterns with just three colors and, as a byproduct of this result, we also obtain a novel proof for the NP-completeness of PATS which is more concise than the previous proofs.
• ### Hypergraph Automata: A Theoretical Model for Patterned Self-assembly(1302.2840)

Feb. 12, 2013 math.CO, cs.DM, cs.FL
Patterned self-assembly is a process whereby coloured tiles self-assemble to build a rectangular coloured pattern. We propose self-assembly (SA) hypergraph automata as an automata-theoretic model for patterned self-assembly. We investigate the computational power of SA-hypergraph automata and show that for every recognizable picture language, there exists an SA-hypergraph automaton that accepts this language. Conversely, we prove that for any restricted SA-hypergraph automaton, there exists a Wang Tile System, a model for recognizable picture languages, that accepts the same language. The advantage of SA-hypergraph automata over Wang automata, acceptors for the class of recognizable picture languages, is that they do not rely on an a priori defined scanning strategy
• ### Deciding Whether a Regular Language is Generated by a Splicing System(1112.4897)

Aug. 30, 2012 cs.FL
Splicing as a binary word/language operation is inspired by the DNA recombination under the action of restriction enzymes and ligases, and was first introduced by Tom Head in 1987. Shortly thereafter, it was proven that the languages generated by (finite) splicing systems form a proper subclass of the class of regular languages. However, the question of whether or not one can decide if a given regular language is generated by a splicing system remained open. In this paper we give a positive answer to this question. Namely, we prove that, if a language is generated by a splicing system, then it is also generated by a splicing system whose size is a function of the size of the syntactic monoid of the input language, and which can be effectively constructed.
• ### Iterated Hairpin Completions of Non-crossing Words(1110.0760)

Oct. 4, 2011 cs.FL
Iterated hairpin completion is an operation on formal languages that is inspired by the hairpin formation in DNA biochemistry. Iterated hairpin completion of a word (or more precisely a singleton language) is always a context-sensitive language and for some words it is known to be non-context-free. However, it is unknown whether regularity of iterated hairpin completion of a given word is decidable. Also the question whether iterated hairpin completion of a word can be context-free but not regular was asked in literature. In this paper we investigate iterated hairpin completions of non-crossing words and, within this setting, we are able to answer both questions. For non-crossing words we prove that the regularity of iterated hairpin completions is decidable and that if iterated hairpin completion of a non-crossing word is not regular, then it is not context-free either.
• ### On the regularity of iterated hairpin completion of a single word(1104.2385)

April 13, 2011 cs.FL
Hairpin completion is an abstract operation modeling a DNA bio-operation which receives as input a DNA strand $w = x\alpha y \calpha$, and outputs $w' = x \alpha y \bar{\alpha} \bar{x}$, where $\bar{x}$ denotes the Watson-Crick complement of $x$. In this paper, we focus on the problem of finding conditions under which the iterated hairpin completion of a given word is regular. According to the numbers of words $\alpha$ and $\calpha$ that initiate hairpin completion and how they are scattered, we classify the set of all words $w$. For some basic classes of words $w$ containing small numbers of occurrences of $\alpha$ and $\calpha$, we prove that the iterated hairpin completion of $w$ is regular. For other classes with higher numbers of occurrences of $\alpha$ and $\calpha$, we prove a necessary and sufficient condition for the iterated hairpin completion of a word in these classes to be regular.
• ### Polyominoes Simulating Arbitrary-Neighborhood Zippers and Tilings(1002.3769)

April 11, 2011 cs.CC
This paper provides a bridge between the classical tiling theory and the complex neighborhood self-assembling situations that exist in practice. The neighborhood of a position in the plane is the set of coordinates which are considered adjacent to it. This includes classical neighborhoods of size four, as well as arbitrarily complex neighborhoods. A generalized tile system consists of a set of tiles, a neighborhood, and a relation which dictates which are the "admissible" neighboring tiles of a given tile. Thus, in correctly formed assemblies, tiles are assigned positions of the plane in accordance to this relation. We prove that any validly tiled path defined in a given but arbitrary neighborhood (a zipper) can be simulated by a simple "ribbon" of microtiles. A ribbon is a special kind of polyomino, consisting of a non-self-crossing sequence of tiles on the plane, in which successive tiles stick along their adjacent edge. Finally, we extend this construction to the case of traditional tilings, proving that we can simulate arbitrary-neighborhood tilings by simple-neighborhood tilings, while preserving some of their essential properties.
• ### The Power of Nondeterminism in Self-Assembly(1006.2897)

Nov. 25, 2010 cs.DS, cs.CC
We investigate the role of nondeterminism in Winfree's abstract Tile Assembly Model (aTAM), which was conceived to model artificial molecular self-assembling systems constructed from DNA. Of particular practical importance is to find tile systems that minimize resources such as the number of distinct tile types, each of which corresponds to a set of DNA strands that must be custom-synthesized in actual molecular implementations of the aTAM. We seek to identify to what extent the use of nondeterminism in tile systems affects the resources required by such molecular shape-building algorithms. We first show a "molecular computability theoretic" result: there is an infinite shape S that is uniquely assembled by a tile system but not by any deterministic tile system. We then show an analogous phenomenon in the finitary "molecular complexity theoretic" case: there is a finite shape S that is uniquely assembled by a tile system with c tile types, but every deterministic tile system that uniquely assembles S has more than c tile types. In fact we extend the technique to derive a stronger (classical complexity theoretic) result, showing that the problem of finding the minimum number of tile types that uniquely assemble a given finite shape is Sigma-P-2-complete. In contrast, the problem of finding the minimum number of deterministic tile types that uniquely assemble a shape was shown to be NP-complete by Adleman, Cheng, Goel, Huang, Kempe, Moisset de Espan\'es, and Rothemund (Combinatorial Optimization Problems in Self-Assembly, STOC 2002). The conclusion is that nondeterminism confers extra power to assemble a shape from a small tile system, but unless the polynomial hierarchy collapses, it is computationally more difficult to exploit this power by finding the size of the smallest tile system, compared to finding the size of the smallest deterministic tile system.
• ### State Complexity of Catenation Combined with Star and Reversal(1008.1648)

Aug. 10, 2010 cs.FL
This paper is a continuation of our research work on state complexity of combined operations. Motivated by applications, we study the state complexities of two particular combined operations: catenation combined with star and catenation combined with reversal. We show that the state complexities of both of these combined operations are considerably less than the compositions of the state complexities of their individual participating operations.
• ### Ciliate Gene Unscrambling with Fewer Templates(1008.1654)

Aug. 10, 2010 cs.FL
One of the theoretical models proposed for the mechanism of gene unscrambling in some species of ciliates is the template-guided recombination (TGR) system by Prescott, Ehrenfeucht and Rozenberg which has been generalized by Daley and McQuillan from a formal language theory perspective. In this paper, we propose a refinement of this model that generates regular languages using the iterated TGR system with a finite initial language and a finite set of templates, using fewer templates and a smaller alphabet compared to that of the Daley-McQuillan model. To achieve Turing completeness using only finite components, i.e., a finite initial language and a finite set of templates, we also propose an extension of the contextual template-guided recombination system (CTGR system) by Daley and McQuillan, by adding an extra control called permitting contexts on the usage of templates.
• ### State Complexity of Two Combined Operations: Reversal-Catenation and Star-Catenation(1006.4646)

June 23, 2010 cs.FL
In this paper, we show that, due to the structural properties of the resulting automaton obtained from a prior operation, the state complexity of a combined operation may not be equal but close to the mathematical composition of the state complexities of its component operations. In particular, we provide two witness combined operations: reversal combined with catenation and star combined with catenation.
• ### Scalable, Time-Responsive, Digital, Energy-Efficient Molecular Circuits using DNA Strand Displacement(1003.3275)

March 19, 2010 cs.OH
• ### Triangular Self-Assembly(1002.4996)

Feb. 26, 2010 cs.DM
We discuss the self-assembly system of triangular tiles instead of square tiles, in particular right triangular tiles and equilateral triangular tiles. We show that the triangular tile assembly system, either deterministic or non-deterministic, has the same power to the square tile assembly system in computation, which is Turing universal. By providing counter-examples, we show that the triangular tile assembly system and the square tile assembly system are not comparable in general. More precisely, there exists square tile assembly system S such that no triangular tile assembly system is a division of S and produces the same shape; there exists triangular tile assembly system T such that no square tile assembly system produces the same compatible shape with border glues. We also discuss the assembly of triangles by triangular tiles and obtain results similar to the assembly of squares, that is to assemble a triangular of size O(N^2), the minimal number of tiles required is in O(log N/log log N).
• ### Properties of Pseudo-Primitive Words and their Applications(1002.4084)

Feb. 22, 2010 cs.CC
A pseudo-primitive word with respect to an antimorphic involution \theta is a word which cannot be written as a catenation of occurrences of a strictly shorter word t and \theta(t). Properties of pseudo-primitive words are investigated in this paper. These properties link pseudo-primitive words with essential notions in combinatorics on words such as primitive words, (pseudo)-palindromes, and (pseudo)-commutativity. Their applications include an improved solution to the extended Lyndon-Sch\"utzenberger equation u_1 u_2 ... u_l = v_1 ... v_n w_1 ... w_m, where u_1, ..., u_l \in {u, \theta(u)}, v_1, ..., v_n \in {v, \theta(v)}, and w_1, ..., w_m \in {w, \theata(w)} for some words u, v, w, integers l, n, m \ge 2, and an antimorphic involution \theta. We prove that for l \ge 4, n,m \ge 3, this equation implies that u, v, w can be expressed in terms of a common word t and its image \theta(t). Moreover, several cases of this equation where l = 3 are examined.
• ### Negative Interactions in Irreversible Self-Assembly(1002.2746)

Feb. 14, 2010 cs.DS, cs.CC
This paper explores the use of negative (i.e., repulsive) interaction the abstract Tile Assembly Model defined by Winfree. Winfree postulated negative interactions to be physically plausible in his Ph.D. thesis, and Reif, Sahu, and Yin explored their power in the context of reversible attachment operations. We explore the power of negative interactions with irreversible attachments, and we achieve two main results. Our first result is an impossibility theorem: after t steps of assembly, Omega(t) tiles will be forever bound to an assembly, unable to detach. Thus negative glue strengths do not afford unlimited power to reuse tiles. Our second result is a positive one: we construct a set of tiles that can simulate a Turing machine with space bound s and time bound t, while ensuring that no intermediate assembly grows larger than O(s), rather than O(s * t) as required by the standard Turing machine simulation with tiles.
• ### Pseudo-Power Avoidance(0911.2233)

Nov. 11, 2009 cs.DS, cs.FL
Repetition avoidance has been studied since Thue's work. In this paper, we considered another type of repetition, which is called pseudo-power. This concept is inspired by Watson-Crick complementarity in DNA sequence and is defined over an antimorphic involution $\phi$. We first classify the alphabet $\Sigma$ and the antimorphic involution $\phi$, under which there exists sufficiently long pseudo-$k$th-power-free words. Then we present algorithms to test whether a finite word $w$ is pseudo-$k$th-power-free.