
The \emph{state complexity} of a regular language $L_m$ is the number $m$ of
states in a minimal deterministic finite automaton (DFA) accepting $L_m$. The
state complexity of a regularitypreserving binary operation on regular
languages is defined as the maximal state complexity of the result of the
operation where the two operands range over all languages of state complexities
$\le m$ and $\le n$, respectively. We find a tight upper bound on the state
complexity of the binary operation \emph{overlap assembly} on regular
languages. This operation was introduced by CsuhajVarj\'u, Petre, and Vaszil
to model the process of selfassembly of two linear DNA strands into a longer
DNA strand, provided that their ends "overlap". We prove that the state
complexity of the overlap assembly of languages $L_m$ and $L_n$, where $m\ge 2$
and $n\ge1$, is at most $2 (m1) 3^{n1} + 2^n$. Moreover, for $m \ge 2$ and $n
\ge 3$ there exist languages $L_m$ and $L_n$ over an alphabet of size $n$ whose
overlap assembly meets the upper bound and this bound cannot be met with
smaller alphabets. Finally, we prove that $m+n$ is a tight upper bound on the
overlap assembly of unary languages, and that there are binary languages whose
overlap assembly has exponential state complexity at least $m(2^{n1}2)+2$.

We provide, on an extensive dataset and using several different distances,
confirmation of the hypothesis that CGR patterns are preserved along a genomic
DNA sequence, and are different for DNA sequences originating from genomes of
different species. This finding lends support to the theory that CGRs of
genomic sequences can act as graphic genomic signatures. In particular, we
compare the CGR patterns of over five hundred different 150,000 bp genomic
sequences originating from the genomes of six organisms, each belonging to one
of the kingdoms of life: H. sapiens, S. cerevisiae, A. thaliana, P. falciparum,
E. coli, and P. furiosus. We also provide preliminary evidence of this method's
applicability to closely related species by comparing H. sapiens (chromosome
21) sequences and over one hundred and fifty genomic sequences, also 150,000 bp
long, from P. troglodytes (Animalia; chromosome Y), for a total length of more
than 101 million basepairs analyzed. We compute pairwise distances between CGRs
of these genomic sequences using six different distances, and construct
Molecular Distance Maps that visualize all sequences as points in a
twodimensional or threedimensional space, to simultaneously display their
interrelationships. Our analysis confirms that CGR patterns of DNA sequences
from the same genome are in general quantitatively similar, while being
different for DNA sequences from genomes of different species. Our analysis of
the performance of the assessed distances uses three different quality measures
and suggests that several distances outperform the Euclidean distance, which
has so far been almost exclusively used for such studies. In particular we show
that, for this dataset, DSSIM (Structural Dissimilarity Index) and the
descriptor distance (introduced here) are best able to classify genomic
sequences.

This work concerns formal descriptions of DNA code properties, and builds on
previous work on transducer descriptions of classic code properties and on
trajectory descriptions of DNA code properties. This line of research allows us
to give a property as input to an algorithm, in addition to any regular
language, which can then answer questions about the language and the property.
Here we define DNA code properties via transducers and show that this method is
strictly more expressive than that of trajectories, without sacrificing the
efficiency of deciding the satisfaction question. We also show that the
maximality question can be undecidable. Our undecidability results hold not
only for the fixed DNA involution but also for any fixed antimorphic
permutation. Moreover, we also show the undecidability of the antimorphic
version of the Post Corresponding Problem, for any fixed antimorphic
permutation.

We propose a computational method to measure and visualize interrelationships
among any number of DNA sequences allowing, for example, the examination of
hundreds or thousands of complete mitochondrial genomes. An "image distance" is
computed for each pair of graphical representations of DNA sequences, and the
distances are visualized as a Molecular Distance Map: Each point on the map
represents a DNA sequence, and the spatial proximity between any two points
reflects the degree of structural similarity between the corresponding
sequences. The graphical representation of DNA sequences utilized, Chaos Game
Representation (CGR), is genome and speciesspecific and can thus act as a
genomic signature. Consequently, Molecular Distance Maps could inform species
identification, taxonomic classifications and, to a certain extent,
evolutionary history. The image distance employed, Structural Dissimilarity
Index (DSSIM), implicitly compares the occurrences of oligomers of length up to
$k$ (herein $k=9$) in DNA sequences. We computed DSSIM distances for more than
5 million pairs of complete mitochondrial genomes, and used MultiDimensional
Scaling (MDS) to obtain Molecular Distance Maps that visually display the
sequence relatedness in various subsets, at different taxonomic levels. This
generalpurpose method does not require DNA sequence homology and can thus be
used to compare similar or vastly different DNA sequences, genomic or
computergenerated, of the same or different lengths. We illustrate potential
uses of this approach by applying it to several taxonomic subsets: phylum
Vertebrata, (super)kingdom Protista, classes AmphibiaInsectaMammalia, class
Amphibia, and order Primates. This analysis of an extensive dataset confirms
that the oligomer composition of full mtDNA sequences can be a source of
taxonomic information.

We revisit the problem of computing the edit distance of a regular language
given via an NFA. This problem relates to the inherent maximal errordetecting
capability of the language in question. We present an efficient algorithm for
solving this problem which executes in time $O(r^2n^2d)$, where $r$ is the
cardinality of the alphabet involved, $n$ is the number of transitions in the
given NFA, and $d$ is the computed edit distance. We have implemented the
algorithm and present here performance tests. The correctness of the algorithm
is based on the result (also presented here) that the particular
errordetection property related to our problem can be defined via an
inputaltering transducer.

In the field of algorithmic selfassembly, a longstanding unproven
conjecture has been that of the NPhardness of binary pattern tile set
synthesis (2PATS). The $k$PATS problem is that of designing a tile assembly
system with the smallest number of tile types which will selfassemble an input
pattern of $k$ colors. Of both theoretical and practical significance, $k$PATS
has been studied in a series of papers which have shown $k$PATS to be NPhard
for $k = 60$, $k = 29$, and then $k = 11$. In this paper, we close the
fundamental conjecture that 2PATS is NPhard, concluding this line of study.
While most of our proof relies on standard mathematical proof techniques, one
crucial lemma makes use of a computerassisted proof, which is a relatively
novel but increasingly utilized paradigm for deriving proofs for complex
mathematical problems. This tool is especially powerful for attacking
combinatorial problems, as exemplified by the proof of the four color theorem
by Appel and Haken (simplified later by Robertson, Sanders, Seymour, and
Thomas) or the recent important advance on the Erd\H{o}s discrepancy problem by
Konev and Lisitsa using computer programs. We utilize a massively parallel
algorithm and thus turn an otherwise intractable portion of our proof into a
program which requires approximately a year of computation time, bringing the
use of computerassisted proofs to a new scale. We fully detail the algorithm
employed by our code, and make the code freely available online.

We propose a novel combination of methods that (i) portrays quantitative
characteristics of a DNA sequence as an image, (ii) computes distances between
these images, and (iii) uses these distances to output a map wherein each
sequence is a point in a common Euclidean space. In the resulting "Molecular
Distance Map" each point signifies a DNA sequence, and the geometric distance
between any two points reflects the degree of relatedness between the
corresponding sequences and species.
Molecular Distance Maps present compelling visual representations of
relationships between species and could be used for taxonomic clarifications,
for species identification, and for studies of evolutionary history. One of the
advantages of this method is its general applicability since, as sequence
alignment is not required, the DNA sequences chosen for comparison can be
completely different regions in different genomes. In fact, this method can be
used to compare any two DNA sequences. For example, in our dataset of 3,176
mitochondrial DNA sequences, it correctly finds the mtDNA sequences most
closely related to that of the anatomically modern human (the Neanderthal, the
Denisovan, and the chimp), and it finds that the sequence most different from
it belongs to a cucumber. Furthermore, our method can be used to compare real
sequences to artificial, computergenerated, DNA sequences. For example, it is
used to determine that the distances between a Homo sapiens sapiens mtDNA and
artificial sequences of the same length and same trinucleotide frequencies can
be larger than the distance between the same human mtDNA and the mtDNA of a
fruitfly.
We demonstrate this method's promising potential for taxonomical
clarifications by applying it to a diverse variety of cases that have been
historically controversial, such as the genus Polypterus, the family Tarsiidae,
and the vast (super)kingdom Protista.

Patterned selfassembly tile set synthesis PATS is the problem of finding a
minimal tile set which uniquely selfassembles into a given pattern. Czeizler
and Popa proved the NPcompleteness of PATS and Seki showed that the PATS
problem is already NPcomplete for patterns with 60 colors. In search for the
minimal number of colors such that PATS remains NPcomplete, we introduce
multiple bound PATS (mbPATS) where we allow bounds for the numbers of tile
types of each color. We show that mbPATS is NPcomplete for patterns with just
three colors and, as a byproduct of this result, we also obtain a novel proof
for the NPcompleteness of PATS which is more concise than the previous proofs.

Patterned selfassembly is a process whereby coloured tiles selfassemble to
build a rectangular coloured pattern. We propose selfassembly (SA) hypergraph
automata as an automatatheoretic model for patterned selfassembly. We
investigate the computational power of SAhypergraph automata and show that for
every recognizable picture language, there exists an SAhypergraph automaton
that accepts this language. Conversely, we prove that for any restricted
SAhypergraph automaton, there exists a Wang Tile System, a model for
recognizable picture languages, that accepts the same language. The advantage
of SAhypergraph automata over Wang automata, acceptors for the class of
recognizable picture languages, is that they do not rely on an a priori defined
scanning strategy

Splicing as a binary word/language operation is inspired by the DNA
recombination under the action of restriction enzymes and ligases, and was
first introduced by Tom Head in 1987. Shortly thereafter, it was proven that
the languages generated by (finite) splicing systems form a proper subclass of
the class of regular languages. However, the question of whether or not one can
decide if a given regular language is generated by a splicing system remained
open. In this paper we give a positive answer to this question. Namely, we
prove that, if a language is generated by a splicing system, then it is also
generated by a splicing system whose size is a function of the size of the
syntactic monoid of the input language, and which can be effectively
constructed.

Iterated hairpin completion is an operation on formal languages that is
inspired by the hairpin formation in DNA biochemistry. Iterated hairpin
completion of a word (or more precisely a singleton language) is always a
contextsensitive language and for some words it is known to be
noncontextfree. However, it is unknown whether regularity of iterated hairpin
completion of a given word is decidable. Also the question whether iterated
hairpin completion of a word can be contextfree but not regular was asked in
literature. In this paper we investigate iterated hairpin completions of
noncrossing words and, within this setting, we are able to answer both
questions. For noncrossing words we prove that the regularity of iterated
hairpin completions is decidable and that if iterated hairpin completion of a
noncrossing word is not regular, then it is not contextfree either.

Hairpin completion is an abstract operation modeling a DNA biooperation
which receives as input a DNA strand $w = x\alpha y \calpha$, and outputs $w' =
x \alpha y \bar{\alpha} \bar{x}$, where $\bar{x}$ denotes the WatsonCrick
complement of $x$. In this paper, we focus on the problem of finding conditions
under which the iterated hairpin completion of a given word is regular.
According to the numbers of words $\alpha$ and $\calpha$ that initiate hairpin
completion and how they are scattered, we classify the set of all words $w$.
For some basic classes of words $w$ containing small numbers of occurrences of
$\alpha$ and $\calpha$, we prove that the iterated hairpin completion of $w$ is
regular. For other classes with higher numbers of occurrences of $\alpha$ and
$\calpha$, we prove a necessary and sufficient condition for the iterated
hairpin completion of a word in these classes to be regular.

This paper provides a bridge between the classical tiling theory and the
complex neighborhood selfassembling situations that exist in practice. The
neighborhood of a position in the plane is the set of coordinates which are
considered adjacent to it. This includes classical neighborhoods of size four,
as well as arbitrarily complex neighborhoods. A generalized tile system
consists of a set of tiles, a neighborhood, and a relation which dictates which
are the "admissible" neighboring tiles of a given tile. Thus, in correctly
formed assemblies, tiles are assigned positions of the plane in accordance to
this relation. We prove that any validly tiled path defined in a given but
arbitrary neighborhood (a zipper) can be simulated by a simple "ribbon" of
microtiles. A ribbon is a special kind of polyomino, consisting of a
nonselfcrossing sequence of tiles on the plane, in which successive tiles
stick along their adjacent edge. Finally, we extend this construction to the
case of traditional tilings, proving that we can simulate
arbitraryneighborhood tilings by simpleneighborhood tilings, while preserving
some of their essential properties.

We investigate the role of nondeterminism in Winfree's abstract Tile Assembly
Model (aTAM), which was conceived to model artificial molecular selfassembling
systems constructed from DNA. Of particular practical importance is to find
tile systems that minimize resources such as the number of distinct tile types,
each of which corresponds to a set of DNA strands that must be
customsynthesized in actual molecular implementations of the aTAM. We seek to
identify to what extent the use of nondeterminism in tile systems affects the
resources required by such molecular shapebuilding algorithms.
We first show a "molecular computability theoretic" result: there is an
infinite shape S that is uniquely assembled by a tile system but not by any
deterministic tile system. We then show an analogous phenomenon in the finitary
"molecular complexity theoretic" case: there is a finite shape S that is
uniquely assembled by a tile system with c tile types, but every deterministic
tile system that uniquely assembles S has more than c tile types. In fact we
extend the technique to derive a stronger (classical complexity theoretic)
result, showing that the problem of finding the minimum number of tile types
that uniquely assemble a given finite shape is SigmaP2complete. In contrast,
the problem of finding the minimum number of deterministic tile types that
uniquely assemble a shape was shown to be NPcomplete by Adleman, Cheng, Goel,
Huang, Kempe, Moisset de Espan\'es, and Rothemund (Combinatorial Optimization
Problems in SelfAssembly, STOC 2002).
The conclusion is that nondeterminism confers extra power to assemble a shape
from a small tile system, but unless the polynomial hierarchy collapses, it is
computationally more difficult to exploit this power by finding the size of the
smallest tile system, compared to finding the size of the smallest
deterministic tile system.

This paper is a continuation of our research work on state complexity of
combined operations. Motivated by applications, we study the state complexities
of two particular combined operations: catenation combined with star and
catenation combined with reversal. We show that the state complexities of both
of these combined operations are considerably less than the compositions of the
state complexities of their individual participating operations.

One of the theoretical models proposed for the mechanism of gene unscrambling
in some species of ciliates is the templateguided recombination (TGR) system
by Prescott, Ehrenfeucht and Rozenberg which has been generalized by Daley and
McQuillan from a formal language theory perspective. In this paper, we propose
a refinement of this model that generates regular languages using the iterated
TGR system with a finite initial language and a finite set of templates, using
fewer templates and a smaller alphabet compared to that of the DaleyMcQuillan
model. To achieve Turing completeness using only finite components, i.e., a
finite initial language and a finite set of templates, we also propose an
extension of the contextual templateguided recombination system (CTGR system)
by Daley and McQuillan, by adding an extra control called permitting contexts
on the usage of templates.

In this paper, we show that, due to the structural properties of the
resulting automaton obtained from a prior operation, the state complexity of a
combined operation may not be equal but close to the mathematical composition
of the state complexities of its component operations. In particular, we
provide two witness combined operations: reversal combined with catenation and
star combined with catenation.

We propose a novel theoretical biomolecular design to implement any Boolean
circuit using the mechanism of DNA strand displacement. The design is scalable:
all species of DNA strands can in principle be mixed and prepared in a single
test tube, rather than requiring separate purification of each species, which
is a barrier to largescale synthesis. The design is timeresponsive: the
concentration of output species changes in response to the concentration of
input species, so that timevarying inputs may be continuously processed. The
design is digital: Boolean values of wires in the circuit are represented as
high or low concentrations of certain species, and we show how to construct a
singleinput, singleoutput signal restoration gate that amplifies the
difference between high and low, which can be distributed to each wire in the
circuit to overcome signal degradation. This means we can achieve a digital
abstraction of the analog values of concentrations. Finally, the design is
energyefficient: if input species are specified ideally (meaning absolutely 0
concentration of unwanted species), then output species converge to their ideal
concentrations at steadystate, and the system at steadystate is in (dynamic)
equilibrium, meaning that no energy is consumed by irreversible reactions until
the input again changes.
Drawbacks of our design include the following. If input is provided
nonideally (small positive concentration of unwanted species), then energy
must be continually expended to maintain correct output concentrations even at
steadystate. In addition, our fuel species  those species that are
permanently consumed in irreversible reactions  are not "generic"; each gate
in the circuit is powered by its own specific type of fuel species. Hence
different circuits must be powered by different types of fuel. Finally, we
require input to be given according to the dualrail convention, so that an
input of 0 is specified not only by the absence of a certain species, but by
the presence of another. That is, we do not construct a "true NOT gate" that
sets its output to high concentration if and only if its input's concentration
is low. It remains an open problem to design scalable, timeresponsive,
digital, energyefficient molecular circuits that additionally solve one of
these problems, or to prove that some subset of their resolutions are mutually
incompatible.

We discuss the selfassembly system of triangular tiles instead of square
tiles, in particular right triangular tiles and equilateral triangular tiles.
We show that the triangular tile assembly system, either deterministic or
nondeterministic, has the same power to the square tile assembly system in
computation, which is Turing universal. By providing counterexamples, we show
that the triangular tile assembly system and the square tile assembly system
are not comparable in general. More precisely, there exists square tile
assembly system S such that no triangular tile assembly system is a division of
S and produces the same shape; there exists triangular tile assembly system T
such that no square tile assembly system produces the same compatible shape
with border glues. We also discuss the assembly of triangles by triangular
tiles and obtain results similar to the assembly of squares, that is to
assemble a triangular of size O(N^2), the minimal number of tiles required is
in O(log N/log log N).

A pseudoprimitive word with respect to an antimorphic involution \theta is a
word which cannot be written as a catenation of occurrences of a strictly
shorter word t and \theta(t). Properties of pseudoprimitive words are
investigated in this paper. These properties link pseudoprimitive words with
essential notions in combinatorics on words such as primitive words,
(pseudo)palindromes, and (pseudo)commutativity. Their applications include an
improved solution to the extended LyndonSch\"utzenberger equation u_1 u_2 ...
u_l = v_1 ... v_n w_1 ... w_m, where u_1, ..., u_l \in {u, \theta(u)}, v_1,
..., v_n \in {v, \theta(v)}, and w_1, ..., w_m \in {w, \theata(w)} for some
words u, v, w, integers l, n, m \ge 2, and an antimorphic involution \theta. We
prove that for l \ge 4, n,m \ge 3, this equation implies that u, v, w can be
expressed in terms of a common word t and its image \theta(t). Moreover,
several cases of this equation where l = 3 are examined.

This paper explores the use of negative (i.e., repulsive) interaction the
abstract Tile Assembly Model defined by Winfree. Winfree postulated negative
interactions to be physically plausible in his Ph.D. thesis, and Reif, Sahu,
and Yin explored their power in the context of reversible attachment
operations. We explore the power of negative interactions with irreversible
attachments, and we achieve two main results. Our first result is an
impossibility theorem: after t steps of assembly, Omega(t) tiles will be
forever bound to an assembly, unable to detach. Thus negative glue strengths do
not afford unlimited power to reuse tiles. Our second result is a positive one:
we construct a set of tiles that can simulate a Turing machine with space bound
s and time bound t, while ensuring that no intermediate assembly grows larger
than O(s), rather than O(s * t) as required by the standard Turing machine
simulation with tiles.

Repetition avoidance has been studied since Thue's work. In this paper, we
considered another type of repetition, which is called pseudopower. This
concept is inspired by WatsonCrick complementarity in DNA sequence and is
defined over an antimorphic involution $\phi$. We first classify the alphabet
$\Sigma$ and the antimorphic involution $\phi$, under which there exists
sufficiently long pseudo$k$thpowerfree words. Then we present algorithms to
test whether a finite word $w$ is pseudo$k$thpowerfree.