
In the Longest Common Factor with $k$ Mismatches (LCF$_k$) problem, we are
given two strings $X$ and $Y$ of total length $n$, and we are asked to find a
pair of maximallength factors, one of $X$ and the other of $Y$, such that
their Hamming distance is at most $k$. Thankachan et al. show that this problem
can be solved in $\mathcal{O}(n \log^k n)$ time and $\mathcal{O}(n)$ space for
constant $k$. We consider the LCF$_k$($\ell$) problem in which we assume that
the sought factors have length at least $\ell$, and the LCF$_k$($\ell$) problem
for $\ell=\Omega(\log^{2k+2} n)$, which we call the Long LCF$_k$ problem. We
use difference covers to reduce the Long LCF$_k$ problem to a task involving
$m=\mathcal{O}(n/\log^{k+1}n)$ synchronized factors. The latter can be solved
in $\mathcal{O}(m \log^{k+1}m)$ time, which results in a lineartime algorithm
for Long LCF$_k$. In general, our solution to LCF$_k$($\ell$) for arbitrary
$\ell$ takes $\mathcal{O}(n + n \log^{k+1} n/\sqrt{\ell})$ time.

Although realworld text datasets, such as DNA sequences, are far from being
uniformly random, averagecase string searching algorithms perform
significantly better than worstcase ones in most applications of interest. In
this paper, we study the problem of computing the longest prefix of each suffix
of a given string of length $n$ over a constantsized alphabet that occurs
elsewhere in the string with $k$errors. This problem has already been studied
under the Hamming distance model. Our first result is an improvement upon the
stateoftheart averagecase time complexity for nonconstant $k$ and using
only linear space under the Hamming distance model. Notably, we show that our
technique can be extended to the edit distance model with the same time and
space complexities. Specifically, our algorithms run in $\mathcal{O}(n \log^k n
\log \log n)$ time on average using $\mathcal{O}(n)$ space. We show that our
technique is applicable to several algorithmic problems in computational
biology and elsewhere.

In the Range Minimum Query (RMQ) problem, we are given an array $A$ of $n$
numbers and we are asked to answer queries of the following type: for indices
$i$ and $j$ between $0$ and $n1$, query $\text{RMQ}_A(i,j)$ returns the index
of a minimum element in the subarray $A[i..j]$. Answering a small batch of RMQs
is a core computational task in many realworld applications, in particular due
to the connection with the Lowest Common Ancestor (LCA) problem. With small
batch, we mean that the number $q$ of queries is $o(n)$ and we have them all at
hand. It is therefore not relevant to build an $\Omega(n)$sized data structure
or spend $\Omega(n)$ time to build a more succinct one. It is wellknown, among
practitioners and elsewhere, that these data structures for online querying
carry high constants in their preprocessing and querying time. We would thus
like to answer this batch efficiently in practice. With efficiently in
practice, we mean that we (ultimately) want to spend $n + \mathcal{O}(q)$ time
and $\mathcal{O}(q)$ space. We write $n$ to stress that the number of
operations per entry of $A$ should be a very small constant. Here we show how
existing algorithms can be easily modified to satisfy these conditions. The
presented experimental results highlight the practicality of this new scheme.
The most significant improvement obtained is for answering a small batch of LCA
queries. A library implementation of the presented algorithms is made
available.

In the kmappability problem, we are given a string x of length n and
integers m and k, and we are asked to count, for each lengthm factor y of x,
the number of other factors of length m of x that are at Hamming distance at
most k from y. We focus here on the version of the problem where k = 1. The
fastest known algorithm for k = 1 requires time O(mn log n/ log log n) and
space O(n). We present two algorithms that require worstcase time O(mn) and
O(n log^2 n), respectively, and space O(n), thus greatly improving the state of
the art. Moreover, we present an algorithm that requires averagecase time and
space O(n) for integer alphabets if m = {\Omega}(log n/ log {\sigma}), where
{\sigma} is the alphabet size.

The observed frequency of the longest proper prefix, the longest proper
suffix, and the longest infix of a word $w$ in a given sequence $x$ can be used
for classifying $w$ as avoided or overabundant. The definitions used for the
expectation and deviation of $w$ in this statistical model were described and
biologically justified by Brendel et al. (J Biomol Struct Dyn 1986). We have
very recently introduced a timeoptimal algorithm for computing all avoided
words of a given sequence over an integer alphabet (Algorithms Mol Biol 2017).
In this article, we extend this study by presenting an $\mathcal{O}(n)$time
and $\mathcal{O}(n)$space algorithm for computing all overabundant words in a
sequence $x$ of length $n$ over an integer alphabet. Our main result is based
on a new nontrivial combinatorial property of the suffix tree $\mathcal{T}$ of
$x$: the number of distinct factors of $x$ whose longest infix is the label of
an explicit node of $\mathcal{T}$ is no more than $3n4$. We further show that
the presented algorithm is timeoptimal by proving that $\mathcal{O}(n)$ is a
tight upper bound for the number of overabundant words. Finally, we present
experimental results, using both synthetic and real data, which justify the
effectiveness and efficiency of our approach in practical terms.

Identifying palindromes in sequences has been an interesting line of research
in combinatorics on words and also in computational biology, after the
discovery of the relation of palindromes in the DNA sequence with the HIV
virus. Efficient algorithms for the factorization of sequences into palindromes
and maximal palindromes have been devised in recent years. We extend these
studies by allowing gaps in decompositions and errors in palindromes, and also
imposing a lower bound to the length of acceptable palindromes.
We first present an algorithm for obtaining a palindromic decomposition of a
string of length n with the minimal total gap length in time O(n log n * g) and
space O(n g), where g is the number of allowed gaps in the decomposition. We
then consider a decomposition of the string in maximal \deltapalindromes (i.e.
palindromes with \delta errors under the edit or Hamming distance) and g
allowed gaps. We present an algorithm to obtain such a decomposition with the
minimal total gap length in time O(n (g + \delta)) and space O(n g).

We present a generalization of a known fact from combinatorics on words
related to periodicity into quasiperiodicity. A string is called periodic if it
has a period which is at most half of its length. A string $w$ is called
quasiperiodic if it has a nontrivial cover, that is, there exists a string $c$
that is shorter than $w$ and such that every position in $w$ is inside one of
the occurrences of $c$ in $w$. It is a folklore fact that two strings that
differ at exactly one position cannot be both periodic. Here we prove a more
general fact that two strings that differ at exactly one position cannot be
both quasiperiodic. Along the way we obtain new insights into combinatorics of
quasiperiodicities.

Longest common extension queries (LCE queries) and runs are ubiquitous in
algorithmic stringology. Lineartime algorithms computing runs and
preprocessing for constanttime LCE queries have been known for over a decade.
However, these algorithms assume a linearlysortable integer alphabet. A recent
breakthrough paper by Bannai et.\ al.\ (SODA 2015) showed a link between the
two notions: all the runs in a string can be computed via a linear number of
LCE queries. The first to consider these problems over a general ordered
alphabet was Kosolobov (\emph{Inf.\ Process.\ Lett.}, 2016), who presented an
$O(n (\log n)^{2/3})$time algorithm for answering $O(n)$ LCE queries. This
result was improved by Gawrychowski et.\ al.\ (accepted to CPM 2016) to $O(n
\log \log n)$ time. In this work we note a special \emph{noncrossing} property
of LCE queries asked in the runs computation. We show that any $n$ such
noncrossing queries can be answered online in $O(n \alpha(n))$ time, which
yields an $O(n \alpha(n))$time algorithm for computing runs.

The deviation of the observed frequency of a word $w$ from its expected
frequency in a given sequence $x$ is used to determine whether or not the word
is avoided. This concept is particularly useful in DNA linguistic analysis. The
value of the standard deviation of $w$, denoted by $std(w)$, effectively
characterises the extent of a word by its edge contrast in the context in which
it occurs. A word $w$ of length $k>2$ is a $\rho$avoided word in $x$ if
$std(w) \leq \rho$, for a given threshold $\rho < 0$. Notice that such a word
may be completely absent from $x$. Hence computing all such words na\"{\i}vely
can be a very timeconsuming procedure, in particular for large $k$. In this
article, we propose an $O(n)$time and $O(n)$space algorithm to compute all
$\rho$avoided words of length $k$ in a given sequence $x$ of length $n$ over a
fixedsized alphabet. We also present a timeoptimal $O(\sigma n)$time and
$O(\sigma n)$space algorithm to compute all $\rho$avoided words (of any
length) in a sequence of length $n$ over an alphabet of size $\sigma$.
Furthermore, we provide a tight asymptotic upper bound for the number of
$\rho$avoided words and the expected length of the longest one. We make
available an opensource implementation of our algorithm. Experimental results,
using both real and synthetic data, show the efficiency of our implementation.

Approximate string matching is the problem of finding all factors of a text t
of length n that are at a distance at most k from a pattern x of length m.
Approximate circular string matching is the problem of finding all factors of t
that are at a distance at most k from x or from any of its rotations. In this
article, we present a new algorithm for approximate circular string matching
under the edit distance model with optimal averagecase search time O(n(k + log
m)/m). Optimal averagecase search time can also be achieved by the algorithms
for multiple approximate string matching (Fredriksson and Navarro, 2004) using
x and its rotations as the set of multiple patterns. Here we reduce the
preprocessing time and space requirements compared to that approach.

DNA sequencing is the process of determining the exact order of the
nucleotide bases of an individual's genome in order to catalogue sequence
variation and understand its biological implications. Wholegenome sequencing
techniques produce masses of data in the form of short sequences known as
reads. Assembling these reads into a whole genome constitutes a major
algorithmic challenge. Most assembly algorithms utilize de Bruijn graphs
constructed from reads for this purpose. A critical step of these algorithms is
to detect typical motif structures in the graph caused by sequencing errors and
genome repeats, and filter them out; one such complex subgraph class is a
socalled superbubble. In this paper, we propose an O(n+m)time algorithm to
detect all superbubbles in a directed acyclic graph with n nodes and m
(directed) edges, improving the bestknown O(m log m)time algorithm by Sung et
al.

A degenerate symbol x* over an alphabet A is a nonempty subset of A, and a
sequence of such symbols is a degenerate string. A degenerate string is said to
be conservative if its number of nonsolid symbols is upperbounded by a fixed
positive constant k. We consider here the matching problem of conservative
degenerate strings and present the first lineartime algorithm that can find,
for given degenerate strings P* and T* of total length n containing k nonsolid
symbols in total, the occurrences of P* in T* in O(nk) time.

In this paper we consider the problem of computing the longest common abelian
factor (LCAF) between two given strings. We present a simple $O(\sigma~ n^2)$
time algorithm, where $n$ is the length of the strings and $\sigma$ is the
alphabet size, and a subquadratic running time solution for the binary string
case, both having linear space requirement. Furthermore, we present a modified
algorithm applying some interesting tricks and experimentally show that the
resulting algorithm runs faster.

We consider the problem of computing a shortest solid cover of an
indeterminate string. An indeterminate string may contain nonsolid symbols,
each of which specifies a subset of the alphabet that could be present at the
corresponding position. We also consider covering partial words, which are a
special case of indeterminate strings where each nonsolid symbol is a don't
care symbol. We prove that indeterminate string covering problem and partial
word covering problem are NPcomplete for binary alphabet and show that both
problems are fixedparameter tractable with respect to $k$, the number of
nonsolid symbols. For the indeterminate string covering problem we obtain a
$2^{O(k \log k)} + n k^{O(1)}$time algorithm. For the partial word covering
problem we obtain a $2^{O(\sqrt{k}\log k)} + nk^{O(1)}$time algorithm. We
prove that, unless the Exponential Time Hypothesis is false, no
$2^{o(\sqrt{k})} n^{O(1)}$time solution exists for either problem, which shows
that our algorithm for this case is close to optimal. We also present an
algorithm for both problems which is feasible in practice.

For a partial word $w$ the longest common compatible prefix of two positions
$i,j$, denoted $lccp(i,j)$, is the largest $k$ such that $w[i,i+k1]\uparrow
w[j,j+k1]$, where $\uparrow$ is the compatibility relation of partial words
(it is not an equivalence relation). The LCCP problem is to preprocess a
partial word in such a way that any query $lccp(i,j)$ about this word can be
answered in $O(1)$ time. It is a natural generalization of the longest common
prefix (LCP) problem for regular words, for which an $O(n)$ preprocessing time
and $O(1)$ query time solution exists.
Recently an efficient algorithm for this problem has been given by F.
BlanchetSadri and J. Lazarow (LATA 2013). The preprocessing time was
$O(nh+n)$, where $h$ is the number of "holes" in $w$. The algorithm was
designed for partial words over a constant alphabet and was quite involved.
We present a simple solution to this problem with slightly better runtime
that works for any linearlysortable alphabet. Our preprocessing is in time
$O(n\mu+n)$, where $\mu$ is the number of blocks of holes in $w$. Our algorithm
uses ideas from alignment algorithms and dynamic programming.

In this paper, we revisit the much studied problem of Pattern Matching with
Swaps (Swap Matching problem, for short). We first present a graphtheoretic
model, which opens a new and so far unexplored avenue to solve the problem.
Then, using the model, we devise two efficient algorithms to solve the swap
matching problem. The resulting algorithms are adaptations of the classic
shiftand algorithm. For patterns having length similar to the wordsize of the
target machine, both the algorithms run in linear time considering a fixed
alphabet.

We consider an index data structure for similar strings. The generalized
suffix tree can be a solution for this. The generalized suffix tree of two
strings $A$ and $B$ is a compacted trie representing all suffixes in $A$ and
$B$. It has $A+B$ leaves and can be constructed in $O(A+B)$ time.
However, if the two strings are similar, the generalized suffix tree is not
efficient because it does not exploit the similarity which is usually
represented as an alignment of $A$ and $B$.
In this paper we propose a space/timeefficient suffix tree of alignment
which wisely exploits the similarity in an alignment. Our suffix tree for an
alignment of $A$ and $B$ has $A + l_d + l_1$ leaves where $l_d$ is the sum of
the lengths of all parts of $B$ different from $A$ and $l_1$ is the sum of the
lengths of some common parts of $A$ and $B$. We did not compromise the pattern
search to reduce the space. Our suffix tree can be searched for a pattern $P$
in $O(P+occ)$ time where $occ$ is the number of occurrences of $P$ in $A$ and
$B$. We also present an efficient algorithm to construct the suffix tree of
alignment. When the suffix tree is constructed from scratch, the algorithm
requires $O(A + l_d + l_1 + l_2)$ time where $l_2$ is the sum of the lengths
of other common substrings of $A$ and $B$. When the suffix tree of $A$ is
already given, it requires $O(l_d + l_1 + l_2)$ time.

Recently Kubica et al. (Inf. Process. Let., 2013) and Kim et al. (submitted
to Theor. Comp. Sci.) introduced orderpreserving pattern matching. In this
problem we are looking for consecutive substrings of the text that have the
same "shape" as a given pattern. These results include a lineartime
orderpreserving pattern matching algorithm for polynomiallybounded alphabet
and an extension of this result to pattern matching with multiple patterns. We
make one step forward in the analysis and give an
$O(\frac{n\log{n}}{\log\log{n}})$ time randomized algorithm constructing suffix
trees in the orderpreserving setting. We show a number of applications of
orderpreserving suffix trees to identify patterns and repetitions in time
series.

We introduce a new string matching problem called orderpreserving matching
on numeric strings where a pattern matches a text if the text contains a
substring whose relative orders coincide with those of the pattern.
Orderpreserving matching is applicable to many scenarios such as stock price
analysis and musical melody matching in which the order relations should be
matched instead of the strings themselves. Solving orderpreserving matching
has to do with representations of order relations of a numeric string. We
define prefix representation and nearest neighbor representation, which lead to
efficient algorithms for orderpreserving matching. We present efficient
algorithms for single and multiple pattern cases. For the single pattern case,
we give an O(n log m) time algorithm and optimize it further to obtain O(n + m
log m) time. For the multiple pattern case, we give an O(n log m) time
algorithm.

Abelian periodicity of strings has been studied extensively over the last
years. In 2006 Constantinescu and Ilie defined the abelian period of a string
and several algorithms for the computation of all abelian periods of a string
were given. In contrast to the classical period of a word, its abelian version
is more flexible, factors of the word are considered the same under any
internal permutation of their letters. We show two O(y^2) algorithms for the
computation of all abelian periods of a string y. The first one maps each
letter to a suitable number such that each factor of the string can be
identified by the unique sum of the numbers corresponding to its letters and
hence abelian periods can be identified easily. The other one maps each letter
to a prime number such that each factor of the string can be identified by the
unique product of the numbers corresponding to its letters and so abelian
periods can be identified easily. We also define weak abelian periods on
strings and give an O(ylog(y)) algorithm for their computation, together
with some other algorithms for more basic problems.

The notion of the cover is a generalization of a period of a string, and
there are linear time algorithms for finding the shortest cover. The seed is a
more complicated generalization of periodicity, it is a cover of a superstring
of a given string, and the shortest seed problem is of much higher algorithmic
difficulty. The problem is not well understood, no linear time algorithm is
known. In the paper we give linear time algorithms for some of its versions 
computing shortest leftseed array, longest leftseed array and checking for
seeds of a given length. The algorithm for the last problem is used to compute
the seed array of a string (i.e., the shortest seeds for all the prefixes of
the string) in $O(n^2)$ time. We describe also a simpler alternative algorithm
computing efficiently the shortest seeds. As a byproduct we obtain an
$O(n\log{(n/m)})$ time algorithm checking if the shortest seed has length at
least $m$ and finding the corresponding seed. We also correct some important
details missing in the previously known shortestseed algorithm (Iliopoulos et
al., 1996).