
A seed in a word is a relaxed version of a period in which the occurrences of
the repeating subword may overlap. We show a lineartime algorithm computing a
linearsize representation of all the seeds of a word (the number of seeds
might be quadratic). In particular, one can easily derive the shortest seed and
the number of seeds from our representation. Thus, we solve an open problem
stated in the survey by Smyth (2000) and improve upon a previous O(n log n)
algorithm by Iliopoulos, Moore, and Park (1996). Our approach is based on
combinatorial relations between seeds and subword complexity (used here for the
first time in context of seeds). In the previous papers, the compact
representation of seeds consisted of two independent parts operating on the
suffix tree of the word and the suffix tree of the reverse of the word,
respectively. Our second contribution is a simpler representation of all seeds
which avoids dealing with the reversed word.
A preliminary version of this work, with a much more complex algorithm
constructing the earlier representation of seeds, was presented at the 23rd
Annual ACMSIAM Symposium of Discrete Algorithms (SODA 2012).

In the Longest Common Factor with $k$ Mismatches (LCF$_k$) problem, we are
given two strings $X$ and $Y$ of total length $n$, and we are asked to find a
pair of maximallength factors, one of $X$ and the other of $Y$, such that
their Hamming distance is at most $k$. Thankachan et al. show that this problem
can be solved in $\mathcal{O}(n \log^k n)$ time and $\mathcal{O}(n)$ space for
constant $k$. We consider the LCF$_k$($\ell$) problem in which we assume that
the sought factors have length at least $\ell$, and the LCF$_k$($\ell$) problem
for $\ell=\Omega(\log^{2k+2} n)$, which we call the Long LCF$_k$ problem. We
use difference covers to reduce the Long LCF$_k$ problem to a task involving
$m=\mathcal{O}(n/\log^{k+1}n)$ synchronized factors. The latter can be solved
in $\mathcal{O}(m \log^{k+1}m)$ time, which results in a lineartime algorithm
for Long LCF$_k$. In general, our solution to LCF$_k$($\ell$) for arbitrary
$\ell$ takes $\mathcal{O}(n + n \log^{k+1} n/\sqrt{\ell})$ time.

The orderpreserving model (opmodel, in short) was introduced quite recently
but has already attracted significant attention because of its applications in
data analysis. We introduce several types of periods in this setting
(opperiods). Then we give algorithms to compute these periods in time $O(n)$,
$O(n\log\log n)$, $O(n \log^2 \log n/\log \log \log n)$, $O(n\log n)$ depending
on the type of periodicity. In the most general variant the number of different
periods can be as big as $\Omega(n^2)$, and a compact representation is needed.
Our algorithms require novel combinatorial insight into the properties of such
periods.

We investigate the function $L(h,p,q)$, called here the threshold function,
related to periodicity of partial words (words with holes). The value
$L(h,p,q)$ is defined as the minimum length threshold which guarantees that a
natural extension of the periodicity lemma is valid for partial words with $h$
holes and (strong) periods $p,q$. We show how to evaluate the threshold
function in $O(\log p + \log q)$ time, which is an improvement upon the best
previously known $O(p+q)$time algorithm. In a series of papers, the formulae
for the threshold function, in terms of $p$ and $q$, were provided for each
fixed $h \le 7$. We demystify the generic structure of such formulae, and for
each value $h$ we express the threshold function in terms of a piecewiselinear
function with $O(h)$ pieces.

Longest common extension queries (LCE queries) and runs are ubiquitous in
algorithmic stringology. Lineartime algorithms computing runs and
preprocessing for constanttime LCE queries have been known for over a decade.
However, these algorithms assume a linearlysortable integer alphabet. A recent
breakthrough paper by Bannai et.\ al.\ (SODA 2015) showed a link between the
two notions: all the runs in a string can be computed via a linear number of
LCE queries. The first to consider these problems over a general ordered
alphabet was Kosolobov (\emph{Inf.\ Process.\ Lett.}, 2016), who presented an
$O(n (\log n)^{2/3})$time algorithm for answering $O(n)$ LCE queries. This
result was improved by Gawrychowski et.\ al.\ (accepted to CPM 2016) to $O(n
\log \log n)$ time. In this work we note a special \emph{noncrossing} property
of LCE queries asked in the runs computation. We show that any $n$ such
noncrossing queries can be answered online in $O(n \alpha(n))$ time, which
yields an $O(n \alpha(n))$time algorithm for computing runs.

The combinatorics of squares in a word depends on how the equivalence of
halves of the square is defined. We consider Abelian squares, parameterized
squares, and orderpreserving squares. The word $uv$ is an Abelian
(parameterized, orderpreserving) square if $u$ and $v$ are equivalent in the
Abelian (parameterized, orderpreserving) sense. The maximum number of ordinary
squares in a word is known to be asymptotically linear, but the exact bound is
still investigated. We present several results on the maximum number of
distinct squares for nonstandard subword equivalence relations. Let
$\mathit{SQ}_{\mathrm{Abel}}(n,\sigma)$ and
$\mathit{SQ}'_{\mathrm{Abel}}(n,\sigma)$ denote the maximum number of Abelian
squares in a word of length $n$ over an alphabet of size $\sigma$, which are
distinct as words and which are nonequivalent in the Abelian sense,
respectively. For $\sigma\ge 2$ we prove that
$\mathit{SQ}_{\mathrm{Abel}}(n,\sigma)=\Theta(n^2)$,
$\mathit{SQ}'_{\mathrm{Abel}}(n,\sigma)=\Omega(n^{3/2})$ and
$\mathit{SQ}'_{\mathrm{Abel}}(n,\sigma) = O(n^{11/6})$. We also give linear
bounds for parameterized and orderpreserving squares for alphabets of constant
size: $\mathit{SQ}_{\mathrm{param}}(n,O(1))=\Theta(n)$,
$\mathit{SQ}_{\mathrm{op}}(n,O(1))=\Theta(n)$. The upper bounds have quadratic
dependence on the alphabet size for orderpreserving squares and exponential
dependence for parameterized squares.
As a side result we construct infinite words over the smallest alphabet which
avoid nontrivial orderpreserving squares and nontrivial parameterized cubes
(nontrivial parameterized squares cannot be avoided in an infinite word).

Longest common extension queries (often called longest common prefix queries)
constitute a fundamental building block in multiple string algorithms, for
example computing runs and approximate pattern matching. We show that a
sequence of $q$ LCE queries for a string of size $n$ over a general ordered
alphabet can be realized in $O(q \log \log n+n\log^*n)$ time making only
$O(q+n)$ symbol comparisons. Consequently, all runs in a string over a general
ordered alphabet can be computed in $O(n \log \log n)$ time making $O(n)$
symbol comparisons. Our results improve upon a solution by Kosolobov
(Information Processing Letters, 2016), who gave an algorithm with $O(n
\log^{2/3} n)$ running time and conjectured that $O(n)$ time is possible. We
make a significant progress towards resolving this conjecture. Our techniques
extend to the case of general unordered alphabets, when the time increases to
$O(q\log n + n\log^*n)$. The main tools are difference covers and the
disjointsets data structure.

We study a variation of the classical Shortest Common Superstring (SCS)
problem in which a shortest superstring of a finite set of strings $S$ is
sought containing as a factor every string of $S$ or its reversal. We call this
problem Shortest Common Superstring with Reversals (SCSR). This problem has
been introduced by Jiang et al., who designed a greedylike algorithm with
length approximation ratio $4$. In this paper, we show that a natural
adaptation of the classical greedy algorithm for SCS has (optimal) compression
ratio $\frac12$, i.e., the sum of the overlaps in the output string is at least
half the sum of the overlaps in an optimal solution. We also provide a
lineartime implementation of our algorithm.

We consider the problem of computing a shortest solid cover of an
indeterminate string. An indeterminate string may contain nonsolid symbols,
each of which specifies a subset of the alphabet that could be present at the
corresponding position. We also consider covering partial words, which are a
special case of indeterminate strings where each nonsolid symbol is a don't
care symbol. We prove that indeterminate string covering problem and partial
word covering problem are NPcomplete for binary alphabet and show that both
problems are fixedparameter tractable with respect to $k$, the number of
nonsolid symbols. For the indeterminate string covering problem we obtain a
$2^{O(k \log k)} + n k^{O(1)}$time algorithm. For the partial word covering
problem we obtain a $2^{O(\sqrt{k}\log k)} + nk^{O(1)}$time algorithm. We
prove that, unless the Exponential Time Hypothesis is false, no
$2^{o(\sqrt{k})} n^{O(1)}$time solution exists for either problem, which shows
that our algorithm for this case is close to optimal. We also present an
algorithm for both problems which is feasible in practice.

We consider several types of internal queries: questions about subwords of a
text. As the main tool we develop an optimal data structure for the problem
called here internal pattern matching. This data structure provides
constanttime answers to queries about occurrences of one subword $x$ in
another subword $y$ of a given text, assuming that $y=\mathcal{O}(x)$,
which allows for a constantspace representation of all occurrences. This
problem can be viewed as a natural extension of the wellstudied pattern
matching problem. The data structure has linear size and admits a lineartime
construction algorithm.
Using the solution to the internal pattern matching problem, we obtain very
efficient data structures answering queries about: primitivity of subwords,
periods of subwords, general substring compression, and cyclic equivalence of
two subwords. All these results improve upon the best previously known
counterparts. The linear construction time of our data structure also allows to
improve the algorithm for finding $\delta$subrepetitions in a text (a more
general version of maximal repetitions, also called runs). For any fixed
$\delta$ we obtain the first lineartime algorithm, which matches the linear
time complexity of the algorithm computing runs. Our data structure has already
been used as a part of the efficient solutions for subword suffix rank &
selection, as well as substring compression using BurrowsWheeler transform
composed with runlength encoding.

In the Manhattan Sequence Consensus problem (MSC problem) we are given $k$
integer sequences, each of length $l$, and we are to find an integer sequence
$x$ of length $l$ (called a consensus sequence), such that the maximum
Manhattan distance of $x$ from each of the input sequences is minimized. For
binary sequences Manhattan distance coincides with Hamming distance, hence in
this case the string consensus problem (also called string center problem or
closest string problem) is a special case of MSC. Our main result is a
practically efficient $O(l)$time algorithm solving MSC for $k\le 5$ sequences.
Practicality of our algorithms has been verified experimentally. It improves
upon the quadratic algorithm by Amir et al.\ (SPIRE 2012) for string consensus
problem for $k=5$ binary strings. Similarly as in Amir's algorithm we use a
columnbased framework. We replace the implied general integer linear
programming by its easy special cases, due to combinatorial properties of the
MSC for $k\le 5$. We also show that for a general parameter $k$ any instance
can be reduced in linear time to a kernel of size $k!$, so the problem is
fixedparameter tractable. Nevertheless, for $k\ge 4$ this is still too large
for any naive solution to be feasible in practice.

A factor $u$ of a word $w$ is a cover of $w$ if every position in $w$ lies
within some occurrence of $u$ in $w$. A word $w$ covered by $u$ thus
generalizes the idea of a repetition, that is, a word composed of exact
concatenations of $u$. In this article we introduce a new notion of
$\alpha$partial cover, which can be viewed as a relaxed variant of cover, that
is, a factor covering at least $\alpha$ positions in $w$. We develop a data
structure of $O(n)$ size (where $n=w$) that can be constructed in $O(n\log
n)$ time which we apply to compute all shortest $\alpha$partial covers for a
given $\alpha$. We also employ it for an $O(n\log n)$time algorithm computing
a shortest $\alpha$partial cover for each $\alpha=1,2,\ldots,n$.

For a partial word $w$ the longest common compatible prefix of two positions
$i,j$, denoted $lccp(i,j)$, is the largest $k$ such that $w[i,i+k1]\uparrow
w[j,j+k1]$, where $\uparrow$ is the compatibility relation of partial words
(it is not an equivalence relation). The LCCP problem is to preprocess a
partial word in such a way that any query $lccp(i,j)$ about this word can be
answered in $O(1)$ time. It is a natural generalization of the longest common
prefix (LCP) problem for regular words, for which an $O(n)$ preprocessing time
and $O(1)$ query time solution exists.
Recently an efficient algorithm for this problem has been given by F.
BlanchetSadri and J. Lazarow (LATA 2013). The preprocessing time was
$O(nh+n)$, where $h$ is the number of "holes" in $w$. The algorithm was
designed for partial words over a constant alphabet and was quite involved.
We present a simple solution to this problem with slightly better runtime
that works for any linearlysortable alphabet. Our preprocessing is in time
$O(n\mu+n)$, where $\mu$ is the number of blocks of holes in $w$. Our algorithm
uses ideas from alignment algorithms and dynamic programming.

Recently Kubica et al. (Inf. Process. Let., 2013) and Kim et al. (submitted
to Theor. Comp. Sci.) introduced orderpreserving pattern matching. In this
problem we are looking for consecutive substrings of the text that have the
same "shape" as a given pattern. These results include a lineartime
orderpreserving pattern matching algorithm for polynomiallybounded alphabet
and an extension of this result to pattern matching with multiple patterns. We
make one step forward in the analysis and give an
$O(\frac{n\log{n}}{\log\log{n}})$ time randomized algorithm constructing suffix
trees in the orderpreserving setting. We show a number of applications of
orderpreserving suffix trees to identify patterns and repetitions in time
series.

We derive a simple efficient algorithm for Abelian periods knowing all
Abelian squares in a string. An efficient algorithm for the latter problem was
given by Cummings and Smyth in 1997. By the way we show an alternative
algorithm for Abelian squares. We also obtain a linear time algorithm finding
all `long' Abelian periods. The aim of the paper is a (new) reduction of the
problem of all Abelian periods to that of (already solved) all Abelian squares
which provides new insight into both connected problems.

The notion of the cover is a generalization of a period of a string, and
there are linear time algorithms for finding the shortest cover. The seed is a
more complicated generalization of periodicity, it is a cover of a superstring
of a given string, and the shortest seed problem is of much higher algorithmic
difficulty. The problem is not well understood, no linear time algorithm is
known. In the paper we give linear time algorithms for some of its versions 
computing shortest leftseed array, longest leftseed array and checking for
seeds of a given length. The algorithm for the last problem is used to compute
the seed array of a string (i.e., the shortest seeds for all the prefixes of
the string) in $O(n^2)$ time. We describe also a simpler alternative algorithm
computing efficiently the shortest seeds. As a byproduct we obtain an
$O(n\log{(n/m)})$ time algorithm checking if the shortest seed has length at
least $m$ and finding the corresponding seed. We also correct some important
details missing in the previously known shortestseed algorithm (Iliopoulos et
al., 1996).

A run is an inclusion maximal occurrence in a string (as a subinterval) of a
repetition $v$ with a period $p$ such that $2p \le v$. The exponent of a run
is defined as $v/p$ and is $\ge 2$. We show new bounds on the maximal sum of
exponents of runs in a string of length $n$. Our upper bound of $4.1n$ is
better than the best previously known proven bound of $5.6n$ by Crochemore &
Ilie (2008). The lower bound of $2.035n$, obtained using a family of binary
words, contradicts the conjecture of Kolpakov & Kucherov (1999) that the
maximal sum of exponents of runs in a string of length $n$ is smaller than $2n$

We investigate the problem of the maximum number of cubic subwords (of the
form $www$) in a given word. We also consider square subwords (of the form
$ww$). The problem of the maximum number of squares in a word is not well
understood. Several new results related to this problem are produced in the
paper. We consider two simple problems related to the maximum number of
subwords which are squares or which are highly repetitive; then we provide a
nontrivial estimation for the number of cubes. We show that the maximum number
of squares $xx$ such that $x$ is not a primitive word (nonprimitive squares) in
a word of length $n$ is exactly $\lfloor \frac{n}{2}\rfloor  1$, and the
maximum number of subwords of the form $x^k$, for $k\ge 3$, is exactly $n2$.
In particular, the maximum number of cubes in a word is not greater than $n2$
either. Using very technical properties of occurrences of cubes, we improve
this bound significantly. We show that the maximum number of cubes in a word of
length $n$ is between $(1/2)n$ and $(4/5)n$. (In particular, we improve the
lower bound from the conference version of the paper.)

A run is a maximal occurrence of a repetition $v$ with a period $p$ such that
$2p \le v$. The maximal number of runs in a string of length $n$ was studied
by several authors and it is known to be between $0.944 n$ and $1.029 n$. We
investigate highly periodic runs, in which the shortest period $p$ satisfies
$3p \le v$. We show the upper bound $0.5n$ on the maximal number of such runs
in a string of length $n$ and construct a sequence of words for which we obtain
the lower bound $0.406 n$.