
A seed in a word is a relaxed version of a period in which the occurrences of
the repeating subword may overlap. We show a lineartime algorithm computing a
linearsize representation of all the seeds of a word (the number of seeds
might be quadratic). In particular, one can easily derive the shortest seed and
the number of seeds from our representation. Thus, we solve an open problem
stated in the survey by Smyth (2000) and improve upon a previous O(n log n)
algorithm by Iliopoulos, Moore, and Park (1996). Our approach is based on
combinatorial relations between seeds and subword complexity (used here for the
first time in context of seeds). In the previous papers, the compact
representation of seeds consisted of two independent parts operating on the
suffix tree of the word and the suffix tree of the reverse of the word,
respectively. Our second contribution is a simpler representation of all seeds
which avoids dealing with the reversed word.
A preliminary version of this work, with a much more complex algorithm
constructing the earlier representation of seeds, was presented at the 23rd
Annual ACMSIAM Symposium of Discrete Algorithms (SODA 2012).

A border of a string is a nonempty prefix of the string that is also a
suffix of the string, and a string is unbordered if it has no border other than
itself. Loptev, Kucherov, and Starikovskaya [CPM 2015] conjectured the
following: If we pick a string of length $n$ from a fixed nonunary alphabet
uniformly at random, then the expected maximum length of its unbordered factors
is $n  O(1)$. We confirm this conjecture by proving that the expected value
is, in fact, ${n  \Theta(\sigma^{1})}$, where $\sigma$ is the size of the
alphabet. This immediately implies that we can find such a maximal unbordered
factor in linear time on average. However, we go further and show that the
optimum averagecase running time is in $\Omega (\sqrt{n}) \cap O (\sqrt{n
\log_\sigma n})$ due to analogous bounds by Czumaj and G\k{a}sieniec [CPM 2000]
for the problem of computing the shortest period of a uniformly random string.

We consider the Abelian longest common factor problem in two scenarios: when
input strings are uncompressed and are of size $n$, and when the input strings
are runlength encoded and their compressed representations have size at most
$m$. The alphabet size is denoted by $\sigma$. For the uncompressed problem, we
show an $o(n^2)$time and $\Oh(n)$space algorithm in the case of
$\sigma=\Oh(1)$, making a nontrivial use of tabulation. For the RLEcompressed
problem, we show two algorithms: one working in $\Oh(m^2\sigma^2 \log^3 m)$
time and $\Oh(m (\sigma^2+\log^2 m))$ space, which employs line sweep, and one
that works in $\Oh(m^3)$ time and $\Oh(m)$ space that applies in a careful way
a slidingwindowbased approach. The latter improves upon the previously known
$\Oh(nm^2)$time and $\Oh(m^4)$time algorithms that were recently developed by
Sugimoto et al.\ (IWOCA 2017) and Grabowski (SPIRE 2017), respectively.

We consider the streaming complexity of a fundamental task in approximate
pattern matching: the $k$mismatch problem. It asks to compute Hamming
distances between a pattern of length $n$ and all length$n$ substrings of a
text for which the Hamming distance does not exceed a given threshold $k$. In
our problem formulation, we report not only the Hamming distance but also, on
demand, the full \emph{mismatch information}, that is the list of mismatched
pairs of symbols and their indices. The twin challenges of streaming pattern
matching derive from the need both to achieve small working space and also to
guarantee that every arriving input symbol is processed quickly.
We present a streaming algorithm for the $k$mismatch problem which uses
$O(k\log{n}\log\frac{n}{k})$ bits of space and spends \ourcomplexity time on
each symbol of the input stream, which consists of the pattern followed by the
text. The running time almost matches the classic offline solution and the
space usage is within a logarithmic factor of optimal.
Our new algorithm therefore effectively resolves and also extends an open
problem first posed in FOCS'09. En route to this solution, we also give a
deterministic $O( k (\log \frac{n}{k} + \log \Sigma) )$bit encoding of all
the alignments with Hamming distance at most $k$ of a length$n$ pattern within
a text of length $O(n)$. This secondary result provides an optimal solution to
a natural communication complexity problem which may be of independent
interest.

In the Longest Common Factor with $k$ Mismatches (LCF$_k$) problem, we are
given two strings $X$ and $Y$ of total length $n$, and we are asked to find a
pair of maximallength factors, one of $X$ and the other of $Y$, such that
their Hamming distance is at most $k$. Thankachan et al. show that this problem
can be solved in $\mathcal{O}(n \log^k n)$ time and $\mathcal{O}(n)$ space for
constant $k$. We consider the LCF$_k$($\ell$) problem in which we assume that
the sought factors have length at least $\ell$, and the LCF$_k$($\ell$) problem
for $\ell=\Omega(\log^{2k+2} n)$, which we call the Long LCF$_k$ problem. We
use difference covers to reduce the Long LCF$_k$ problem to a task involving
$m=\mathcal{O}(n/\log^{k+1}n)$ synchronized factors. The latter can be solved
in $\mathcal{O}(m \log^{k+1}m)$ time, which results in a lineartime algorithm
for Long LCF$_k$. In general, our solution to LCF$_k$($\ell$) for arbitrary
$\ell$ takes $\mathcal{O}(n + n \log^{k+1} n/\sqrt{\ell})$ time.

The orderpreserving model (opmodel, in short) was introduced quite recently
but has already attracted significant attention because of its applications in
data analysis. We introduce several types of periods in this setting
(opperiods). Then we give algorithms to compute these periods in time $O(n)$,
$O(n\log\log n)$, $O(n \log^2 \log n/\log \log \log n)$, $O(n\log n)$ depending
on the type of periodicity. In the most general variant the number of different
periods can be as big as $\Omega(n^2)$, and a compact representation is needed.
Our algorithms require novel combinatorial insight into the properties of such
periods.

We investigate the function $L(h,p,q)$, called here the threshold function,
related to periodicity of partial words (words with holes). The value
$L(h,p,q)$ is defined as the minimum length threshold which guarantees that a
natural extension of the periodicity lemma is valid for partial words with $h$
holes and (strong) periods $p,q$. We show how to evaluate the threshold
function in $O(\log p + \log q)$ time, which is an improvement upon the best
previously known $O(p+q)$time algorithm. In a series of papers, the formulae
for the threshold function, in terms of $p$ and $q$, were provided for each
fixed $h \le 7$. We demystify the generic structure of such formulae, and for
each value $h$ we express the threshold function in terms of a piecewiselinear
function with $O(h)$ pieces.

In the longest common substring problem we are given two strings of length
$n$ and must find a substring of maximal length that occurs in both strings. It
is wellknown that the problem can be solved in linear time, but the solution
is not robust and can vary greatly when the input strings are changed even by
one letter. To circumvent this, Leimeister and Morgenstern introduced the
problem of the longest common substring with $k$ mismatches. Lately, this
problem has received a lot of attention in the literature. In this paper we
first show a conditional lower bound based on the SETH hypothesis implying that
there is little hope to improve existing solutions. We then introduce a new but
closely related problem of the longest common substring with approximately $k$
mismatches and use computational geometry techniques to show that it admits a
solution with strongly subquadratic running time. We also apply these results
to obtain a strongly subquadratic approximation algorithm for the longest
common substring with $k$ mismatches problem and show conditional hardness of
improving its approximation ratio.

In a \emph{weighted sequence}, for every position of the sequence and every
letter of the alphabet a probability of occurrence of this letter at this
position is specified. Weighted sequences are commonly used to represent
imprecise or uncertain data, for example, in molecular biology where they are
known under the name of PositionWeight Matrices. Given a probability threshold
$\frac1z$, we say that a string $P$ of length $m$ occurs in a weighted sequence
$X$ at position $i$ if the product of probabilities of the letters of $P$ at
positions $i,\ldots,i+m1$ in $X$ is at least $\frac1z$. In this article, we
consider an \emph{indexing} variant of the problem, in which we are to
preprocess a weighted sequence to answer multiple pattern matching queries. We
present an $O(nz)$time construction of an $O(nz)$sized index for a weighted
sequence of length $n$ over a constantsized alphabet that answers pattern
matching queries in optimal, $O(m+Occ)$ time, where $Occ$ is the number of
occurrences reported. The cornerstone of our data structure is a novel
construction of a family of $\lfloor z \rfloor$ special strings that carries
the information about all the strings that occur in the weighted sequence with
a sufficient probability. We obtain a weighted index with the same complexities
as in the most efficient previously known index by Barton et al. (CPM 2016),
but our construction is significantly simpler. The most complex algorithmic
tool required in the basic form of our index is the suffix tree which we use to
develop a new, more straightforward index for the socalled property matching
problem. We provide an implementation of our data structure. Our construction
allows us also to obtain a significant improvement over the complexities of the
approximate variant of the weighted index presented by Biswas et al. (EDBT
2016) and an improvement of the space complexity of their general index.

We investigate two closely related LZ78based compression schemes: LZMW (an
old scheme by Miller and Wegman) and LZD (a recent variant by Goto et al.).
Both LZD and LZMW naturally produce a grammar for a string of length $n$; we
show that the size of this grammar can be larger than the size of the smallest
grammar by a factor $\Omega(n^{\frac{1}3})$ but is always within a factor
$O((\frac{n}{\log n})^{\frac{2}{3}})$. In addition, we show that the standard
algorithms using $\Theta(z)$ working space to construct the LZD and LZMW
parsings, where $z$ is the size of the parsing, work in $\Omega(n^{\frac{5}4})$
time in the worst case. We then describe a new Las Vegas LZD/LZMW parsing
algorithm that uses $O (z \log n)$ space and $O(n + z \log^2 n)$ time w.h.p..

We show that a problem of deleting a minimum number of vertices from a graph
to obtain a graph embeddable on a surface of a given Euler genus is solvable in
time $2^{C_g \cdot k^2 \log k} n^{O(1)}$, where $k$ is the size of the deletion
set, $C_g$ is a constant depending on the Euler genus $g$ of the target
surface, and $n$ is the size of the input graph. On the way to this result, we
develop an algorithm solving the problem in question in time $2^{O((t+g) \log
(t+g))} n$, given a tree decomposition of the input graph of width $t$. The
results generalize previous algorithms for the surface being a sphere by Marx
and Schlotter [Algorithmica 2012], Kawarabayashi [FOCS 2009], and Jansen,
Lokshtanov, and Saurabh [SODA 2014].

Strip packing is a classical packing problem, where the goal is to pack a set
of rectangular objects into a strip of a given width, while minimizing the
total height of the packing. The problem has multiple applications, e.g. in
scheduling and stockcutting, and has been studied extensively.
When the dimensions of objects are allowed to be exponential in the total
input size, it is known that the problem cannot be approximated within a factor
better than $3/2$, unless $\mathrm{P}=\mathrm{NP}$. However, there was no
corresponding lower bound for polynomially bounded input data. In fact,
Nadiradze and Wiese [SODA 2016] have recently proposed a $(1.4 + \epsilon)$
approximation algorithm for this variant, thus showing that strip packing with
polynomially bounded data can be approximated better than when exponentially
large values in the input data are allowed. Their result has subsequently been
improved to a $(4/3 + \epsilon)$ approximation by two independent research
groups [FSTTCS 2016, arXiv:1610.04430]. This raises a question whether strip
packing with polynomially bounded input data admits a quasipolynomial time
approximation scheme, as is the case for related twodimensional packing
problems like maximum independent set of rectangles or twodimensional
knapsack.
In this paper we answer this question in negative by proving that it is
NPhard to approximate strip packing within a factor better than $12/11$, even
when admitting only polynomially bounded input data. In particular, this shows
that the strip packing problem admits no quasipolynomial time approximation
scheme, unless $\mathrm{NP} \subseteq \mathrm{DTIME}(2^{\mathrm{polylog}(n)})$.

Suffix tree (and the closely related suffix array) are fundamental structures
capturing all substrings of a given text essentially by storing all its
suffixes in the lexicographical order. In some applications, we work with a
subset of $b$ interesting suffixes, which are stored in the socalled sparse
suffix tree. Because the size of this structure is $\Theta(b)$, it is natural
to seek a construction algorithm using only $O(b)$ words of space assuming
readonly random access to the text. We design a lineartime Monte Carlo
algorithm for this problem, hence resolving an open question explicitly stated
by Bille et al. [TALG 2016]. The best previously known algorithm by I et al.
[STACS 2014] works in $O(n\log b)$ time. Our solution proceeds in $n/b$ rounds;
in the $r$th round, we consider all suffixes starting at positions congruent
to $r$ modulo $n/b$. By maintaining rolling hashes, we lexicographically sort
all interesting suffixes starting at such positions, and then we merge them
with the already considered suffixes. For efficient merging, we also need to
answer LCE queries in small space. By plugging in the structure of Bille et al.
[CPM 2015] we obtain $O(n+b\log b)$ time complexity. We improve this structure,
which implies a lineartime sparse suffix tree construction algorithm. We
complement our Monte Carlo algorithm with a deterministic verification
procedure. The verification takes $O(n\sqrt{\log b})$ time, which improves upon
the bound of $O(n\log b)$ obtained by I et al. [STACS 2014]. This is obtained
by first observing that the pruning done inside the previous solution has a
rather clean description using the notion of graph spanners with small
multiplicative stretch. Then, we are able to decrease the verification time by
applying difference covers twice. Combined with the Monte Carlo algorithm, this
gives us an $O(n\sqrt{\log b})$time and $O(b)$space Las Vegas algorithm.

We study pattern matching problems on two major representations of uncertain
sequences used in molecular biology: weighted sequences (also known as position
weight matrices, PWM) and profiles (i.e., scoring matrices). In the simple
version, in which only the pattern or only the text is uncertain, we obtain
efficient algorithms with theoreticallyprovable running times using a
variation of the lookahead scoring technique. We also consider a general
variant of the pattern matching problems in which both the pattern and the text
are uncertain. Central to our solution is a special case where the sequences
have equal length, called the consensus problem. We propose algorithms for the
consensus problem parameterized by the number of strings that match one of the
sequences. As our basic approach, a careful adaptation of the classic
meetinthemiddle algorithm for the knapsack problem is used. On the lower
bound side, we prove that our dependence on the parameter is optimal up to
lowerorder terms conditioned on the optimality of the original algorithm for
the knapsack problem.

Longest common extension queries (LCE queries) and runs are ubiquitous in
algorithmic stringology. Lineartime algorithms computing runs and
preprocessing for constanttime LCE queries have been known for over a decade.
However, these algorithms assume a linearlysortable integer alphabet. A recent
breakthrough paper by Bannai et.\ al.\ (SODA 2015) showed a link between the
two notions: all the runs in a string can be computed via a linear number of
LCE queries. The first to consider these problems over a general ordered
alphabet was Kosolobov (\emph{Inf.\ Process.\ Lett.}, 2016), who presented an
$O(n (\log n)^{2/3})$time algorithm for answering $O(n)$ LCE queries. This
result was improved by Gawrychowski et.\ al.\ (accepted to CPM 2016) to $O(n
\log \log n)$ time. In this work we note a special \emph{noncrossing} property
of LCE queries asked in the runs computation. We show that any $n$ such
noncrossing queries can be answered online in $O(n \alpha(n))$ time, which
yields an $O(n \alpha(n))$time algorithm for computing runs.

The combinatorics of squares in a word depends on how the equivalence of
halves of the square is defined. We consider Abelian squares, parameterized
squares, and orderpreserving squares. The word $uv$ is an Abelian
(parameterized, orderpreserving) square if $u$ and $v$ are equivalent in the
Abelian (parameterized, orderpreserving) sense. The maximum number of ordinary
squares in a word is known to be asymptotically linear, but the exact bound is
still investigated. We present several results on the maximum number of
distinct squares for nonstandard subword equivalence relations. Let
$\mathit{SQ}_{\mathrm{Abel}}(n,\sigma)$ and
$\mathit{SQ}'_{\mathrm{Abel}}(n,\sigma)$ denote the maximum number of Abelian
squares in a word of length $n$ over an alphabet of size $\sigma$, which are
distinct as words and which are nonequivalent in the Abelian sense,
respectively. For $\sigma\ge 2$ we prove that
$\mathit{SQ}_{\mathrm{Abel}}(n,\sigma)=\Theta(n^2)$,
$\mathit{SQ}'_{\mathrm{Abel}}(n,\sigma)=\Omega(n^{3/2})$ and
$\mathit{SQ}'_{\mathrm{Abel}}(n,\sigma) = O(n^{11/6})$. We also give linear
bounds for parameterized and orderpreserving squares for alphabets of constant
size: $\mathit{SQ}_{\mathrm{param}}(n,O(1))=\Theta(n)$,
$\mathit{SQ}_{\mathrm{op}}(n,O(1))=\Theta(n)$. The upper bounds have quadratic
dependence on the alphabet size for orderpreserving squares and exponential
dependence for parameterized squares.
As a side result we construct infinite words over the smallest alphabet which
avoid nontrivial orderpreserving squares and nontrivial parameterized cubes
(nontrivial parameterized squares cannot be avoided in an infinite word).

In this paper we study the fundamental problem of maintaining a dynamic
collection of strings under the following operations: concat  concatenates two
strings, split  splits a string into two at a given position, compare  finds
the lexicographical order (less, equal, greater) between two strings, LCP 
calculates the longest common prefix of two strings. We present an efficient
data structure for this problem, where an update requires only $O(\log n)$
worstcase time with high probability, with $n$ being the total length of all
strings in the collection, and a query takes constant worstcase time. On the
lower bound side, we prove that even if the only possible query is checking
equality of two strings, either updates or queries take amortized $\Omega(\log
n)$ time; hence our implementation is optimal.
Such operations can be used as a basic building block to solve other string
problems. We provide two examples. First, we can augment our data structure to
provide pattern matching queries that may locate occurrences of a specified
pattern $p$ in the strings in our collection in optimal $O(p)$ time, at the
expense of increasing update time to $O(\log^2 n)$. Second, we show how to
maintain a history of an edited text, processing updates in $O(\log t \log \log
t)$ time, where $t$ is the number of edits, and how to support pattern matching
queries against the whole history in $O(p \log t \log \log t)$ time.
Finally, we note that our data structure can be applied to test dynamic tree
isomorphism and to compare strings generated by dynamic straightline grammars.

Longest common extension queries (often called longest common prefix queries)
constitute a fundamental building block in multiple string algorithms, for
example computing runs and approximate pattern matching. We show that a
sequence of $q$ LCE queries for a string of size $n$ over a general ordered
alphabet can be realized in $O(q \log \log n+n\log^*n)$ time making only
$O(q+n)$ symbol comparisons. Consequently, all runs in a string over a general
ordered alphabet can be computed in $O(n \log \log n)$ time making $O(n)$
symbol comparisons. Our results improve upon a solution by Kosolobov
(Information Processing Letters, 2016), who gave an algorithm with $O(n
\log^{2/3} n)$ running time and conjectured that $O(n)$ time is possible. We
make a significant progress towards resolving this conjecture. Our techniques
extend to the case of general unordered alphabets, when the time increases to
$O(q\log n + n\log^*n)$. The main tools are difference covers and the
disjointsets data structure.

The problem of finding factors of a text string which are identical or
similar to a given pattern string is a central problem in computer science. A
generalised version of this problem consists in implementing an index over the
text to support efficient online pattern queries. We study this problem in the
case where the text is weighted: for every position of the text and every
letter of the alphabet a probability of occurrence of this letter at this
position is given. Sequences of this type, also called position weight
matrices, are commonly used to represent imprecise or uncertain data. A
weighted sequence may represent many different strings, each with probability
of occurrence equal to the product of probabilities of its letters at
subsequent positions. Given a probability threshold $1/z$, we say that a
pattern string $P$ matches a weighted text at position $i$ if the product of
probabilities of the letters of $P$ at positions $i,\ldots,i+P1$ in the text
is at least $1/z$. In this article, we present an $O(nz)$time construction of
an $O(nz)$sized index that can answer pattern matching queries in a weighted
text in optimal time improving upon the state of the art by a factor of $z \log
z$. Other applications of this data structure include an $O(nz)$time
construction of the weighted prefix table and an $O(nz)$time computation of
all covers of a weighted sequence, which improve upon the state of the art by
the same factor.

For a text given in advance, the substring minimal suffix queries ask to
determine the lexicographically minimal nonempty suffix of a substring
specified by the location of its occurrence in the text. We develop a data
structure answering such queries optimally: in constant time after lineartime
preprocessing. This improves upon the results of Babenko et al. (CPM 2014),
whose tradeoff solution is characterized by $\Theta(n\log n)$ product of these
time complexities. Next, we extend our queries to support concatenations of
$O(1)$ substrings, for which the construction and query time is preserved. We
apply these generalized queries to compute lexicographically minimal and
maximal rotations of a given substring in constant time after lineartime
preprocessing.
Our data structures mainly rely on properties of Lyndon words and Lyndon
factorizations. We combine them with further algorithmic and combinatorial
tools, such as fusion trees and the notion of order isomorphism of strings.

Given a word $w$ and a Parikh vector $\mathcal{P}$, an abelian run of period
$\mathcal{P}$ in $w$ is a maximal occurrence of a substring of $w$ having
abelian period $\mathcal{P}$. Our main result is an online algorithm that,
given a word $w$ of length $n$ over an alphabet of cardinality $\sigma$ and a
Parikh vector $\mathcal{P}$, returns all the abelian runs of period
$\mathcal{P}$ in $w$ in time $O(n)$ and space $O(\sigma+p)$, where $p$ is the
norm of $\mathcal{P}$, i.e., the sum of its components. We also present an
online algorithm that computes all the abelian runs with periods of norm $p$ in
$w$ in time $O(np)$, for any given norm $p$. Finally, we give an $O(n^2)$time
offline randomized algorithm for computing all the abelian runs of $w$. Its
deterministic counterpart runs in $O(n^2\log\sigma)$ time.

We study a variation of the classical Shortest Common Superstring (SCS)
problem in which a shortest superstring of a finite set of strings $S$ is
sought containing as a factor every string of $S$ or its reversal. We call this
problem Shortest Common Superstring with Reversals (SCSR). This problem has
been introduced by Jiang et al., who designed a greedylike algorithm with
length approximation ratio $4$. In this paper, we show that a natural
adaptation of the classical greedy algorithm for SCS has (optimal) compression
ratio $\frac12$, i.e., the sum of the overlaps in the output string is at least
half the sum of the overlaps in an optimal solution. We also provide a
lineartime implementation of our algorithm.

We give efficient algorithms for ranking Lyndon words of length n over an
alphabet of size {\sigma}. The rank of a Lyndon word is its position in the
sequence of lexicographically ordered Lyndon words of the same length. The
outputs are integers of exponential size, and complexity of arithmetic
operations on such large integers cannot be ignored. Our model of computations
is the wordRAM, in which basic arithmetic operations on (large) numbers of
size at most {\sigma}^n take O(n) time. Our algorithm for ranking Lyndon words
makes O(n^2) arithmetic operations (this would imply directly cubic time on
wordRAM). However, using an algebraic approach we are able to reduce the total
time complexity on the wordRAM to O(n^2 log {\sigma}). We also present an
O(n^3 log^2 {\sigma})time algorithm that generates the Lyndon word of a given
length and rank in lexicographic order. Finally we use the connections between
Lyndon words and lexicographically minimal de Bruijn sequences (theorem of
Fredricksen and Maiorana) to develop the first polynomialtime algorithm for
decoding minimal de Bruijn sequence of any rank n (it determines the position
of an arbitrary word of length n within the de Bruijn sequence).

We generalize KarpRabin string matching to handle multiple patterns in
$\mathcal{O}(n \log n + m)$ time and $\mathcal{O}(s)$ space, where $n$ is the
length of the text and $m$ is the total length of the $s$ patterns, returning
correct answers with high probability. As a prime application of our algorithm,
we show how to approximate the LZ77 parse of a string of length $n$. If the
optimal parse consists of $z$ phrases, using only $\mathcal{O}(z)$ working
space we can return a parse consisting of at most $(1+\varepsilon)z$ phrases in
$\mathcal{O}(\varepsilon^{1}n\log n)$ time, for any $\varepsilon\in (0,1]$. As
previous quasilineartime algorithms for LZ77 use $\Omega(n/\textrm{polylog
}n)$ space, but $z$ can be exponentially small in $n$, these improvements in
space are substantial.

We present an improved wavelet tree construction algorithm and discuss its
applications to a number of rank/select problems for integer keys and strings.
Given a string of length n over an alphabet of size $\sigma\leq n$, our
method builds the wavelet tree in $O(n \log \sigma/ \sqrt{\log{n}})$ time,
improving upon the stateoftheart algorithm by a factor of $\sqrt{\log n}$.
As a consequence, given an array of n integers we can construct in $O(n
\sqrt{\log n})$ time a data structure consisting of $O(n)$ machine words and
capable of answering rank/select queries for the subranges of the array in
$O(\log n / \log \log n)$ time. This is a $\log \log n$factor improvement in
query time compared to Chan and P\u{a}tra\c{s}cu and a $\sqrt{\log n}$factor
improvement in construction time compared to Brodal et al.
Next, we switch to stringological context and propose a novel notion of
wavelet suffix trees. For a string w of length n, this data structure occupies
$O(n)$ words, takes $O(n \sqrt{\log n})$ time to construct, and simultaneously
captures the combinatorial structure of substrings of w while enabling
efficient topdown traversal and binary search. In particular, with a wavelet
suffix tree we are able to answer in $O(\log x)$ time the following two
natural analogues of rank/select queries for suffixes of substrings: for
substrings x and y of w count the number of suffixes of x that are
lexicographically smaller than y, and for a substring x of w and an integer k,
find the kth lexicographically smallest suffix of x.
We further show that wavelet suffix trees allow to compute a
runlengthencoded BurrowsWheeler transform of a substring x of w in $O(s \log
x)$ time, where s denotes the length of the resulting runlength encoding.
This answers a question by Cormode and Muthukrishnan, who considered an
analogous problem for LempelZiv compression.