• ### A Linear Time Algorithm for Seeds Computation(1107.2422)

March 13, 2019 cs.DS
A seed in a word is a relaxed version of a period in which the occurrences of the repeating subword may overlap. We show a linear-time algorithm computing a linear-size representation of all the seeds of a word (the number of seeds might be quadratic). In particular, one can easily derive the shortest seed and the number of seeds from our representation. Thus, we solve an open problem stated in the survey by Smyth (2000) and improve upon a previous O(n log n) algorithm by Iliopoulos, Moore, and Park (1996). Our approach is based on combinatorial relations between seeds and subword complexity (used here for the first time in context of seeds). In the previous papers, the compact representation of seeds consisted of two independent parts operating on the suffix tree of the word and the suffix tree of the reverse of the word, respectively. Our second contribution is a simpler representation of all seeds which avoids dealing with the reversed word. A preliminary version of this work, with a much more complex algorithm constructing the earlier representation of seeds, was presented at the 23rd Annual ACM-SIAM Symposium of Discrete Algorithms (SODA 2012).
• ### A Note on the Longest Common Compatible Prefix Problem for Partial Words(1312.2381)

Dec. 9, 2013 cs.DS, cs.FL
For a partial word $w$ the longest common compatible prefix of two positions $i,j$, denoted $lccp(i,j)$, is the largest $k$ such that $w[i,i+k-1]\uparrow w[j,j+k-1]$, where $\uparrow$ is the compatibility relation of partial words (it is not an equivalence relation). The LCCP problem is to preprocess a partial word in such a way that any query $lccp(i,j)$ about this word can be answered in $O(1)$ time. It is a natural generalization of the longest common prefix (LCP) problem for regular words, for which an $O(n)$ preprocessing time and $O(1)$ query time solution exists. Recently an efficient algorithm for this problem has been given by F. Blanchet-Sadri and J. Lazarow (LATA 2013). The preprocessing time was $O(nh+n)$, where $h$ is the number of "holes" in $w$. The algorithm was designed for partial words over a constant alphabet and was quite involved. We present a simple solution to this problem with slightly better runtime that works for any linearly-sortable alphabet. Our preprocessing is in time $O(n\mu+n)$, where $\mu$ is the number of blocks of holes in $w$. Our algorithm uses ideas from alignment algorithms and dynamic programming.
• ### Order-Preserving Suffix Trees and Their Algorithmic Applications(1303.6872)

March 27, 2013 cs.DS
Recently Kubica et al. (Inf. Process. Let., 2013) and Kim et al. (submitted to Theor. Comp. Sci.) introduced order-preserving pattern matching. In this problem we are looking for consecutive substrings of the text that have the same "shape" as a given pattern. These results include a linear-time order-preserving pattern matching algorithm for polynomially-bounded alphabet and an extension of this result to pattern matching with multiple patterns. We make one step forward in the analysis and give an $O(\frac{n\log{n}}{\log\log{n}})$ time randomized algorithm constructing suffix trees in the order-preserving setting. We show a number of applications of order-preserving suffix trees to identify patterns and repetitions in time series.
• ### A Note on Efficient Computation of All Abelian Periods in a String(1208.3313)

Aug. 16, 2012 cs.DM, cs.DS
We derive a simple efficient algorithm for Abelian periods knowing all Abelian squares in a string. An efficient algorithm for the latter problem was given by Cummings and Smyth in 1997. By the way we show an alternative algorithm for Abelian squares. We also obtain a linear time algorithm finding all `long' Abelian periods. The aim of the paper is a (new) reduction of the problem of all Abelian periods to that of (already solved) all Abelian squares which provides new insight into both connected problems.
• ### Efficient Seeds Computation Revisited(1104.3153)

April 15, 2011 cs.DS
The notion of the cover is a generalization of a period of a string, and there are linear time algorithms for finding the shortest cover. The seed is a more complicated generalization of periodicity, it is a cover of a superstring of a given string, and the shortest seed problem is of much higher algorithmic difficulty. The problem is not well understood, no linear time algorithm is known. In the paper we give linear time algorithms for some of its versions --- computing shortest left-seed array, longest left-seed array and checking for seeds of a given length. The algorithm for the last problem is used to compute the seed array of a string (i.e., the shortest seeds for all the prefixes of the string) in $O(n^2)$ time. We describe also a simpler alternative algorithm computing efficiently the shortest seeds. As a by-product we obtain an $O(n\log{(n/m)})$ time algorithm checking if the shortest seed has length at least $m$ and finding the corresponding seed. We also correct some important details missing in the previously known shortest-seed algorithm (Iliopoulos et al., 1996).
• ### On the maximal sum of exponents of runs in a string(1003.4866)

March 25, 2010 cs.DM, cs.FL
A run is an inclusion maximal occurrence in a string (as a subinterval) of a repetition $v$ with a period $p$ such that $2p \le |v|$. The exponent of a run is defined as $|v|/p$ and is $\ge 2$. We show new bounds on the maximal sum of exponents of runs in a string of length $n$. Our upper bound of $4.1n$ is better than the best previously known proven bound of $5.6n$ by Crochemore & Ilie (2008). The lower bound of $2.035n$, obtained using a family of binary words, contradicts the conjecture of Kolpakov & Kucherov (1999) that the maximal sum of exponents of runs in a string of length $n$ is smaller than $2n$
• ### On the maximal number of cubic subwords in a string(0911.1370)

Nov. 6, 2009 cs.DM, cs.FL
We investigate the problem of the maximum number of cubic subwords (of the form $www$) in a given word. We also consider square subwords (of the form $ww$). The problem of the maximum number of squares in a word is not well understood. Several new results related to this problem are produced in the paper. We consider two simple problems related to the maximum number of subwords which are squares or which are highly repetitive; then we provide a nontrivial estimation for the number of cubes. We show that the maximum number of squares $xx$ such that $x$ is not a primitive word (nonprimitive squares) in a word of length $n$ is exactly $\lfloor \frac{n}{2}\rfloor - 1$, and the maximum number of subwords of the form $x^k$, for $k\ge 3$, is exactly $n-2$. In particular, the maximum number of cubes in a word is not greater than $n-2$ either. Using very technical properties of occurrences of cubes, we improve this bound significantly. We show that the maximum number of cubes in a word of length $n$ is between $(1/2)n$ and $(4/5)n$. (In particular, we improve the lower bound from the conference version of the paper.)
• ### On the maximal number of highly periodic runs in a string(0907.2157)

July 13, 2009 cs.DM, cs.DS
A run is a maximal occurrence of a repetition $v$ with a period $p$ such that $2p \le |v|$. The maximal number of runs in a string of length $n$ was studied by several authors and it is known to be between $0.944 n$ and $1.029 n$. We investigate highly periodic runs, in which the shortest period $p$ satisfies $3p \le |v|$. We show the upper bound $0.5n$ on the maximal number of such runs in a string of length $n$ and construct a sequence of words for which we obtain the lower bound $0.406 n$.