• Existential length universality(1702.03961)

Feb. 14, 2019 cs.FL
We study the following natural variation on the classical universality problem: given a language $L(M)$ represented by $M$ (e.g., a DFA/RE/NFA/PDA), does there exist an integer $\ell \geq 0$ such that $\Sigma^\ell \subseteq L(M)$? In the case of an NFA, we show that this problem is NEXPTIME-complete, and the smallest such $\ell$ can be doubly exponential in the number of states. This particular case was formulated as an open problem in 2009, and our solution uses a novel and involved construction. In the case of a PDA, we show that it is recursively unsolvable, while the smallest such $\ell$ is not bounded by any computable function of the number of states. In the case of a DFA, we show that the problem is NP-complete, and $e^{\sqrt{n \log n} (1+o(1))}$ is an asymptotically tight upper bound for the smallest such $\ell$, where $n$ is the number of states. Finally, we prove that in all these cases, the problem becomes computationally easier when the length $\ell$ is also given in binary in the input: it is polynomially solvable for a DFA, PSPACE-complete for an NFA, and co-NEXPTIME-complete for a PDA.
• Edit Distance between Unrooted Trees in Cubic Time(1804.10186)

April 26, 2018 cs.DS
Edit distance between trees is a natural generalization of the classical edit distance between strings, in which the allowed elementary operations are contraction, uncontraction and relabeling of an edge. Demaine et al. [ACM Trans. on Algorithms, 6(1), 2009] showed how to compute the edit distance between rooted trees on $n$ nodes in $\mathcal{O}(n^{3})$ time. However, generalizing their method to unrooted trees seems quite problematic, and the most efficient known solution remains to be the previous $\mathcal{O}(n^{3}\log n)$ time algorithm by Klein [ESA 1998]. Given the lack of progress on improving this complexity, it might appear that unrooted trees are simply more difficult than rooted trees. We show that this is, in fact, not the case, and edit distance between unrooted trees on $n$ nodes can be computed in $\mathcal{O}(n^{3})$ time. A significantly faster solution is unlikely to exist, as Bringmann et al. [SODA 2018] proved that the complexity of computing the edit distance between rooted trees cannot be decreased to $\mathcal{O}(n^{3-\epsilon})$ unless some popular conjecture fails, and the lower bound easily extends to unrooted trees. We also show that for two unrooted trees of size $m$ and $n$, where $m\le n$, our algorithm can be modified to run in $\mathcal{O}(nm^2(1+\log\frac nm))$. This, again, matches the complexity achieved by Demaine et al. for rooted trees, who also showed that this is optimal if we restrict ourselves to the so-called decomposition algorithms.
• Voronoi diagrams on planar graphs, and computing the diameter in deterministic $\tilde{O}(n^{5/3})$ time(1704.02793)

Feb. 25, 2018 cs.CG, cs.DS
We present an explicit and efficient construction of additively weighted Voronoi diagrams on planar graphs. Let $G$ be a planar graph with $n$ vertices and $b$ sites that lie on a constant number of faces. We show how to preprocess $G$ in $\tilde O(nb^2)$ time so that one can compute any additively weighted Voronoi diagram for these sites in $\tilde O(b)$ time. We use this construction to compute the diameter of a directed planar graph with real arc lengths in $\tilde{O}(n^{5/3})$ time. This improves the recent breakthrough result of Cabello (SODA'17), both by improving the running time (from $\tilde{O}(n^{11/6})$), and by providing a deterministic algorithm. It is in fact the first truly subquadratic deterministic algorithm for this problem. Our use of Voronoi diagrams to compute the diameter follows that of Cabello, but he used abstract Voronoi diagrams, which makes his diameter algorithm more involved, more expensive, and randomized. As in Cabello's work, our algorithm can also compute the Wiener index of a planar graph (i.e., the sum of all pairwise distances) within the same bound. Our construction of Voronoi diagrams for planar graphs is of independent interest. It has already been used to obtain fast exact distance oracles for planar graphs [Cohen-Addad et al., FOCS'17].
• A Faster FPTAS for #Knapsack(1802.05791)

Feb. 15, 2018 cs.DS
Given a set $W = \{w_1,\ldots, w_n\}$ of non-negative integer weights and an integer $C$, the #Knapsack problem asks to count the number of distinct subsets of $W$ whose total weight is at most $C$. In the more general integer version of the problem, the subsets are multisets. That is, we are also given a set $\{u_1,\ldots, u_n\}$ and we are allowed to take up to $u_i$ items of weight $w_i$. We present a deterministic FPTAS for #Knapsack running in $O(n^{2.5}\varepsilon^{-1.5}\log(n \varepsilon^{-1})\log (n \varepsilon))$ time. The previous best deterministic algorithm [FOCS 2011] runs in $O(n^3 \varepsilon^{-1} \log(n\varepsilon^{-1}))$ time (see also [ESA 2014] for a logarithmic factor improvement). The previous best randomized algorithm [STOC 2003] runs in $O(n^{2.5} \sqrt{\log (n\varepsilon^{-1}) } + \varepsilon^{-2} n^2 )$ time. Therefore, in the natural setting of constant $\varepsilon$, we close the gap between the $\tilde O(n^{2.5})$ randomized algorithm and the $\tilde O(n^3)$ deterministic algorithm. For the integer version with $U = \max_i \{u_i\}$, we present a deterministic FPTAS running in $O(n^{2.5}\varepsilon^{-1.5}\log(n\varepsilon^{-1} \log U)\log (n \varepsilon) \log^2 U)$ time. The previous best deterministic algorithm [APPROX 2016] runs in $O(n^3\varepsilon^{-1}\log(n \varepsilon^{-1} \log U) \log^2 U)$ time.
• Slowing Down Top Trees for Better Worst-Case Bounds(1801.01059)

Jan. 3, 2018 cs.DS
We consider the top tree compression scheme introduced by Bille et al. [ICALP 2013] and construct an infinite family of trees on $n$ nodes labeled from an alphabet of size $\sigma$, for which the size of the top DAG is $\Theta(\frac{n}{\log_\sigma n}\log\log_\sigma n)$. Our construction matches a previously known upper bound and exhibits a weakness of this scheme, as the information-theoretic lower bound is $\Omega(\frac{n}{\log_\sigma n})$. This settles an open problem stated by Lohrey et al. [arXiv 2017], who designed a more involved version achieving the lower bound. We show that this can be also guaranteed by a very minor modification of the original scheme: informally, one only needs to ensure that different parts of the tree are not compressed too quickly. Arguably, our version is more uniform, and in particular, the compression procedure is oblivious to the value of $\sigma$.
• Longest Gapped Repeats and Palindromes(1511.07180)

Oct. 11, 2017 cs.DS
A gapped repeat (respectively, palindrome) occurring in a word $w$ is a factor $uvu$ (respectively, $u^Rvu$) of $w$. In such a repeat (palindrome) $u$ is called the arm of the repeat (respectively, palindrome), while $v$ is called the gap. We show how to compute efficiently, for every position $i$ of the word $w$, the longest gapped repeat and palindrome occurring at that position, provided that the length of the gap is subject to various types of restrictions. That is, that for each position $i$ we compute the longest prefix $u$ of $w[i..n]$ such that $uv$ (respectively, $u^Rv$) is a suffix of $w[1..i-1]$ (defining thus a gapped repeat $uvu$ -- respectively, palindrome $u^Rvu$), and the length of $v$ is subject to the aforementioned restrictions.
• Better Tradeoffs for Exact Distance Oracles in Planar Graphs(1708.01386)

Aug. 4, 2017 cs.DS
We present an $O(n^{1.5})$-space distance oracle for directed planar graphs that answers distance queries in $O(\log n)$ time. Our oracle both significantly simplifies and significantly improves the recent oracle of Cohen-Addad, Dahlgaard and Wulff-Nilsen [FOCS 2017], which uses $O(n^{5/3})$-space and answers queries in $O(\log n)$ time. We achieve this by designing an elegant and efficient point location data structure for Voronoi diagrams on planar graphs. We further show a smooth tradeoff between space and query-time. For any $S\in [n,n^2]$, we show an oracle of size $S$ that answers queries in $\tilde O(\max\{1,n^{1.5}/S\})$ time. This new tradeoff is currently the best (up to polylogarithmic factors) for the entire range of $S$ and improves by polynomial factors over all the previously known tradeoffs for the range $S \in [n,n^{5/3}]$.
• Distinct Squares in Circular Words(1708.00639)

Aug. 2, 2017 cs.FL
A circular word, or a necklace, is an equivalence class under conjugation of a word. A fundamental question concerning regularities in standard words is bounding the number of distinct squares in a word of length $n$. The famous conjecture attributed to Fraenkel and Simpson is that there are at most $n$ such distinct squares, yet the best known upper bound is $1.84n$ by Deza et al. [Discr. Appl. Math. 180, 52-69 (2015)]. We consider a natural generalization of this question to circular words: how many distinct squares can there be in all cyclic rotations of a word of length $n$? We prove an upper bound of $3.14n$. This is complemented with an infinite family of words implying a lower bound of $1.25n$.
• Better Labeling Schemes for Nearest Common Ancestors through Minor-Universal Trees(1707.06011)

July 19, 2017 cs.DS
A labeling scheme for nearest common ancestors assigns a distinct binary string, called the label, to every node of a tree, so that given the labels of two nodes (and no further information about the topology of the tree) we can compute the label of their nearest common ancestor. The goal is to make the labels as short as possible. Alstrup, Gavoille, Kaplan, and Rauhe [Theor. Comput. Syst. 37(3):441-456 2004] showed that $O(\log n)$-bit labels are enough. More recently, Alstrup, Halvorsen, and Larsen [SODA 2014] refined this to only $2.772\log n$, and provided a lower bound of $1.008\log n$. We connect designing a labeling scheme for nearest common ancestors to the existence of a tree, called a minor-universal tree, that contains every tree on $n$ nodes as a topological minor. Even though it is not clear if a labeling scheme must be based on such a notion, we argue that the existing schemes can be reformulated as such, and it allows us to obtain clean and good bounds on the length of the labels. As the main upper bound, we show that $2.318\log n$-bit labels are enough. Surprisingly, the notion of a minor-universal tree for binary trees on $n$ nodes has been already used in a different context by Hrubes et al. [CCC 2010], and Young, Chu, and Wong [J. ACM 46(3):416-435, 1999] introduced a closely related notion of a universal tree. On the lower bound side, we show that the size of any minor-universal tree for trees on $n$ nodes is $\Omega(n^{2.174})$. This highlights a natural limitation for all approaches based on such trees. Our lower bound technique also implies that the size of any universal tree in the sense of Young et al. is $\Omega(n^{2.185})$, thus dramatically improves their lower bound of $\Omega(n\log n)$. We complement the existential results with a generic transformation that decreases the query time to constant in any scheme based on a minor-universal tree.
• A Faster Construction of Greedy Consensus Trees(1705.10548)

July 4, 2017 cs.DS
A consensus tree is a phylogenetic tree that captures the similarity between a set of conflicting phylogenetic trees. The problem of computing a consensus tree is a major step in phylogenetic tree reconstruction. It also finds applications in predicting a species tree from a set of gene trees. This paper focuses on two of the most well-known and widely used oconsensus tree methods: the greedy consensus tree and the frequency difference consensus tree. Given $k$ conflicting trees each with $n$ leaves, the previous fastest algorithms for these problems were $O(k n^2)$ for the greedy consensus tree [J. ACM 2016] and $\tilde O(\min \{ k n^2, k^2n\})$ for the frequency difference consensus tree [ACM TCBB 2016]. We improve these running times to $\tilde O(k n^{1.5})$ and $\tilde O(k n)$ respectively.
• Dispersion on Trees(1706.09185)

June 28, 2017 cs.DS
In the $k$-dispersion problem, we need to select $k$ nodes of a given graph so as to maximize the minimum distance between any two chosen nodes. This can be seen as a generalization of the independent set problem, where the goal is to select nodes so that the minimum distance is larger than 1. We design an optimal $O(n)$ time algorithm for the dispersion problem on trees consisting of $n$ nodes, thus improving the previous $O(n\log n)$ time solution from 1997. We also consider the weighted case, where the goal is to choose a set of nodes of total weight at least $W$. We present an $O(n\log^2n)$ algorithm improving the previous $O(n\log^4 n)$ solution. Our solution builds on the search version (where we know the minimum distance $\lambda$ between the chosen nodes) for which we present tight $\Theta(n\log n)$ upper and lower bounds.
• A Family of Approximation Algorithms for the Maximum Duo-Preservation String Mapping Problem(1702.02405)

May 30, 2017 cs.DS
In the Maximum Duo-Preservation String Mapping problem we are given two strings and wish to map the letters of the former to the letters of the latter so as to maximise the number of duos. A duo is a pair of consecutive letters that is mapped to a pair of consecutive letters in the same order. This is complementary to the well-studied Minimum Common String Partition problem, where the goal is to partition the former string into blocks that can be permuted and concatenated to obtain the latter string. Maximum Duo-Preservation String Mapping is APX-hard. After a series of improvements, Brubach [WABI 2016] showed a polynomial-time $3.25$-approximation algorithm. Our main contribution is that for any $\epsilon>0$ there exists a polynomial-time $(2+\epsilon)$-approximation algorithm. Similarly to a previous solution by Boria et al. [CPM 2016], our algorithm uses the local search technique. However, this is used only after a certain preliminary greedy procedure, which gives us more structure and makes a more general local search possible. We complement this with a specialised version of the algorithm that achieves $2.67$-approximation in quadratic time.
• Optimal Distance Labeling Schemes for Trees(1608.00212)

May 14, 2017 cs.DS
Labeling schemes seek to assign a short label to each node in a network, so that a function on two nodes can be computed by examining their labels alone. For the particular case of trees, optimal bounds (up to low order terms) were recently obtained for adjacency labeling [FOCS'15], nearest common ancestor labeling [SODA'14], and ancestry labeling [SICOMP'06]. In this paper we obtain optimal bounds for distance labeling. We present labels of size $1/4\log^2n+o(\log^2n)$, matching (up to low order terms) the recent $1/4\log^2n-O(\log n)$ lower bound [ICALP'16]. Prior to our work, all distance labeling schemes for trees could be reinterpreted as universal trees. A tree $T$ is said to be universal if any tree on $n$ nodes can be found as a subtree of $T$. A universal tree with $|T|$ nodes implies a distance labeling scheme with label size $\log |T|$. In 1981, Chung et al. proved that any distance labeling scheme based on universal trees requires labels of size $1/2\log^2 n-\log n\cdot\log\log n+O(\log n)$. Our scheme is the first to break this lower bound, showing a separation between distance labeling and universal trees. The $\Theta(\log^2 n)$ barrier has led researchers to consider distances bounded by $k$. The size of such labels was improved from $\log n+O(k\sqrt{\log n})$ [WADS'01] to $\log n+O(k^2(\log(k\log n))$ [SODA'03] and then to $\log n+O(k\log(k\log(n/k)))$ [PODC'07]. We show how to construct labels whose size is $\min\{\log n+O(k\log((\log n)/k)),O(\log n\cdot\log(k/\log n))\}$. We complement this with almost tight lower bounds of $\log n+\Omega(k\log(\log n/(k\log k)))$ and $\Omega(\log n\cdot\log(k/\log n))$. Finally, we consider $(1+\varepsilon)$-approximate distances. We show that the labeling scheme of [ICALP'16] can be easily modified to obtain an $O(\log(1/\varepsilon)\cdot\log n)$ upper bound and we prove a $\Omega(\log(1/\varepsilon)\cdot\log n)$ lower bound.
• Optimal trade-offs for pattern matching with $k$ mismatches(1704.01311)

April 5, 2017 cs.DS
Given a pattern of length $m$ and a text of length $n$, the goal in $k$-mismatch pattern matching is to compute, for every $m$-substring of the text, the exact Hamming distance to the pattern or report that it exceeds $k$. This can be solved in either $\widetilde{O}(n \sqrt{k})$ time as shown by Amir et al. [J. Algorithms 2004] or $\widetilde{O}((m + k^2) \cdot n/m)$ time due to a result of Clifford et al. [SODA 2016]. We provide a smooth time trade-off between these two bounds by designing an algorithm working in time $\widetilde{O}( (m + k \sqrt{m}) \cdot n/m)$. We complement this with a matching conditional lower bound, showing that a significantly faster combinatorial algorithm is not possible, unless the combinatorial matrix multiplication conjecture fails.
• Tree Edit Distance Cannot be Computed in Strongly Subcubic Time (unless APSP can)(1703.08940)

March 27, 2017 cs.DS
The edit distance between two rooted ordered trees with $n$ nodes labeled from an alphabet~$\Sigma$ is the minimum cost of transforming one tree into the other by a sequence of elementary operations consisting of deleting and relabeling existing nodes, as well as inserting new nodes. Tree edit distance is a well known generalization of string edit distance. The fastest known algorithm for tree edit distance runs in cubic $O(n^3)$ time and is based on a similar dynamic programming solution as string edit distance. In this paper we show that a truly subcubic $O(n^{3-\varepsilon})$ time algorithm for tree edit distance is unlikely: For $|\Sigma| = \Omega(n)$, a truly subcubic algorithm for tree edit distance implies a truly subcubic algorithm for the all pairs shortest paths problem. For $|\Sigma| = O(1)$, a truly subcubic algorithm for tree edit distance implies an $O(n^{k-\varepsilon})$ algorithm for finding a maximum weight $k$-clique. Thus, while in terms of upper bounds string edit distance and tree edit distance are highly related, in terms of lower bounds string edit distance exhibits the hardness of the strong exponential time hypothesis [Backurs, Indyk STOC'15] whereas tree edit distance exhibits the hardness of all pairs shortest paths. Our result provides a matching conditional lower bound for one of the last remaining classic dynamic programming problems.
• A note on distance labeling in planar graphs(1611.06529)

Nov. 20, 2016 cs.DS
A distance labeling scheme is an assignments of labels, that is binary strings, to all nodes of a graph, so that the distance between any two nodes can be computed from their labels and the labels are as short as possible. A major open problem is to determine the complexity of distance labeling in unweighted and undirected planar graphs. It is known that, in such a graph on $n$ nodes, some labels must consist of $\Omega(n^{1/3})$ bits, but the best known labeling scheme uses labels of length $O(\sqrt{n}\log n)$ [Gavoille, Peleg, P\'erennes, and Raz, J. Algorithms, 2004]. We show that, in fact, labels of length $O(\sqrt{n})$ are enough.
• Tight Tradeoffs for Real-Time Approximation of Longest Palindromes in Streams(1610.03125)

Oct. 10, 2016 cs.DS
We consider computing a longest palindrome in the streaming model, where the symbols arrive one-by-one and we do not have random access to the input. While computing the answer exactly using sublinear space is not possible in such a setting, one can still hope for a good approximation guarantee. Our contribution is twofold. First, we provide lower bounds on the space requirements for randomized approximation algorithms processing inputs of length $n$. We rule out Las Vegas algorithms, as they cannot achieve sublinear space complexity. For Monte Carlo algorithms, we prove a lower bounds of $\Omega( M \log\min\{|\Sigma|,M\})$ bits of memory; here $M=n/E$ for approximating the answer with additive error $E$, and $M= \frac{\log n}{\log (1+\varepsilon)}$ for approximating the answer with multiplicative error $(1 + \varepsilon)$. Second, we design three real-time algorithms for this problem. Our Monte Carlo approximation algorithms for both additive and multiplicative versions of the problem use $O(M)$ words of memory. Thus the obtained lower bounds are asymptotically tight up to a logarithmic factor. The third algorithm is deterministic and finds a longest palindrome exactly if it is short. This algorithm can be run in parallel with a Monte Carlo algorithm to obtain better results in practice. Overall, both the time and space complexity of finding a longest palindrome in a stream are essentially settled.
• Sparse Suffix Tree Construction in Optimal Time and Space(1608.00865)

Aug. 2, 2016 cs.DS
Suffix tree (and the closely related suffix array) are fundamental structures capturing all substrings of a given text essentially by storing all its suffixes in the lexicographical order. In some applications, we work with a subset of $b$ interesting suffixes, which are stored in the so-called sparse suffix tree. Because the size of this structure is $\Theta(b)$, it is natural to seek a construction algorithm using only $O(b)$ words of space assuming read-only random access to the text. We design a linear-time Monte Carlo algorithm for this problem, hence resolving an open question explicitly stated by Bille et al. [TALG 2016]. The best previously known algorithm by I et al. [STACS 2014] works in $O(n\log b)$ time. Our solution proceeds in $n/b$ rounds; in the $r$-th round, we consider all suffixes starting at positions congruent to $r$ modulo $n/b$. By maintaining rolling hashes, we lexicographically sort all interesting suffixes starting at such positions, and then we merge them with the already considered suffixes. For efficient merging, we also need to answer LCE queries in small space. By plugging in the structure of Bille et al. [CPM 2015] we obtain $O(n+b\log b)$ time complexity. We improve this structure, which implies a linear-time sparse suffix tree construction algorithm. We complement our Monte Carlo algorithm with a deterministic verification procedure. The verification takes $O(n\sqrt{\log b})$ time, which improves upon the bound of $O(n\log b)$ obtained by I et al. [STACS 2014]. This is obtained by first observing that the pruning done inside the previous solution has a rather clean description using the notion of graph spanners with small multiplicative stretch. Then, we are able to decrease the verification time by applying difference covers twice. Combined with the Monte Carlo algorithm, this gives us an $O(n\sqrt{\log b})$-time and $O(b)$-space Las Vegas algorithm.
• Sublinear-Space Distance Labeling using Hubs(1507.06240)

July 20, 2016 cs.DS
A distance labeling scheme is an assignment of bit-labels to the vertices of an undirected, unweighted graph such that the distance between any pair of vertices can be decoded solely from their labels. We propose a series of new labeling schemes within the framework of so-called hub labeling (HL, also known as landmark labeling or 2-hop-cover labeling), in which each node $u$ stores its distance to all nodes from an appropriately chosen set of hubs $S(u) \subseteq V$. For a queried pair of nodes $(u,v)$, the length of a shortest $u-v$-path passing through a hub node from $S(u)\cap S(v)$ is then used as an upper bound on the distance between $u$ and $v$. We present a hub labeling which allows us to decode exact distances in sparse graphs using labels of size sublinear in the number of nodes. For graphs with at most $n$ nodes and average degree $\Delta$, the tradeoff between label bit size $L$ and query decoding time $T$ for our approach is given by $L = O(n \log \log_\Delta T / \log_\Delta T)$, for any $T \leq n$. Our simple approach is thus the first sublinear-space distance labeling for sparse graphs that simultaneously admits small decoding time (for constant $\Delta$, we can achieve any $T=\omega(1)$ while maintaining $L=o(n)$), and it also provides an improvement in terms of label size with respect to previous slower approaches. By using similar techniques, we then present a $2$-additive labeling scheme for general graphs, i.e., one in which the decoder provides a 2-additive-approximation of the distance between any pair of nodes. We achieve almost the same label size-time tradeoff $L = O(n \log^2 \log T / \log T)$, for any $T \leq n$. To our knowledge, this is the first additive scheme with constant absolute error to use labels of sublinear size. The corresponding decoding time is then small (any $T=\omega(1)$ is sufficient).
• Improved Bounds for Shortest Paths in Dense Distance Graphs(1602.07013)

July 15, 2016 cs.DS
We study the problem of computing shortest paths in so-called dense distance graphs. Every planar graph $G$ on $n$ vertices can be partitioned into a set of $O(n/r)$ edge-disjoint regions (called an $r$-division) with $O(r)$ vertices each, such that each region has $O(\sqrt{r})$ vertices (called boundary vertices) in common with other regions. A dense distance graph of a region is a complete graph containing all-pairs distances between its boundary nodes. A dense distance graph of an $r$-division is the union of the $O(n/r)$ dense distance graphs of the individual pieces. Since the introduction of dense distance graphs by Fakcharoenphol and Rao, computing single-source shortest paths in dense distance graphs has found numerous applications in fundamental planar graph algorithms. Fakcharoenphol and Rao proposed an algorithm (later called FR-Dijkstra) for computing single-source shortest paths in a dense distance graph in $O\left(\frac{n}{\sqrt{r}}\log{n}\log{r}\right)$ time. We show an $O\left(\frac{n}{\sqrt{r}}\left(\frac{\log^2{r}}{\log^2\log{r}}+\log{n}\log^{\epsilon}{r}\right)\right)$ time algorithm for this problem, which is the first improvement to date over FR-Dijkstra for the important case when $r$ is polynomial in $n$. In this case, our algorithm is faster by a factor of $O(\log^2{\log{n}})$ and implies improved upper bounds for such planar graph problems as multiple-source multiple-sink maximum flow, single-source all-sinks maximum flow, and (dynamic) exact distance oracles.
• Randomized algorithms for finding a majority element(1603.01583)

April 28, 2016 cs.DS
Given $n$ colored balls, we want to detect if more than $\lfloor n/2\rfloor$ of them have the same color, and if so find one ball with such majority color. We are only allowed to choose two balls and compare their colors, and the goal is to minimize the total number of such operations. A well-known exercise is to show how to find such a ball with only $2n$ comparisons while using only a logarithmic number of bits for bookkeeping. The resulting algorithm is called the Boyer--Moore majority vote algorithm. It is known that any deterministic method needs $\lceil 3n/2\rceil-2$ comparisons in the worst case, and this is tight. However, it is not clear what is the required number of comparisons if we allow randomization. We construct a randomized algorithm which always correctly finds a ball of the majority color (or detects that there is none) using, with high probability, only $7n/6+o(n)$ comparisons. We also prove that the expected number of comparisons used by any such randomized method is at least $1.019n$.
• Optimal Dynamic Strings(1511.02612)

April 8, 2016 cs.DS
In this paper we study the fundamental problem of maintaining a dynamic collection of strings under the following operations: concat - concatenates two strings, split - splits a string into two at a given position, compare - finds the lexicographical order (less, equal, greater) between two strings, LCP - calculates the longest common prefix of two strings. We present an efficient data structure for this problem, where an update requires only $O(\log n)$ worst-case time with high probability, with $n$ being the total length of all strings in the collection, and a query takes constant worst-case time. On the lower bound side, we prove that even if the only possible query is checking equality of two strings, either updates or queries take amortized $\Omega(\log n)$ time; hence our implementation is optimal. Such operations can be used as a basic building block to solve other string problems. We provide two examples. First, we can augment our data structure to provide pattern matching queries that may locate occurrences of a specified pattern $p$ in the strings in our collection in optimal $O(|p|)$ time, at the expense of increasing update time to $O(\log^2 n)$. Second, we show how to maintain a history of an edited text, processing updates in $O(\log t \log \log t)$ time, where $t$ is the number of edits, and how to support pattern matching queries against the whole history in $O(|p| \log t \log \log t)$ time. Finally, we note that our data structure can be applied to test dynamic tree isomorphism and to compare strings generated by dynamic straight-line grammars.
• Faster Longest Common Extension Queries in Strings over General Alphabets(1602.00447)

April 7, 2016 cs.DS
Longest common extension queries (often called longest common prefix queries) constitute a fundamental building block in multiple string algorithms, for example computing runs and approximate pattern matching. We show that a sequence of $q$ LCE queries for a string of size $n$ over a general ordered alphabet can be realized in $O(q \log \log n+n\log^*n)$ time making only $O(q+n)$ symbol comparisons. Consequently, all runs in a string over a general ordered alphabet can be computed in $O(n \log \log n)$ time making $O(n)$ symbol comparisons. Our results improve upon a solution by Kosolobov (Information Processing Letters, 2016), who gave an algorithm with $O(n \log^{2/3} n)$ running time and conjectured that $O(n)$ time is possible. We make a significant progress towards resolving this conjecture. Our techniques extend to the case of general unordered alphabets, when the time increases to $O(q\log n + n\log^*n)$. The main tools are difference covers and the disjoint-sets data structure.
• Tight tradeoffs for approximating palindromes in streams(1410.6433)

April 7, 2016 cs.DS
We consider computing the longest palindrome in a text of length $n$ in the streaming model, where the characters arrive one-by-one, and we do not have random access to the input. While computing the answer exactly using sublinear memory is not possible in such a setting, one can still hope for a good approximation guarantee. We focus on the two most natural variants, where we aim for either additive or multiplicative approximation of the length of the longest palindrome. We first show that there is no point in considering Las Vegas algorithms in such a setting, as they cannot achieve sublinear space complexity. For Monte Carlo algorithms, we provide a lowerbound of $\Omega(\frac{n}{E})$ bits for approximating the answer with additive error $E$, and $\Omega(\frac{\log n}{\log(1+\varepsilon)})$ bits for approximating the answer with multiplicative error $(1+\varepsilon)$ for the binary alphabet. Then, we construct a generic Monte Carlo algorithm, which by choosing the parameters appropriately achieves space complexity matching up to a logarithmic factor for both variants. This substantially improves the previous results by Berenbrink et al. (STACS 2014) and essentially settles the space complexity.
• Efficiently Finding All Maximal $\alpha$-gapped Repeats(1509.09237)

Sept. 30, 2015 cs.DS
For $\alpha\geq 1$, an $\alpha$-gapped repeat in a word $w$ is a factor $uvu$ of $w$ such that $|uv|\leq \alpha |u|$; the two factors $u$ in such a repeat are called arms, while the factor $v$ is called gap. Such a repeat is called maximal if its arms cannot be extended simultaneously with the same symbol to the right or, respectively, to the left. In this paper we show that the number of maximal $\alpha$-gapped repeats that may occur in a word is upper bounded by $18\alpha n$. This allows us to construct an algorithm finding all the maximal $\alpha$-gapped repeats of a word in $O(\alpha n)$; this is optimal, in the worst case, as there are words that have $\Theta(\alpha n)$ maximal $\alpha$-gapped repeats. Our techniques can be extended to get comparable results in the case of $\alpha$-gapped palindromes, i.e., factors $uvu^\mathrm{T}$ with $|uv|\leq \alpha |u|$.