
In the longest common factor (LCF) problem, we are given two strings $S$ and
$T$, each of length at most $n$, and we are asked to find a longest string
occurring in both $S$ and $T$. This is a classical and wellstudied problem in
computer science. The LCF length for two strings can vary greatly even when a
single character is changed. A data structure that can be built in
$\tilde{\mathcal{O}}(n)$ (The $\tilde{\mathcal{O}}$ notation suppresses
$\log^{\mathcal{O}(1)} n$ factors.) time and can return an LCF of the two
strings after a single edit operation (that is reverted afterwards) in
$\tilde{\mathcal{O}}(1)$ time was very recently proposed as a first step
towards the study of the fully dynamic LCF problem. In the fully dynamic
version, edit operations are allowed in any of the two strings, and we are to
report an LCF after each such operation. We present the first algorithm that
requires strongly sublinear time per edit operation. In particular, we show how
to return an LCF in $\tilde{\mathcal{O}}(n^{3/4})$ time after each operation
using $\tilde{\mathcal{O}}(n)$ space. We also present an algorithm with
$\tilde{\mathcal{O}}(\sqrt{n})$ query time for the restricted case where edits
are allowed only in one of the two strings and faster algorithms for several
restricted variants of dynamic and internal LCF problems (here `internal' means
that we are to answer queries about LCF on multiple factors of a given text).

In the Longest Common Factor with $k$ Mismatches (LCF$_k$) problem, we are
given two strings $X$ and $Y$ of total length $n$, and we are asked to find a
pair of maximallength factors, one of $X$ and the other of $Y$, such that
their Hamming distance is at most $k$. Thankachan et al. show that this problem
can be solved in $\mathcal{O}(n \log^k n)$ time and $\mathcal{O}(n)$ space for
constant $k$. We consider the LCF$_k$($\ell$) problem in which we assume that
the sought factors have length at least $\ell$, and the LCF$_k$($\ell$) problem
for $\ell=\Omega(\log^{2k+2} n)$, which we call the Long LCF$_k$ problem. We
use difference covers to reduce the Long LCF$_k$ problem to a task involving
$m=\mathcal{O}(n/\log^{k+1}n)$ synchronized factors. The latter can be solved
in $\mathcal{O}(m \log^{k+1}m)$ time, which results in a lineartime algorithm
for Long LCF$_k$. In general, our solution to LCF$_k$($\ell$) for arbitrary
$\ell$ takes $\mathcal{O}(n + n \log^{k+1} n/\sqrt{\ell})$ time.

Although realworld text datasets, such as DNA sequences, are far from being
uniformly random, averagecase string searching algorithms perform
significantly better than worstcase ones in most applications of interest. In
this paper, we study the problem of computing the longest prefix of each suffix
of a given string of length $n$ over a constantsized alphabet that occurs
elsewhere in the string with $k$errors. This problem has already been studied
under the Hamming distance model. Our first result is an improvement upon the
stateoftheart averagecase time complexity for nonconstant $k$ and using
only linear space under the Hamming distance model. Notably, we show that our
technique can be extended to the edit distance model with the same time and
space complexities. Specifically, our algorithms run in $\mathcal{O}(n \log^k n
\log \log n)$ time on average using $\mathcal{O}(n)$ space. We show that our
technique is applicable to several algorithmic problems in computational
biology and elsewhere.

In the Range Minimum Query (RMQ) problem, we are given an array $A$ of $n$
numbers and we are asked to answer queries of the following type: for indices
$i$ and $j$ between $0$ and $n1$, query $\text{RMQ}_A(i,j)$ returns the index
of a minimum element in the subarray $A[i..j]$. Answering a small batch of RMQs
is a core computational task in many realworld applications, in particular due
to the connection with the Lowest Common Ancestor (LCA) problem. With small
batch, we mean that the number $q$ of queries is $o(n)$ and we have them all at
hand. It is therefore not relevant to build an $\Omega(n)$sized data structure
or spend $\Omega(n)$ time to build a more succinct one. It is wellknown, among
practitioners and elsewhere, that these data structures for online querying
carry high constants in their preprocessing and querying time. We would thus
like to answer this batch efficiently in practice. With efficiently in
practice, we mean that we (ultimately) want to spend $n + \mathcal{O}(q)$ time
and $\mathcal{O}(q)$ space. We write $n$ to stress that the number of
operations per entry of $A$ should be a very small constant. Here we show how
existing algorithms can be easily modified to satisfy these conditions. The
presented experimental results highlight the practicality of this new scheme.
The most significant improvement obtained is for answering a small batch of LCA
queries. A library implementation of the presented algorithms is made
available.

In the kmappability problem, we are given a string x of length n and
integers m and k, and we are asked to count, for each lengthm factor y of x,
the number of other factors of length m of x that are at Hamming distance at
most k from y. We focus here on the version of the problem where k = 1. The
fastest known algorithm for k = 1 requires time O(mn log n/ log log n) and
space O(n). We present two algorithms that require worstcase time O(mn) and
O(n log^2 n), respectively, and space O(n), thus greatly improving the state of
the art. Moreover, we present an algorithm that requires averagecase time and
space O(n) for integer alphabets if m = {\Omega}(log n/ log {\sigma}), where
{\sigma} is the alphabet size.

The observed frequency of the longest proper prefix, the longest proper
suffix, and the longest infix of a word $w$ in a given sequence $x$ can be used
for classifying $w$ as avoided or overabundant. The definitions used for the
expectation and deviation of $w$ in this statistical model were described and
biologically justified by Brendel et al. (J Biomol Struct Dyn 1986). We have
very recently introduced a timeoptimal algorithm for computing all avoided
words of a given sequence over an integer alphabet (Algorithms Mol Biol 2017).
In this article, we extend this study by presenting an $\mathcal{O}(n)$time
and $\mathcal{O}(n)$space algorithm for computing all overabundant words in a
sequence $x$ of length $n$ over an integer alphabet. Our main result is based
on a new nontrivial combinatorial property of the suffix tree $\mathcal{T}$ of
$x$: the number of distinct factors of $x$ whose longest infix is the label of
an explicit node of $\mathcal{T}$ is no more than $3n4$. We further show that
the presented algorithm is timeoptimal by proving that $\mathcal{O}(n)$ is a
tight upper bound for the number of overabundant words. Finally, we present
experimental results, using both synthetic and real data, which justify the
effectiveness and efficiency of our approach in practical terms.

Identifying palindromes in sequences has been an interesting line of research
in combinatorics on words and also in computational biology, after the
discovery of the relation of palindromes in the DNA sequence with the HIV
virus. Efficient algorithms for the factorization of sequences into palindromes
and maximal palindromes have been devised in recent years. We extend these
studies by allowing gaps in decompositions and errors in palindromes, and also
imposing a lower bound to the length of acceptable palindromes.
We first present an algorithm for obtaining a palindromic decomposition of a
string of length n with the minimal total gap length in time O(n log n * g) and
space O(n g), where g is the number of allowed gaps in the decomposition. We
then consider a decomposition of the string in maximal \deltapalindromes (i.e.
palindromes with \delta errors under the edit or Hamming distance) and g
allowed gaps. We present an algorithm to obtain such a decomposition with the
minimal total gap length in time O(n (g + \delta)) and space O(n g).

The deviation of the observed frequency of a word $w$ from its expected
frequency in a given sequence $x$ is used to determine whether or not the word
is avoided. This concept is particularly useful in DNA linguistic analysis. The
value of the standard deviation of $w$, denoted by $std(w)$, effectively
characterises the extent of a word by its edge contrast in the context in which
it occurs. A word $w$ of length $k>2$ is a $\rho$avoided word in $x$ if
$std(w) \leq \rho$, for a given threshold $\rho < 0$. Notice that such a word
may be completely absent from $x$. Hence computing all such words na\"{\i}vely
can be a very timeconsuming procedure, in particular for large $k$. In this
article, we propose an $O(n)$time and $O(n)$space algorithm to compute all
$\rho$avoided words of length $k$ in a given sequence $x$ of length $n$ over a
fixedsized alphabet. We also present a timeoptimal $O(\sigma n)$time and
$O(\sigma n)$space algorithm to compute all $\rho$avoided words (of any
length) in a sequence of length $n$ over an alphabet of size $\sigma$.
Furthermore, we provide a tight asymptotic upper bound for the number of
$\rho$avoided words and the expected length of the longest one. We make
available an opensource implementation of our algorithm. Experimental results,
using both real and synthetic data, show the efficiency of our implementation.