• ### Estimation of the True Evolutionary Distance under the Fragile Breakage Model(1510.08002)

May 25, 2017 math.PR, q-bio.QM, q-bio.PE, q-bio.GN
The ability to estimate the evolutionary distance between extant genomes plays a crucial role in many phylogenomic studies. Often such estimation is based on the parsimony assumption, implying that the distance between two genomes can be estimated as the rearrangement distance equal the minimal number of genome rearrangements required to transform one genome into the other. However, in reality the parsimony assumption may not always hold, emphasizing the need for estimation that does not rely on the rearrangement distance. The distance that accounts for the actual (rather than minimal) number of rearrangements between two genomes is often referred to as the true evolutionary distance. While there exists a method for the true evolutionary distance estimation, it however assumes that genomes can be broken by rearrangements equally likely at any position in the course of evolution. This assumption, known as the random breakage model, has recently been refuted in favor of the more rigorous fragile breakage model postulating that only certain "fragile" genomic regions are prone to rearrangements. We propose a new method for estimating the true evolutionary distance between two genomes under the fragile breakage model. We evaluate the proposed method on simulated genomes, which show its high accuracy. We further apply the proposed method for estimation of evolutionary distances within a set of five yeast genomes and a set of two fish genomes.
• ### Generalized Hultman Numbers and Cycle Structures of Breakpoint Graphs(1503.05285)

Feb. 11, 2017 math.CO, q-bio.GN
Genome rearrangements can be modeled as $k$-breaks, which break a genome at k positions and glue the resulting fragments in a new order. In particular, reversals, translocations, fusions, and fissions are modeled as $2$-breaks, and transpositions are modeled as $3$-breaks. While $k$-break rearrangements for $k>3$ have not been observed in evolution, they are used in cancer genomics to model chromothripsis, a catastrophic event of multiple breakages happening simultaneously in a genome. It is known that the $k$-break distance between two genomes (i.e., the minimum number of $k$-breaks required to transform one genome into the other) can be computed in terms of cycle lengths in the breakpoint graph of these genomes. In the current work, we address the combinatorial problem of enumerating genomes at a given $k$-break distance from a fixed unichromosomal genome. More generally, we enumerate genome pairs, whose breakpoint graph has a given distribution of cycle lengths. We further show how our enumeration can be used for uniform sampling of random genomes at a given $k$-break distance, and describe its connection to various combinatorial objects such as Bell polynomials.
• ### Exact tests for stochastic block models(1612.06040)

Dec. 19, 2016 math.ST, stat.TH, stat.ME
We develop a finite-sample goodness-of-fit test for \emph{latent-variable} block models for networks and test it on simulated and real data sets. The main building block for the latent block assignment model test is the exact test for the model with observed blocks assignment. The latter is implemented using algebraic statistics. While we focus on three variants of the stochastic block model, the methodology extends to any mixture of log-linear models on discrete data.
• ### Combinatorial Scoring of Phylogenetic Networks(1602.02841)

Aug. 8, 2016 math.CO, q-bio.PE, cs.DS
Construction of phylogenetic trees and networks for extant species from their characters represents one of the key problems in phylogenomics. While solution to this problem is not always uniquely defined and there exist multiple methods for tree/network construction, it becomes important to measure how well the constructed networks capture the given character relationship across the species. In the current study, we propose a novel method for measuring the specificity of a given phylogenetic network in terms of the total number of distributions of character states at the leaves that the network may impose. While for binary phylogenetic trees, this number has an exact formula and depends only on the number of leaves and character states but not on the tree topology, the situation is much more complicated for non-binary trees or networks. Nevertheless, we develop an algorithm for combinatorial enumeration of such distributions, which is applicable for arbitrary trees and networks under some reasonable assumptions.
• ### Singular Values Distribution of Squares of Elliptic Random Matrices and Type B Narayana Polynomials(1501.04615)

April 8, 2016 math.CO, math.PR
We consider Gaussian elliptic random matrices $X$ of a size $N \times N$ with parameter $\rho$, i.e., matrices whose pairs of entries $(X_{ij}, X_{ji})$ are mutually independent Gaussian vectors, $E X_{ij} = 0$, $E X^2_{ij} = 1$ and $E X_{ij} X_{ji} = \rho$. We are interested in the asymptotic distribution of eigenvalues of the matrix $W =\frac{1}{N^2} X^2 X^{*2}$. We have shown that this distribution is defined by its moments and we provide a recurrent relation for these moments. We have proven that the (symmetrized) asymptotic distribution is determined by its free cumulants, which are Narayana polynomials of type B: $$c_{2n} = \sum_{k=0}^n \binom{n}{k}^2 \rho^{2k}.$$
• ### A Computational Method for the Rate Estimation of Evolutionary Transpositions(1501.07546)

Jan. 29, 2015 math.CO, math.PR, q-bio.PE, q-bio.GN
Genome rearrangements are evolutionary events that shuffle genomic architectures. Most frequent genome rearrangements are reversals, translocations, fusions, and fissions. While there are some more complex genome rearrangements such as transpositions, they are rarely observed and believed to constitute only a small fraction of genome rearrangements happening in the course of evolution. The analysis of transpositions is further obfuscated by intractability of the underlying computational problems. We propose a computational method for estimating the rate of transpositions in evolutionary scenarios between genomes. We applied our method to a set of mammalian genomes and estimated the transpositions rate in mammalian evolution to be around 0.26.
• ### Hultman numbers, polygon gluings and matrix integrals(1111.3061)

Nov. 13, 2011 math.CO, math.PR
The Hultman numbers enumerate permutations whose cycle graph has a given number of alternating cycles (they are relevant to the Bafna-Pevzner approach to genome comparison and genome rearrangements). We give two new interpretations of the Hultman numbers: in terms of polygon gluings and as integrals over the space of complex matrices, and derive some properties of their generating functions.
• ### On the asymptotic distribution of singular values of products of large rectangular random matrices(1012.2586)

April 26, 2011 math.PR
We consider products of independent large random rectangular matrices with independent entries. The limit distribution of the expected empirical distribution of singular values of such products is computed. The distribution function is described by its Stieltjes transform, which satisfies some algebraic equation. In the particular case of square matrices we get a well-known distribution which moments are Fuss-Catalan numbers.
• ### On the asymptotic distribution of the singular values of powers of random matrices(1012.2743)

Dec. 13, 2010 math.PR
We consider powers of random matrices with independent entries. Let $X_{ij}, i,j\ge 1$, be independent complex random variables with $\E X_{ij}=0$ and $\E |X_{ij}|^2=1$ and let $\mathbf X$ denote an $n\times n$ matrix with $[\mathbf X]_{ij}=X_{ij}$, for $1\le i, j\le n$. Denote by $s_1^{(m)}\ge...\ge s_n^{(m)}$ the singular values of the random matrix $\mathbf W:={n^{-\frac m2}} \mathbf X^m$ and define the empirical distribution of the squared singular values by $$\mathcal F_n^{(m)}(x)=\frac1n\sum_{k=1}^nI_{\{{s_k^{(m)}}^2\le x\}},$$ where $I_{\{B\}}$ denotes the indicator of an event $B$. We prove that under a Lindeberg condition for the fourth moment that the expected spectral distribution $F_n^{(m)}(x)=\E \mathcal F_n^{(m)}(x)$ converges to the distribution function $G^{(m)}(x)$ defined by its moments $$\alpha_k(m):=\int_{\mathbb R}x^k\,d\,G(x)=\frac {1}{mk+1}\binom{km+k}{k}.$$
• ### Asymptotic distribution of singular values of powers of random matrices(1002.4442)

Feb. 24, 2010 math.PR
Let $x$ be a complex random variable such that ${\E {x}=0}$, ${\E |x|^2=1}$, ${\E |x|^{4} < \infty}$. Let $x_{ij}$, $i,j \in \{1,2,...\}$ be independet copies of $x$. Let ${\Xb=(N^{-1/2}x_{ij})}$, $1\leq i,j \leq N$ be a random matrix. Writing $\Xb^*$ for the adjoint matrix of $\Xb$, consider the product $\Xb^m{\Xb^*}^m$ with some $m \in \{1,2,...\}$. The matrix $\Xb^m{\Xb^*}^m$ is Hermitian positive semi-definite. Let $\lambda_1,\lambda_2,...,\lambda_N$ be eigenvalues of $\Xb^m{\Xb^*}^m$ (or squared singular values of the matrix $\Xb^m$). In this paper we find the asymptotic distribution function $G^{(m)}(x)=\lim_{N\to\infty}\E{F_N^{(m)}(x)}$ of the empirical distribution function ${F_N^{(m)}(x)} = N^{-1} \sum_{k=1}^N {\mathbb{I}{\{\lambda_k \leq x\}}},$ where $\mathbb{I} \{A\}$ stands for the indicator function of event $A$. The moments of $G^{(m)}$ satisfy $M^{(m)}_p=\int_{\mathbb{R}}{x^p dG^{(m)}(x)}=\frac{1}{mp+1}\binom{mp+p}{p}.$ In Free Probability Theory $M^{(m)}_p$ are known as Fuss--Catalan numbers. With $m=1$ our result turns to a well known result of Marchenko--Pastur 1967.