• We study high-dimensional distribution learning in an agnostic setting where an adversary is allowed to arbitrarily corrupt an $\varepsilon$-fraction of the samples. Such questions have a rich history spanning statistics, machine learning and theoretical computer science. Even in the most basic settings, the only known approaches are either computationally inefficient or lose dimension-dependent factors in their error guarantees. This raises the following question:Is high-dimensional agnostic distribution learning even possible, algorithmically? In this work, we obtain the first computationally efficient algorithms with dimension-independent error guarantees for agnostically learning several fundamental classes of high-dimensional distributions: (1) a single Gaussian, (2) a product distribution on the hypercube, (3) mixtures of two product distributions (under a natural balancedness condition), and (4) mixtures of spherical Gaussians. Our algorithms achieve error that is independent of the dimension, and in many cases scales nearly-linearly with the fraction of adversarially corrupted samples. Moreover, we develop a general recipe for detecting and correcting corruptions in high-dimensions, that may be applicable to many other problems.
  • We investigate the problem of learning Bayesian networks in a robust model where an $\epsilon$-fraction of the samples are adversarially corrupted. In this work, we study the fully observable discrete case where the structure of the network is given. Even in this basic setting, previous learning algorithms either run in exponential time or lose dimension-dependent factors in their error guarantees. We provide the first computationally efficient robust learning algorithm for this problem with dimension-independent error guarantees. Our algorithm has near-optimal sample complexity, runs in polynomial time, and achieves error that scales nearly-linearly with the fraction of adversarially corrupted samples. Finally, we show on both synthetic and semi-synthetic data that our algorithm performs well in practice.
  • Robust estimation is much more challenging in high dimensions than it is in one dimension: Most techniques either lead to intractable optimization problems or estimators that can tolerate only a tiny fraction of errors. Recent work in theoretical computer science has shown that, in appropriate distributional models, it is possible to robustly estimate the mean and covariance with polynomial time algorithms that can tolerate a constant fraction of corruptions, independent of the dimension. However, the sample and time complexity of these algorithms is prohibitively large for high-dimensional applications. In this work, we address both of these issues by establishing sample complexity bounds that are optimal, up to logarithmic factors, as well as giving various refinements that allow the algorithms to tolerate a much larger fraction of corruptions. Finally, we show on both synthetic and real data that our algorithms have state-of-the-art performance and suddenly make high-dimensional robust estimation a realistic possibility.
  • In high dimensions, most machine learning methods are brittle to even a small fraction of structured outliers. To address this, we introduce a new meta-algorithm that can take in a base learner such as least squares or stochastic gradient descent, and harden the learner to be resistant to outliers. Our method, Sever, possesses strong theoretical guarantees yet is also highly scalable -- beyond running the base learner itself, it only requires computing the top singular vector of a certain $n \times d$ matrix. We apply Sever on a drug design dataset and a spam classification dataset, and find that in both cases it has substantially greater robustness than several baselines. On the spam dataset, with $1\%$ corruptions, we achieved $7.4\%$ test error, compared to $13.4\%-20.5\%$ for the baselines, and $3\%$ error on the uncorrupted dataset. Similarly, on the drug design dataset, with $10\%$ corruptions, we achieved $1.42$ mean-squared error test error, compared to $1.51$-$2.33$ for the baselines, and $1.23$ error on the uncorrupted dataset.
  • We study the problem of learning multivariate log-concave densities with respect to a global loss function. We obtain the first upper bound on the sample complexity of the maximum likelihood estimator (MLE) for a log-concave density on $\mathbb{R}^d$, for all $d \geq 4$. Prior to this work, no finite sample upper bound was known for this estimator in more than $3$ dimensions. In more detail, we prove that for any $d \geq 1$ and $\epsilon>0$, given $\tilde{O}_d((1/\epsilon)^{(d+3)/2})$ samples drawn from an unknown log-concave density $f_0$ on $\mathbb{R}^d$, the MLE outputs a hypothesis $h$ that with high probability is $\epsilon$-close to $f_0$, in squared Hellinger loss. A sample complexity lower bound of $\Omega_d((1/\epsilon)^{(d+1)/2})$ was previously known for any learning algorithm that achieves this guarantee. We thus establish that the sample complexity of the log-concave MLE is near-optimal, up to an $\tilde{O}(1/\epsilon)$ factor.
  • We study the problem of generalized uniformity testing \cite{BC17} of a discrete probability distribution: Given samples from a probability distribution $p$ over an {\em unknown} discrete domain $\mathbf{\Omega}$, we want to distinguish, with probability at least $2/3$, between the case that $p$ is uniform on some {\em subset} of $\mathbf{\Omega}$ versus $\epsilon$-far, in total variation distance, from any such uniform distribution. We establish tight bounds on the sample complexity of generalized uniformity testing. In more detail, we present a computationally efficient tester whose sample complexity is optimal, up to constant factors, and a matching information-theoretic lower bound. Specifically, we show that the sample complexity of generalized uniformity testing is $\Theta\left(1/(\epsilon^{4/3}\|p\|_3) + 1/(\epsilon^{2} \|p\|_2) \right)$.
  • We study the general problem of testing whether an unknown distribution belongs to a specified family of distributions. More specifically, given a distribution family $\mathcal{P}$ and sample access to an unknown discrete distribution $\mathbf{P}$, we want to distinguish (with high probability) between the case that $\mathbf{P} \in \mathcal{P}$ and the case that $\mathbf{P}$ is $\epsilon$-far, in total variation distance, from every distribution in $\mathcal{P}$. This is the prototypical hypothesis testing problem that has received significant attention in statistics and, more recently, in theoretical computer science. The sample complexity of this general inference task depends on the underlying family $\mathcal{P}$. The gold standard in distribution property testing is to design sample-optimal and computationally efficient algorithms for this task. The main contribution of this work is a simple and general testing technique that is applicable to all distribution families whose Fourier spectrum satisfies a certain approximate sparsity property. To the best of our knowledge, ours is the first use of the Fourier transform in the context of distribution testing. We apply our Fourier-based framework to obtain near sample-optimal and computationally efficient testers for the following fundamental distribution families: Sums of Independent Integer Random Variables (SIIRVs), Poisson Multinomial Distributions (PMDs), and Discrete Log-Concave Distributions. For the first two, ours are the first non-trivial testers in the literature, vastly generalizing previous work on testing Poisson Binomial Distributions. For the third, our tester improves on prior work in both sample and time complexity.
  • We study the efficient learnability of geometric concept classes - specifically, low-degree polynomial threshold functions (PTFs) and intersections of halfspaces - when a fraction of the data is adversarially corrupted. We give the first polynomial-time PAC learning algorithms for these concept classes with dimension-independent error guarantees in the presence of nasty noise under the Gaussian distribution. In the nasty noise model, an omniscient adversary can arbitrarily corrupt a small fraction of both the unlabeled data points and their labels. This model generalizes well-studied noise models, including the malicious noise model and the agnostic (adversarial label noise) model. Prior to our work, the only concept class for which efficient malicious learning algorithms were known was the class of origin-centered halfspaces. Specifically, our robust learning algorithm for low-degree PTFs succeeds under a number of tame distributions -- including the Gaussian distribution and, more generally, any log-concave distribution with (approximately) known low-degree moments. For LTFs under the Gaussian distribution, we give a polynomial-time algorithm that achieves error $O(\epsilon)$, where $\epsilon$ is the noise rate. At the core of our PAC learning results is an efficient algorithm to approximate the low-degree Chow-parameters of any bounded function in the presence of nasty noise. To achieve this, we employ an iterative spectral method for outlier detection and removal, inspired by recent work in robust unsupervised learning. Our aforementioned algorithm succeeds for a range of distributions satisfying mild concentration bounds and moment assumptions. The correctness of our robust learning algorithm for intersections of halfspaces makes essential use of a novel robust inverse independence lemma that may be of broader interest.
  • We study the problem of estimating multivariate log-concave probability density functions. We prove the first sample complexity upper bound for learning log-concave densities on $\mathbb{R}^d$, for all $d \geq 1$. Prior to our work, no upper bound on the sample complexity of this learning problem was known for the case of $d>3$. In more detail, we give an estimator that, for any $d \ge 1$ and $\epsilon>0$, draws $\tilde{O}_d \left( (1/\epsilon)^{(d+5)/2} \right)$ samples from an unknown target log-concave density on $\mathbb{R}^d$, and outputs a hypothesis that (with high probability) is $\epsilon$-close to the target, in total variation distance. Our upper bound on the sample complexity comes close to the known lower bound of $\Omega_d \left( (1/\epsilon)^{(d+1)/2} \right)$ for this problem.
  • We describe a general technique that yields the first {\em Statistical Query lower bounds} for a range of fundamental high-dimensional learning problems involving Gaussian distributions. Our main results are for the problems of (1) learning Gaussian mixture models (GMMs), and (2) robust (agnostic) learning of a single unknown Gaussian distribution. For each of these problems, we show a {\em super-polynomial gap} between the (information-theoretic) sample complexity and the computational complexity of {\em any} Statistical Query algorithm for the problem. Our SQ lower bound for Problem (1) is qualitatively matched by known learning algorithms for GMMs. Our lower bound for Problem (2) implies that the accuracy of the robust learning algorithm in~\cite{DiakonikolasKKLMS16} is essentially best possible among all polynomial-time SQ algorithms. Our SQ lower bounds are attained via a unified moment-matching technique that is useful in other contexts and may be of broader interest. Our technique yields nearly-tight lower bounds for a number of related unsupervised estimation problems. Specifically, for the problems of (3) robust covariance estimation in spectral norm, and (4) robust sparse mean estimation, we establish a quadratic {\em statistical--computational tradeoff} for SQ algorithms, matching known upper bounds. Finally, our technique can be used to obtain tight sample complexity lower bounds for high-dimensional {\em testing} problems. Specifically, for the classical problem of robustly {\em testing} an unknown mean (known covariance) Gaussian, our technique implies an information-theoretic sample lower bound that scales {\em linearly} in the dimension. Our sample lower bound matches the sample complexity of the corresponding robust {\em learning} problem and separates the sample complexity of robust testing from standard (non-robust) testing.
  • We study the fundamental problem of learning the parameters of a high-dimensional Gaussian in the presence of noise -- where an $\varepsilon$-fraction of our samples were chosen by an adversary. We give robust estimators that achieve estimation error $O(\varepsilon)$ in the total variation distance, which is optimal up to a universal constant that is independent of the dimension. In the case where just the mean is unknown, our robustness guarantee is optimal up to a factor of $\sqrt{2}$ and the running time is polynomial in $d$ and $1/\epsilon$. When both the mean and covariance are unknown, the running time is polynomial in $d$ and quasipolynomial in $1/\varepsilon$. Moreover all of our algorithms require only a polynomial number of samples. Our work shows that the same sorts of error guarantees that were established over fifty years ago in the one-dimensional setting can also be achieved by efficient algorithms in high-dimensional settings.
  • This work initiates a systematic investigation of testing {\em high-dimensional} structured distributions by focusing on testing {\em Bayesian networks} -- the prototypical family of directed graphical models. A Bayesian network is defined by a directed acyclic graph, where we associate a random variable with each node. The value at any particular node is conditionally independent of all the other non-descendant nodes once its parents are fixed. Specifically, we study the properties of identity testing and closeness testing of Bayesian networks. Our main contribution is the first non-trivial efficient testing algorithms for these problems and corresponding information-theoretic lower bounds. For a wide range of parameter settings, our testing algorithms have sample complexity {\em sublinear} in the dimension and are sample-optimal, up to constant factors.
  • We investigate the complexity of computing approximate Nash equilibria in anonymous games. Our main algorithmic result is the following: For any $n$-player anonymous game with a bounded number of strategies and any constant $\delta>0$, an $O(1/n^{1-\delta})$-approximate Nash equilibrium can be computed in polynomial time. Complementing this positive result, we show that if there exists any constant $\delta>0$ such that an $O(1/n^{1+\delta})$-approximate equilibrium can be computed in polynomial time, then there is a fully polynomial-time approximation scheme for this problem. We also present a faster algorithm that, for any $n$-player $k$-strategy anonymous game, runs in time $\tilde O((n+k) k n^k)$ and computes an $\tilde O(n^{-1/3} k^{11/3})$-approximate equilibrium. This algorithm follows from the existence of simple approximate equilibria of anonymous games, where each player plays one strategy with probability $1-\delta$, for some small $\delta$, and plays uniformly at random with probability $\delta$. Our approach exploits the connection between Nash equilibria in anonymous games and Poisson multinomial distributions (PMDs). Specifically, we prove a new probabilistic lemma establishing the following: Two PMDs, with large variance in each direction, whose first few moments are approximately matching are close in total variation distance. Our structural result strengthens previous work by providing a smooth tradeoff between the variance bound and the number of matching moments.
  • An $(n, k)$-Poisson Multinomial Distribution (PMD) is a random variable of the form $X = \sum_{i=1}^n X_i$, where the $X_i$'s are independent random vectors supported on the set of standard basis vectors in $\mathbb{R}^k.$ In this paper, we obtain a refined structural understanding of PMDs by analyzing their Fourier transform. As our core structural result, we prove that the Fourier transform of PMDs is {\em approximately sparse}, i.e., roughly speaking, its $L_1$-norm is small outside a small set. By building on this result, we obtain the following applications: {\bf Learning Theory.} We design the first computationally efficient learning algorithm for PMDs with respect to the total variation distance. Our algorithm learns an arbitrary $(n, k)$-PMD within variation distance $\epsilon$ using a near-optimal sample size of $\widetilde{O}_k(1/\epsilon^2),$ and runs in time $\widetilde{O}_k(1/\epsilon^2) \cdot \log n.$ Previously, no algorithm with a $\mathrm{poly}(1/\epsilon)$ runtime was known, even for $k=3.$ {\bf Game Theory.} We give the first efficient polynomial-time approximation scheme (EPTAS) for computing Nash equilibria in anonymous games. For normalized anonymous games with $n$ players and $k$ strategies, our algorithm computes a well-supported $\epsilon$-Nash equilibrium in time $n^{O(k^3)} \cdot (k/\epsilon)^{O(k^3\log(k/\epsilon)/\log\log(k/\epsilon))^{k-1}}.$ The best previous algorithm for this problem had running time $n^{(f(k)/\epsilon)^k},$ where $f(k) = \Omega(k^{k^2})$, for any $k>2.$ {\bf Statistics.} We prove a multivariate central limit theorem (CLT) that relates an arbitrary PMD to a discretized multivariate Gaussian with the same mean and covariance, in total variation distance. Our new CLT strengthens the CLT of Valiant and Valiant by completely removing the dependence on $n$ in the error bound.
  • We study the {\em robust proper learning} of univariate log-concave distributions (over continuous and discrete domains). Given a set of samples drawn from an unknown target distribution, we want to compute a log-concave hypothesis distribution that is as close as possible to the target, in total variation distance. In this work, we give the first computationally efficient algorithm for this learning problem. Our algorithm achieves the information-theoretically optimal sample size (up to a constant factor), runs in polynomial time, and is robust to model misspecification with nearly-optimal error guarantees. Specifically, we give an algorithm that, on input $n=O(1/\eps^{5/2})$ samples from an unknown distribution $f$, runs in time $\widetilde{O}(n^{8/5})$, and outputs a log-concave hypothesis $h$ that (with high probability) satisfies $\dtv(h, f) = O(\opt)+\eps$, where $\opt$ is the minimum total variation distance between $f$ and the class of log-concave distributions. Our approach to the robust proper learning problem is quite flexible and may be applicable to many other univariate distribution families.
  • We give polynomial time algorithms for quantitative (and qualitative) reachability analysis for Branching Markov Decision Processes (BMDPs). Specifically, given a BMDP, and given an initial population, where the objective of the controller is to maximize (or minimize) the probability of eventually reaching a population that contains an object of a desired (or undesired) type, we give algorithms for approximating the supremum (infimum) reachability probability, within desired precision epsilon > 0, in time polynomial in the encoding size of the BMDP and in log(1/epsilon). We furthermore give P-time algorithms for computing epsilon-optimal strategies for both maximization and minimization of reachability probabilities. We also give P-time algorithms for all associated qualitative analysis problems, namely: deciding whether the optimal (supremum or infimum) reachability probabilities are 0 or 1. Prior to this paper, approximation of optimal reachability probabilities for BMDPs was not even known to be decidable. Our algorithms exploit the following basic fact: we show that for any BMDP, its maximum (minimum) non-reachability probabilities are given by the greatest fixed point (GFP) solution g* in [0,1]^n of a corresponding monotone max (min) Probabilistic Polynomial System of equations (max/min-PPS), x=P(x), which are the Bellman optimality equations for a BMDP with non-reachability objectives. We show how to compute the GFP of max/min PPSs to desired precision in P-time. We also study more general Branching Simple Stochastic Games (BSSGs) with (non-)reachability objectives. We show that: (1) the value of these games is captured by the GFP of a corresponding max-minPPS; (2) the quantitative problem of approximating the value is in TFNP; and (3) the qualitative problems associated with the value are all solvable in P-time.
  • We study the structure and learnability of sums of independent integer random variables (SIIRVs). For $k \in \mathbb{Z}_{+}$, a $k$-SIIRV of order $n \in \mathbb{Z}_{+}$ is the probability distribution of the sum of $n$ independent random variables each supported on $\{0, 1, \dots, k-1\}$. We denote by ${\cal S}_{n,k}$ the set of all $k$-SIIRVs of order $n$. In this paper, we tightly characterize the sample and computational complexity of learning $k$-SIIRVs. More precisely, we design a computationally efficient algorithm that uses $\widetilde{O}(k/\epsilon^2)$ samples, and learns an arbitrary $k$-SIIRV within error $\epsilon,$ in total variation distance. Moreover, we show that the {\em optimal} sample complexity of this learning problem is $\Theta((k/\epsilon^2)\sqrt{\log(1/\epsilon)}).$ Our algorithm proceeds by learning the Fourier transform of the target $k$-SIIRV in its effective support. Its correctness relies on the {\em approximate sparsity} of the Fourier transform of $k$-SIIRVs -- a structural property that we establish, roughly stating that the Fourier transform of $k$-SIIRVs has small magnitude outside a small set. Along the way we prove several new structural results about $k$-SIIRVs. As one of our main structural contributions, we give an efficient algorithm to construct a sparse {\em proper} $\epsilon$-cover for ${\cal S}_{n,k},$ in total variation distance. We also obtain a novel geometric characterization of the space of $k$-SIIRVs. Our characterization allows us to prove a tight lower bound on the size of $\epsilon$-covers for ${\cal S}_{n,k}$, and is the key ingredient in our tight sample complexity lower bound. Our approach of exploiting the sparsity of the Fourier transform in distribution learning is general, and has recently found additional applications.
  • We give an algorithm for properly learning Poisson binomial distributions. A Poisson binomial distribution (PBD) of order $n$ is the discrete probability distribution of the sum of $n$ mutually independent Bernoulli random variables. Given $\widetilde{O}(1/\epsilon^2)$ samples from an unknown PBD $\mathbf{p}$, our algorithm runs in time $(1/\epsilon)^{O(\log \log (1/\epsilon))}$, and outputs a hypothesis PBD that is $\epsilon$-close to $\mathbf{p}$ in total variation distance. The previously best known running time for properly learning PBDs was $(1/\epsilon)^{O(\log(1/\epsilon))}$. As one of our main contributions, we provide a novel structural characterization of PBDs. We prove that, for all $\epsilon >0,$ there exists an explicit collection $\cal{M}$ of $(1/\epsilon)^{O(\log \log (1/\epsilon))}$ vectors of multiplicities, such that for any PBD $\mathbf{p}$ there exists a PBD $\mathbf{q}$ with $O(\log(1/\epsilon))$ distinct parameters whose multiplicities are given by some element of ${\cal M}$, such that $\mathbf{q}$ is $\epsilon$-close to $\mathbf{p}$. Our proof combines tools from Fourier analysis and algebraic geometry. Our approach to the proper learning problem is as follows: Starting with an accurate non-proper hypothesis, we fit a PBD to this hypothesis. More specifically, we essentially start with the hypothesis computed by the computationally efficient non-proper learning algorithm in our recent work~\cite{DKS15}. Our aforementioned structural characterization allows us to reduce the corresponding fitting problem to a collection of $(1/\epsilon)^{O(\log \log(1/\epsilon))}$ systems of low-degree polynomial inequalities. We show that each such system can be solved in time $(1/\epsilon)^{O(\log \log(1/\epsilon))}$, which yields the overall running time of our algorithm.
  • The following two decision problems capture the complexity of comparing integers or rationals that are succinctly represented in product-of-exponentials notation, or equivalently, via arithmetic circuits using only multiplication and division gates, and integer inputs: Input instance: four lists of positive integers: a_1, ...., a_n ; b_1,...., b_n ; c_1,....,c_m ; d_1, ...., d_m ; where each of the integers is represented in binary. Problem 1 (equality testing): Decide whether a_1^{b_1} a_2^{b_2} .... a_n^{b_n} = c_1^{d_1} c_2^{d_2} .... c_m^{d_m} . Problem 2 (inequality testing): Decide whether a_1^{b_1} a_2^{b_2} ... a_n^{b_n} >= c_1^{d_1} c_2^{d_2} .... c_m^{d_m} . Problem 1 is easily decidable in polynomial time using a simple iterative algorithm. Problem 2 is much harder. We observe that the complexity of Problem 2 is intimately connected to deep conjectures and results in number theory. In particular, if a refined form of the ABC conjecture formulated by Baker in 1998 holds, or if the older Lang-Waldschmidt conjecture (formulated in 1978) on linear forms in logarithms holds, then Problem 2 is decidable in P-time (in the standard Turing model of computation). Moreover, it follows from the best available quantitative bounds on linear forms in logarithms, e.g., by Baker and W\"{u}stholz (1993) or Matveev (2000), that if m and n are fixed universal constants then Problem 2 is decidable in P-time (without relying on any conjectures). This latter fact was observed earlier by Shub (1993). We describe one application: P-time maximum probability parsing for arbitrary stochastic context-free grammars (where \epsilon-rules are allowed).
  • A central computational problem for analyzing and model checking various classes of infinite-state recursive probabilistic systems (including quasi-birth-death processes, multi-type branching processes, stochastic context-free grammars, probabilistic pushdown automata and recursive Markov chains) is the computation of {\em termination probabilities}, and computing these probabilities in turn boils down to computing the {\em least fixed point} (LFP) solution of a corresponding {\em monotone polynomial system} (MPS) of equations, denoted x=P(x). It was shown by Etessami & Yannakakis that a decomposed variant of Newton's method converges monotonically to the LFP solution for any MPS that has a non-negative solution. Subsequently, Esparza, Kiefer, & Luttenberger obtained upper bounds on the convergence rate of Newton's method for certain classes of MPSs. More recently, better upper bounds have been obtained for special classes of MPSs. However, prior to this paper, for arbitrary (not necessarily strongly-connected) MPSs, no upper bounds at all were known on the convergence rate of Newton's method as a function of the encoding size |P| of the input MPS, x=P(x). In this paper we provide worst-case upper bounds, as a function of both the input encoding size |P|, and epsilon > 0, on the number of iterations required for decomposed Newton's method (even with rounding) to converge within additive error epsilon > 0 of q^*, for any MPS with LFP solution q^*. Our upper bounds are essentially optimal in terms of several important parameters. Using our upper bounds, and building on prior work, we obtain the first P-time algorithm (in the standard Turing model of computation) for quantitative model checking, to within desired precision, of discrete-time QBDs and (equivalently) probabilistic 1-counter automata, with respect to any (fixed) omega-regular or LTL property.
  • We study the problem of computing the probability that a given stochastic context-free grammar (SCFG), G, generates a string in a given regular language L(D) (given by a DFA, D). This basic problem has a number of applications in statistical natural language processing, and it is also a key necessary step towards quantitative \omega-regular model checking of stochastic context-free processes (equivalently, 1-exit recursive Markov chains, or stateless probabilistic pushdown processes). We show that the probability that G generates a string in L(D) can be computed to within arbitrary desired precision in polynomial time (in the standard Turing model of computation), under a rather mild assumption about the SCFG, G, and with no extra assumption about D. We show that this assumption is satisfied for SCFG's whose rule probabilities are learned via the well-known inside-outside (EM) algorithm for maximum-likelihood estimation (a standard method for constructing SCFGs in statistical NLP and biological sequence analysis). Thus, for these SCFGs the algorithm always runs in P-time.
  • We show that one can approximate the least fixed point solution for a multivariate system of monotone probabilistic polynomial equations in time polynomial in both the encoding size of the system of equations and in log(1/\epsilon), where \epsilon > 0 is the desired additive error bound of the solution. (The model of computation is the standard Turing machine model.) We use this result to resolve several open problems regarding the computational complexity of computing key quantities associated with some classic and heavily studied stochastic processes, including multi-type branching processes and stochastic context-free grammars.
  • We show that one can approximate the least fixed point solution for a multivariate system of monotone probabilistic max(min) polynomial equations, referred to as maxPPSs (and minPPSs, respectively), in time polynomial in both the encoding size of the system of equations and in log(1/epsilon), where epsilon > 0 is the desired additive error bound of the solution. (The model of computation is the standard Turing machine model.) We establish this result using a generalization of Newton's method which applies to maxPPSs and minPPSs, even though the underlying functions are only piecewise-differentiable. This generalizes our recent work which provided a P-time algorithm for purely probabilistic PPSs. These equations form the Bellman optimality equations for several important classes of infinite-state Markov Decision Processes (MDPs). Thus, as a corollary, we obtain the first polynomial time algorithms for computing to within arbitrary desired precision the optimal value vector for several classes of infinite-state MDPs which arise as extensions of classic, and heavily studied, purely stochastic processes. These include both the problem of maximizing and mininizing the termination (extinction) probability of multi-type branching MDPs, stochastic context-free MDPs, and 1-exit Recursive MDPs. Furthermore, we also show that we can compute in P-time an epsilon-optimal policy for both maximizing and minimizing branching, context-free, and 1-exit-Recursive MDPs, for any given desired epsilon > 0. This is despite the fact that actually computing optimal strategies is Sqrt-Sum-hard and PosSLP-hard in this setting. We also derive, as an easy consequence of these results, an FNP upper bound on the complexity of computing the value (within arbitrary desired precision) of branching simple stochastic games (BSSGs).