• We develop a machine learning (ML) framework to populate large dark matter-only simulations with baryonic galaxies. Our ML framework takes input halo properties including halo mass, environment, spin, and recent growth history, and outputs central galaxy and halo baryonic properties including stellar mass ($M_*$), star formation rate (SFR), metallicity ($Z$), neutral ($\rm HI$) and molecular ($\rm H_2$) hydrogen mass. We apply this to the MUFASA cosmological hydrodynamic simulation, and show that it recovers the mean trends of output quantities with halo mass highly accurately, including following the sharp drop in SFR and gas in quenched massive galaxies. However, the scatter around the mean relations is under-predicted. Examining galaxies individually, at $z=0$ the stellar mass and metallicity are accurately recovered ($\sigma\lesssim 0.2$~dex), but SFR and $\rm HI$ show larger scatter ($\sigma\gtrsim 0.3$~dex); these values improve somewhat at $z=1,2$. Remarkably, ML quantitatively recovers second parameter trends in galaxy properties, e.g. that galaxies with higher gas content and lower metallicity have higher SFR at a given $M_*$. Testing various ML algorithms, we find that none perform significantly better than the others, nor does ensembling improve performance, likely because none of the algorithms reproduce the large observed scatter around the mean properties. For the random forest algorithm, we find that halo mass and nearby ($\sim 200$~kpc) environment are the most important predictive variables followed by growth history, while halo spin and $\sim$Mpc scale environment are not important. Finally we study the impact of additionally inputting key baryonic properties $M_*$, SFR and $Z$, as would be available e.g. from an equilibrium model, and show that particularly providing the SFR enables $\rm HI$ to be recovered substantially more accurately.
  • The ability to identify sentiment in text, referred to as sentiment analysis, is one which is natural to adult humans. This task is, however, not one which a computer can perform by default. Identifying sentiments in an automated, algorithmic manner will be a useful capability for business and research in their search to understand what consumers think about their products or services and to understand human sociology. Here we propose two new Genetic Algorithms (GAs) for the task of automated text sentiment analysis. The GAs learn whether words occurring in a text corpus are either sentiment or amplifier words, and their corresponding magnitude. Sentiment words, such as 'horrible', add linearly to the final sentiment. Amplifier words in contrast, which are typically adjectives/adverbs like 'very', multiply the sentiment of the following word. This increases, decreases or negates the sentiment of the following word. The sentiment of the full text is then the sum of these terms. This approach grows both a sentiment and amplifier dictionary which can be reused for other purposes and fed into other machine learning algorithms. We report the results of multiple experiments conducted on large Amazon data sets. The results reveal that our proposed approach was able to outperform several public and/or commercial sentiment analysis algorithms.
  • Deep neural networks continue to show improved performance with increasing depth, an encouraging trend that implies an explosion in the possible permutations of network architectures and hyperparameters for which there is little intuitive guidance. To address this increasing complexity, we propose Evolutionary DEep Networks (EDEN), a computationally efficient neuro-evolutionary algorithm which interfaces to any deep neural network platform, such as TensorFlow. We show that EDEN evolves simple yet successful architectures built from embedding, 1D and 2D convolutional, max pooling and fully connected layers along with their hyperparameters. Evaluation of EDEN across seven image and sentiment classification datasets shows that it reliably finds good networks -- and in three cases achieves state-of-the-art results -- even on a single GPU, in just 6-24 hours. Our study provides a first attempt at applying neuro-evolution to the creation of 1D convolutional networks for sentiment analysis including the optimisation of the embedding layer.
  • Can textual data be compressed intelligently without losing accuracy in evaluating sentiment? In this study, we propose a novel evolutionary compression algorithm, PARSEC (PARts-of-Speech for sEntiment Compression), which makes use of Parts-of-Speech tags to compress text in a way that sacrifices minimal classification accuracy when used in conjunction with sentiment analysis algorithms. An analysis of PARSEC with eight commercial and non-commercial sentiment analysis algorithms on twelve English sentiment data sets reveals that accurate compression is possible with (0%, 1.3%, 3.3%) loss in sentiment classification accuracy for (20%, 50%, 75%) data compression with PARSEC using LingPipe, the most accurate of the sentiment algorithms. Other sentiment analysis algorithms are more severely affected by compression. We conclude that significant compression of text data is possible for sentiment analysis depending on the accuracy demands of the specific application and the specific sentiment analysis algorithm used.
  • We discuss the ground-breaking science that will be possible with a wide area survey, using the MeerKAT telescope, known as MeerKLASS (MeerKAT Large Area Synoptic Survey). The current specifications of MeerKAT make it a great fit for science applications that require large survey speeds but not necessarily high angular resolutions. In particular, for cosmology, a large survey over $\sim 4,000 \, {\rm deg}^2$ for $\sim 4,000$ hours will potentially provide the first ever measurements of the baryon acoustic oscillations using the 21cm intensity mapping technique, with enough accuracy to impose constraints on the nature of dark energy. The combination with multi-wavelength data will give unique additional information, such as exquisite constraints on primordial non-Gaussianity using the multi-tracer technique, as well as a better handle on foregrounds and systematics. Such a wide survey with MeerKAT is also a great match for HI galaxy studies, providing unrivalled statistics in the pre-SKA era for galaxies resolved in the HI emission line beyond local structures at z > 0.01. It will also produce a large continuum galaxy sample down to a depth of about 5\,$\mu$Jy in L-band, which is quite unique over such large areas and will allow studies of the large-scale structure of the Universe out to high redshifts, complementing the galaxy HI survey to form a transformational multi-wavelength approach to study galaxy dynamics and evolution. Finally, the same survey will supply unique information for a range of other science applications, including a large statistical investigation of galaxy clusters as well as produce a rotation measure map across a huge swathe of the sky. The MeerKLASS survey will be a crucial step on the road to using SKA1-MID for cosmological applications and other commensal surveys, as described in the top priority SKA key science projects (abridged).
  • Regression or classification? This is perhaps the most basic question faced when tackling a new supervised learning problem. We present an Evolutionary Deep Learning (EDL) algorithm that automatically solves this by identifying the question type with high accuracy, along with a proposed deep architecture. Typically, a significant amount of human insight and preparation is required prior to executing machine learning algorithms. For example, when creating deep neural networks, the number of parameters must be selected in advance and furthermore, a lot of these choices are made based upon pre-existing knowledge of the data such as the use of a categorical cross entropy loss function. Humans are able to study a dataset and decide whether it represents a classification or a regression problem, and consequently make decisions which will be applied to the execution of the neural network. We propose the Automated Problem Identification (API) algorithm, which uses an evolutionary algorithm interface to TensorFlow to manipulate a deep neural network to decide if a dataset represents a classification or a regression problem. We test API on 16 different classification, regression and sentiment analysis datasets with up to 10,000 features and up to 17,000 unique target values. API achieves an average accuracy of $96.3\%$ in identifying the problem type without hardcoding any insights about the general characteristics of regression or classification problems. For example, API successfully identifies classification problems even with 1000 target values. Furthermore, the algorithm recommends which loss function to use and also recommends a neural network architecture. Our work is therefore a step towards fully automated machine learning.
  • Supernova cosmology without spectra will be the bread and butter mode for future surveys such as LSST. This lack of supernova spectra results in uncertainty in the redshifts which, if ignored, leads to significantly biased estimates of cosmological parameters. Here we present a hierarchical Bayesian formalism -- zBEAMS -- that fully addresses this problem by marginalising over the unknown or contaminated supernova redshifts to produce unbiased cosmological estimates that are competitive with entirely spectroscopic data. zBEAMS provides a unified treatment of both photometric redshifts and host galaxy misidentification (occurring due to chance galaxy alignments or faint hosts), effectively correcting the inevitable contamination in the Hubble diagram. Like its predecessor BEAMS, our formalism also takes care of non-Ia supernova contamination by marginalising over the unknown supernova type. We demonstrate the effectiveness of this technique with simulations of supernovae with photometric redshifts and host galaxy misidentification. A novel feature of the photometric redshift case is the important role played by the redshift distribution of the supernovae.
  • We outline a new method to compute the Bayes Factor for model selection which bypasses the Bayesian Evidence. Our method combines multiple models into a single, nested, Supermodel using one or more hyperparameters. Since the models are now nested the Bayes Factors between the models can be efficiently computed using the Savage-Dickey Density Ratio (SDDR). In this way model selection becomes a problem of parameter estimation. We consider two ways of constructing the supermodel in detail: one based on combined models, and a second based on combined likelihoods. We report on these two approaches for a Gaussian linear model for which the Bayesian evidence can be calculated analytically and a toy nonlinear problem. Unlike the combined model approach, where a standard Monte Carlo Markov Chain (MCMC) struggles, the combined-likelihood approach fares much better in providing a reliable estimate of the log-Bayes Factor. This scheme potentially opens the way to computationally efficient ways to compute Bayes Factors in high dimensions that exploit the good scaling properties of MCMC, as compared to methods such as nested sampling that fail for high dimensions.
  • Classifying transients based on multi band light curves is a challenging but crucial problem in the era of GAIA and LSST since the sheer volume of transients will make spectroscopic classification unfeasible. Here we present a nonparametric classifier that uses the transient's light curve measurements to predict its class given training data. It implements two novel components: the first is the use of the BAGIDIS wavelet methodology - a characterization of functional data using hierarchical wavelet coefficients. The second novelty is the introduction of a ranked probability classifier on the wavelet coefficients that handles both the heteroscedasticity of the data in addition to the potential non-representativity of the training set. The ranked classifier is simple and quick to implement while a major advantage of the BAGIDIS wavelets is that they are translation invariant, hence they do not need the light curves to be aligned to extract features. Further, BAGIDIS is nonparametric so it can be used for blind searches for new objects. We demonstrate the effectiveness of our ranked wavelet classifier against the well-tested Supernova Photometric Classification Challenge dataset in which the challenge is to correctly classify light curves as Type Ia or non-Ia supernovae. We train our ranked probability classifier on the spectroscopically-confirmed subsample (which is not representative) and show that it gives good results for all supernova with observed light curve timespans greater than 100 days (roughly 55% of the dataset). For such data, we obtain a Ia efficiency of 80.5% and a purity of 82.4% yielding a highly competitive score of 0.49 whilst implementing a truly "model-blind" approach to supernova classification. Consequently this approach may be particularly suitable for the classification of astronomical transients in the era of large synoptic sky surveys.
  • There are two redshifts in cosmology: $z_{obs}$, the observed redshift computed via spectral lines, and the model redshift, $z$, defined by the effective FLRW scale factor. In general these do not coincide. We place observational constraints on the allowed distortions of $z$ away from $z_{obs}$ - a possibility we dub redshift remapping. Remapping is degenerate with cosmic dynamics for either $d_L(z)$ or $H(z)$ observations alone: for example, the simple remapping $z = \alpha_1 z_{obs} +\alpha_2 z_{obs}^2$ allows a decelerating Einstein de Sitter universe to fit the observed supernova Hubble diagram as successfully as $\Lambda$CDM, highlighting that supernova data alone cannot prove that the universe is accelerating. We show however, that redshift remapping leads to apparent violations of cosmic distance duality that can be used to detect its presence even when neither a specific theory of gravity nor the Copernican Principle are assumed. Combining current data sets favours acceleration but does not yet rule out redshift remapping as an alternative to dark energy. Future surveys, however, will provide exquisite constraints on remapping and any models -- such as backreaction -- that predict it.
  • Radio interferometers suffer from the problem of missing information in their data, due to the gaps between the antennas. This results in artifacts, such as bright rings around sources, in the images obtained. Multiple deconvolution algorithms have been proposed to solve this problem and produce cleaner radio images. However, these algorithms are unable to correctly estimate uncertainties in derived scientific parameters or to always include the effects of instrumental errors. We propose an alternative technique called Bayesian Inference for Radio Observations (BIRO) which uses a Bayesian statistical framework to determine the scientific parameters and instrumental errors simultaneously directly from the raw data, without making an image. We use a simple simulation of Westerbork Synthesis Radio Telescope data including pointing errors and beam parameters as instrumental effects, to demonstrate the use of BIRO.
  • New telescopes like the Square Kilometre Array (SKA) will push into a new sensitivity regime and expose systematics, such as direction-dependent effects, that could previously be ignored. Current methods for handling such systematics rely on alternating best estimates of instrumental calibration and models of the underlying sky, which can lead to inadequate uncertainty estimates and biased results because any correlations between parameters are ignored. These deconvolution algorithms produce a single image that is assumed to be a true representation of the sky, when in fact it is just one realization of an infinite ensemble of images compatible with the noise in the data. In contrast, here we report a Bayesian formalism that simultaneously infers both systematics and science. Our technique, Bayesian Inference for Radio Observations (BIRO), determines all parameters directly from the raw data, bypassing image-making entirely, by sampling from the joint posterior probability distribution. This enables it to derive both correlations and accurate uncertainties, making use of the flexible software MEQTREES to model the sky and telescope simultaneously. We demonstrate BIRO with two simulated sets of Westerbork Synthesis Radio Telescope data sets. In the first, we perform joint estimates of 103 scientific (flux densities of sources) and instrumental (pointing errors, beamwidth and noise) parameters. In the second example, we perform source separation with BIRO. Using the Bayesian evidence, we can accurately select between a single point source, two point sources and an extended Gaussian source, allowing for 'super-resolution' on scales much smaller than the synthesized beam.
  • For future surveys, spectroscopic follow-up for all supernovae will be extremely difficult. However, one can use light curve fitters, to obtain the probability that an object is a Type Ia. One may consider applying a probability cut to the data, but we show that the resulting non-Ia contamination can lead to biases in the estimation of cosmological parameters. A different method, which allows the use of the full dataset and results in unbiased cosmological parameter estimation, is Bayesian Estimation Applied to Multiple Species (BEAMS). BEAMS is a Bayesian approach to the problem which includes the uncertainty in the types in the evaluation of the posterior. Here we outline the theory of BEAMS and demonstrate its effectiveness using both simulated datasets and SDSS-II data. We also show that it is possible to use BEAMS if the data are correlated, by introducing a numerical marginalisation over the types of the objects. This is largely a pedagogical introduction to BEAMS with references to the main BEAMS papers.
  • New supernova surveys such as the Dark Energy Survey, Pan-STARRS and the LSST will produce an unprecedented number of photometric supernova candidates, most with no spectroscopic data. Avoiding biases in cosmological parameters due to the resulting inevitable contamination from non-Ia supernovae can be achieved with the BEAMS formalism, allowing for fully photometric supernova cosmology studies. Here we extend BEAMS to deal with the case in which the supernovae are correlated by systematic uncertainties. The analytical form of the full BEAMS posterior requires evaluating 2^N terms, where N is the number of supernova candidates. This `exponential catastrophe' is computationally unfeasible even for N of order 100. We circumvent the exponential catastrophe by marginalising numerically instead of analytically over the possible supernova types: we augment the cosmological parameters with nuisance parameters describing the covariance matrix and the types of all the supernovae, \tau_i, that we include in our MCMC analysis. We show that this method deals well even with large, unknown systematic uncertainties without a major increase in computational time, whereas ignoring the correlations can lead to significant biases and incorrect credible contours. We then compare the numerical marginalisation technique with a perturbative expansion of the posterior based on the insight that future surveys will have exquisite light curves and hence the probability that a given candidate is a Type Ia will be close to unity or zero, for most objects. Although this perturbative approach changes computation of the posterior from a 2^N problem into an N^2 or N^3 one, we show that it leads to biases in general through a small number of misclassifications, implying that numerical marginalisation is superior.
  • Using a sample of 608 Type Ia supernovae from the SDSS-II and BOSS surveys, combined with a sample of foreground galaxies from SDSS-II, we estimate the weak lensing convergence for each supernova line-of-sight. We find that the correlation between this measurement and the Hubble residuals is consistent with the prediction from lensing (at a significance of 1.7sigma. Strong correlations are also found between the residuals and supernova nuisance parameters after a linear correction is applied. When these other correlations are taken into account, the lensing signal is detected at 1.4sigma. We show for the first time that distance estimates from supernovae can be improved when lensing is incorporated by including a new parameter in the SALT2 methodology for determining distance moduli. The recovered value of the new parameter is consistent with the lensing prediction. Using CMB data from WMAP7, H0 data from HST and SDSS BAO measurements, we find the best-fit value of the new lensing parameter and show that the central values and uncertainties on Omega_m and w are unaffected. The lensing of supernovae, while only seen at marginal significance in this low redshift sample, will be of vital importance for the next generation of surveys, such as DES and LSST, which will be systematics dominated.
  • We introduce Bayesian Estimation Applied to Multiple Species (BEAMS), an algorithm designed to deal with parameter estimation when using contaminated data. We present the algorithm and demonstrate how it works with the help of a Gaussian simulation. We then apply it to supernova data from the Sloan Digital Sky Survey (SDSS), showing how the resulting confidence contours of the cosmological parameters shrink significantly.
  • The Fisher Matrix is the backbone of modern cosmological forecasting. We describe the Fisher4Cast software: a general-purpose, easy-to-use, Fisher Matrix framework. It is open source, rigorously designed and tested and includes a Graphical User Interface (GUI) with automated LATEX file creation capability and point-and-click Fisher ellipse generation. Fisher4Cast was designed for ease of extension and, although written in Matlab, is easily portable to open-source alternatives such as Octave and Scilab. Here we use Fisher4Cast to present new 3-D and 4-D visualisations of the forecasting landscape and to investigate the effects of growth and curvature on future cosmological surveys. Early releases have been available at http://www.cosmology.org.za since May 2008 with 750 downloads in the first year. Version 2.2 is made public with this paper and includes a Quick Start guide and the code used to produce the figures in this paper, in the hope that it will be useful to the cosmology and wider scientific communities.
  • We show that in cases of marginal detections (~ 3\sigma), such as that of Baryonic Acoustic Oscillations (BAO) in cosmology, the often-used Gaussian approximation to the full likelihood is very poor, especially beyond ~3\sigma. This can radically alter confidence intervals on parameters and implies that one cannot naively extrapolate 1\sigma-errorbars to 3\sigma, and beyond. We propose a simple fitting formula which corrects for this effect in posterior probabilities arising from marginal detections. Alternatively the full likelihood should be used for parameter estimation rather than the Gaussian approximation of a just mean and an error.
  • Future photometric supernova surveys will produce vastly more candidates than can be followed up spectroscopically, highlighting the need for effective classification methods based on lightcurves alone. Here we introduce boosting and kernel density estimation techniques which have minimal astrophysical input, and compare their performance on 20,000 simulated Dark Energy Survey lightcurves. We demonstrate that these methods are comparable to the best template fitting methods currently used, and in particular do not require the redshift of the host galaxy or candidate. However both methods require a training sample that is representative of the full population, so typical spectroscopic supernova subsamples will lead to poor performance. To enable the full potential of such blind methods, we recommend that representative training samples should be used and so specific attention should be given to their creation in the design phase of future photometric surveys.
  • Baryon Acoustic Oscillations (BAO) are frozen relics left over from the pre-decoupling universe. They are the standard rulers of choice for 21st century cosmology, providing distance estimates that are, for the first time, firmly rooted in well-understood, linear physics. This review synthesises current understanding regarding all aspects of BAO cosmology, from the theoretical and statistical to the observational, and includes a map of the future landscape of BAO surveys, both spectroscopic and photometric.
  • We extend our study of the optimization of large baryon acoustic oscillation (BAO) surveys to return the best constraints on the dark energy, building on Paper I of this series (Parkinson et al. 2007). The survey galaxies are assumed to be pre-selected active, star-forming galaxies observed by their line emission with a constant number density across the redshift bin. Star-forming galaxies have a redshift desert in the region 1.6 < z < 2, and so this redshift range was excluded from the analysis. We use the Seo & Eisenstein (2007) fitting formula for the accuracies of the BAO measurements, using only the information for the oscillatory part of the power spectrum as distance and expansion rate rulers. We go beyond our earlier analysis by examining the effect of including curvature on the optimal survey configuration and updating the expected `prior' constraints from Planck and SDSS. We once again find that the optimal survey strategy involves minimizing the exposure time and maximizing the survey area (within the instrumental constraints), and that all time should be spent observing in the low-redshift range (z<1.6) rather than beyond the redshift desert, z>2. We find that when assuming a flat universe the optimal survey makes measurements in the redshift range 0.1 < z <0.7, but that including curvature as a nuisance parameter requires us to push the maximum redshift to 1.35, to remove the degeneracy between curvature and evolving dark energy. The inclusion of expected other data sets (such as WiggleZ, BOSS and a stage III SN-Ia survey) removes the necessity of measurements below redshift 0.9, and pushes the maximum redshift up to 1.5. We discuss considerations in determining the best survey strategy in light of uncertainty in the true underlying cosmological model.
  • This is the Users' Manual for the Fisher Matrix software Fisher4Cast and covers installation, GUI help, command line basics, code flow and data structure, as well as cosmological applications and extensions. Finally we discuss the extensive tests performed on the software.
  • We highlight the unexpected impact of nucleosynthesis and other early universe constraints on the detectability of tracking quintessence dynamics at late times, showing that such dynamics may well be invisible until the unveiling of the Stage-IV dark energy experiments (DUNE, JDEM, LSST, SKA). Nucleosynthesis forces |w'(0)| < 0.2 for the models we consider and strongly limits potential deviations from LCDM. Surprisingly, the standard CPL parametrisation, w(z) = w_0 + w_a z/(1+z), cannot match the nucleosynthesis bound for minimally coupled tracking scalar fields. Given that such models are arguably the best-motivated alternatives to a cosmological constant these results may significantly impact future cosmological survey design and imply that dark energy may well be dynamical even if we do not detect any dynamics in the next decade.
  • The recent discovery of apparent cosmic acceleration has highlighted the depth of our ignorance of the fundamental properties of nature. It is commonly assumed that the explanation for acceleration must come from a new form of energy dominating the cosmos - dark energy - or a modification of Einstein's theory of Relativity. It is often overlooked, however, that a currently viable alternative explanation of the data is radial inhomogeneity which alters the Hubble diagram without any acceleration. This explanation is often ignored for two reasons: radial inhomogeneity significantly complicates analysis and predictions, and so the full details have not been investigated; and it is a philosophically highly controversial idea, revoking as it does the long-held Copernican Principle. To date, there has not been a general way of determining the validity if the Copernican Principle -- that we live at a typical position in the universe -- significantly weakening the foundations of cosmology as a scientific endeavour. Here we present an observational test for the Copernican assumption which can be automatically implemented while we search for dark energy in the coming decade. Our test is entirely independent of any model for dark energy or theory of gravity and thereby represents a model-independent test of the Copernican Principle.
  • We show that the assumption of a flat universe induces critically large errors in reconstructing the dark energy equation of state at z>~0.9 even if the true cosmic curvature is very small, O(1%) or less. The spuriously reconstructed w(z) shows a range of unusual behaviour, including crossing of the phantom divide and mimicking of standard tracking quintessence models. For 1% curvature and LCDM, the error in w grows rapidly above z~0.9 reaching (50%,100%) by redshifts of (2.5,2.9) respectively, due to the long cosmological lever arm. Interestingly, the w(z) reconstructed from distance data and Hubble rate measurements have opposite trends due to the asymmetric influence of the curved geodesics. These results show that including curvature as a free parameter is imperative in any future analyses attempting to pin down the dynamics of dark energy, especially at moderate or high redshifts.