
We develop a machine learning (ML) framework to populate large dark
matteronly simulations with baryonic galaxies. Our ML framework takes input
halo properties including halo mass, environment, spin, and recent growth
history, and outputs central galaxy and halo baryonic properties including
stellar mass ($M_*$), star formation rate (SFR), metallicity ($Z$), neutral
($\rm HI$) and molecular ($\rm H_2$) hydrogen mass. We apply this to the MUFASA
cosmological hydrodynamic simulation, and show that it recovers the mean trends
of output quantities with halo mass highly accurately, including following the
sharp drop in SFR and gas in quenched massive galaxies. However, the scatter
around the mean relations is underpredicted. Examining galaxies individually,
at $z=0$ the stellar mass and metallicity are accurately recovered
($\sigma\lesssim 0.2$~dex), but SFR and $\rm HI$ show larger scatter
($\sigma\gtrsim 0.3$~dex); these values improve somewhat at $z=1,2$.
Remarkably, ML quantitatively recovers second parameter trends in galaxy
properties, e.g. that galaxies with higher gas content and lower metallicity
have higher SFR at a given $M_*$. Testing various ML algorithms, we find that
none perform significantly better than the others, nor does ensembling improve
performance, likely because none of the algorithms reproduce the large observed
scatter around the mean properties. For the random forest algorithm, we find
that halo mass and nearby ($\sim 200$~kpc) environment are the most important
predictive variables followed by growth history, while halo spin and $\sim$Mpc
scale environment are not important. Finally we study the impact of
additionally inputting key baryonic properties $M_*$, SFR and $Z$, as would be
available e.g. from an equilibrium model, and show that particularly providing
the SFR enables $\rm HI$ to be recovered substantially more accurately.

The ability to identify sentiment in text, referred to as sentiment analysis,
is one which is natural to adult humans. This task is, however, not one which a
computer can perform by default. Identifying sentiments in an automated,
algorithmic manner will be a useful capability for business and research in
their search to understand what consumers think about their products or
services and to understand human sociology. Here we propose two new Genetic
Algorithms (GAs) for the task of automated text sentiment analysis. The GAs
learn whether words occurring in a text corpus are either sentiment or
amplifier words, and their corresponding magnitude. Sentiment words, such as
'horrible', add linearly to the final sentiment. Amplifier words in contrast,
which are typically adjectives/adverbs like 'very', multiply the sentiment of
the following word. This increases, decreases or negates the sentiment of the
following word. The sentiment of the full text is then the sum of these terms.
This approach grows both a sentiment and amplifier dictionary which can be
reused for other purposes and fed into other machine learning algorithms. We
report the results of multiple experiments conducted on large Amazon data sets.
The results reveal that our proposed approach was able to outperform several
public and/or commercial sentiment analysis algorithms.

Deep neural networks continue to show improved performance with increasing
depth, an encouraging trend that implies an explosion in the possible
permutations of network architectures and hyperparameters for which there is
little intuitive guidance. To address this increasing complexity, we propose
Evolutionary DEep Networks (EDEN), a computationally efficient
neuroevolutionary algorithm which interfaces to any deep neural network
platform, such as TensorFlow. We show that EDEN evolves simple yet successful
architectures built from embedding, 1D and 2D convolutional, max pooling and
fully connected layers along with their hyperparameters. Evaluation of EDEN
across seven image and sentiment classification datasets shows that it reliably
finds good networks  and in three cases achieves stateoftheart results 
even on a single GPU, in just 624 hours. Our study provides a first attempt at
applying neuroevolution to the creation of 1D convolutional networks for
sentiment analysis including the optimisation of the embedding layer.

Can textual data be compressed intelligently without losing accuracy in
evaluating sentiment? In this study, we propose a novel evolutionary
compression algorithm, PARSEC (PARtsofSpeech for sEntiment Compression),
which makes use of PartsofSpeech tags to compress text in a way that
sacrifices minimal classification accuracy when used in conjunction with
sentiment analysis algorithms. An analysis of PARSEC with eight commercial and
noncommercial sentiment analysis algorithms on twelve English sentiment data
sets reveals that accurate compression is possible with (0%, 1.3%, 3.3%) loss
in sentiment classification accuracy for (20%, 50%, 75%) data compression with
PARSEC using LingPipe, the most accurate of the sentiment algorithms. Other
sentiment analysis algorithms are more severely affected by compression. We
conclude that significant compression of text data is possible for sentiment
analysis depending on the accuracy demands of the specific application and the
specific sentiment analysis algorithm used.

We discuss the groundbreaking science that will be possible with a wide area
survey, using the MeerKAT telescope, known as MeerKLASS (MeerKAT Large Area
Synoptic Survey). The current specifications of MeerKAT make it a great fit for
science applications that require large survey speeds but not necessarily high
angular resolutions. In particular, for cosmology, a large survey over $\sim
4,000 \, {\rm deg}^2$ for $\sim 4,000$ hours will potentially provide the first
ever measurements of the baryon acoustic oscillations using the 21cm intensity
mapping technique, with enough accuracy to impose constraints on the nature of
dark energy. The combination with multiwavelength data will give unique
additional information, such as exquisite constraints on primordial
nonGaussianity using the multitracer technique, as well as a better handle on
foregrounds and systematics. Such a wide survey with MeerKAT is also a great
match for HI galaxy studies, providing unrivalled statistics in the preSKA era
for galaxies resolved in the HI emission line beyond local structures at z >
0.01. It will also produce a large continuum galaxy sample down to a depth of
about 5\,$\mu$Jy in Lband, which is quite unique over such large areas and
will allow studies of the largescale structure of the Universe out to high
redshifts, complementing the galaxy HI survey to form a transformational
multiwavelength approach to study galaxy dynamics and evolution. Finally, the
same survey will supply unique information for a range of other science
applications, including a large statistical investigation of galaxy clusters as
well as produce a rotation measure map across a huge swathe of the sky. The
MeerKLASS survey will be a crucial step on the road to using SKA1MID for
cosmological applications and other commensal surveys, as described in the top
priority SKA key science projects (abridged).

Regression or classification? This is perhaps the most basic question faced
when tackling a new supervised learning problem. We present an Evolutionary
Deep Learning (EDL) algorithm that automatically solves this by identifying the
question type with high accuracy, along with a proposed deep architecture.
Typically, a significant amount of human insight and preparation is required
prior to executing machine learning algorithms. For example, when creating deep
neural networks, the number of parameters must be selected in advance and
furthermore, a lot of these choices are made based upon preexisting knowledge
of the data such as the use of a categorical cross entropy loss function.
Humans are able to study a dataset and decide whether it represents a
classification or a regression problem, and consequently make decisions which
will be applied to the execution of the neural network. We propose the
Automated Problem Identification (API) algorithm, which uses an evolutionary
algorithm interface to TensorFlow to manipulate a deep neural network to decide
if a dataset represents a classification or a regression problem. We test API
on 16 different classification, regression and sentiment analysis datasets with
up to 10,000 features and up to 17,000 unique target values. API achieves an
average accuracy of $96.3\%$ in identifying the problem type without hardcoding
any insights about the general characteristics of regression or classification
problems. For example, API successfully identifies classification problems even
with 1000 target values. Furthermore, the algorithm recommends which loss
function to use and also recommends a neural network architecture. Our work is
therefore a step towards fully automated machine learning.

Supernova cosmology without spectra will be the bread and butter mode for
future surveys such as LSST. This lack of supernova spectra results in
uncertainty in the redshifts which, if ignored, leads to significantly biased
estimates of cosmological parameters. Here we present a hierarchical Bayesian
formalism  zBEAMS  that fully addresses this problem by marginalising over
the unknown or contaminated supernova redshifts to produce unbiased
cosmological estimates that are competitive with entirely spectroscopic data.
zBEAMS provides a unified treatment of both photometric redshifts and host
galaxy misidentification (occurring due to chance galaxy alignments or faint
hosts), effectively correcting the inevitable contamination in the Hubble
diagram. Like its predecessor BEAMS, our formalism also takes care of nonIa
supernova contamination by marginalising over the unknown supernova type. We
demonstrate the effectiveness of this technique with simulations of supernovae
with photometric redshifts and host galaxy misidentification. A novel feature
of the photometric redshift case is the important role played by the redshift
distribution of the supernovae.

We outline a new method to compute the Bayes Factor for model selection which
bypasses the Bayesian Evidence. Our method combines multiple models into a
single, nested, Supermodel using one or more hyperparameters. Since the models
are now nested the Bayes Factors between the models can be efficiently computed
using the SavageDickey Density Ratio (SDDR). In this way model selection
becomes a problem of parameter estimation. We consider two ways of constructing
the supermodel in detail: one based on combined models, and a second based on
combined likelihoods. We report on these two approaches for a Gaussian linear
model for which the Bayesian evidence can be calculated analytically and a toy
nonlinear problem. Unlike the combined model approach, where a standard Monte
Carlo Markov Chain (MCMC) struggles, the combinedlikelihood approach fares
much better in providing a reliable estimate of the logBayes Factor. This
scheme potentially opens the way to computationally efficient ways to compute
Bayes Factors in high dimensions that exploit the good scaling properties of
MCMC, as compared to methods such as nested sampling that fail for high
dimensions.

Classifying transients based on multi band light curves is a challenging but
crucial problem in the era of GAIA and LSST since the sheer volume of
transients will make spectroscopic classification unfeasible. Here we present a
nonparametric classifier that uses the transient's light curve measurements to
predict its class given training data. It implements two novel components: the
first is the use of the BAGIDIS wavelet methodology  a characterization of
functional data using hierarchical wavelet coefficients. The second novelty is
the introduction of a ranked probability classifier on the wavelet coefficients
that handles both the heteroscedasticity of the data in addition to the
potential nonrepresentativity of the training set. The ranked classifier is
simple and quick to implement while a major advantage of the BAGIDIS wavelets
is that they are translation invariant, hence they do not need the light curves
to be aligned to extract features. Further, BAGIDIS is nonparametric so it can
be used for blind searches for new objects. We demonstrate the effectiveness of
our ranked wavelet classifier against the welltested Supernova Photometric
Classification Challenge dataset in which the challenge is to correctly
classify light curves as Type Ia or nonIa supernovae. We train our ranked
probability classifier on the spectroscopicallyconfirmed subsample (which is
not representative) and show that it gives good results for all supernova with
observed light curve timespans greater than 100 days (roughly 55% of the
dataset). For such data, we obtain a Ia efficiency of 80.5% and a purity of
82.4% yielding a highly competitive score of 0.49 whilst implementing a truly
"modelblind" approach to supernova classification. Consequently this approach
may be particularly suitable for the classification of astronomical transients
in the era of large synoptic sky surveys.

There are two redshifts in cosmology: $z_{obs}$, the observed redshift
computed via spectral lines, and the model redshift, $z$, defined by the
effective FLRW scale factor. In general these do not coincide. We place
observational constraints on the allowed distortions of $z$ away from $z_{obs}$
 a possibility we dub redshift remapping. Remapping is degenerate with cosmic
dynamics for either $d_L(z)$ or $H(z)$ observations alone: for example, the
simple remapping $z = \alpha_1 z_{obs} +\alpha_2 z_{obs}^2$ allows a
decelerating Einstein de Sitter universe to fit the observed supernova Hubble
diagram as successfully as $\Lambda$CDM, highlighting that supernova data alone
cannot prove that the universe is accelerating. We show however, that redshift
remapping leads to apparent violations of cosmic distance duality that can be
used to detect its presence even when neither a specific theory of gravity nor
the Copernican Principle are assumed. Combining current data sets favours
acceleration but does not yet rule out redshift remapping as an alternative to
dark energy. Future surveys, however, will provide exquisite constraints on
remapping and any models  such as backreaction  that predict it.

Radio interferometers suffer from the problem of missing information in their
data, due to the gaps between the antennas. This results in artifacts, such as
bright rings around sources, in the images obtained. Multiple deconvolution
algorithms have been proposed to solve this problem and produce cleaner radio
images. However, these algorithms are unable to correctly estimate
uncertainties in derived scientific parameters or to always include the effects
of instrumental errors. We propose an alternative technique called Bayesian
Inference for Radio Observations (BIRO) which uses a Bayesian statistical
framework to determine the scientific parameters and instrumental errors
simultaneously directly from the raw data, without making an image. We use a
simple simulation of Westerbork Synthesis Radio Telescope data including
pointing errors and beam parameters as instrumental effects, to demonstrate the
use of BIRO.

New telescopes like the Square Kilometre Array (SKA) will push into a new
sensitivity regime and expose systematics, such as directiondependent effects,
that could previously be ignored. Current methods for handling such systematics
rely on alternating best estimates of instrumental calibration and models of
the underlying sky, which can lead to inadequate uncertainty estimates and
biased results because any correlations between parameters are ignored. These
deconvolution algorithms produce a single image that is assumed to be a true
representation of the sky, when in fact it is just one realization of an
infinite ensemble of images compatible with the noise in the data. In contrast,
here we report a Bayesian formalism that simultaneously infers both systematics
and science. Our technique, Bayesian Inference for Radio Observations (BIRO),
determines all parameters directly from the raw data, bypassing imagemaking
entirely, by sampling from the joint posterior probability distribution. This
enables it to derive both correlations and accurate uncertainties, making use
of the flexible software MEQTREES to model the sky and telescope
simultaneously. We demonstrate BIRO with two simulated sets of Westerbork
Synthesis Radio Telescope data sets. In the first, we perform joint estimates
of 103 scientific (flux densities of sources) and instrumental (pointing
errors, beamwidth and noise) parameters. In the second example, we perform
source separation with BIRO. Using the Bayesian evidence, we can accurately
select between a single point source, two point sources and an extended
Gaussian source, allowing for 'superresolution' on scales much smaller than
the synthesized beam.

For future surveys, spectroscopic followup for all supernovae will be
extremely difficult. However, one can use light curve fitters, to obtain the
probability that an object is a Type Ia. One may consider applying a
probability cut to the data, but we show that the resulting nonIa
contamination can lead to biases in the estimation of cosmological parameters.
A different method, which allows the use of the full dataset and results in
unbiased cosmological parameter estimation, is Bayesian Estimation Applied to
Multiple Species (BEAMS). BEAMS is a Bayesian approach to the problem which
includes the uncertainty in the types in the evaluation of the posterior. Here
we outline the theory of BEAMS and demonstrate its effectiveness using both
simulated datasets and SDSSII data. We also show that it is possible to use
BEAMS if the data are correlated, by introducing a numerical marginalisation
over the types of the objects. This is largely a pedagogical introduction to
BEAMS with references to the main BEAMS papers.

New supernova surveys such as the Dark Energy Survey, PanSTARRS and the LSST
will produce an unprecedented number of photometric supernova candidates, most
with no spectroscopic data. Avoiding biases in cosmological parameters due to
the resulting inevitable contamination from nonIa supernovae can be achieved
with the BEAMS formalism, allowing for fully photometric supernova cosmology
studies. Here we extend BEAMS to deal with the case in which the supernovae are
correlated by systematic uncertainties. The analytical form of the full BEAMS
posterior requires evaluating 2^N terms, where N is the number of supernova
candidates. This `exponential catastrophe' is computationally unfeasible even
for N of order 100. We circumvent the exponential catastrophe by marginalising
numerically instead of analytically over the possible supernova types: we
augment the cosmological parameters with nuisance parameters describing the
covariance matrix and the types of all the supernovae, \tau_i, that we include
in our MCMC analysis. We show that this method deals well even with large,
unknown systematic uncertainties without a major increase in computational
time, whereas ignoring the correlations can lead to significant biases and
incorrect credible contours. We then compare the numerical marginalisation
technique with a perturbative expansion of the posterior based on the insight
that future surveys will have exquisite light curves and hence the probability
that a given candidate is a Type Ia will be close to unity or zero, for most
objects. Although this perturbative approach changes computation of the
posterior from a 2^N problem into an N^2 or N^3 one, we show that it leads to
biases in general through a small number of misclassifications, implying that
numerical marginalisation is superior.

Using a sample of 608 Type Ia supernovae from the SDSSII and BOSS surveys,
combined with a sample of foreground galaxies from SDSSII, we estimate the
weak lensing convergence for each supernova lineofsight. We find that the
correlation between this measurement and the Hubble residuals is consistent
with the prediction from lensing (at a significance of 1.7sigma. Strong
correlations are also found between the residuals and supernova nuisance
parameters after a linear correction is applied. When these other correlations
are taken into account, the lensing signal is detected at 1.4sigma. We show for
the first time that distance estimates from supernovae can be improved when
lensing is incorporated by including a new parameter in the SALT2 methodology
for determining distance moduli. The recovered value of the new parameter is
consistent with the lensing prediction. Using CMB data from WMAP7, H0 data from
HST and SDSS BAO measurements, we find the bestfit value of the new lensing
parameter and show that the central values and uncertainties on Omega_m and w
are unaffected. The lensing of supernovae, while only seen at marginal
significance in this low redshift sample, will be of vital importance for the
next generation of surveys, such as DES and LSST, which will be systematics
dominated.

We introduce Bayesian Estimation Applied to Multiple Species (BEAMS), an
algorithm designed to deal with parameter estimation when using contaminated
data. We present the algorithm and demonstrate how it works with the help of a
Gaussian simulation. We then apply it to supernova data from the Sloan Digital
Sky Survey (SDSS), showing how the resulting confidence contours of the
cosmological parameters shrink significantly.

The Fisher Matrix is the backbone of modern cosmological forecasting. We
describe the Fisher4Cast software: a generalpurpose, easytouse, Fisher
Matrix framework. It is open source, rigorously designed and tested and
includes a Graphical User Interface (GUI) with automated LATEX file creation
capability and pointandclick Fisher ellipse generation. Fisher4Cast was
designed for ease of extension and, although written in Matlab, is easily
portable to opensource alternatives such as Octave and Scilab. Here we use
Fisher4Cast to present new 3D and 4D visualisations of the forecasting
landscape and to investigate the effects of growth and curvature on future
cosmological surveys. Early releases have been available at
http://www.cosmology.org.za since May 2008 with 750 downloads in the first
year. Version 2.2 is made public with this paper and includes a Quick Start
guide and the code used to produce the figures in this paper, in the hope that
it will be useful to the cosmology and wider scientific communities.

We show that in cases of marginal detections (~ 3\sigma), such as that of
Baryonic Acoustic Oscillations (BAO) in cosmology, the oftenused Gaussian
approximation to the full likelihood is very poor, especially beyond ~3\sigma.
This can radically alter confidence intervals on parameters and implies that
one cannot naively extrapolate 1\sigmaerrorbars to 3\sigma, and beyond. We
propose a simple fitting formula which corrects for this effect in posterior
probabilities arising from marginal detections. Alternatively the full
likelihood should be used for parameter estimation rather than the Gaussian
approximation of a just mean and an error.

Future photometric supernova surveys will produce vastly more candidates than
can be followed up spectroscopically, highlighting the need for effective
classification methods based on lightcurves alone. Here we introduce boosting
and kernel density estimation techniques which have minimal astrophysical
input, and compare their performance on 20,000 simulated Dark Energy Survey
lightcurves. We demonstrate that these methods are comparable to the best
template fitting methods currently used, and in particular do not require the
redshift of the host galaxy or candidate. However both methods require a
training sample that is representative of the full population, so typical
spectroscopic supernova subsamples will lead to poor performance. To enable the
full potential of such blind methods, we recommend that representative training
samples should be used and so specific attention should be given to their
creation in the design phase of future photometric surveys.

Baryon Acoustic Oscillations (BAO) are frozen relics left over from the
predecoupling universe. They are the standard rulers of choice for 21st
century cosmology, providing distance estimates that are, for the first time,
firmly rooted in wellunderstood, linear physics. This review synthesises
current understanding regarding all aspects of BAO cosmology, from the
theoretical and statistical to the observational, and includes a map of the
future landscape of BAO surveys, both spectroscopic and photometric.

We extend our study of the optimization of large baryon acoustic oscillation
(BAO) surveys to return the best constraints on the dark energy, building on
Paper I of this series (Parkinson et al. 2007). The survey galaxies are assumed
to be preselected active, starforming galaxies observed by their line
emission with a constant number density across the redshift bin. Starforming
galaxies have a redshift desert in the region 1.6 < z < 2, and so this redshift
range was excluded from the analysis. We use the Seo & Eisenstein (2007)
fitting formula for the accuracies of the BAO measurements, using only the
information for the oscillatory part of the power spectrum as distance and
expansion rate rulers. We go beyond our earlier analysis by examining the
effect of including curvature on the optimal survey configuration and updating
the expected `prior' constraints from Planck and SDSS. We once again find that
the optimal survey strategy involves minimizing the exposure time and
maximizing the survey area (within the instrumental constraints), and that all
time should be spent observing in the lowredshift range (z<1.6) rather than
beyond the redshift desert, z>2. We find that when assuming a flat universe the
optimal survey makes measurements in the redshift range 0.1 < z <0.7, but that
including curvature as a nuisance parameter requires us to push the maximum
redshift to 1.35, to remove the degeneracy between curvature and evolving dark
energy. The inclusion of expected other data sets (such as WiggleZ, BOSS and a
stage III SNIa survey) removes the necessity of measurements below redshift
0.9, and pushes the maximum redshift up to 1.5. We discuss considerations in
determining the best survey strategy in light of uncertainty in the true
underlying cosmological model.

This is the Users' Manual for the Fisher Matrix software Fisher4Cast and
covers installation, GUI help, command line basics, code flow and data
structure, as well as cosmological applications and extensions. Finally we
discuss the extensive tests performed on the software.

We highlight the unexpected impact of nucleosynthesis and other early
universe constraints on the detectability of tracking quintessence dynamics at
late times, showing that such dynamics may well be invisible until the
unveiling of the StageIV dark energy experiments (DUNE, JDEM, LSST, SKA).
Nucleosynthesis forces w'(0) < 0.2 for the models we consider and strongly
limits potential deviations from LCDM. Surprisingly, the standard CPL
parametrisation, w(z) = w_0 + w_a z/(1+z), cannot match the nucleosynthesis
bound for minimally coupled tracking scalar fields. Given that such models are
arguably the bestmotivated alternatives to a cosmological constant these
results may significantly impact future cosmological survey design and imply
that dark energy may well be dynamical even if we do not detect any dynamics in
the next decade.

The recent discovery of apparent cosmic acceleration has highlighted the
depth of our ignorance of the fundamental properties of nature. It is commonly
assumed that the explanation for acceleration must come from a new form of
energy dominating the cosmos  dark energy  or a modification of Einstein's
theory of Relativity. It is often overlooked, however, that a currently viable
alternative explanation of the data is radial inhomogeneity which alters the
Hubble diagram without any acceleration. This explanation is often ignored for
two reasons: radial inhomogeneity significantly complicates analysis and
predictions, and so the full details have not been investigated; and it is a
philosophically highly controversial idea, revoking as it does the longheld
Copernican Principle. To date, there has not been a general way of determining
the validity if the Copernican Principle  that we live at a typical position
in the universe  significantly weakening the foundations of cosmology as a
scientific endeavour. Here we present an observational test for the Copernican
assumption which can be automatically implemented while we search for dark
energy in the coming decade. Our test is entirely independent of any model for
dark energy or theory of gravity and thereby represents a modelindependent
test of the Copernican Principle.

We show that the assumption of a flat universe induces critically large
errors in reconstructing the dark energy equation of state at z>~0.9 even if
the true cosmic curvature is very small, O(1%) or less. The spuriously
reconstructed w(z) shows a range of unusual behaviour, including crossing of
the phantom divide and mimicking of standard tracking quintessence models. For
1% curvature and LCDM, the error in w grows rapidly above z~0.9 reaching
(50%,100%) by redshifts of (2.5,2.9) respectively, due to the long cosmological
lever arm. Interestingly, the w(z) reconstructed from distance data and Hubble
rate measurements have opposite trends due to the asymmetric influence of the
curved geodesics. These results show that including curvature as a free
parameter is imperative in any future analyses attempting to pin down the
dynamics of dark energy, especially at moderate or high redshifts.