-
The explosion of data in recent years has generated an increasing need for
new analysis techniques in order to extract knowledge from massive datasets.
Machine learning has proved particularly useful to perform this task. Fully
automatized methods have recently gathered great popularity, even though those
methods often lack physical interpretability. In contrast, feature based
approaches can provide both well-performing models and understandable
causalities with respect to the correlations found between features and
physical processes. Efficient feature selection is an essential tool to boost
the performance of machine learning models. In this work, we propose a forward
selection method in order to compute, evaluate, and characterize better
performing features for regression and classification problems. Given the
importance of photometric redshift estimation, we adopt it as our case study.
We synthetically created 4,520 features by combining magnitudes, errors, radii,
and ellipticities of quasars, taken from the SDSS. We apply a forward selection
process, a recursive method in which a huge number of feature sets is tested
through a kNN algorithm, leading to a tree of feature sets. The branches of the
tree are then used to perform experiments with the random forest, in order to
validate the best set with an alternative model. We demonstrate that the sets
of features determined with our approach improve the performances of the
regression models significantly when compared to the performance of the classic
features from the literature. The found features are unexpected and surprising,
being very different from the classic features. Therefore, a method to
interpret some of the found features in a physical context is presented. The
methodology described here is very general and can be used to improve the
performance of machine learning models for any regression or classification
task.
-
Astronomy has entered the big data era and Machine Learning based methods
have found widespread use in a large variety of astronomical applications. This
is demonstrated by the recent huge increase in the number of publications
making use of this new approach. The usage of machine learning methods, however
is still far from trivial and many problems still need to be solved. Using the
evaluation of photometric redshifts as a case study, we outline the main
problems and some ongoing efforts to solve them.
-
Within scientific and real life problems, classification is a typical case of
extremely complex tasks in data-driven scenarios, especially if approached with
traditional techniques. Machine Learning supervised and unsupervised paradigms,
providing self-adaptive and semi-automatic methods, are able to navigate into
large volumes of data characterized by a multi-dimensional parameter space,
thus representing an ideal method to disentangle classes of objects in a
reliable and efficient way. In Astrophysics, the identification of candidate
Globular Clusters through deep, wide-field, single band images, is one of such
cases where self-adaptive methods demonstrated a high performance and
reliability. Here we experimented some variants of the known Neural Gas model,
exploring both supervised and unsupervised paradigms of Machine Learning for
the classification of Globular Clusters. Main scope of this work was to verify
the possibility to improve the computational efficiency of the methods to solve
complex data-driven problems, by exploiting the parallel programming with GPU
framework. By using the astrophysical playground, the goal was to
scientifically validate such kind of models for further applications extended
to other contexts.
-
In Astrophysics, the identification of candidate Globular Clusters through
deep, wide-field, single band HST images, is a typical data analytics problem,
where methods based on Machine Learning have revealed a high efficiency and
reliability, demonstrating the capability to improve the traditional
approaches. Here we experimented some variants of the known Neural Gas model,
exploring both supervised and unsupervised paradigms of Machine Learning, on
the classification of Globular Clusters, extracted from the NGC1399 HST data.
Main focus of this work was to use a well-tested playground to scientifically
validate such kind of models for further extended experiments in astrophysics
and using other standard Machine Learning methods (for instance Random Forest
and Multi Layer Perceptron neural network) for a comparison of performances in
terms of purity and completeness.
-
Photometric redshifts (photo-z's) provide an alternative way to estimate the
distances of large samples of galaxies and are therefore crucial to a large
variety of cosmological problems. Among the various methods proposed over the
years, supervised machine learning (ML) methods capable to interpolate the
knowledge gained by means of spectroscopical data have proven to be very
effective. METAPHOR (Machine-learning Estimation Tool for Accurate PHOtometric
Redshifts) is a novel method designed to provide a reliable PDF (Probability
density Function) of the error distribution of photometric redshifts predicted
by ML methods. The method is implemented as a modular workflow, whose internal
engine for photo-z estimation makes use of the MLPQNA neural network (Multi
Layer Perceptron with Quasi Newton learning rule), with the possibility to
easily replace the specific machine learning model chosen to predict photo-z's.
After a short description of the software, we present a summary of results on
public galaxy data (Sloan Digital Sky Survey - Data Release 9) and a comparison
with a completely different method based on Spectral Energy Distribution (SED)
template fitting.
-
We present METAPHOR (Machine-learning Estimation Tool for Accurate
PHOtometric Redshifts), a method able to provide a reliable PDF for photometric
galaxy redshifts estimated through empirical techniques. METAPHOR is a modular
workflow, mainly based on the MLPQNA neural network as internal engine to
derive photometric galaxy redshifts, but giving the possibility to easily
replace MLPQNA with any other method to predict photo-z's and their PDF. We
present here the results about a validation test of the workflow on the
galaxies from SDSS-DR9, showing also the universality of the method by
replacing MLPQNA with KNN and Random Forest models. The validation test include
also a comparison with the PDF's derived from a traditional SED template
fitting method (Le Phare).
-
In the current data-driven science era, it is needed that data analysis
techniques has to quickly evolve to face with data whose dimensions has
increased up to the Petabyte scale. In particular, being modern astrophysics
based on multi-wavelength data organized into large catalogues, it is crucial
that the astronomical catalog cross-matching methods, strongly dependant from
the catalogues size, must ensure efficiency, reliability and scalability.
Furthermore, multi-band data are archived and reduced in different ways, so
that the resulting catalogues may differ each other in formats, resolution,
data structure, etc, thus requiring the highest generality of cross-matching
features. We present $C^{3}$ (Command-line Catalogue Cross-match), a
multi-platform application designed to efficiently cross-match massive
catalogues from modern surveys. Conceived as a stand-alone command-line process
or a module within generic data reduction/analysis pipeline, it provides the
maximum flexibility, in terms of portability, configuration, coordinates and
cross-matching types, ensuring high performance capabilities by using a
multi-core parallel processing paradigm and a sky partitioning algorithm.
-
In the modern galaxy surveys photometric redshifts play a central role in a
broad range of studies, from gravitational lensing and dark matter distribution
to galaxy evolution. Using a dataset of about 25,000 galaxies from the second
data release of the Kilo Degree Survey (KiDS) we obtain photometric redshifts
with five different methods: (i) Random forest, (ii) Multi Layer Perceptron
with Quasi Newton Algorithm, (iii) Multi Layer Perceptron with an optimization
network based on the Levenberg-Marquardt learning rule, (iv) the Bayesian
Photometric Redshift model (or BPZ) and (v) a classical SED template fitting
procedure (Le Phare). We show how SED fitting techniques could provide useful
information on the galaxy spectral type which can be used to improve the
capability of machine learning methods constraining systematic errors and
reduce the occurrence of catastrophic outliers. We use such classification to
train specialized regression estimators, by demonstrating that such hybrid
approach, involving SED fitting and machine learning in a single collaborative
framework, is capable to improve the overall prediction accuracy of photometric
redshifts.
-
Euclid is a Europe-led cosmology space mission dedicated to a visible and
near infrared survey of the entire extra-galactic sky. Its purpose is to deepen
our knowledge of the dark content of our Universe. After an overview of the
Euclid mission and science, this contribution describes how the community is
getting organized to face the data analysis challenges, both in software
development and in operational data processing matters. It ends with a more
specific account of some of the main contributions of the Swiss Science Data
Center (SDC-CH).
-
Photometric redshifts (photo-z's) are fundamental in galaxy surveys to
address different topics, from gravitational lensing and dark matter
distribution to galaxy evolution. The Kilo Degree Survey (KiDS), i.e. the ESO
public survey on the VLT Survey Telescope (VST), provides the unprecedented
opportunity to exploit a large galaxy dataset with an exceptional image quality
and depth in the optical wavebands. Using a KiDS subset of about 25,000
galaxies with measured spectroscopic redshifts, we have derived photo-z's using
i) three different empirical methods based on supervised machine learning, ii)
the Bayesian Photometric Redshift model (or BPZ), and iii) a classical SED
template fitting procedure (Le Phare). We confirm that, in the regions of the
photometric parameter space properly sampled by the spectroscopic templates,
machine learning methods provide better redshift estimates, with a lower
scatter and a smaller fraction of outliers. SED fitting techniques, however,
provide useful information on the galaxy spectral type which can be effectively
used to constrain systematic errors and to better characterize potential
catastrophic outliers. Such classification is then used to specialize the
training of regression machine learning models, by demonstrating that a hybrid
approach, involving SED fitting and machine learning in a single collaborative
framework, can be effectively used to improve the accuracy of photo-z
estimates.
-
Modern Astrophysics is based on multi-wavelength data organized into large
and heterogeneous catalogues. Hence, the need for efficient, reliable and
scalable catalogue cross-matching methods plays a crucial role in the era of
the petabyte scale. Furthermore, multi-band data have often very different
angular resolution, requiring the highest generality of cross-matching
features, mainly in terms of region shape and resolution. In this work we
present $C^{3}$ (Command-line Catalogue Cross-match), a multi-platform
application designed to efficiently cross-match massive catalogues. It is based
on a multi-core parallel processing paradigm and conceived to be executed as a
stand-alone command-line process or integrated within any generic data
reduction/analysis pipeline, providing the maximum flexibility to the end-user,
in terms of portability, parameter configuration, catalogue formats, angular
resolution, region shapes, coordinate units and cross-matching types. Using
real data, extracted from public surveys, we discuss the cross-matching
capabilities and computing time efficiency also through a direct comparison
with some publicly available tools, chosen among the most used within the
community, and representative of different interface paradigms. We verified
that the $C^{3}$ tool has excellent capabilities to perform an efficient and
reliable cross-matching between large datasets. Although the elliptical
cross-match and the parametric handling of angular orientation and offset are
known concepts in the astrophysical context, their availability in the
presented command-line tool makes $C^{3}$ competitive in the context of public
astronomical tools.
-
The most valuable asset of a space mission like Euclid are the data. Due to
their huge volume, the automatic quality control becomes a crucial aspect over
the entire lifetime of the experiment. Here we focus on the design strategy for
the Science Ground Segment (SGS) Data Quality Common Tools (DQCT), which has
the main role to provide software solutions to gather, evaluate, and record
quality information about the raw and derived data products from a primarily
scientific perspective. The SGS DQCT will provide a quantitative basis for
evaluating the application of reduction and calibration reference data, as well
as diagnostic tools for quality parameters, flags, trend analysis diagrams and
any other metadata parameter produced by the pipeline. In a large programme
like Euclid, it is prohibitively expensive to process large amount of data at
the pixel level just for the purpose of quality evaluation. Thus, all measures
of quality at the pixel level are implemented in the individual pipeline
stages, and passed along as metadata in the production. In this sense most of
the tasks related to science data quality are delegated to the pipeline stages,
even though the responsibility for science data quality is managed at a higher
level. The DQCT subsystem of the SGS is currently under development, but its
path to full realization will likely be different than that of other
subsystems. Primarily because, due to a high level of parallelism and to the
wide pipeline processing redundancy, for instance the mechanism of double
Science Data Center for each processing function, the data quality tools have
not only to be widely spread over all pipeline segments and data levels, but
also to minimize the occurrences of potential diversity of solutions
implemented for similar functions, ensuring the maximum of coherency and
standardization for quality evaluation and reporting in the SGS.
-
The emerging need for efficient, reliable and scalable astronomical catalog
cross-matching is becoming more pressing in the current data-driven science
era, where the size of data has rapidly increased up to the Petabyte scale. C3
(Command-line Catalogue Cross-matching) is a multi-platform tool designed to
efficiently cross-match massive catalogues from modern astronomical surveys,
ensuring high-performance capabilities through the use of a multi-core parallel
processing paradigm. The tool has been conceived to be executed as a
stand-alone command-line process or integrated within any generic data
reduction/analysis pipeline, providing the maximum flexibility to the end user,
in terms of parameter configuration, coordinates and cross-matching types. In
this work we present the architecture and the features of the tool. Moreover,
since the modular design of the tool enables an easy customization to specific
use cases and requirements, we present also an example of a customized C3
version designed and used in the FP7 project ViaLactea, dedicated to
cross-correlate Hi-GAL clumps with multi-band compact sources.
-
A variety of fundamental astrophysical science topics require the
determination of very accurate photometric redshifts (photo-z's). A wide
plethora of methods have been developed, based either on template models
fitting or on empirical explorations of the photometric parameter space.
Machine learning based techniques are not explicitly dependent on the physical
priors and able to produce accurate photo-z estimations within the photometric
ranges derived from the spectroscopic training set. These estimates, however,
are not easy to characterize in terms of a photo-z Probability Density Function
(PDF), due to the fact that the analytical relation mapping the photometric
parameters onto the redshift space is virtually unknown. We present METAPHOR
(Machine-learning Estimation Tool for Accurate PHOtometric Redshifts), a method
designed to provide a reliable PDF of the error distribution for empirical
techniques. The method is implemented as a modular workflow, whose internal
engine for photo-z estimation makes use of the MLPQNA neural network (Multi
Layer Perceptron with Quasi Newton learning rule), with the possibility to
easily replace the specific machine learning model chosen to predict photo-z's.
We present a summary of results on SDSS-DR9 galaxy data, used also to perform a
direct comparison with PDF's obtained by the Le Phare SED template fitting. We
show that METAPHOR is capable to estimate the precision and reliability of
photometric redshifts obtained with three different self-adaptive techniques,
i.e. MLPQNA, Random Forest and the standard K-Nearest Neighbors models.
-
We present an innovative method called FilExSeC (Filaments Extraction,
Selection and Classification), a data mining tool developed to investigate the
possibility to refine and optimize the shape reconstruction of filamentary
structures detected with a consolidated method based on the flux derivative
analysis, through the column-density maps computed from Herschel infrared
Galactic Plane Survey (Hi-GAL) observations of the Galactic plane. The present
methodology is based on a feature extraction module followed by a machine
learning model (Random Forest) dedicated to select features and to classify the
pixels of the input images. From tests on both simulations and real
observations the method appears reliable and robust with respect to the
variability of shape and distribution of filaments. In the cases of highly
defined filament structures, the presented method is able to bridge the gaps
among the detected fragments, thus improving their shape reconstruction. From a
preliminary "a posteriori" analysis of derived filament physical parameters,
the method appears potentially able to add a sufficient contribution to
complete and refine the filament reconstruction.
-
The VIALACTEA project has a work package dedicated to Tools and
Infrastructure and, inside it, a task for the Database and Virtual Observatory
Infrastructure. This task aims at providing an infrastructure to store all the
resources needed by the, more purposely, scientific work packages of the
project itself. This infrastructure includes a combination of: storage
facilities, relational databases and web services on top of them, and has
taken, as a whole, the name of VIALACTEA Knowledge Base (VLKB). This
contribution illustrates the current status of this VLKB. It details the set of
data resources put together; describes the database that allows data discovery
through VO inspired metadata maintenance; illustrates the discovery, cutout and
access services built on top of the former two for the users to exploit the
data content.
-
Astronomy is undergoing through a methodological revolution triggered by an
unprecedented wealth of complex and accurate data. DAMEWARE (DAta Mining &
Exploration Web Application and REsource) is a general purpose, Web-based,
Virtual Observatory compliant, distributed data mining framework specialized in
massive data sets exploration with machine learning methods. We present the
DAMEWARE (DAta Mining & Exploration Web Application REsource) which allows the
scientific community to perform data mining and exploratory experiments on
massive data sets, by using a simple web browser. DAMEWARE offers several tools
which can be seen as working environments where to choose data analysis
functionalities such as clustering, classification, regression, feature
extraction etc., together with models and algorithms.
-
Due to the necessity to evaluate photo-z for a variety of huge sky survey
data sets, it seemed important to provide the astronomical community with an
instrument able to fill this gap. Besides the problem of moving massive data
sets over the network, another critical point is that a great part of
astronomical data is stored in private archives that are not fully accessible
on line. So, in order to evaluate photo-z it is needed a desktop application
that can be downloaded and used by everyone locally, i.e. on his own personal
computer or more in general within the local intranet hosted by a data center.
The name chosen for the application is PhotoRApToR, i.e. Photometric Research
Application To Redshift (Cavuoti et al. 2015, 2014; Brescia 2014b). It embeds a
machine learning algorithm and special tools dedicated to preand
post-processing data. The ML model is the MLPQNA (Multi Layer Perceptron
trained by the Quasi Newton Algorithm), which has been revealed particularly
powerful for the photo-z calculation on the base of a spectroscopic sample
(Cavuoti et al. 2012; Brescia et al. 2013, 2014a; Biviano et al. 2013).
The PhotoRApToR program package is available, for different platforms, at the
official website (http://dame.dsf.unina.it/dame_photoz.html#photoraptor).
-
The exploitation of present and future synoptic (multi-band and multi-epoch)
surveys requires an extensive use of automatic methods for data processing and
data interpretation. In this work, using data extracted from the Catalina Real
Time Transient Survey (CRTS), we investigate the classification performance of
some well tested methods: Random Forest, MLPQNA (Multi Layer Perceptron with
Quasi Newton Algorithm) and K-Nearest Neighbors, paying special attention to
the feature selection phase. In order to do so, several classification
experiments were performed. Namely: identification of cataclysmic variables,
separation between galactic and extra-galactic objects and identification of
supernovae.
-
The VIALACTEA project aims at building a predictive model of star formation
in our galaxy. We present the innovative integrated framework and the main
technologies and methodologies to reach this ambitious goal.
-
Calibrating the photometric redshifts of >10^9 galaxies for upcoming weak
lensing cosmology experiments is a major challenge for the astrophysics
community. The path to obtaining the required spectroscopic redshifts for
training and calibration is daunting, given the anticipated depths of the
surveys and the difficulty in obtaining secure redshifts for some faint galaxy
populations. Here we present an analysis of the problem based on the
self-organizing map, a method of mapping the distribution of data in a
high-dimensional space and projecting it onto a lower-dimensional
representation. We apply this method to existing photometric data from the
COSMOS survey selected to approximate the anticipated Euclid weak lensing
sample, enabling us to robustly map the empirical distribution of galaxies in
the multidimensional color space defined by the expected Euclid filters.
Mapping this multicolor distribution lets us determine where - in galaxy color
space - redshifts from current spectroscopic surveys exist and where they are
systematically missing. Crucially, the method lets us determine whether a
spectroscopic training sample is representative of the full photometric space
occupied by the galaxies in a survey. We explore optimal sampling techniques
and estimate the additional spectroscopy needed to map out the color-redshift
relation, finding that sampling the galaxy distribution in color space in a
systematic way can efficiently meet the calibration requirements. While the
analysis presented here focuses on the Euclid survey, similar analysis can be
applied to other surveys facing the same calibration challenge, such as DES,
LSST, and WFIRST.
-
The Kilo-Degree Survey (KiDS) is an optical wide-field imaging survey carried
out with the VLT Survey Telescope and the OmegaCAM camera. KiDS will image 1500
square degrees in four filters (ugri), and together with its near-infrared
counterpart VIKING will produce deep photometry in nine bands. Designed for
weak lensing shape and photometric redshift measurements, the core science
driver of the survey is mapping the large-scale matter distribution in the
Universe back to a redshift of ~0.5. Secondary science cases are manifold,
covering topics such as galaxy evolution, Milky Way structure, and the
detection of high-redshift clusters and quasars.
KiDS is an ESO Public Survey and dedicated to serving the astronomical
community with high-quality data products derived from the survey data, as well
as with calibration data. Public data releases will be made on a yearly basis,
the first two of which are presented here. For a total of 148 survey tiles
(~160 sq.deg.) astrometrically and photometrically calibrated, coadded ugri
images have been released, accompanied by weight maps, masks, source lists, and
a multi-band source catalog.
A dedicated pipeline and data management system based on the Astro-WISE
software system, combined with newly developed masking and source
classification software, is used for the data production of the data products
described here. The achieved data quality and early science projects based on
the data products in the first two data releases are reviewed in order to
validate the survey data. Early scientific results include the detection of
nine high-z QSOs, fifteen candidate strong gravitational lenses, high-quality
photometric redshifts and galaxy structural parameters for hundreds of
thousands of galaxies. (Abridged)
-
We estimated photometric redshifts (zphot) for more than 1.1 million galaxies
of the ESO Public Kilo-Degree Survey (KiDS) Data Release 2. KiDS is an optical
wide-field imaging survey carried out with the VLT Survey Telescope (VST) and
the OmegaCAM camera, which aims at tackling open questions in cosmology and
galaxy evolution, such as the origin of dark energy and the channel of galaxy
mass growth. We present a catalogue of photometric redshifts obtained using the
Multi Layer Perceptron with Quasi Newton Algorithm (MLPQNA) model, provided
within the framework of the DAta Mining and Exploration Web Application
REsource (DAMEWARE). These photometric redshifts are based on a spectroscopic
knowledge base which was obtained by merging spectroscopic datasets from GAMA
(Galaxy And Mass Assembly) data release 2 and SDSS-III data release 9. The
overall 1 sigma uncertainty on Delta z = (zspec - zphot) / (1+ zspec) is ~
0.03, with a very small average bias of ~ 0.001, a NMAD of ~ 0.02 and a
fraction of catastrophic outliers (| Delta z | > 0.15) of ~0.4%.
-
We discuss whether modern machine learning methods can be used to
characterize the physical nature of the large number of objects sampled by the
modern multi-band digital surveys. In particular, we applied the MLPQNA (Multi
Layer Perceptron with Quasi Newton Algorithm) method to the optical data of the
Sloan Digital Sky Survey - Data Release 10, investigating whether photometric
data alone suffice to disentangle different classes of objects as they are
defined in the SDSS spectroscopic classification. We discuss three groups of
classification problems: (i) the simultaneous classification of galaxies,
quasars and stars; (ii) the separation of stars from quasars; (iii) the
separation of galaxies with normal spectral energy distribution from those with
peculiar spectra, such as starburst or starforming galaxies and AGN. While
confirming the difficulty of disentangling AGN from normal galaxies on a
photometric basis only, MLPQNA proved to be quite effective in the three-class
separation. In disentangling quasars from stars and galaxies, our method
achieved an overall efficiency of 91.31% and a QSO class purity of ~95%. The
resulting catalogue of candidate quasars/AGNs consists of ~3.6 million objects,
of which about half a million are also flagged as robust candidates, and will
be made available on CDS VizieR facility.
-
Photometric redshifts (photo-z) are crucial to the scientific exploitation of
modern panchromatic digital surveys. In this paper we present PhotoRApToR
(Photometric Research Application To Redshift): a Java/C++ based desktop
application capable to solve non-linear regression and multi-variate
classification problems, in particular specialized for photo-z estimation. It
embeds a machine learning algorithm, namely a multilayer neural network trained
by the Quasi Newton learning rule, and special tools dedicated to pre- and
postprocessing data. PhotoRApToR has been successfully tested on several
scientific cases. The application is available for free download from the DAME
Program web site.