-
We present a novel $N$-body simulation method that compactifies the infinite
spatial extent of the Universe into a finite sphere with isotropic boundary
conditions to follow the evolution of the large-scale structure. Our approach
eliminates the need for periodic boundary conditions, a mere numerical
convenience which is not supported by observation and which modifies the law of
force on large scales in an unrealistic fashion. We demonstrate that our method
outclasses standard simulations executed on workstation-scale hardware in
dynamic range, it is balanced in following a comparable number of high and low
$k$ modes and, its fundamental geometry and topology match observations. Our
approach is also capable of simulating an expanding, infinite universe in
static coordinates with Newtonian dynamics. The price of these achievements is
that most of the simulated volume has smoothly varying mass and spatial
resolution, an approximation that carries different systematics than periodic
simulations.
Our initial implementation of the method is called StePS which stands for
Stereographically Projected Cosmological Simulations. It uses stereographic
projection for space compactification and naive $\mathcal{O}(N^2)$ force
calculation which is nevertheless faster to arrive at a correlation function of
the same quality than any standard (tree or P$^3$M) algorithm with similar
spatial and mass resolution. The $N^2$ force calculation is easy to adapt to
modern graphics cards, hence our code can function as a high-speed prediction
tool for modern large-scale surveys. To learn about the limits of the
respective methods, we compare StePS with GADGET-2
\citep{Gadget2_2005MNRAS.364.1105S} running matching initial conditions.
-
The recent AvERA cosmological simulation of R\'acz et al. (2017) has a
$\Lambda \mathrm{CDM}$-like expansion history and removes the tension between
local and Planck (cosmic microwave background) Hubble constants. We contrast
the AvERA prediction of the integrated Sachs--Wolfe (ISW) effect with that of
$\Lambda \mathrm{CDM}$. The linear ISW effect is proportional to the derivative
of the growth function, thus it is sensitive to small differences in the
expansion histories of the respective models. We create simulated ISW maps
tracing the path of light-rays through the Millennium XXL cosmological
simulation, and perform theoretical calculations of the ISW power spectrum.
AvERA predicts a significantly higher ISW effect than $\Lambda \mathrm{CDM}$,
$A=1.93-5.29$ times larger depending on the $l$ index of the spherical power
spectrum, which could be utilized to definitively differentiate the models. We
also show that AvERA predicts an opposite-sign ISW effect in the redshift range
$z \approx 1.5 - 4.4$, in clear contrast with $\Lambda \mathrm{CDM}$. Finally,
we compare our ISW predictions with previous observations. While at present
these cannot distinguish between the two models due to large error bars, and
lack of internal consistency suggesting systematics, ISW probes from future
surveys will tightly constrain the models.
-
In the last two decades Computer Aided Diagnostics (CAD) systems were
developed to help radiologists analyze screening mammograms. The benefits of
current CAD technologies appear to be contradictory and they should be improved
to be ultimately considered useful. Since 2012 deep convolutional neural
networks (CNN) have been a tremendous success in image recognition, reaching
human performance. These methods have greatly surpassed the traditional
approaches, which are similar to currently used CAD solutions. Deep CNN-s have
the potential to revolutionize medical image analysis. We propose a CAD system
based on one of the most successful object detection frameworks, Faster R-CNN.
The system detects and classifies malignant or benign lesions on a mammogram
without any human intervention. Our approach described here has achieved the
2nd place in the Digital Mammography DREAM Challenge with $ AUC = 0.85 $. The
proposed method also sets the state of the art classification performance on
the public INbreast database, $ AUC = 0.95$. When used as a detector, the
system reaches high sensitivity with very few false positive marks per image on
the INbreast dataset.
-
Monitoring network state can be crucial in Future Internet infrastructures.
Passive monitoring of all the routers is expensive and prohibitive. Storing,
accessing and sharing the data is a technological challenge among networks with
conflicting economic interests. Active monitoring methods can be attractive
alternatives as they are free from most of these issues. Here we demonstrate
that it is possible to improve the active network tomography methodology to
such extent that the quality of the extracted link or router level delay is
comparable to the passively measurable information. We show that the temporal
precision of the measurements and the performance of the data analysis should
be simultaneously improved to achieve this goal. In this paper we not only
introduce a new efficient message-passing based algorithm but we also show that
it is applicable for data collected by the ETOMIC high precision active
measurement infrastructure. The measurements are conducted in the GEANT2 high
speed academic network connecting the sites, which is an ideal test ground for
such Future Internet applications.
-
In Future Internet it is possible to change elements of congestion control in
order to eliminate jitter and batch loss caused by the current control
mechanisms based on packet loss events. We investigate the fundamental problem
of adjusting sending rates to achieve optimal utilization of highly variable
bandwidth of a network path using accurate packet rate information. This is
done by continuously controlling the sending rate with a function of the
measured packet rate at the receiver. We propose the relative loss of packet
rate between the sender and the receiver (Relative Rate Reduction, RRR) as a
new accurate and continuous measure of congestion of a network path, replacing
the erratically fluctuating packet loss. We demonstrate that with choosing
various RRR based feedback functions the optimum is reached with adjustable
congestion level. The proposed method guarantees fair bandwidth sharing of
competitive flows. Finally, we present testbed experiments to demonstrate the
performance of the algorithm.
-
Viral videos can reach global penetration traveling through international
channels of communication similarly to real diseases starting from a
well-localized source. In past centuries, disease fronts propagated in a
concentric spatial fashion from the the source of the outbreak via the short
range human contact network. The emergence of long-distance air-travel changed
these ancient patterns. However, recently, Brockmann and Helbing have shown
that concentric propagation waves can be reinstated if propagation time and
distance is measured in the flight-time and travel volume weighted underlying
air-travel network. Here, we adopt this method for the analysis of viral meme
propagation in Twitter messages, and define a similar weighted network distance
in the communication network connecting countries and states of the World. We
recover a wave-like behavior on average and assess the randomizing effect of
non-locality of spreading. We show that similar result can be recovered from
Google Trends data as well.
-
We present a flexible template-based photometric redshift estimation
framework, implemented in C#, that can be seamlessly integrated into a SQL
database (or DB) server and executed on-demand in SQL. The DB integration
eliminates the need to move large photometric datasets outside a database for
redshift estimation, and utilizes the computational capabilities of DB
hardware. The code is able to perform both maximum likelihood and Bayesian
estimation, and can handle inputs of variable photometric filter sets and
corresponding broad-band magnitudes. It is possible to take into account the
full covariance matrix between filters, and filter zero points can be
empirically calibrated using measurements with given redshifts. The list of
spectral templates and the prior can be specified flexibly, and the expensive
synthetic magnitude computations are done via lazy evaluation, coupled with a
caching of results. Parallel execution is fully supported. For large upcoming
photometric surveys such as the LSST, the ability to perform in-place photo-z
calculation would be a significant advantage. Also, the efficient handling of
variable filter sets is a necessity for heterogeneous databases, for example
the Hubble Source Catalog, and for cross-match services such as SkyQuery. We
illustrate the performance of our code on two reference photo-z estimation
testing datasets, and provide an analysis of execution time and scalability
with respect to different configurations. The code is available for download at
https://github.com/beckrob/Photo-z-SQL.
-
According to the separate universe conjecture, spherically symmetric
sub-regions in an isotropic universe behave like mini-universes with their own
cosmological parameters. This is an excellent approximation in both Newtonian
and general relativistic theories. We estimate local expansion rates for a
large number of such regions, and use a scale parameter calculated from the
volume-averaged increments of local scale parameters at each time step in an
otherwise standard cosmological $N$-body simulation. The particle mass,
corresponding to a coarse graining scale, is an adjustable parameter. This mean
field approximation neglects tidal forces and boundary effects, but it is the
first step towards a non-perturbative statistical estimation of the effect of
non-linear evolution of structure on the expansion rate. Using our algorithm, a
simulation with an initial $\Omega_m=1$ Einstein--de~Sitter setting closely
tracks the expansion and structure growth history of the $\Lambda$CDM
cosmology. Due to small but characteristic differences, our model can be
distinguished from the $\Lambda$CDM model by future precision observations.
Moreover, our model can resolve the emerging tension between local Hubble
constant measurements and the Planck best-fitting cosmology. Further
improvements to the simulation are necessary to investigate light propagation
and confirm full consistency with cosmic microwave background observations.
-
Recently, numerous approaches have emerged in the social sciences to exploit
the opportunities made possible by the vast amounts of data generated by online
social networks (OSNs). Having access to information about users on such a
scale opens up a range of possibilities, all without the limitations associated
with often slow and expensive paper-based polls. A question that remains to be
satisfactorily addressed, however, is how demography is represented in the OSN
content? Here, we study language use in the US using a corpus of text compiled
from over half a billion geo-tagged messages from the online microblogging
platform Twitter. Our intention is to reveal the most important spatial
patterns in language use in an unsupervised manner and relate them to
demographics. Our approach is based on Latent Semantic Analysis (LSA) augmented
with the Robust Principal Component Analysis (RPCA) methodology. We find
spatially correlated patterns that can be interpreted based on the words
associated with them. The main language features can be related to slang use,
urbanization, travel, religion and ethnicity, the patterns of which are shown
to correlate plausibly with traditional census data. Our findings thus validate
the concept of demography being represented in OSN language use and show that
the traits observed are inherently present in the word frequencies without any
previous assumptions about the dataset. Thus, they could form the basis of
further research focusing on the evaluation of demographic data estimation from
other big data sources, or on the dynamical processes that result in the
patterns found here.
-
We present the methodology and data behind the photometric redshift database
of the Sloan Digital Sky Survey Data Release 12 (SDSS DR12). We adopt a hybrid
technique, empirically estimating the redshift via local regression on a
spectroscopic training set, then fitting a spectrum template to obtain
K-corrections and absolute magnitudes. The SDSS spectroscopic catalog was
augmented with data from other, publicly available spectroscopic surveys to
mitigate target selection effects. The training set is comprised of $1,976,978$
galaxies, and extends up to redshift $z\approx 0.8$, with a useful coverage of
up to $z\approx 0.6$. We provide photometric redshifts and realistic error
estimates for the $208,474,076$ galaxies of the SDSS primary photometric
catalog. We achieve an average bias of $\overline{\Delta z_{\mathrm{norm}}} =
5.84 \times 10^{-5}$, a standard deviation of $\sigma \left(\Delta
z_{\mathrm{norm}}\right)=0.0205$, and a $3\sigma$ outlier rate of $P_o=4.11\%$
when cross-validating on our training set. The published redshift error
estimates and photometric error classes enable the selection of galaxies with
high quality photometric redshifts. We also provide a supplementary error map
that allows additional, sophisticated filtering of the data.
-
We analyse the correlations between continuum properties and emission line
equivalent widths of star-forming and active galaxies from the Sloan Digital
Sky Survey. Since upcoming large sky surveys will make broad-band observations
only, including strong emission lines into theoretical modelling of spectra
will be essential to estimate physical properties of photometric galaxies. We
show that emission line equivalent widths can be fairly well reconstructed from
the stellar continuum using local multiple linear regression in the continuum
principal component analysis (PCA) space. Line reconstruction is good for
star-forming galaxies and reasonable for galaxies with active nuclei. We
propose a practical method to combine stellar population synthesis models with
empirical modelling of emission lines. The technique will help generate more
accurate model spectra and mock catalogues of galaxies to fit observations of
the new surveys. More accurate modelling of emission lines is also expected to
improve template-based photometric redshift estimation methods. We also show
that, by combining PCA coefficients from the pure continuum and the emission
lines, automatic distinction between hosts of weak active galactic nuclei
(AGNs) and quiescent star-forming galaxies can be made. The classification
method is based on a training set consisting of high-confidence starburst
galaxies and AGNs, and allows for the similar separation of active and
star-forming galaxies as the empirical curve found by Kauffmann et al. We
demonstrate the use of three important machine learning algorithms in the
paper: k-nearest neighbour finding, k-means clustering and support vector
machines.
-
Why life persists at the edge of chaos is a question at the very heart of
evolution. Here we show that molecules taking part in biochemical processes
from small molecules to proteins are critical quantum mechanically. Electronic
Hamiltonians of biomolecules are tuned exactly to the critical point of the
metal-insulator transition separating the Anderson localized insulator phase
from the conducting disordered metal phase. Using tools from Random Matrix
Theory we confirm that the energy level statistics of these biomolecules show
the universal transitional distribution of the metal-insulator critical point
and the wave functions are multifractals in accordance with the theory of
Anderson transitions. The findings point to the existence of a universal
mechanism of charge transport in living matter. The revealed bio-conductor
material is neither a metal nor an insulator but a new quantum critical
material which can exist only in highly evolved systems and has unique material
properties.
-
One of the new discoveries in quantum biology is the role of Environment
Assisted Quantum Transport (ENAQT) in excitonic transport processes. In
disordered quantum systems transport is most efficient when the environment
just destroys quantum interferences responsible for localization, but the
coupling does not drive the system to fully classical thermal diffusion yet.
This poised realm between the pure quantum and the semi-classical domains has
not been considered in other biological transport processes, such as charge
transport through organic molecules. Binding in receptor-ligand complexes is
assumed to be static as electrons are assumed to be not able to cross the
ligand molecule. We show that ENAQT makes cross ligand transport possible and
efficient between certain atoms opening the way for the reorganization of the
charge distribution on the receptor when the ligand molecule docks. This new
effect can potentially change our understanding how receptors work. We
demonstrate room temperature ENAQT on the caffeine molecule.
-
A main focus in economics research is understanding the time series of prices
of goods and assets. While statistical models using only the properties of the
time series itself have been successful in many aspects, we expect to gain a
better understanding of the phenomena involved if we can model the underlying
system of interacting agents. In this article, we consider the history of
Bitcoin, a novel digital currency system, for which the complete list of
transactions is available for analysis. Using this dataset, we reconstruct the
transaction network between users and analyze changes in the structure of the
subgraph induced by the most active users. Our approach is based on the
unsupervised identification of important features of the time variation of the
network. Applying the widely used method of Principal Component Analysis to the
matrix constructed from snapshots of the network at different times, we are
able to show how structural changes in the network accompany significant
changes in the exchange price of bitcoins.
-
We present a case study about the spatial indexing and regional
classification of billions of geographic coordinates from geo-tagged social
network data using Hierarchical Triangular Mesh (HTM) implemented for Microsoft
SQL Server. Due to the lack of certain features of the HTM library, we use it
in conjunction with the GIS functions of SQL Server to significantly increase
the efficiency of pre-filtering of spatial filter and join queries. For
example, we implemented a new algorithm to compute the HTM tessellation of
complex geographic regions and precomputed the intersections of HTM triangles
and geographic regions for faster false-positive filtering. With full control
over the index structure, HTM-based pre-filtering of simple containment
searches outperforms SQL Server spatial indices by a factor of ten and
HTM-based spatial joins run about a hundred times faster.
-
The possibility to analyze everyday monetary transactions is limited by the
scarcity of available data, as this kind of information is usually considered
highly sensitive. Present econophysics models are usually employed on presumed
random networks of interacting agents, and only macroscopic properties (e.g.
the resulting wealth distribution) are compared to real-world data. In this
paper, we analyze BitCoin, which is a novel digital currency system, where the
complete list of transactions is publicly available. Using this dataset, we
reconstruct the network of transactions, and extract the time and amount of
each payment. We analyze the structure of the transaction network by measuring
network characteristics over time, such as the degree distribution, degree
correlations and clustering. We find that linear preferential attachment drives
the growth of the network. We also study the dynamics taking place on the
transaction network, i.e. the flow of money. We measure temporal patterns and
the wealth accumulation. Investigating the microscopic statistics of money
movement, we find that sublinear preferential attachment governs the evolution
of the wealth distribution. We report a scaling relation between the degree and
wealth associated to individual nodes.
-
Correlations and other collective phenomena in a schematic model of
heterogeneous binary agents (individual spin-glass samples) are considered on
the complete graph and also on 2d and 3d regular lattices. The system's
stochastic dynamics is studied by numerical simulations. The dynamics is so
slow that one can meaningfully speak of quasi-equilibrium states. Performing
measurements of correlations in such a quasi-equilibrium state we find that
they are random both as to their sign and absolute value, but on average they
fall off very slowly with distance in all instances that we have studied. This
means that the system is essentially non-local, small changes at one end may
have a strong impact at the other. Correlations and other local quantities are
extremely sensitive to the boundary conditions all across the system, although
this sensitivity disappears upon averaging over the samples or partially
averaging over the agents. The strong, random correlations tend to organize a
large fraction of the agents into strongly correlated clusters that act
together. If we think about this model as a distant metaphor of economic agents
or bank networks, the systemic risk implications of this tendency are clear:
any impact on even a single strongly correlated agent will spread, in an
unforeseeable manner, to the whole system via the strong random correlations.
-
Understanding the diversity in spectra is the key to determining the physical
parameters of galaxies. The optical spectra of galaxies are highly convoluted
with continuum and lines which are potentially sensitive to different physical
parameters. Defining the wavelength regions of interest is therefore an
important question. In this work, we identify informative wavelength regions in
a single-burst stellar populations model by using the CUR Matrix Decomposition.
Simulating the Lick/IDS spectrograph configuration, we recover the widely used
Dn(4000), Hbeta, and HdeltaA to be most informative. Simulating the SDSS
spectrograph configuration with a wavelength range 3450-8350 Angstrom and a
model-limited spectral resolution of 3 Angstrom, the most informative regions
are: first region-the 4000 Angstrom break and the Hdelta line; second
region-the Fe-like indices; third region-the Hbeta line; fourth region-the G
band and the Hgamma line. A Principal Component Analysis on the first region
shows that the first eigenspectrum tells primarily the stellar age, the second
eigenspectrum is related to the age-metallicity degeneracy, and the third
eigenspectrum shows an anti-correlation between the strengths of the Balmer and
the Ca K and H absorptions. The regions can be used to determine the stellar
age and metallicity in early-type galaxies which have solar abundance ratios,
no dust, and a single-burst star formation history. The region identification
method can be applied to any set of spectra of the user's interest, so that we
eliminate the need for a common, fixed-resolution index system. We discuss
future directions in extending the current analysis to late-type galaxies.
-
Twitter is a popular public conversation platform with world-wide audience
and diverse forms of connections between users. In this paper we introduce the
concept of aggregated regional Twitter networks in order to characterize
communication between geopolitical regions. We present the study of a follower
and a mention graph created from an extensive data set collected during the
second half of the year of $2012$. With a k-shell decomposition the global
core-periphery structure is revealed and by means of a modified Regional-SIR
model we also consider basic information spreading properties.
-
Principal component analysis (PCA) and related techniques have been
successfully employed in natural language processing. Text mining applications
in the age of the online social media (OSM) face new challenges due to
properties specific to these use cases (e.g. spelling issues specific to texts
posted by users, the presence of spammers and bots, service announcements,
etc.). In this paper, we employ a Robust PCA technique to separate typical
outliers and highly localized topics from the low-dimensional structure present
in language use in online social networks. Our focus is on identifying
geospatial features among the messages posted by the users of the Twitter
microblogging service. Using a dataset which consists of over 200 million
geolocated tweets collected over the course of a year, we investigate whether
the information present in word usage frequencies can be used to identify
regional features of language use and topics of interest. Using the PCA pursuit
method, we are able to identify important low-dimensional features, which
constitute smoothly varying functions of the geographic location.
-
Despite their relatively low sampling factor, the freely available, randomly
sampled status streams of Twitter are very useful sources of geographically
embedded social network data. To statistically analyze the information Twitter
provides via these streams, we have collected a year's worth of data and built
a multi-terabyte relational database from it. The database is designed for fast
data loading and to support a wide range of studies focusing on the statistics
and geographic features of social networks, as well as on the linguistic
analysis of tweets. In this paper we present the method of data collection, the
database design, the data loading procedure and special treatment of geo-tagged
and multi-lingual data. We also provide some SQL recipes for computing network
statistics.
-
Position angle measurements of Sloan Digital Sky Survey (SDSS) galaxies, as
measured by the surface brightness profile fitting code of the SDSS photometric
pipeline (Lupton 2001), are known to be strongly biased, especially in the case
of almost face-on and highly inclined galaxies. To address this issue we
developed a reliable algorithm which determines position angles by means of
isophote fitting. In this paper we present our algorithm and a catalogue of
position angles for 26397 SDSS galaxies taken from the deep co-added Stripe 82
(equatorial stripe) images.
-
Scaling phenomena have been intensively studied during the past decade in the
context of complex networks. As part of these works, recently novel methods
have appeared to measure the dimension of abstract and spatially embedded
networks. In this paper we propose a new dimension measurement method for
networks, which does not require global knowledge on the embedding of the
nodes, instead it exploits link-wise information (link lengths, link delays or
other physical quantities). Our method can be regarded as a generalization of
the spectral dimension, that grasps the network's large-scale structure through
local observations made by a random walker while traversing the links. We apply
the presented method to synthetic and real-world networks, including road maps,
the Internet infrastructure and the Gowalla geosocial network. We analyze the
theoretically and empirically designated case when the length distribution of
the links has the form P(r) ~ 1/r. We show that while previous dimension
concepts are not applicable in this case, the new dimension measure still
exhibits scaling with two distinct scaling regimes. Our observations suggest
that the link length distribution is not sufficient in itself to entirely
control the dimensionality of complex networks, and we show that the proposed
measure provides information that complements other known measures.
-
Many fields of science rely on relational database management systems to
analyze, publish and share data. Since RDBMS are originally designed for, and
their development directions are primarily driven by, business use cases they
often lack features very important for scientific applications. Horizontal
scalability is probably the most important missing feature which makes it
challenging to adapt traditional relational database systems to the ever
growing data sizes. Due to the limited support of array data types and metadata
management, successful application of RDBMS in science usually requires the
development of custom extensions. While some of these extensions are specific
to the field of science, the majority of them could easily be generalized and
reused in other disciplines. With the Graywulf project we intend to target
several goals. We are building a generic platform that offers reusable
components for efficient storage, transformation, statistical analysis and
presentation of scientific data stored in Microsoft SQL Server. Graywulf also
addresses the distributed computational issues arising from current RDBMS
technologies. The current version supports load balancing of simple queries and
parallel execution of partitioned queries over a set of mirrored databases.
Uniform user access to the data is provided through a web based query interface
and a data surface for software clients. Queries are formulated in a slightly
modified syntax of SQL that offers a transparent view of the distributed data.
The software library consists of several components that can be reused to
develop complex scientific data warehouses: a system registry, administration
tools to manage entire database server clusters, a sophisticated workflow
execution framework, and a SQL parser library.
-
Getting spectra at good signal-to-noise ratios takes orders of magnitudes
more time than photometric observations. Building on the technique developed
for photometric redshift estimation of galaxies, we develop and demonstrate a
non-parametric photometric method for estimating the chemical composition of
galactic stars. We investigate the efficiency of our method using
spectroscopically determined stellar metallicities from SDSS DR7. The technique
is generic in the sense that it is not restricted to certain stellar types or
stellar parameter ranges and makes it possible to obtain metallicities and
error estimates for a much larger sample than spectroscopic surveys would
allow. We find that our method performs well, especially for brighter stars and
higher metallicities and, in contrast to many other techniques, we are able to
reliably estimate the error of the predicted metallicities.