• We present a novel $N$-body simulation method that compactifies the infinite spatial extent of the Universe into a finite sphere with isotropic boundary conditions to follow the evolution of the large-scale structure. Our approach eliminates the need for periodic boundary conditions, a mere numerical convenience which is not supported by observation and which modifies the law of force on large scales in an unrealistic fashion. We demonstrate that our method outclasses standard simulations executed on workstation-scale hardware in dynamic range, it is balanced in following a comparable number of high and low $k$ modes and, its fundamental geometry and topology match observations. Our approach is also capable of simulating an expanding, infinite universe in static coordinates with Newtonian dynamics. The price of these achievements is that most of the simulated volume has smoothly varying mass and spatial resolution, an approximation that carries different systematics than periodic simulations. Our initial implementation of the method is called StePS which stands for Stereographically Projected Cosmological Simulations. It uses stereographic projection for space compactification and naive $\mathcal{O}(N^2)$ force calculation which is nevertheless faster to arrive at a correlation function of the same quality than any standard (tree or P$^3$M) algorithm with similar spatial and mass resolution. The $N^2$ force calculation is easy to adapt to modern graphics cards, hence our code can function as a high-speed prediction tool for modern large-scale surveys. To learn about the limits of the respective methods, we compare StePS with GADGET-2 \citep{Gadget2_2005MNRAS.364.1105S} running matching initial conditions.
  • The recent AvERA cosmological simulation of R\'acz et al. (2017) has a $\Lambda \mathrm{CDM}$-like expansion history and removes the tension between local and Planck (cosmic microwave background) Hubble constants. We contrast the AvERA prediction of the integrated Sachs--Wolfe (ISW) effect with that of $\Lambda \mathrm{CDM}$. The linear ISW effect is proportional to the derivative of the growth function, thus it is sensitive to small differences in the expansion histories of the respective models. We create simulated ISW maps tracing the path of light-rays through the Millennium XXL cosmological simulation, and perform theoretical calculations of the ISW power spectrum. AvERA predicts a significantly higher ISW effect than $\Lambda \mathrm{CDM}$, $A=1.93-5.29$ times larger depending on the $l$ index of the spherical power spectrum, which could be utilized to definitively differentiate the models. We also show that AvERA predicts an opposite-sign ISW effect in the redshift range $z \approx 1.5 - 4.4$, in clear contrast with $\Lambda \mathrm{CDM}$. Finally, we compare our ISW predictions with previous observations. While at present these cannot distinguish between the two models due to large error bars, and lack of internal consistency suggesting systematics, ISW probes from future surveys will tightly constrain the models.
  • In the last two decades Computer Aided Diagnostics (CAD) systems were developed to help radiologists analyze screening mammograms. The benefits of current CAD technologies appear to be contradictory and they should be improved to be ultimately considered useful. Since 2012 deep convolutional neural networks (CNN) have been a tremendous success in image recognition, reaching human performance. These methods have greatly surpassed the traditional approaches, which are similar to currently used CAD solutions. Deep CNN-s have the potential to revolutionize medical image analysis. We propose a CAD system based on one of the most successful object detection frameworks, Faster R-CNN. The system detects and classifies malignant or benign lesions on a mammogram without any human intervention. Our approach described here has achieved the 2nd place in the Digital Mammography DREAM Challenge with $ AUC = 0.85 $. The proposed method also sets the state of the art classification performance on the public INbreast database, $ AUC = 0.95$. When used as a detector, the system reaches high sensitivity with very few false positive marks per image on the INbreast dataset.
  • Monitoring network state can be crucial in Future Internet infrastructures. Passive monitoring of all the routers is expensive and prohibitive. Storing, accessing and sharing the data is a technological challenge among networks with conflicting economic interests. Active monitoring methods can be attractive alternatives as they are free from most of these issues. Here we demonstrate that it is possible to improve the active network tomography methodology to such extent that the quality of the extracted link or router level delay is comparable to the passively measurable information. We show that the temporal precision of the measurements and the performance of the data analysis should be simultaneously improved to achieve this goal. In this paper we not only introduce a new efficient message-passing based algorithm but we also show that it is applicable for data collected by the ETOMIC high precision active measurement infrastructure. The measurements are conducted in the GEANT2 high speed academic network connecting the sites, which is an ideal test ground for such Future Internet applications.
  • In Future Internet it is possible to change elements of congestion control in order to eliminate jitter and batch loss caused by the current control mechanisms based on packet loss events. We investigate the fundamental problem of adjusting sending rates to achieve optimal utilization of highly variable bandwidth of a network path using accurate packet rate information. This is done by continuously controlling the sending rate with a function of the measured packet rate at the receiver. We propose the relative loss of packet rate between the sender and the receiver (Relative Rate Reduction, RRR) as a new accurate and continuous measure of congestion of a network path, replacing the erratically fluctuating packet loss. We demonstrate that with choosing various RRR based feedback functions the optimum is reached with adjustable congestion level. The proposed method guarantees fair bandwidth sharing of competitive flows. Finally, we present testbed experiments to demonstrate the performance of the algorithm.
  • Viral videos can reach global penetration traveling through international channels of communication similarly to real diseases starting from a well-localized source. In past centuries, disease fronts propagated in a concentric spatial fashion from the the source of the outbreak via the short range human contact network. The emergence of long-distance air-travel changed these ancient patterns. However, recently, Brockmann and Helbing have shown that concentric propagation waves can be reinstated if propagation time and distance is measured in the flight-time and travel volume weighted underlying air-travel network. Here, we adopt this method for the analysis of viral meme propagation in Twitter messages, and define a similar weighted network distance in the communication network connecting countries and states of the World. We recover a wave-like behavior on average and assess the randomizing effect of non-locality of spreading. We show that similar result can be recovered from Google Trends data as well.
  • We present a flexible template-based photometric redshift estimation framework, implemented in C#, that can be seamlessly integrated into a SQL database (or DB) server and executed on-demand in SQL. The DB integration eliminates the need to move large photometric datasets outside a database for redshift estimation, and utilizes the computational capabilities of DB hardware. The code is able to perform both maximum likelihood and Bayesian estimation, and can handle inputs of variable photometric filter sets and corresponding broad-band magnitudes. It is possible to take into account the full covariance matrix between filters, and filter zero points can be empirically calibrated using measurements with given redshifts. The list of spectral templates and the prior can be specified flexibly, and the expensive synthetic magnitude computations are done via lazy evaluation, coupled with a caching of results. Parallel execution is fully supported. For large upcoming photometric surveys such as the LSST, the ability to perform in-place photo-z calculation would be a significant advantage. Also, the efficient handling of variable filter sets is a necessity for heterogeneous databases, for example the Hubble Source Catalog, and for cross-match services such as SkyQuery. We illustrate the performance of our code on two reference photo-z estimation testing datasets, and provide an analysis of execution time and scalability with respect to different configurations. The code is available for download at https://github.com/beckrob/Photo-z-SQL.
  • According to the separate universe conjecture, spherically symmetric sub-regions in an isotropic universe behave like mini-universes with their own cosmological parameters. This is an excellent approximation in both Newtonian and general relativistic theories. We estimate local expansion rates for a large number of such regions, and use a scale parameter calculated from the volume-averaged increments of local scale parameters at each time step in an otherwise standard cosmological $N$-body simulation. The particle mass, corresponding to a coarse graining scale, is an adjustable parameter. This mean field approximation neglects tidal forces and boundary effects, but it is the first step towards a non-perturbative statistical estimation of the effect of non-linear evolution of structure on the expansion rate. Using our algorithm, a simulation with an initial $\Omega_m=1$ Einstein--de~Sitter setting closely tracks the expansion and structure growth history of the $\Lambda$CDM cosmology. Due to small but characteristic differences, our model can be distinguished from the $\Lambda$CDM model by future precision observations. Moreover, our model can resolve the emerging tension between local Hubble constant measurements and the Planck best-fitting cosmology. Further improvements to the simulation are necessary to investigate light propagation and confirm full consistency with cosmic microwave background observations.
  • Recently, numerous approaches have emerged in the social sciences to exploit the opportunities made possible by the vast amounts of data generated by online social networks (OSNs). Having access to information about users on such a scale opens up a range of possibilities, all without the limitations associated with often slow and expensive paper-based polls. A question that remains to be satisfactorily addressed, however, is how demography is represented in the OSN content? Here, we study language use in the US using a corpus of text compiled from over half a billion geo-tagged messages from the online microblogging platform Twitter. Our intention is to reveal the most important spatial patterns in language use in an unsupervised manner and relate them to demographics. Our approach is based on Latent Semantic Analysis (LSA) augmented with the Robust Principal Component Analysis (RPCA) methodology. We find spatially correlated patterns that can be interpreted based on the words associated with them. The main language features can be related to slang use, urbanization, travel, religion and ethnicity, the patterns of which are shown to correlate plausibly with traditional census data. Our findings thus validate the concept of demography being represented in OSN language use and show that the traits observed are inherently present in the word frequencies without any previous assumptions about the dataset. Thus, they could form the basis of further research focusing on the evaluation of demographic data estimation from other big data sources, or on the dynamical processes that result in the patterns found here.
  • We present the methodology and data behind the photometric redshift database of the Sloan Digital Sky Survey Data Release 12 (SDSS DR12). We adopt a hybrid technique, empirically estimating the redshift via local regression on a spectroscopic training set, then fitting a spectrum template to obtain K-corrections and absolute magnitudes. The SDSS spectroscopic catalog was augmented with data from other, publicly available spectroscopic surveys to mitigate target selection effects. The training set is comprised of $1,976,978$ galaxies, and extends up to redshift $z\approx 0.8$, with a useful coverage of up to $z\approx 0.6$. We provide photometric redshifts and realistic error estimates for the $208,474,076$ galaxies of the SDSS primary photometric catalog. We achieve an average bias of $\overline{\Delta z_{\mathrm{norm}}} = 5.84 \times 10^{-5}$, a standard deviation of $\sigma \left(\Delta z_{\mathrm{norm}}\right)=0.0205$, and a $3\sigma$ outlier rate of $P_o=4.11\%$ when cross-validating on our training set. The published redshift error estimates and photometric error classes enable the selection of galaxies with high quality photometric redshifts. We also provide a supplementary error map that allows additional, sophisticated filtering of the data.
  • We analyse the correlations between continuum properties and emission line equivalent widths of star-forming and active galaxies from the Sloan Digital Sky Survey. Since upcoming large sky surveys will make broad-band observations only, including strong emission lines into theoretical modelling of spectra will be essential to estimate physical properties of photometric galaxies. We show that emission line equivalent widths can be fairly well reconstructed from the stellar continuum using local multiple linear regression in the continuum principal component analysis (PCA) space. Line reconstruction is good for star-forming galaxies and reasonable for galaxies with active nuclei. We propose a practical method to combine stellar population synthesis models with empirical modelling of emission lines. The technique will help generate more accurate model spectra and mock catalogues of galaxies to fit observations of the new surveys. More accurate modelling of emission lines is also expected to improve template-based photometric redshift estimation methods. We also show that, by combining PCA coefficients from the pure continuum and the emission lines, automatic distinction between hosts of weak active galactic nuclei (AGNs) and quiescent star-forming galaxies can be made. The classification method is based on a training set consisting of high-confidence starburst galaxies and AGNs, and allows for the similar separation of active and star-forming galaxies as the empirical curve found by Kauffmann et al. We demonstrate the use of three important machine learning algorithms in the paper: k-nearest neighbour finding, k-means clustering and support vector machines.
  • Why life persists at the edge of chaos is a question at the very heart of evolution. Here we show that molecules taking part in biochemical processes from small molecules to proteins are critical quantum mechanically. Electronic Hamiltonians of biomolecules are tuned exactly to the critical point of the metal-insulator transition separating the Anderson localized insulator phase from the conducting disordered metal phase. Using tools from Random Matrix Theory we confirm that the energy level statistics of these biomolecules show the universal transitional distribution of the metal-insulator critical point and the wave functions are multifractals in accordance with the theory of Anderson transitions. The findings point to the existence of a universal mechanism of charge transport in living matter. The revealed bio-conductor material is neither a metal nor an insulator but a new quantum critical material which can exist only in highly evolved systems and has unique material properties.
  • One of the new discoveries in quantum biology is the role of Environment Assisted Quantum Transport (ENAQT) in excitonic transport processes. In disordered quantum systems transport is most efficient when the environment just destroys quantum interferences responsible for localization, but the coupling does not drive the system to fully classical thermal diffusion yet. This poised realm between the pure quantum and the semi-classical domains has not been considered in other biological transport processes, such as charge transport through organic molecules. Binding in receptor-ligand complexes is assumed to be static as electrons are assumed to be not able to cross the ligand molecule. We show that ENAQT makes cross ligand transport possible and efficient between certain atoms opening the way for the reorganization of the charge distribution on the receptor when the ligand molecule docks. This new effect can potentially change our understanding how receptors work. We demonstrate room temperature ENAQT on the caffeine molecule.
  • A main focus in economics research is understanding the time series of prices of goods and assets. While statistical models using only the properties of the time series itself have been successful in many aspects, we expect to gain a better understanding of the phenomena involved if we can model the underlying system of interacting agents. In this article, we consider the history of Bitcoin, a novel digital currency system, for which the complete list of transactions is available for analysis. Using this dataset, we reconstruct the transaction network between users and analyze changes in the structure of the subgraph induced by the most active users. Our approach is based on the unsupervised identification of important features of the time variation of the network. Applying the widely used method of Principal Component Analysis to the matrix constructed from snapshots of the network at different times, we are able to show how structural changes in the network accompany significant changes in the exchange price of bitcoins.
  • We present a case study about the spatial indexing and regional classification of billions of geographic coordinates from geo-tagged social network data using Hierarchical Triangular Mesh (HTM) implemented for Microsoft SQL Server. Due to the lack of certain features of the HTM library, we use it in conjunction with the GIS functions of SQL Server to significantly increase the efficiency of pre-filtering of spatial filter and join queries. For example, we implemented a new algorithm to compute the HTM tessellation of complex geographic regions and precomputed the intersections of HTM triangles and geographic regions for faster false-positive filtering. With full control over the index structure, HTM-based pre-filtering of simple containment searches outperforms SQL Server spatial indices by a factor of ten and HTM-based spatial joins run about a hundred times faster.
  • The possibility to analyze everyday monetary transactions is limited by the scarcity of available data, as this kind of information is usually considered highly sensitive. Present econophysics models are usually employed on presumed random networks of interacting agents, and only macroscopic properties (e.g. the resulting wealth distribution) are compared to real-world data. In this paper, we analyze BitCoin, which is a novel digital currency system, where the complete list of transactions is publicly available. Using this dataset, we reconstruct the network of transactions, and extract the time and amount of each payment. We analyze the structure of the transaction network by measuring network characteristics over time, such as the degree distribution, degree correlations and clustering. We find that linear preferential attachment drives the growth of the network. We also study the dynamics taking place on the transaction network, i.e. the flow of money. We measure temporal patterns and the wealth accumulation. Investigating the microscopic statistics of money movement, we find that sublinear preferential attachment governs the evolution of the wealth distribution. We report a scaling relation between the degree and wealth associated to individual nodes.
  • Correlations and other collective phenomena in a schematic model of heterogeneous binary agents (individual spin-glass samples) are considered on the complete graph and also on 2d and 3d regular lattices. The system's stochastic dynamics is studied by numerical simulations. The dynamics is so slow that one can meaningfully speak of quasi-equilibrium states. Performing measurements of correlations in such a quasi-equilibrium state we find that they are random both as to their sign and absolute value, but on average they fall off very slowly with distance in all instances that we have studied. This means that the system is essentially non-local, small changes at one end may have a strong impact at the other. Correlations and other local quantities are extremely sensitive to the boundary conditions all across the system, although this sensitivity disappears upon averaging over the samples or partially averaging over the agents. The strong, random correlations tend to organize a large fraction of the agents into strongly correlated clusters that act together. If we think about this model as a distant metaphor of economic agents or bank networks, the systemic risk implications of this tendency are clear: any impact on even a single strongly correlated agent will spread, in an unforeseeable manner, to the whole system via the strong random correlations.
  • Understanding the diversity in spectra is the key to determining the physical parameters of galaxies. The optical spectra of galaxies are highly convoluted with continuum and lines which are potentially sensitive to different physical parameters. Defining the wavelength regions of interest is therefore an important question. In this work, we identify informative wavelength regions in a single-burst stellar populations model by using the CUR Matrix Decomposition. Simulating the Lick/IDS spectrograph configuration, we recover the widely used Dn(4000), Hbeta, and HdeltaA to be most informative. Simulating the SDSS spectrograph configuration with a wavelength range 3450-8350 Angstrom and a model-limited spectral resolution of 3 Angstrom, the most informative regions are: first region-the 4000 Angstrom break and the Hdelta line; second region-the Fe-like indices; third region-the Hbeta line; fourth region-the G band and the Hgamma line. A Principal Component Analysis on the first region shows that the first eigenspectrum tells primarily the stellar age, the second eigenspectrum is related to the age-metallicity degeneracy, and the third eigenspectrum shows an anti-correlation between the strengths of the Balmer and the Ca K and H absorptions. The regions can be used to determine the stellar age and metallicity in early-type galaxies which have solar abundance ratios, no dust, and a single-burst star formation history. The region identification method can be applied to any set of spectra of the user's interest, so that we eliminate the need for a common, fixed-resolution index system. We discuss future directions in extending the current analysis to late-type galaxies.
  • Twitter is a popular public conversation platform with world-wide audience and diverse forms of connections between users. In this paper we introduce the concept of aggregated regional Twitter networks in order to characterize communication between geopolitical regions. We present the study of a follower and a mention graph created from an extensive data set collected during the second half of the year of $2012$. With a k-shell decomposition the global core-periphery structure is revealed and by means of a modified Regional-SIR model we also consider basic information spreading properties.
  • Principal component analysis (PCA) and related techniques have been successfully employed in natural language processing. Text mining applications in the age of the online social media (OSM) face new challenges due to properties specific to these use cases (e.g. spelling issues specific to texts posted by users, the presence of spammers and bots, service announcements, etc.). In this paper, we employ a Robust PCA technique to separate typical outliers and highly localized topics from the low-dimensional structure present in language use in online social networks. Our focus is on identifying geospatial features among the messages posted by the users of the Twitter microblogging service. Using a dataset which consists of over 200 million geolocated tweets collected over the course of a year, we investigate whether the information present in word usage frequencies can be used to identify regional features of language use and topics of interest. Using the PCA pursuit method, we are able to identify important low-dimensional features, which constitute smoothly varying functions of the geographic location.
  • Despite their relatively low sampling factor, the freely available, randomly sampled status streams of Twitter are very useful sources of geographically embedded social network data. To statistically analyze the information Twitter provides via these streams, we have collected a year's worth of data and built a multi-terabyte relational database from it. The database is designed for fast data loading and to support a wide range of studies focusing on the statistics and geographic features of social networks, as well as on the linguistic analysis of tweets. In this paper we present the method of data collection, the database design, the data loading procedure and special treatment of geo-tagged and multi-lingual data. We also provide some SQL recipes for computing network statistics.
  • Position angle measurements of Sloan Digital Sky Survey (SDSS) galaxies, as measured by the surface brightness profile fitting code of the SDSS photometric pipeline (Lupton 2001), are known to be strongly biased, especially in the case of almost face-on and highly inclined galaxies. To address this issue we developed a reliable algorithm which determines position angles by means of isophote fitting. In this paper we present our algorithm and a catalogue of position angles for 26397 SDSS galaxies taken from the deep co-added Stripe 82 (equatorial stripe) images.
  • Scaling phenomena have been intensively studied during the past decade in the context of complex networks. As part of these works, recently novel methods have appeared to measure the dimension of abstract and spatially embedded networks. In this paper we propose a new dimension measurement method for networks, which does not require global knowledge on the embedding of the nodes, instead it exploits link-wise information (link lengths, link delays or other physical quantities). Our method can be regarded as a generalization of the spectral dimension, that grasps the network's large-scale structure through local observations made by a random walker while traversing the links. We apply the presented method to synthetic and real-world networks, including road maps, the Internet infrastructure and the Gowalla geosocial network. We analyze the theoretically and empirically designated case when the length distribution of the links has the form P(r) ~ 1/r. We show that while previous dimension concepts are not applicable in this case, the new dimension measure still exhibits scaling with two distinct scaling regimes. Our observations suggest that the link length distribution is not sufficient in itself to entirely control the dimensionality of complex networks, and we show that the proposed measure provides information that complements other known measures.
  • Many fields of science rely on relational database management systems to analyze, publish and share data. Since RDBMS are originally designed for, and their development directions are primarily driven by, business use cases they often lack features very important for scientific applications. Horizontal scalability is probably the most important missing feature which makes it challenging to adapt traditional relational database systems to the ever growing data sizes. Due to the limited support of array data types and metadata management, successful application of RDBMS in science usually requires the development of custom extensions. While some of these extensions are specific to the field of science, the majority of them could easily be generalized and reused in other disciplines. With the Graywulf project we intend to target several goals. We are building a generic platform that offers reusable components for efficient storage, transformation, statistical analysis and presentation of scientific data stored in Microsoft SQL Server. Graywulf also addresses the distributed computational issues arising from current RDBMS technologies. The current version supports load balancing of simple queries and parallel execution of partitioned queries over a set of mirrored databases. Uniform user access to the data is provided through a web based query interface and a data surface for software clients. Queries are formulated in a slightly modified syntax of SQL that offers a transparent view of the distributed data. The software library consists of several components that can be reused to develop complex scientific data warehouses: a system registry, administration tools to manage entire database server clusters, a sophisticated workflow execution framework, and a SQL parser library.
  • Getting spectra at good signal-to-noise ratios takes orders of magnitudes more time than photometric observations. Building on the technique developed for photometric redshift estimation of galaxies, we develop and demonstrate a non-parametric photometric method for estimating the chemical composition of galactic stars. We investigate the efficiency of our method using spectroscopically determined stellar metallicities from SDSS DR7. The technique is generic in the sense that it is not restricted to certain stellar types or stellar parameter ranges and makes it possible to obtain metallicities and error estimates for a much larger sample than spectroscopic surveys would allow. We find that our method performs well, especially for brighter stars and higher metallicities and, in contrast to many other techniques, we are able to reliably estimate the error of the predicted metallicities.