-
Over the last few years, data analytics shifted from a descriptive era,
confined to the explanation of past events, to the emergence of predictive
techniques. Nonetheless, existing predictive techniques still fail to
effectively explore alternative futures, which continuously diverge from
current situations when exploring the effects of what-if decisions. Enabling
prescriptive analytics therefore calls for the design of scalable systems that
can cope with the complexity and the diversity of underlying data models. In
this article, we address this challenge by combining graphs and time series
within a scalable storage system that can organize a massive amount of
unstructured and continuously changing data into multi-dimensional data models,
called Many-Worlds Graphs. We demonstrate that our open source implementation,
GreyCat, can efficiently fork and update thousands of parallel worlds composed
of millions of timestamped nodes, such as what-if exploration.
-
Particle physics has an ambitious and broad experimental programme for the
coming decades. This programme requires large investments in detector hardware,
either to build new facilities and experiments, or to upgrade existing ones.
Similarly, it requires commensurate investment in the R&D of software to
acquire, manage, process, and analyse the shear amounts of data to be recorded.
In planning for the HL-LHC in particular, it is critical that all of the
collaborating stakeholders agree on the software goals and priorities, and that
the efforts complement each other. In this spirit, this white paper describes
the R&D activities required to prepare for this software upgrade.
-
T-distributed stochastic neighbor embedding (tSNE) is a popular and
prize-winning approach for dimensionality reduction and visualizing
high-dimensional data. However, tSNE is non-parametric: once visualization is
built, tSNE is not designed to incorporate additional data into existing
representation. It highly limits the applicability of tSNE to the scenarios
where data are added or updated over time (like dashboards or series of data
snapshots).
In this paper we propose, analyze and evaluate LION-tSNE (Local Interpolation
with Outlier coNtrol) - a novel approach for incorporating new data into tSNE
representation. LION-tSNE is based on local interpolation in the vicinity of
training data, outlier detection and a special outlier mapping algorithm. We
show that LION-tSNE method is robust both to outliers and to new samples from
existing clusters. We also discuss multiple possible improvements for special
cases.
We compare LION-tSNE to a comprehensive list of possible benchmark approaches
that include multiple interpolation techniques, gradient descent for new data,
and neural network approximation.
-
Smart systems are characterised by their ability to analyse measured data in
live and to react to changes according to expert rules. Therefore, such systems
exploit appropriate data models together with actions, triggered by
domain-related conditions. The challenge at hand is that smart systems usually
need to process thousands of updates to detect which rules need to be
triggered, often even on restricted hardware like a Raspberry Pi. Despite
various approaches have been investigated to efficiently check conditions on
data models, they either assume to fit into main memory or rely on high latency
persistence storage systems that severely damage the reactivity of smart
systems. To tackle this challenge, we propose a novel composition process,
which weaves executable rules into a data model with lazy loading abilities. We
quantitatively show, on a smart building case study, that our approach can
handle, at low latency, big sets of rules on top of large-scale data models on
restricted hardware.
-
Gaining profound insights from collected data of today's application domains
like IoT, cyber-physical systems, health care, or the financial sector is
business-critical and can create the next multi-billion dollar market. However,
analyzing these data and turning it into valuable insights is a huge challenge.
This is often not alone due to the large volume of data but due to an
incredibly high domain complexity, which makes it necessary to combine various
extrapolation and prediction methods to understand the collected data.
Model-driven analytics is a refinement process of raw data driven by a model
reflecting deep domain understanding, connecting data, domain knowledge, and
learning.
-
From 2012 to 2015 together with other Linked Data community members and
experts from the social, behavioral, and economic sciences (SBE), we developed
diverse vocabularies to represent SBE metadata and tabular data in RDF. The
DDI-RDF Discovery Vocabulary (DDI-RDF) is designed to support the
dissemination, management, and reuse of unit-record data, i.e., data about
individuals, households, and businesses, collected in form of responses to
studies and archived for research purposes. The RDF Data Cube Vocabulary (QB)
is a W3C recommendation for expressing data cubes, i.e. multi-dimensional
aggregate data and its metadata. Physical Data Description (PHDD) is a
vocabulary to model data in rectangular format, i.e., tabular data. The data
could either be represented in records with character-separated values (CSV) or
fixed length. The Simple Knowledge Organization System (SKOS) is a vocabulary
to build knowledge organization systems such as thesauri, classification
schemes, and taxonomies. XKOS is a SKOS extension to describe formal
statistical classifications.
To ensure high quality of and trust in both metadata and data, their
representation in RDF must satisfy certain criteria - specified in terms of RDF
constraints. In this paper, we evaluate the data quality of 15,694 data sets
(4.26 billion triples) of research data for the social, behavioral, and
economic sciences obtained from 33 SPARQL endpoints. We checked 115 constraints
on three different and representative SBE vocabularies (DDI-RDF, QB, and SKOS)
by means of the RDF Validator, a validation environment which is available at
http://purl.org/net/rdfval-demo.
-
To ensure high quality of and trust in both metadata and data, their
representation in RDF must satisfy certain criteria - specified in terms of RDF
constraints. From 2012 to 2015 together with other Linked Data community
members and experts from the social, behavioral, and economic sciences (SBE),
we developed diverse vocabularies to represent SBE metadata and rectangular
data in RDF.
The DDI-RDF Discovery Vocabulary (DDI-RDF) is designed to support the
dissemination, management, and reuse of unit-record data, i.e., data about
individuals, households, and businesses, collected in form of responses to
studies and archived for research purposes. The RDF Data Cube Vocabulary (QB)
is a W3C recommendation for expressing data cubes, i.e. multi-dimensional
aggregate data and its metadata. Physical Data Description (PHDD) is a
vocabulary to model data in rectangular format, i.e., tabular data. The data
could either be represented in records with character-separated values (CSV) or
fixed length. The Simple Knowledge Organization System (SKOS) is a vocabulary
to build knowledge organization systems such as thesauri, classification
schemes, and taxonomies. XKOS is a SKOS extension to describe formal
statistical classifications.
In this paper, we describe RDF constraints to validate metadata on
unit-record data (DDI-RDF), aggregated data (QB), thesauri (SKOS), and
statistical classifications (XKOS) and to validate tabular data (PHDD) - all of
them represented in RDF. We classified these constraints according to the
severity of occurring constraint violations. This technical report is updated
continuously as modifying, adding, and deleting constraints remains ongoing
work.
-
Intelligent software systems continuously analyze their surrounding
environment and accordingly adapt their internal state. Depending on the
criticality index of the situation, the system should dynamically focus or
widen its analysis and reasoning scope. A naive -why have less when you can
have more- approach would consist in systematically sampling the context at a
very high rate and triggering the reasoning process regularly. This reasoning
process would then need to mine a huge amount of data, extract a relevant view,
and finally analyze this adequate view. This overall process would require some
heavy resources and/or be time-consuming, conflicting with the (near) real-time
response time requirements of intelligent systems. We claim that a continuous
and more flexible navigation into context models, in space and in time, can
significantly improve reasoning processes. This paper thus introduces a novel
modeling approach together with a navigation concept, which consider time and
locality as first-class properties crosscutting any model element, and enable
the seamless navigation of models in this space-time continuum. In particular,
we leverage a time-relative navigation (inspired by the space-time and
distortion theory [7]) able to efficiently empower continuous reasoning
processes. We integrate our approach into an open-source modeling framework and
evaluate it on a smart grid reasoning engine for electric load prediction. We
demonstrate that reasoners leveraging this distorted space-time continuum
outperform the full sampling approach, and is compatible with most of (near)
real-time requirements.