PhD Thesis - Inference and Learning Systems for Uncertain Relational Data

Inference and Learning Systems for Uncertain Relational Data

Abstract

Representing uncertain information and being able to reason on it is of foremost importance for real world applications. The research field Statistical Relational Learning (SRL) tackles these challenges. SRL combines principles and ideas from three important subfields of Artificial Intelligence: machine learning, knowledge representation and reasoning on uncertainty. The distribution semantics provides a powerful mechanism for combining logic and probability theory.
The distribution semantics has been applied so far to extend Logic Programming (LP) languages such as Prolog and represents one of the most successful approaches of Probabilistic Logic Programming (PLP), with several PLP languages adopting it such as PRISM, ProbLog and LPADs. However, with the birth of the Semantic Web, that uses Description Logics (DLs) to represent knowledge, it has become increasingly important to have Probabilistic Description Logic (PDLs). The DISPONTE semantics was developed for this purpose and applies the distribution semantics to description logics.
The main objective of this dissertation is to propose approaches for reasoning and learning on uncertain relational data. The first part concerns reasoning over uncertain data. In particular, with regard to reasoning in PLP, we present the latest advances in the cplint system, which allows hybrid programs, i.e. programs where some of the random variables are continuous, and causal inference. Moreover cplint has a web interface, named cplint on SWISH, which allows the user to easily experiment with the system. To perform inference on PDLs that follow DISPONTE, a suite of algorithms was developed: BUNDLE (“Binary decision diagrams for Uncertain reasoNing on Description Logic thEories”), TRILL (“Tableau Reasoner for descrIption Logics in Prolog”) and TRILL^P (“TRILL powered by Pinpointing formulas”).
The second part, which focuses on learning, considers two problems: parameter learning and structure learning. We describe the systems EDGE (“Em over bDds for description loGics paramEter learning”) for parameter learning and LEAP (“LEArning Probabilistic description logics”) for structure learning of PDLs. The execution of these algorithms and those for PLP, such as EMBLEM for parameter learning and SLIPCOVER for structure learning, is rather expensive from a computational point of view, taking a few hours on datasets of the order of MBs. In order to efficiently manage larger datasets in the era of Big Data and Linked Open Data, it is extremely important to develop fast learning algorithms. One solution is to distribute the algorithms using modern computing infrastructures such as clusters and clouds. We thus extended EMBLEM, SLIPCOVER, EDGE and LEAP to exploit these facilities by developing their MapReduce versions: EMBLEM^MR, SEMPRE, EDGE MR and LEAP^MR.
We tested the proposed approaches on real-world datasets and their performance was comparable or superior to those of state-of-the-art systems.