EDBT/ICDT 2010 Joint Conference

Electronic Conference Proceedings

Provenance for Database Transformations

Author

Abstract

Database transformations (queries, views, mappings) take apart, filter, and recombine source data in order to populate warehouses, materialize views, and provide inputs to analysis tools. As they do so, applications often need to track the relationship between parts and pieces of the sources and parts and pieces of the transformations' output. This relationship is what we call database provenance.

This talk presents an approach to database provenance that is based on two observations. First, provenance is a kind of annotation, and we can develop a general approach to annotation propagation that also covers other applications, for example to uncertainty and access control. In fact, provenance turns out to be the most general kind of such annotation, in a precise and practically useful sense. Second, the propagation of annotation through a broad class of transformations relies on just two operations: one when annotations are jointly used and one when they are used alternatively. This leads to annotations forming a specific algebraic structure, a commutative semiring.

The semiring approach works for annotating tuples, field values and attributes in standard relations, in nested relations (complex values), and for annotating nodes in (unordered) XML. It works for transformations expressed in the positive fragment of relational algebra, nested relational calculus, unordered XQuery, as well as for Datalog, GLAV schema mappings, and tgd constraints. Specific semirings correspond to earlier approaches to provenance, while others correspond to forms of uncertainty, trust, cost, and access control.

This is joint work with J.N. Foster, T.J. Green, Z. Ives, and G. Karvounarakis, done in part within the frameworks of the Orchestra and pPOD projects.

About the Speaker

Val Tannen (University of Pennsylvania, USA)
Val Tannen

Val Tannen is a professor in the Department of Computer and Information Science of the University of Pennsylvania. He joined Penn after receiving his PhD from the Massachusetts Institute of Technology in 1987. He has always been interested in applications of Logic to Computer Science and after working for a time in Programming Languages his current research interests are mainly in Databases. He and his students and collaborators have worked on query language design and on models and systems for query optimization, parallel query processing, and data integration. More recently their work has focused on models and systems for data sharing, data provenance and the management of uncertain information. Since 1994 he has also worked in Bioinformatics, leading projects in genomics and phyloinformatics through the Penn Center for Bioinformatics and Genomic Frontiers Institute as well as through large NSF projects such as CIPRES (TreeBASE II), pPOD (data management for AToL) and the iPlant Collaborative (iPToL).

Session

EDBT Invited Talk: Provenance for Database Transformations (Wednesday, March 24, 09:00—10:30)