Indiana University Bloomington

Luddy School of Informatics, Computing, and Engineering

Technical Report TR725:
Provenance as Essential Infrastructure for Data Lakes

Isuru Suriarachchi and Beth Plale
(Apr 2016), 4 pages
[Accepted for IPAW 2016,]
The Data Lake is emerging as a Big Data storage and management solution which can store any type of data at scale and execute data transformations for analysis. Higher flexibility in storage increases the risk of Data Lakes becoming data swamps. In this paper we show how provenance contributes to data management within a Data Lake infrastructure. We study provenance integration challenges and propose a reference architecture for provenance usage in a Data Lake. Finally we discuss the applicability of our tools in the proposed architecture.

Available as: