Toggle Main Menu Toggle Search

Open Access padlockePrints

DPDS: Assisting Data Science with Data Provenance

Lookup NU author(s): Professor Paolo MissierORCiD

Downloads


Licence

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND).


Abstract

© 2022, VLDB Endowment. All rights reserved. Successful data-driven science requires a complex combination of data engineering pipelines and data modelling techniques. Robust and defensible results can only be achieved when each step in the pipeline that is designed to clean, transform and alter data in preparation for data modelling can be justified, and its effect on the data explained. The DPDS toolkit presented in this paper is designed to make such justification and explanation process an integral part of data science practice, adding value while remaining as un-intrusive as possible to the analyst. Catering to the broad community of python/pandas data engineers, DPDS implements an observer pattern that is able to capture the fine-grained provenance associated with each individual element of a dataframe, across multiple transformation steps. The resulting provenance graph is stored in Neo4j and queried through a UI, with the goal of helping engineers and analysts to justify and explain their choice of data operations, from raw data to model training, by highlighting the details of the changes through each transformation.


Publication metadata

Author(s): Chapman A, Missier P, Lauro L, Torlone R

Publication type: Article

Publication status: Published

Journal: Proceedings of the VLDB Endowment

Year: 2022

Volume: 15

Issue: 12

Pages: 3614-3617

Online publication date: 29/09/2022

Acceptance date: 02/04/2018

Date deposited: 16/06/2023

ISSN (electronic): 2150-8097

Publisher: VLDB Endowment

URL: https://doi.org/10.14778/3554821.3554857

DOI: 10.14778/3554821.3554857


Altmetrics

Altmetrics provided by Altmetric


Share