par Rezig, El Kindi;Cao, Lei;Simonini, Giovanni;Schoemans, Maxime
;Madden, Samuel;Ouzzani, Mourad;Tang, Nan;Stonebraker, Michael
Référence Conference on Innovative Data Systems Research(10: 12-15 January 2020: Amsterdam, The Netherlands), CIDR 2020 Proceedings
Publication Publié, 2020-01-15

Référence Conference on Innovative Data Systems Research(10: 12-15 January 2020: Amsterdam, The Netherlands), CIDR 2020 Proceedings
Publication Publié, 2020-01-15
Publication dans des actes
Résumé : | With the democratization of data science libraries and frameworks, most data scientists manage and generate their data analytics pipelines using a collection of scripts (e.g., Python, R). This marks a shift from traditional applications that communicate back and forth with a DBMS that stores and manages the application data. While code debuggers have reached impressive maturity over the past decades, they fall short in assisting users to explore data-driven what-if scenarios (e.g., split the training set into two and build two ML models). Those scenarios, while doable programmatically, are a substantial burden for users to manage themselves. Dagger (Data Debugger) is an end-to-end data debugger that abstracts key data-centric primitives to enable users to quickly identify and mitigate data-related problems in a given pipeline. Dagger was motivated by a series of interviews we conducted with data scientists across several organizations. A preliminary version of Dagger has been incorporated into Data Civilizer 2.0 to help physicians at the Massachusetts General Hospital process complex pipelines. |