Accountability in Machine Learning Datasets: Best Practices from Software Engineering

Published 2025-12-03

* Indicates Original Research Paper

Affiliation: School of Engineering & Information Technology, Sanskriti University, Mathura

Abstract

Datasets that power machine learning are often used, shared, and reused with little visibility into the processes of deliberation that led to their creation. As artificial intelligence systems are increasingly used in high-stakes tasks, system development and deployment practices must be adapted to address the very real consequences of how model development data is constructed and used in practice. This includes greater transparency about data, and accountability for decisions made when developing it. In this paper, we introduce a rigorous framework for dataset development transparency that supports decision-making and accountability. The framework uses the cyclical, infrastructural and engineering nature of dataset development to draw on best practices from the software development lifecycle. Each stage of the data development lifecycle yields documents that facilitate improved communication and decision making, as well as drawing attention to the value and necessity of careful data work. The proposed framework makes visible the often-overlooked work and decisions that go into dataset creation, a critical step in closing the accountability gap in artificial intelligence and a critical/necessary resource aligned with recent work on auditing processes.

Abstract 255 | PDF Downloads 31

Issue

Most Recent

Section

Review

This work is licensed under a Creative Commons Attribution 4.0 International License.

##plugins.themes.bootstrap3.article.sidebar##

##plugins.themes.bootstrap3.article.main##

Abstract

##plugins.themes.bootstrap3.article.details##