Whose Truth? Power, Labor, and the Production of Ground-Truth Data
To satisfy the voracious demand for more, cheaper, and increasingly differentiated data for machine learning (ML), tasks such as data collection, curation, and annotation are outsourced through specialized firms and platforms. The data workers who perform these tasks are kept apart from the rest of the ML production chain. They work under precarious conditions and are subject to continuous surveillance. This dissertation focuses on business process outsourcing companies (BPOs) where ground-truth data is produced. Ground-truth data delivers the variables that are used to train and validate most forms of supervised ML models. Through fieldwork at two BPOs located in Argentina and Bulgaria, interviews with data workers, managers, and ML practitioners, as well as a longitudinal participatory design engagement with workers at both organizations, this dissertation situates data production in specific settings shaped by particular market demands, local contexts, and labor constellations. It expands previous research in data creation and crowdsourcing by discussing the economic imperatives and labor relationships that shape ML supply chains and arguing that labor is a fundamental aspect to be integrated into ML ethics discourses. The findings show that ground-truth data is the product of subjective and asymmetrical social and labor relationships. Narrow instructions and work interfaces, precarized labor conditions, and local contexts shaped by economic crises ensure that data workers remain obedient to managers and clients. In such constellations, clients have the power to impose their preferred “truth values” on data as long as they have the financial means to pay workers who execute that imposition. Naturalized yet arbitrary forms of knowledge are inscribed in data through such production processes. This dissertation argues that documentation practices are key for making naturalized “truths” encoded in data visible and contestable. The collaborative documentation of data production processes can preserve moments of dissent, enable feedback loops, and center workers’ voices. The findings present a series of considerations for designing documentation frameworks that allow data workers to intervene in the shaping of task instructions, the data produced through their labor, and, ultimately, the production processes involved. Improving material conditions in data work, empowering workers, recognizing their labor as a powerful tool to produce better data, and documenting data production processes in detail are essential steps to allow for spaces of reflection, deliberation, and audit that contribute to addressing important social and ethical questions surrounding ML technologies.