Home
Scholarly Works
AutoTag: a framework for generating training data
Journal article

AutoTag: a framework for generating training data

Abstract

Machine learning approaches have been successful on text data with wide range of applications like sentiments analysis from customers’ reviews, trending topics on the social network and meaningful concept extraction from product data. When it comes to classification tasks, supervised approaches perform much better than unsupervised approaches. However, supervised models require training data—quantum of training data required depends on the complexity of the task at hand. Generating training data is usually manual, and hence requires significant human involvement—mostly from the domain experts. Generation of training data, particularly in large volume is time consuming, expensive and error-prone due to fatigue. The objective of this article is to use machine learning based methods for rapid generation of training data without human supervision. We propose a framework named AutoTag that utilizes a carefully selected ensemble of innovative heuristics, unsupervised methods and semi-supervised methods. It constructs a model with high accuracy even with little labelled data, and is found to be competitive with the state of the art models. We evaluated performance of AutoTag against semi-supervised methods and LLMs on a number of classification tasks. AutoTag’s performance was found to be superior for most of the cases. Using only 800 training instances, AutoTag-generated labels had accuracy above 90%. Thus AutoTag can be used to generate quality training data at scale, thereby reducing cost and time. Besides, it can be used for fast quality check on data annotated by unskilled annotators. Based on a practical deployment, we observed that average quality control time is reduced by 45% when AutoTag is used.Graphical abstract

Authors

Mondal SA; Bhattacharaya T; Rai A; Sodhi GS; Bansal R; Mondal A; Gupta A

Journal

Progress in Artificial Intelligence, Vol. 14, No. 2, pp. 191–210

Publisher

Springer Nature

Publication Date

June 1, 2025

DOI

10.1007/s13748-024-00360-x

ISSN

2192-6352

Contact the Experts team