SYNTHETIC DATA GENERATION FOR RESPONSIBLE TABULAR MACHINE LEARNING: REVIEW AND RESEARCH DIRECTIONS

Authors

  • Savita Harer , Shashank Swami Author

Abstract

Tabular machine learning is widely used in real-world applications such as healthcare, finance, and insurance. However, these systems face significant challenges, including data imbalance, bias, and privacy risks, which may affect model performance and reliability. Synthetic data generation has emerged as an effective solution to address these issues by creating artificial datasets that preserve statistical characteristics while protecting sensitive information.

This paper presents a comprehensive review of synthetic data generation techniques for tabular machine learning. It analyzes their effectiveness in improving data quality, mitigating bias, handling class imbalance, and ensuring privacy preservation. In addition, the study provides a structured taxonomy and comparative analysis of existing methods.

The review also identifies key research gaps, particularly the lack of unified frameworks and standardized evaluation metrics. It highlights the need for simple, scalable, and integrated solutions that balance performance, fairness, and privacy for real-world applications

Downloads

Published

2026-06-04

Issue

Section

Articles