[This article belongs to Volume - 57, Issue - 02, 2025]
Gongcheng Kexue Yu Jishu/Advanced Engineering Science
Journal ID : AES-26-10-2025-16

Title : NEWS CLASSIFICATION USING LDA-BASED TOPIC MODELLING AND MACHINE LEARNING
P. Malaiarasu, Dr. R. Kalaimagal

Abstract :

In the era of information explosion, the ability to classify news content automatically has become crucial for organizing and retrieving relevant information from massive digital streams. The continuous generation of unstructured textual data poses significant challenges in terms of processing efficiency, accuracy, and scalability. Big data platforms offer a promising solution by enabling distributed processing and high-speed computation. This study presents a scalable framework for news classification built on Apache Spark using PySpark. Natural Language Processing (NLP) techniques particularly Latent Dirichlet Allocation (LDA) method is applied to extract latent topics from the news corpus. When compared to traditional keyword-based models, LDA captures semantic relationships between words, enabling deeper topic understanding and more informative feature extraction. Hence, these topic-driven features are subsequently used to train classifiers for precise categorization of news articles into predefined classes. Thus, the LDA with better tuned Machine Learning (ML) framework achieved classification accuracy as 86.05%, ensuring reliable and consistent performance in topic-based News classifications. The integration of PySpark, LDA, and ML techniques have ensures high performance, scalability, and interpretability. The proposed model demonstrates strong applicability for real-time news filtering, trend detection, and intelligent content management in large-scale media environments.