Transfer Learning Using a CNN Fused Random Forest for SMS Spam Detection with Semantic Normalization of Text Corpus
DOI:
https://doi.org/10.37933/nipes/7.2.2025.29Abstract
With a global estimated 59-billion (in 2023) of short messages sent daily – it grossed an estimated revenue of $327Billion to become one of the popular mode of interaction among network users with compatible devices. The Short Messaging Service (SMS) has today become the preferred choice of exchange, which in turn, has witnessed the proliferation of spam. Spam are inappropriate and unsolicited message sent indiscriminately to unsuspecting users that have today become a menace due to the huge loss in revenue by Telcos, its great acts of inconvenience to user – besides, its being quite annoying and distractive. To curb this menace, Telcos have explored the utilization of machine learning based filters – which in many cases, have been found to degrade network performance due to the limited character-size, text corpus dataset availability, and imbalanced nature of the text corpus. Our study posits a convolution neural network fused Random Forest ensemble that explores vectorization and word embedded with semantic approach and text normalization preprocessing modes. These were adopted to effectively identify SMS spams. Evidence of the performance in applying the SMOTE-Tomek balancing with vectorization approach shows that the proposed CNN-RF outperformed other benchmark models with F1 of 98.92%, Accuracy 98.35%, Precision 96.89%, and Recall 99.01% respectively; While, the benchmark models (Memetic algorithm, Random Forest, Logistic Regression and CNN) yield F1 range of [81.45%, 91.05%, 92.1% and 91.25%] with Accuracy [80.32%, 91.05%, 92.28%, and 95.74%]. In addition, the proposed CNN-RF accurately identifies SMS spam with a 98.35%, accuracy to effectively classify 1,429cases of test dataset with only 15-incorrectly classified cases. The study acknowledges that the utilization of semantic text normalization with vectorization fused with SMOTE-Tomek balancing improves model’s overall performance to successfully yield a resultant prediction for SMS spam identification, and effective classification.