Automated Spam Detection in YouTube Comments: A Natural Language Processing and Gradient Boosting Approach

B. Dhashvanth Sai; E. Bhargavi; G. Srija Naidu; G. Aditya Srinivas; A. Vimal Kumar

doi:https://doi.org/10.47001/IRJIET/2025.INSPIRE22

Automated Spam Detection in YouTube Comments: A Natural Language Processing and Gradient Boosting Approach

B. Dhashvanth SaiB.Tech Student, Department of CSE, Guru Nanak Institute of Technology, Hyderabad, Telangana, IndiaE. BhargaviB.Tech Student, Department of CSE, Guru Nanak Institute of Technology, Hyderabad, Telangana, IndiaG. Srija NaiduB.Tech Student, Department of CSE, Guru Nanak Institute of Technology, Hyderabad, Telangana, IndiaG. Aditya SrinivasB.Tech Student, Department of CSE, Guru Nanak Institute of Technology, Hyderabad, Telangana, IndiaA. Vimal KumarB.Tech Student, Department of CSE, Guru Nanak Institute of Technology, Hyderabad, Telangana, India

Vol 9 No 25 (2025): Volume 9, Special Issue of INSPIRE’25 April 2025 | Pages: 134-140

International Research Journal of Innovations in Engineering and Technology

OPEN ACCESS | Research Article | Published Date: 24-04-2025

doi.org/10.47001/IRJIET/2025.INSPIRE22

Full Text PDF

Abstract

The rapid growth of social media platforms such as YouTube, Facebook, Twitter, and TikTok has revolutionized communication but has also led to an increase in spam and harmful content. Detecting spam comments automatically is crucial to maintaining a safe and engaging digital environment. This study proposes a spam detection model using Natural Language Processing (NLP) and XGBoost, a powerful machine learning algorithm known for its high efficiency and predictive accuracy. The model is trained on a dataset containing YouTube comments and utilizes text preprocessing techniques such as tokenization, stopword removal, and lemmatization to enhance detection accuracy. Compared to traditional classifiers like Naïve Bayes and Linear SVM, the proposed NLP-XGBoost model achieves 94% accuracy in classifying spam and non-spam comments. The results demonstrate the potential of machine learning in improving content moderation and safeguarding online interactions.

Keywords

Spam Detection, NLP, XGBoost, Social Media, Text Classification, Machine Learning, Content Moderation, YouTube Comments

Citation of this Article

B. Dhashvanth Sai, E. Bhargavi, G. Srija Naidu, G. Aditya Srinivas, & A. Vimal Kumar. (2025). Automated Spam Detection in YouTube Comments: A Natural Language Processing and Gradient Boosting Approach. In proceeding of International Conference on Sustainable Practices and Innovations in Research and Engineering (INSPIRE'25), published by IRJIET, Volume 9, Special Issue of INSPIRE’25, pp 134-140. Article DOI https://doi.org/10.47001/IRJIET/2025.INSPIRE22

This work is licensed under Creative common Attribution Non Commercial 4.0 Internation Licence

References

F. Del-Vigna, A. Cimino, F. Dell-Orletta, M. Petrocchi, and M. Tesconi, “Hate me, hate me not: Hate speech detection on facebook,” in First Italian Conference on Cybersecurity, 2017.
M. Bouazizi and T. Ohtsuki, “Multi-class sentiment analysis on twitter: Classification performance and challenges,” Big Data Mining and Analytics, vol. 2, no. 3, pp. 181–194, Sep. 2019.
G. Jalaja and C. Kavitha, Sentiment Analysis for Text Extracted from Twitter. Singapore: Springer Singapore, 2019, pp. 693–700.
S. Sharma and A. Jain, “Cyber social media analytics and issues: A pragmatic approach for twitter sentiment analysis,” in Advances in Computer Communication and Computational Sciences, S. K. Bhatia, S. Tiwari, K. K. Mishra, and M. C. Trivedi, Eds. Singapore: Springer Singapore, 2019, pp. 473–484.
M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, and R. Kumar, “Identifying and categorizing youtube comments in social media (offenseval),” arXiv preprint arXiv:1903.08983, 2019.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.p. 263– 270.
P. Liu, W. Li, and L. Zou, “Transfer learning for youtube comments detection using bidirectional transformers,” in Proceedings of the 13th International Workshop on Semantic Evaluation, 2019, pp. 87–91.
J. Han, S. Wu, and X. Liu, “Identifying and categorizing youtube comments in social media,” in Proceedings of the 13th International Workshop on Semantic Evaluation, 2019, pp. 652–656.
A.Nikolov and V. Radivchev, “Offensive tweet classification with bert and ensembles,” in Proceedings of the 13th International Workshop on Semantic Evaluation, 2019, pp. 691–695.
J. Risch, A. Stoll, M. Ziegele, and R. Krestel, “hpidedis at germeval 2019: Youtube comments identification using a germanbert model,” in Preliminary proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019). Erlangen, Germany: German Society for Computational Linguistics & Language Technology, 2019, pp. 403–408. 47.
G. Pitsilis, H. Ramampiaro, and H. Langseth, “Detecting youtube comments in tweets using deep learning,” arXiv preprint arXiv:1801.04433, 2018.
Z. Mossie and J.-H. Wang, “Vulnerable community identification using hate speech detection on social media,” Information Processing & Management, p. 102087, 2019.
P. Mathur, R. Shah, R. Sawhney, and D. Mahata, “Detecting offensive tweets in hindi-english code-switched language,” in Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media, 2018, pp. 18–26.
M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, and R. Kumar, “Predicting the type and target of offensive posts in social media,” arXiv preprint arXiv:1902.09666, 2019.
H. Watanabe, M. Bouazizi, and T. Ohtsuki, “Hate speech on twitter: A pragmatic approach to collect hateful and offensive expressions and perform hate speech detection,” IEEE Access, vol. 6, pp. 13 825–13 835, 2018.
Jahnavi, Y., Kumar, P. N., Anusha, P., & Prasad, M. S. (2022, November). Prediction and Evaluation of Cancer Using Machine Learning Techniques. In International Conference on Sustainable and Innovative Solutions for Current Challenges in Engineering & Technology (pp. 399-405). Singapore: Springer Nature Singapore.
P. Kadiri, P. Anusha, M. Prabhu, R. Asuncion, V. S. Pavan and J. V. Suman, "Morphed Picture Recognition using Machine Learning Algorithms," 2024 Second International Conference on Advances in Information Technology (ICAIT), Chikkamagaluru, Karnataka, India, 2024, pp. 1-6, doi: 10.1109/ICAIT61638.2024.10690845.

For Authors

Publication Archives

Volume 1 - 2017

Volume 2 - 2018

Volume 3 - 2019

Volume 4 - 2020

Volume 5 - 2021

Volume 6 - 2022

Volume 7 - 2023

Volume 8 - 2024

Volume 9 - 2025

Volume 10 - 2026

For Board Members

Downloads

Research Areas

Automated Spam Detection in YouTube Comments: A Natural Language Processing and Gradient Boosting Approach

Abstract

Keywords

Citation of this Article

References

International Research Journal of Innovations in Engineering
and Technology - IRJIET

Editorial Policies

Quick Links