Automated Spam Detection in YouTube Comments: A Natural Language Processing and Gradient Boosting Approach

Abstract

The rapid growth of social media platforms such as YouTube, Facebook, Twitter, and TikTok has revolutionized communication but has also led to an increase in spam and harmful content. Detecting spam comments automatically is crucial to maintaining a safe and engaging digital environment. This study proposes a spam detection model using Natural Language Processing (NLP) and XGBoost, a powerful machine learning algorithm known for its high efficiency and predictive accuracy. The model is trained on a dataset containing YouTube comments and utilizes text preprocessing techniques such as tokenization, stopword removal, and lemmatization to enhance detection accuracy. Compared to traditional classifiers like Naïve Bayes and Linear SVM, the proposed NLP-XGBoost model achieves 94% accuracy in classifying spam and non-spam comments. The results demonstrate the potential of machine learning in improving content moderation and safeguarding online interactions.

Country : India

1 B. Dhashvanth Sai2 E. Bhargavi3 G. Srija Naidu4 G. Aditya Srinivas5 A. Vimal Kumar

  1. B.Tech Student, Department of CSE, Guru Nanak Institute of Technology, Hyderabad, Telangana, India
  2. B.Tech Student, Department of CSE, Guru Nanak Institute of Technology, Hyderabad, Telangana, India
  3. B.Tech Student, Department of CSE, Guru Nanak Institute of Technology, Hyderabad, Telangana, India
  4. B.Tech Student, Department of CSE, Guru Nanak Institute of Technology, Hyderabad, Telangana, India
  5. B.Tech Student, Department of CSE, Guru Nanak Institute of Technology, Hyderabad, Telangana, India

IRJIET, Volume 9, Special Issue of INSPIRE’25 April 2025 pp. 134-140

doi.org/10.47001/IRJIET/2025.INSPIRE22

References

  1. F. Del-Vigna, A. Cimino, F. Dell-Orletta, M. Petrocchi, and M. Tesconi, “Hate me, hate me not: Hate speech detection on facebook,” in First Italian Conference on Cybersecurity, 2017.
  2. M. Bouazizi and T. Ohtsuki, “Multi-class sentiment analysis on twitter: Classification performance and challenges,” Big Data Mining and Analytics, vol. 2, no. 3, pp. 181–194, Sep. 2019.
  3. G. Jalaja and C. Kavitha, Sentiment Analysis for Text Extracted from Twitter. Singapore: Springer Singapore, 2019, pp. 693–700.
  4. S. Sharma and A. Jain, “Cyber social media analytics and issues: A pragmatic approach for twitter sentiment analysis,” in Advances in Computer Communication and Computational Sciences, S. K. Bhatia, S. Tiwari, K. K. Mishra, and M. C. Trivedi, Eds. Singapore: Springer Singapore, 2019, pp. 473–484.
  5. M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, and R. Kumar, “Identifying and categorizing youtube comments in social media (offenseval),” arXiv preprint arXiv:1903.08983, 2019.
  6. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.p. 263– 270.
  7. P. Liu, W. Li, and L. Zou, “Transfer learning for youtube comments detection using bidirectional transformers,” in Proceedings of the 13th International Workshop on Semantic Evaluation, 2019, pp. 87–91.
  8. J. Han, S. Wu, and X. Liu, “Identifying and categorizing youtube comments in social media,” in Proceedings of the 13th International Workshop on Semantic Evaluation, 2019, pp. 652–656.
  9. A.Nikolov and V. Radivchev, “Offensive tweet classification with bert and ensembles,” in Proceedings of the 13th International Workshop on Semantic Evaluation, 2019, pp. 691–695.
  10. J. Risch, A. Stoll, M. Ziegele, and R. Krestel, “hpidedis at germeval 2019: Youtube comments identification using a germanbert model,” in Preliminary proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019). Erlangen, Germany: German Society for Computational Linguistics & Language Technology, 2019, pp. 403–408. 47.
  11. G. Pitsilis, H. Ramampiaro, and H. Langseth, “Detecting youtube comments in tweets using deep learning,” arXiv preprint arXiv:1801.04433, 2018.
  12. Z. Mossie and J.-H. Wang, “Vulnerable community identification using hate speech detection on social media,” Information Processing & Management, p. 102087, 2019.
  13. P. Mathur, R. Shah, R. Sawhney, and D. Mahata, “Detecting offensive tweets in hindi-english code-switched language,” in Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media, 2018, pp. 18–26.
  14. M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, and R. Kumar, “Predicting the type and target of offensive posts in social media,” arXiv preprint arXiv:1902.09666, 2019.
  15. H. Watanabe, M. Bouazizi, and T. Ohtsuki, “Hate speech on twitter: A pragmatic approach to collect hateful and offensive expressions and perform hate speech detection,” IEEE Access, vol. 6, pp. 13 825–13 835, 2018.
  16. Jahnavi, Y., Kumar, P. N., Anusha, P., & Prasad, M. S. (2022, November). Prediction and Evaluation of Cancer Using Machine Learning Techniques. In International Conference on Sustainable and Innovative Solutions for Current Challenges in Engineering & Technology (pp. 399-405). Singapore: Springer Nature Singapore.
  17. P. Kadiri, P. Anusha, M. Prabhu, R. Asuncion, V. S. Pavan and J. V. Suman, "Morphed Picture Recognition using Machine Learning Algorithms," 2024 Second International Conference on Advances in Information Technology (ICAIT), Chikkamagaluru, Karnataka, India, 2024, pp. 1-6, doi: 10.1109/ICAIT61638.2024.10690845.