Arabic Sentiment Analysis using Apache Spark

Abstract

People express their feelings and emotions on social media platforms including Twitter. Twitter blogging system currently includes huge Arabic user-generated contents. These Arabic data are rapidly increasing in volume. Arabic language has special characteristics, syntax, grammar and morphological rules. In this research, we are investigating the development of a big data system to analyze the emotion of Arabic-related contents. This application could be useful for many monitoring, marketing, recommendation and decision support systems. We make use of machine learning algorithm to achieve this goal. We particularly incorporate and compare the supervised-based Naïve-Bayes and Logistic Regression algorithms as the main machine learning engines that processes the Arabic-language tweets in order to classify them if the new tweets as either positive, negative or neutral. Before applying the Naïve-Bayes and logistic regression processing algorithms, we apply a pre-processing phase of the input tweets to generate numerical features instead of input text and emojis to make it suitable for the processing algorithms. We have developed a customized pre-processing pipeline that includes several Arabic NLP (ANLP) preparation steps and that suits Arabic language and the contents of real-life tweets. In addition, these two machine learning algorithms have several hyper-parameters that could affect the performance and accuracy of the algorithm. Therefore, we have utilized cross-validation techniques to evaluate and detect the best possible hyper-parameters combination that results in the best accuracy results of classification outputs. Experiments show promising results using the designed system. We will present some experiment results that show the accuracy of the system against real-life Arabic tweets data in terms of f1-score, weighted precision and weighted recall evaluation metrics.

Country : Kingdom of Saudi Arabia

1 Mohamed A. Ahmed

  1. Computer Science Department, College of Computer and Information Systems, Umm Al-Qura University, Makkah Al Mukarramah, Kingdom of Saudi Arabia

IRJIET, Volume 4, Issue 2, February 2020 pp. 31-40

References

  1. "Prominent Arabic Social Media", https://www.extradigital.co.uk/articles/arabic/social-media.html, Last accessed on 15 July 2019.
  2. "Apache Spark - Unified Analytics Engine for Big Data", https://spark.apache.org/, Last accessed on 15 July 2019.
  3. M. Al-Ayyoub, A. Khamaiseh, Y. Jararweh and M. Al-Kabi, "A comprehensive survey of Arabic sentiment analysis", Journal of Information Processing & Management, Volume 56 Issue 2, March 2019, pp. 320-342.
  4. K. Haifa and A. Azmi, “Arabic tweets sentiment analysis – a hybrid scheme”, Journal of Information Science, Volume 42, 2016, pp. 782 –797.
  5. T. Tjur, "Coefficients of determination in logistic regression models", American Statistician, doi:10.1198/tast.2009.08210, 2009, pp. 366–372.
  6. "Apache Hadoop 2.8.0 – MapReduce", https://hadoop.apache.org/docs/r2.8.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html, Last accessed on 15 July 2019.
  7. J. Perrin, "Spark in Action", second edition, book. March 2018, ISBN: 9781617295522.
  8. M. Heikal, M. Torki and N. El-Makky, "Sentiment Analysis of Arabic Tweets using Deep Learning", The 4th International Conference on Arabic Computational Linguistics (ACLing 2018), Dubai, United Arab Emirates, November 17-19 2018, pp. 114–122.
  9. J. Schmidhuber, “Deep learning in neural networks", Journal of Neural Networks, Volume 61 Issue C, January 2015, pp. 85-117.
  10. S. Hochreiter and J. Schmidhuber, "Long short-term memory", Neural computation, Volume 9 Issue 8, November 1997, pp. 1735–1780.
  11. L. Almuqren and A. Cristea, "Framework for sentiment analysis of Arabic text", Proceedings of the 27th ACM Conference on Hypertext and Social Media, Halifax, Nova Scotia, Canada, July 10-13 2016, pp. 315-317.
  12. L. Al-Horaibi and M. Khan, "Sentiment Analysis of Arabic Tweets Using Semantic Resources", International Journal of Computing and Information Sciences, Volume 13 Issue 1, January 2017, pp. 9-18.
  13. M. AL-Smadi, O. Qawasmeh, M. Al-Ayyoub and Y. Jararweh, "Deep Recurrent Neural Network vs. Support Vector Machine for Aspect-Based Sentiment Analysis of Arabic Hotels’ Reviews", Journal of Computational Science, doi: 10.1016/j.jocs.2017.11.006, Volume 27, July 2018, Pages 386-393.
  14. B. Wang and M. Liu, "Deep Learning for Aspect-Based Sentiment Analysis", Stanford University report, [online] Available at: https://cs224d.stanford.edu/reports/WangBo.pdf, 2015.
  15. F. Mahyoub, M. Siddiqui and M. Dahab, "Building an Arabic sentiment lexicon using semi-supervised learning", Journal of King Saud University-Computer and Information Sciences, doi: 10.1016/j.jksuci.2014.06.003, Volume 26 Issue 4, December 2014, pp. 417–424.
  16. Y. Regragui, L. Abouenour, F. Krieche and K. Bouzoubaa, "Arabic WordNet: New Content and New Applications", Proceedings of the 8th Global Wordnet Conference (GWN 2016), Bucharest, Romania, January 2016, pp. 330-338.
  17. M. Mike Frampton, "Mastering Apache Spark", book. Packt Publishing, September 2015, ISBN: 9781783987146
  18. R. Dua, N. Pentreath and M. Ghotra, "Machine Learning with Spark", second edition, book. April 2017, ISBN: 9781785889936
  19. R. Garreta, G. Moncecchi, T. Hauck and G. Hackeling, "Scikit-learn: Machine Learning Simplified", book. 2017, ISBN: 9781847197528
  20. M. Markatou, H. Tian, S. Biswas and G. Hripcsak, "Analysis of Variance of Cross-Validation Estimators of the Generalization Error", Journal of Machine Learning Research, July 2005, pp. 1127-1168.
  21. I.Hamed, M. Elmahdy and S. Abdennadher, "Building a First Language Model for Code-switch Arabic-English", Procedia Computer Scienceو Volume 117, April 2017, pp. 208-216.
  22. T. Zerrouki and A. Balla, "Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems", Data in Brief, February 2017, pp. 147-151.
  23. "Full Emoji List, v12.0", https://unicode.org/emoji/charts/full-emoji-list.html, Last accessed on 15 July 2019.
  24. F. Al Omran and C. Treude, "Choosing an NLP Library for Analyzing Software Documentation: A Systematic Literature Review and a Series of Experiments", IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), May 20-28 2017, pp. 187-197.
  25. M. Dahab, R. Al-Mutawa and A. Ibrahim, "A Comparative Study on Arabic Stemmers", International Journal of Computer Applications, Volume 125 Issue 8, September 2015, pp. 38-47.
  26. C. Freksen, L. Kamma and K. Larsen, "Fully understanding the hashing trick", 32nd International Conference on Neural Information Processing Systems, Montréal, Canada, December 3-8 2018, pp. 5394-5404.
  27. T. White, "Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale", fourth edition, book. April 2015, ISBN: 9781491901632
  28. M. Nabil, M. Aly and A. Atiya, “ASTD: Arabic Sentiment Tweets Dataset”, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, September 17–21 2015, pp. 2515–2519.