PDF Malware Detection Using Machine Learning Models

Abstract

PDFs are widely used for document sharing, but their popularity also makes them a common target for malware. The software, titled "PDF Malware Detection Using Machine Learning Models," aims to develop and compare ml learning models for detecting malware in PDFs. Using a Kaggle dataset containing examples of both hazardous and secure PDFs, various methods such as Random Forest, C5.0, J48, Support Vector Machines, AdaBoost, Deep Neural Networks, Gradient Boosting Machines, and K-Nearest Neighbors will be employed. The main goal is to attain high detection accuracy while integrating explainability to gain a deeper understanding of the models' behaviour. By leveraging machine learning techniques, this project seeks to enhance cybersecurity measures, offering a robust solution to identify and mitigate potential threats embedded in PDF documents.

Country : India

1 A.Komala2 Boya Chandu3 Medivala Reddy Hemanth

  1. Dept. of CSE-Cybersecurity, Madanapalle Institute of Technology & Science, Madanapalle, India
  2. Dept. of CSE-Cybersecurity, Madanapalle Institute of Technology & Science, Madanapalle, India
  3. Dept. of CSE-Cybersecurity, Madanapalle Institute of Technology & Science, Madanapalle, India

IRJIET, Volume 9, Special Issue of INSPIRE’25 April 2025 pp. 278-283

doi.org/10.47001/IRJIET/2025.INSPIRE45

References

  1. S. S. Alshamrani, ‘‘Design and analysis of machine learning based technique for malware identification and classification of portable document format files,’’ Secur. Commun. Netw., vol. 2022, pp. 1–10, Sep. 2022.
  2. P. Singh, S. Tapaswi, and S. Gupta, ‘‘Malware detection in PDF and office documents: A survey,’’ Inf. Secur. J., Global Perspective, vol. 29, no. 3, pp. 134–153, May 2020.
  3. N. Livathinos, C. Berrospi, M. Lysak, V. Kuropiatnyk, A. Nassar, A. Carvalho, M. Dolfi, C. Auer, K. Dinkla, and P. Staar, ‘‘Robust PDF document conversion using recurrent neural networks,’’ in Proc. AAAI Conf. Artif. Intell., vol. 35, no. 17, 2021, pp. 15137–15145.
  4. Q. A. Al-Haija, A. Odeh, and H. Qattous, ‘‘PDF malware detection based on optimizable decision trees,’’ Electronics, vol. 11, no. 19, p. 3142, Sep. 2022.
  5. Y. Wiseman, ‘‘Efficient embedded images in portable document format,’’ Int. J., vol. 124, pp. 38–129, Jan. 2019.
  6. M.Ijaz,M.H.Durad,andM.Ismail,‘‘Static and dynamic malware analysis using machine learning,’’ in Proc. 16th Int. Bhurban Conf. Appl. Sci. Technol. (IBCAST), Jan. 2019, pp. 687–691.
  7. Y. Alosefer, ‘‘Analysing web-based malware behaviour through client honey pots,’’ Ph.D. dissertation, School Comput. Sci. Inform., Cardiff Univ., Cardiff, Wales, U.K., 2012.
  8. N. Idika and A. P. Mathur, ‘‘A survey of malware detection techniques,’’ Purdue Univ., vol. 48, no. 2, pp. 32–46, 2007.
  9. M. Abdelsalam, M. Gupta, and S. Mittal, ‘‘Artificial intelligence assisted malware analysis,’’ in Proc. ACM Workshop Secure Trustworthy Cyber Phys. Syst., Apr. 2021, pp. 75–77.
  10. W. Wang, Y. Shang, Y. He, Y. Li, and J. Liu, ‘‘BotMark: Automated botnet detection with hybrid analysis of flow-based and graph-based traffic behaviors,’’ Inf. Sci., vol. 511, pp. 284–296, Feb. 2020.
  11. N. Srndic and P. Laskov, ‘‘Practical evasion of a learning-based classifier: A case study,’’ in Proc. IEEE Symp. Secur. Privacy, May 2014, pp. 197–211.
  12. D.Maiorca, I. Corona, and G. Giacinto, ‘‘Looking at the bag is not enough to find the bomb: An evasion of structural methods for malicious PDF f iles detection,’’ in Proc. 8th ACM SIGSAC Symp. Inf., Comput. Commun. Secur., May 2013, pp. 119–130. 13858.
  13. S. Atkinson, G. Carr, C. Shaw, and S. Zargari, ‘‘Drone forensics: The impact and challenges,’’ in Digital Forensic Investigation of Internet of Things (IoT) Devices. Cham, Switzerland: Springer, 2021, pp. 65–124.
  14. C.Liu, C.Lou, M.Yu, S.M.Yiu, K.P.Chow, G.Li ,J.Jiang, and W.Huang, ‘‘A novel adversarial example detection method for malicious PDFs using multiple mutated classifiers,’’ Forensic Sci. Int., Digit. Invest., vol. 38, Oct. 2021, Art. no. 301124.
  15. Q.A.Al-Haija and A.Ishtaiwi, ‘‘Machine learning based model to identify firewall decisions to improve cyber-defense,’’ Int. J. Adv. Sci., Eng. Inf. Technol., vol. 11, no. 4, p. 1688, Aug. 2021.
  16. D. Stevens. (2023). PDFid (Version 0.2.8). [Online]. Available: https://blog.didierstevens.com/programs/pdf-tools
  17. PDF-Info. (2021). PDF-Info (Version 2.1.0). [Online]. Available: https://pypi.org/project/pdf-info/
  18. D. Stevens. (2023). PDF-Parser (Version 0.7.8). [Online]. Available: https://blog.didierstevens.com/programs/pdf-tools
  19. M. Yu, J. Jiang, G. Li, C. Lou, Y. Liu, C. Liu, and W. Huang, ‘‘Malicious documents detection for business process management based on multi layer abstract model,’’ Future Gener. Comput. Syst., vol. 99, pp. 517–526, Oct. 2019.
  20. H. Pareek, P. Eswari, N. S. C. Babu, and C. Bangalore, ‘‘Entropy and n gram analysis of malicious pdf documents,’’ Int. J. Eng., vol. 2, no. 2, pp. 1–3, 2013.
  21. C. Smutz and A. Stavrou, ‘‘Malicious PDF detection using metadata and structural features,’’ in Proc. 28th Annu. Comput. Secur. Appl. Conf., Dec. 2012, pp. 239–248.
  22. D. Maiorca, G. Giacinto, and I. Corona, ‘‘A pattern recognition system for malicious pdf files detection,’’ in Proc. Int. Workshop Mach. Learn. Data Mining Pattern Recognit. Cham, Switzerland: Springer, 2012, pp. 510–524.
  23. H. Pareek, ‘‘Malicious pdf document detection based on feature extraction andentropy,’’ Int. J. Secur., Privacy Trust Manage., vol. 2, no. 5, pp. 31–35, Oct. 2013.