PDF Malware Detection Using Machine Learning Models

A.Komala; Boya Chandu; Medivala Reddy Hemanth

doi:https://doi.org/10.47001/IRJIET/2025.INSPIRE45

PDF Malware Detection Using Machine Learning Models

A.KomalaDept. of CSE-Cybersecurity, Madanapalle Institute of Technology & Science, Madanapalle, IndiaBoya ChanduDept. of CSE-Cybersecurity, Madanapalle Institute of Technology & Science, Madanapalle, IndiaMedivala Reddy HemanthDept. of CSE-Cybersecurity, Madanapalle Institute of Technology & Science, Madanapalle, India

Vol 9 No 25 (2025): Volume 9, Special Issue of INSPIRE’25 April 2025 | Pages: 278-283

International Research Journal of Innovations in Engineering and Technology

OPEN ACCESS | Research Article | Published Date: 24-04-2025

doi.org/10.47001/IRJIET/2025.INSPIRE45

Full Text PDF

Abstract

PDFs are widely used for document sharing, but their popularity also makes them a common target for malware. The software, titled "PDF Malware Detection Using Machine Learning Models," aims to develop and compare ml learning models for detecting malware in PDFs. Using a Kaggle dataset containing examples of both hazardous and secure PDFs, various methods such as Random Forest, C5.0, J48, Support Vector Machines, AdaBoost, Deep Neural Networks, Gradient Boosting Machines, and K-Nearest Neighbors will be employed. The main goal is to attain high detection accuracy while integrating explainability to gain a deeper understanding of the models' behaviour. By leveraging machine learning techniques, this project seeks to enhance cybersecurity measures, offering a robust solution to identify and mitigate potential threats embedded in PDF documents.

Keywords

PDF malware detection, machine learning, Random Forest, SVM, DNN, cybersecurity, malicious PDF, classification algorithms, Kaggle dataset

Citation of this Article

A.Komala, Boya Chandu, & Medivala Reddy Hemanth. (2025). PDF Malware Detection Using Machine Learning Models. In proceeding of International Conference on Sustainable Practices and Innovations in Research and Engineering (INSPIRE'25), published by IRJIET, Volume 9, Special Issue of INSPIRE’25, pp 278-283. Article DOI https://doi.org/10.47001/IRJIET/2025.INSPIRE45

This work is licensed under Creative common Attribution Non Commercial 4.0 Internation Licence

References

S. S. Alshamrani, ‘‘Design and analysis of machine learning based technique for malware identification and classification of portable document format files,’’ Secur. Commun. Netw., vol. 2022, pp. 1–10, Sep. 2022.
P. Singh, S. Tapaswi, and S. Gupta, ‘‘Malware detection in PDF and office documents: A survey,’’ Inf. Secur. J., Global Perspective, vol. 29, no. 3, pp. 134–153, May 2020.
N. Livathinos, C. Berrospi, M. Lysak, V. Kuropiatnyk, A. Nassar, A. Carvalho, M. Dolfi, C. Auer, K. Dinkla, and P. Staar, ‘‘Robust PDF document conversion using recurrent neural networks,’’ in Proc. AAAI Conf. Artif. Intell., vol. 35, no. 17, 2021, pp. 15137–15145.
Q. A. Al-Haija, A. Odeh, and H. Qattous, ‘‘PDF malware detection based on optimizable decision trees,’’ Electronics, vol. 11, no. 19, p. 3142, Sep. 2022.
Y. Wiseman, ‘‘Efficient embedded images in portable document format,’’ Int. J., vol. 124, pp. 38–129, Jan. 2019.
M.Ijaz,M.H.Durad,andM.Ismail,‘‘Static and dynamic malware analysis using machine learning,’’ in Proc. 16th Int. Bhurban Conf. Appl. Sci. Technol. (IBCAST), Jan. 2019, pp. 687–691.
Y. Alosefer, ‘‘Analysing web-based malware behaviour through client honey pots,’’ Ph.D. dissertation, School Comput. Sci. Inform., Cardiff Univ., Cardiff, Wales, U.K., 2012.
N. Idika and A. P. Mathur, ‘‘A survey of malware detection techniques,’’ Purdue Univ., vol. 48, no. 2, pp. 32–46, 2007.
M. Abdelsalam, M. Gupta, and S. Mittal, ‘‘Artificial intelligence assisted malware analysis,’’ in Proc. ACM Workshop Secure Trustworthy Cyber Phys. Syst., Apr. 2021, pp. 75–77.
W. Wang, Y. Shang, Y. He, Y. Li, and J. Liu, ‘‘BotMark: Automated botnet detection with hybrid analysis of flow-based and graph-based traffic behaviors,’’ Inf. Sci., vol. 511, pp. 284–296, Feb. 2020.
N. Srndic and P. Laskov, ‘‘Practical evasion of a learning-based classifier: A case study,’’ in Proc. IEEE Symp. Secur. Privacy, May 2014, pp. 197–211.
D.Maiorca, I. Corona, and G. Giacinto, ‘‘Looking at the bag is not enough to find the bomb: An evasion of structural methods for malicious PDF f iles detection,’’ in Proc. 8th ACM SIGSAC Symp. Inf., Comput. Commun. Secur., May 2013, pp. 119–130. 13858.
S. Atkinson, G. Carr, C. Shaw, and S. Zargari, ‘‘Drone forensics: The impact and challenges,’’ in Digital Forensic Investigation of Internet of Things (IoT) Devices. Cham, Switzerland: Springer, 2021, pp. 65–124.
C.Liu, C.Lou, M.Yu, S.M.Yiu, K.P.Chow, G.Li ,J.Jiang, and W.Huang, ‘‘A novel adversarial example detection method for malicious PDFs using multiple mutated classifiers,’’ Forensic Sci. Int., Digit. Invest., vol. 38, Oct. 2021, Art. no. 301124.
Q.A.Al-Haija and A.Ishtaiwi, ‘‘Machine learning based model to identify firewall decisions to improve cyber-defense,’’ Int. J. Adv. Sci., Eng. Inf. Technol., vol. 11, no. 4, p. 1688, Aug. 2021.
D. Stevens. (2023). PDFid (Version 0.2.8). [Online]. Available: https://blog.didierstevens.com/programs/pdf-tools
PDF-Info. (2021). PDF-Info (Version 2.1.0). [Online]. Available: https://pypi.org/project/pdf-info/
D. Stevens. (2023). PDF-Parser (Version 0.7.8). [Online]. Available: https://blog.didierstevens.com/programs/pdf-tools
M. Yu, J. Jiang, G. Li, C. Lou, Y. Liu, C. Liu, and W. Huang, ‘‘Malicious documents detection for business process management based on multi layer abstract model,’’ Future Gener. Comput. Syst., vol. 99, pp. 517–526, Oct. 2019.
H. Pareek, P. Eswari, N. S. C. Babu, and C. Bangalore, ‘‘Entropy and n gram analysis of malicious pdf documents,’’ Int. J. Eng., vol. 2, no. 2, pp. 1–3, 2013.
C. Smutz and A. Stavrou, ‘‘Malicious PDF detection using metadata and structural features,’’ in Proc. 28th Annu. Comput. Secur. Appl. Conf., Dec. 2012, pp. 239–248.
D. Maiorca, G. Giacinto, and I. Corona, ‘‘A pattern recognition system for malicious pdf files detection,’’ in Proc. Int. Workshop Mach. Learn. Data Mining Pattern Recognit. Cham, Switzerland: Springer, 2012, pp. 510–524.
H. Pareek, ‘‘Malicious pdf document detection based on feature extraction andentropy,’’ Int. J. Secur., Privacy Trust Manage., vol. 2, no. 5, pp. 31–35, Oct. 2013.

For Authors

Publication Archives

Volume 1 - 2017

Volume 2 - 2018

Volume 3 - 2019

Volume 4 - 2020

Volume 5 - 2021

Volume 6 - 2022

Volume 7 - 2023

Volume 8 - 2024

Volume 9 - 2025

Volume 10 - 2026

For Board Members

Downloads

Research Areas

PDF Malware Detection Using Machine Learning Models

Abstract

Keywords

Citation of this Article

References

International Research Journal of Innovations in Engineering
and Technology - IRJIET

Editorial Policies

Quick Links