Deep Learning-based Fingerprinting Methods for Audio Representation and Search

Divesh Singh

doi:https://doi.org/10.47001/IRJIET/2025.903024

Deep Learning-based Fingerprinting Methods for Audio Representation and Search

Divesh SinghInfosys Limited, Mumbai, India

Vol 9 No 3 (2025): Volume 9, Issue 3, March 2025 | Pages: 182-192

International Research Journal of Innovations in Engineering and Technology

OPEN ACCESS | Research Article | Published Date: 28-03-2025

doi.org/10.47001/IRJIET/2025.903024

Full Text PDF

Abstract

Audio content is abundant and diverse in today's digital age, ranging from music to podcasts and audio streams. Efficiently representing and searching this vast audio data is essential for applications like content identification, recommendation systems, and audio retrieval. Traditional audio fingerprinting methods have relied on handcrafted features and heuristics, which may lack scalability and robustness in real-world scenarios.

In contrast, deep learning has shown remarkable capabilities in various audio-related tasks, such as speech recognition and music classification. Leveraging deep learning-based methods for audio fingerprinting offers the potential to create compact yet informative representations of audio signals, enabling faster and more accurate content identification and search.

This paper explores deep-learning model to develop advanced audio fingerprinting methods. By utilizing models such as a variant of autoencoders – U-Net Autoencoders and Convolutional Neural Networks (CNNs), the work in the paper seeks to extract audio features, and compress and encode them to reduce the feature space effectively. Also, the work scope includes the challenge of noise resilience, ensuring that the audio fingerprints remain consistent and robust even for noisy samples.

This compressed, encoded audio fingerprint is then used to efficiently search the audio database for required purposes (for example, music identification). For creating the audio database, vector database of FAISS is selected as it provides efficient vector search capabilities, which can be utilized well for music identification.

Keywords

Efficient representation of audio, audio fingerprinting, deep learning, U-Net Autoencoders, Convolutional Neural Networks (CNNs), compact feature representation, noise resilience, audio analysis, audio database search, vector database, Facebook AI Similarity Search (FAISS)

Citation of this Article

Divesh Singh. (2025). Deep Learning-based Fingerprinting Methods for Audio Representation and Search. International Research Journal of Innovations in Engineering and Technology - IRJIET, 9(3), 182-192. Article DOI https://doi.org/10.47001/IRJIET/2025.903024

This work is licensed under Creative common Attribution Non Commercial 4.0 Internation Licence

References

Avery, Li-Chun, Wang, “An industrial-strength audio search algorithm”, 582-588, doi: 10.5072/ZENODO.243872, 2004.
Haitsma, Jaap & Kalker, Ton, “A Highly Robust Audio Fingerprinting System”, Proc Int Symp Music Info Retrieval 32, 2002.
P. Panyapanuwat, S. Kamonsantiroj and L. Pipanmaekaporn, "Similarity-preserving hash for content-based audio retrieval using unsupervised deep neural networks”, International Journal of Electrical and Computer Engineering 11.1, 879, 2021.
Choi, K., Fazekas, G., Cho, K. and Sandler, M., “A tutorial on deep learning for music information retrieval”, arXiv preprint arXiv:1709.04396, 2017
Valentini-Botinhao, Cassia, “Noisy speech database for training speech enhancement algorithms and TTS models”, University of Edinburgh. School of Informatics. Centre for Speech Technology Research (CSTR). https://doi.org/10.7488/ds/2117, 2017.
Johnson, J., Douze, M., & Jégou, H., “Billion-scale similarity search with GPUs”, ArXiv. /abs/1702.08734, https://github.com/facebookresearch/faiss, 2017
Ronneberger, O.; Fischer, P.; Brox, “U-net: Convolutional networks for biomedical image segmentation”, In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, Springer: Cham, Switzerland, 5–9 October 2015
M. Tripathi, “Facial image denoising using AutoEncoder and UNET”, Heritage and Sustainable Development, vol. 3, no. 2, pp. 89–96, Oct. 2021.
"Lyra: A New Very Low-Bitrate Codec for Speech Compression", ai.googleblog.com 25 February 2021
Chervyakov, N.; Lyakhov, P.; Nagornov, N., “Analysis of the Quantization Noise in Discrete Wavelet Transform Filters for 3D Medical Imaging”. Appl. Sci,, 10, 1223, 2020, https://doi.org/10.3390/app10041223
Amazon AWS. Vector Database - https://aws.amazon.com/what-is/vector-databases/

For Authors

Publication Archives

Volume 1 - 2017

Volume 2 - 2018

Volume 3 - 2019

Volume 4 - 2020

Volume 5 - 2021

Volume 6 - 2022

Volume 7 - 2023

Volume 8 - 2024

Volume 9 - 2025

Volume 10 - 2026

For Board Members

Downloads

Research Areas

Deep Learning-based Fingerprinting Methods for Audio Representation and Search

Abstract

Keywords

Citation of this Article

References

International Research Journal of Innovations in Engineering
and Technology - IRJIET

Editorial Policies

Quick Links