Deep Learning-based Fingerprinting Methods for Audio Representation and Search

Abstract

Audio content is abundant and diverse in today's digital age, ranging from music to podcasts and audio streams. Efficiently representing and searching this vast audio data is essential for applications like content identification, recommendation systems, and audio retrieval. Traditional audio fingerprinting methods have relied on handcrafted features and heuristics, which may lack scalability and robustness in real-world scenarios.

In contrast, deep learning has shown remarkable capabilities in various audio-related tasks, such as speech recognition and music classification. Leveraging deep learning-based methods for audio fingerprinting offers the potential to create compact yet informative representations of audio signals, enabling faster and more accurate content identification and search.

This paper explores deep-learning model to develop advanced audio fingerprinting methods. By utilizing models such as a variant of autoencoders – U-Net Autoencoders and Convolutional Neural Networks (CNNs), the work in the paper seeks to extract audio features, and compress and encode them to reduce the feature space effectively. Also, the work scope includes the challenge of noise resilience, ensuring that the audio fingerprints remain consistent and robust even for noisy samples.

This compressed, encoded audio fingerprint is then used to efficiently search the audio database for required purposes (for example, music identification). For creating the audio database, vector database of FAISS is selected as it provides efficient vector search capabilities, which can be utilized well for music identification.

Country : India

1 Divesh Singh

  1. Infosys Limited, Mumbai, India

IRJIET, Volume 9, Issue 3, March 2025 pp. 182-192

doi.org/10.47001/IRJIET/2025.903024

References

  1. Avery, Li-Chun, Wang, “An industrial-strength audio search algorithm”, 582-588, doi: 10.5072/ZENODO.243872, 2004.
  2. Haitsma, Jaap & Kalker, Ton, “A Highly Robust Audio Fingerprinting System”, Proc Int Symp Music Info Retrieval 32, 2002.
  3. P. Panyapanuwat, S. Kamonsantiroj and L. Pipanmaekaporn, "Similarity-preserving hash for content-based audio retrieval using unsupervised deep neural networks”, International Journal of Electrical and Computer Engineering 11.1, 879, 2021.
  4. Choi, K., Fazekas, G., Cho, K. and Sandler, M., “A tutorial on deep learning for music information retrieval”, arXiv preprint arXiv:1709.04396, 2017
  5. Valentini-Botinhao, Cassia, “Noisy speech database for training speech enhancement algorithms and TTS models”, University of Edinburgh. School of Informatics. Centre for Speech Technology Research (CSTR). https://doi.org/10.7488/ds/2117, 2017.
  6. Johnson, J., Douze, M., & Jégou, H., “Billion-scale similarity search with GPUs”, ArXiv. /abs/1702.08734, https://github.com/facebookresearch/faiss, 2017
  7. Ronneberger, O.; Fischer, P.; Brox, “U-net: Convolutional networks for biomedical image segmentation”, In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, Springer: Cham, Switzerland, 5–9 October 2015
  8. M. Tripathi, “Facial image denoising using AutoEncoder and UNET”, Heritage and Sustainable Development, vol. 3, no. 2, pp. 89–96, Oct. 2021.
  9. "Lyra: A New Very Low-Bitrate Codec for Speech Compression", ai.googleblog.com 25 February 2021
  10. Chervyakov, N.; Lyakhov, P.; Nagornov, N., “Analysis of the Quantization Noise in Discrete Wavelet Transform Filters for 3D Medical Imaging”. Appl. Sci,, 10, 1223, 2020, https://doi.org/10.3390/app10041223
  11. Amazon AWS. Vector Database - https://aws.amazon.com/what-is/vector-databases/