Impact Factor (2025): 6.9
DOI Prefix: 10.47001/IRJIET
Vol 10 No 5 (2026): Volume 10, Issue 5, May 2026 | Pages: 683-686
International Research Journal of Innovations in Engineering and Technology
OPEN ACCESS | Research Article | Published Date: 31-05-2026
Timbre transfer aims to modify the timbral characteristics of audio while preserving key elements like melody and rhythm. Advances in diffusion-based models have yielded promising results in image and audio synthesis. However, their application to ethnic Nepali instruments remains largely unexplored. We explore an unsupervised method for timbre transfer in Sarangi using latent diffusion bridges. In our experiment, the flute model maps the input audio into its corresponding Gaussian prior, and the Sarangi model reconstructs the target audio from the Gaussian prior. The trained Sarangi model can be used both as a source and a target model. Experimental results demonstrate that the model successfully keeps the melodic structure while altering timbral qualities.
Audio synthesis, Latent Diffusion, Sarangi, Timbre Transfer.
Aashish Shrestha, & Sanjivan Satyal. (2026). Timbre Transfer from Flute to Sarangi Using Latent Diffusion Bridge. International Research Journal of Innovations in Engineering and Technology - IRJIET, 10(5), 683-686. Article DOI https://doi.org/10.47001/IRJIET/2026.105091
This work is licensed under Creative common Attribution Non Commercial 4.0 Internation Licence
Mancusi, M., Halychanskyi, Y., Cheuk, K. W., Moliner, E., Lai, C. H., Uhlich, S., ... & Mitsufuji, Y. (2025, April). Latent diffusion bridges for unsupervised musical audio timbre transfer. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.
Huang, S., Li, Q., Anil, C., Bao, X., Oore, S., & Grosse, R. B. (2018). Timbretron: A wavenet (cyclegan (cqt (audio))) pipeline for musical timbre transfer. arXiv preprint arXiv:1811.09620.
Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision (pp. 2223-2232).
Jain, D. K., Kumar, A., Cai, L., Singhal, S., & Kumar, V. (2020, July). ATT: Attention-based timbre transfer. In 2020 International Joint Conference on Neural Networks (IJCNN) (pp. 1-6). IEEE.
Engel, J., Agrawal, K. K., Chen, S., Gulrajani, I., Donahue, C., & Roberts, A. (2019). Gansynth: Adversarial neural audio synthesis. arXiv preprint arXiv:1902.08710.
Engel, J., Hantrakul, L., Gu, C., & Roberts, A. (2020). DDSP: Differentiable digital signal processing. arXiv preprint arXiv:2001.04643.
Kong, Z., Ping, W., Huang, J., Zhao, K., & Catanzaro, B. (2020). Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761.
Liu, H., Chen, Z., Yuan, Y., Mei, X., Liu, X., Mandic, D., ... & Plumbley, M. D. (2023). Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503.
Elizalde, B., Deshmukh, S., Al Ismail, M., & Wang, H. (2023, June). Clap learning audio concepts from natural language supervision. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2020). Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456.
Défossez, A., Copet, J., Synnaeve, G., & Adi, Y. (2022). High fidelity neural audio compression. arXiv preprint arXiv:2210.13438.
Karras, T., Aittala, M., Aila, T., & Laine, S. (2022). Elucidating the design space of diffusion-based generative models. Advances in neural information processing systems, 35, 26565-26577.
Dowson, D. C., & Landau, B. (1982). The Fréchet distance between multivariate normal distributions. Journal of multivariate analysis, 12(3), 450-455.