A Multimodal Journey in Text-to-Image and Video Creation Using AI

Prof. Balaji Chaugule; Akanksha Gawade; Pranav Mane; Adarsh Thazhathethil; Shashwat Kulkarni

doi:https://doi.org/10.47001/IRJIET/2024.801002

A Multimodal Journey in Text-to-Image and Video Creation Using AI

Prof. Balaji ChauguleDepartment of Information Technology, Savitribai Phule Pune University or Zeal College of Engineering and Research, Pune, IndiaAkanksha GawadeDepartment of Information Technology, Savitribai Phule Pune University or Zeal College of Engineering and Research, Pune, IndiaPranav ManeDepartment of Information Technology, Savitribai Phule Pune University or Zeal College of Engineering and Research, Pune, IndiaAdarsh ThazhathethilDepartment of Information Technology, Savitribai Phule Pune University or Zeal College of Engineering and Research, Pune, IndiaShashwat KulkarniDepartment of Information Technology, Savitribai Phule Pune University or Zeal College of Engineering and Research, Pune, India

Vol 8 No 1 (2024): Volume 8, Issue 1, January 2024 | Pages: 11-14

International Research Journal of Innovations in Engineering and Technology

OPEN ACCESS | Research Article | Published Date: 13-01-2024

doi.org/10.47001/IRJIET/2024.801002

Full Text PDF

Abstract

Text-to-image and video AI models represent technologies that combine narrative with visual content. The model works by converting written text (descriptions, sentences or phrases) into corresponding images or videos. Leveraging advanced deep learning architectures such as Generative Adversarial Networks (GANs) or Transformers, this intelligence can interpret content in narratives and generate visual content consistent with text. In the text-to-image domain, the model creates real images based on text that describe scenes, objects, or even complex scenes described in the text. In film, he arranges images or frames to create a well-rounded, coherent film that suits the narrative. The impact of this technology is broad, providing powerful tools to transform the content of content into a graphical representation, expanding content creation, visual arts, e-commerce, and accessibility for the visually impaired. For producing high resolution images we have implemented EDSR4X model. The EDSR (Enhanced Deep Super-Resolution) model is a state-of-the-art architecture specifically designed for single-image super-resolution tasks. It belongs to the category of convolutional neural networks (CNNs) and focuses on improving the resolution of low-quality images.

Keywords

Text detection, Stable Diffusion, Image Generation, Deep Learning, Text-to-image, Text-to-Video

Citation of this Article

Prof. Balaji Chaugule, Akanksha Gawade, Pranav Mane, Adarsh Thazhathethil, Shashwat Kulkarni, “A Multimodal Journey in Text-to-Image and Video Creation Using AI” Published in International Research Journal of Innovations in Engineering and Technology - IRJIET, Volume 8, Issue 1, pp 11-14, January 2024. Article DOI https://doi.org/10.47001/IRJIET/2024.801002

This work is licensed under Creative common Attribution Non Commercial 4.0 Internation Licence

References

B. Goertzel and C. Pennachin, Artificial general intelligence. Springer, 2007, vol. 2.
V. C. Muller and N. Bostrom, “Future progress in artificial intelligence: A survey of expert opinion,” in Fundamental issues of artificial intelligence. Springer, 2016, pp. 555–572.
J. Clune, “Ai-gas: Ai-generating algorithms, an alternate paradigm for producing general artificial intelligence,” arXiv preprint arXiv:1905.10985, 2019.
R. Fjelland, “Why general artificial intelligence will not be realized,” Humanities and Social Sciences Communications, vol. 7, no. 1, pp. 1–9, 2020.
Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, 2015.
E. Mansimov, E. Parisotto, J. L. Ba, and R. Salakhutdinov, “Generating images from captions with attention,” ICLR, 2016.
S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” in International conference on machine learning. PMLR, 2016, pp. 1060–1069.
H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “Stackgan: Text to photorealistic image synthesis with stacked generative adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5907– 5915.
N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning textto-image diffusion models for subject-driven generation,” arXiv preprint arXiv:2208.12242, 2022.
A.Blattmann, R. Rombach, K. Oktay, and B. Ommer, “Retrieval augmented diffusion models,” arXiv preprint arXiv:2204.11824, 2022.
S. Sheynin, O. Ashual, A. Polyak, U. Singer, O. Gafni, E. Nachmani, and Y. Taigman, “Knndiffusion: Image generation via large-scale retrieval,” arXiv preprint arXiv:2204.02849, 2022.
R. Rombach, A. Blattmann, and B. Ommer, “Text- guided synthesis of artistic images with retrieval augmented diffusion models,” arXiv preprint arXiv:2207.13038, 2022.
W. Chen, H. Hu, C. Saharia, and W. W. Cohen, “Re-imagen: Retrieval-augmented text-to-image generator,” arXiv preprint arXiv:2209.14491, 2022.
U. Khandelwal, O. Levy, D. Jurafsky, L. Zettlemoyer, and M. Lewis, “Generalization through memorization: Nearest neighbor language models,” arXiv preprint arXiv:1911.00172, 2019.
U. Khandelwal, A. Fan, D. Jurafsky, L. Zettlemoyer, and M. Lewis, “Nearest neighbor machine translation,” arXiv preprint arXiv:2010.00710, 2020.
K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang, “Retrieval augmented language model pretraining,” in International Conference on Machine Learning. PMLR, 2020, pp. 3929–3938.
Y. Meng, S. Zong, X. Li, X. Sun, T. Zhang, F. Wu, and J. Li, “Gnnlm: Language modeling based on global contexts via gnn,” arXiv preprint arXiv:2110.08743, 2021.
P. Schramowski, M. Brack, B. Deiseroth, and K. Kersting, “Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models,” arXiv preprint arXiv:2211.05105, 2022.
L. Struppek, D. Hintersdorf, and K. Kersting, “The biased artist: Exploiting cultural biases via homoglyphs in text-guided image generation models,” arXiv preprint arXiv:2209.08891, 2022.
H. Bansal, D. Yin, M. Monajatipoor, and K.-W. Chang, “How well can text-to-image generative models understand ethical natural language interventions?” arXiv preprint arXiv:2210.15230, 2022.
Z. Sha, Z. Li, N. Yu, and Y. Zhang, “De-fake: Detection and attribution of fake images generated by text-to-image diffusion models,” arXiv preprint arXiv:2210.06998, 2022.

For Authors

Publication Archives

Volume 1 - 2017

Volume 2 - 2018

Volume 3 - 2019

Volume 4 - 2020

Volume 5 - 2021

Volume 6 - 2022

Volume 7 - 2023

Volume 8 - 2024

Volume 9 - 2025

Volume 10 - 2026

For Board Members

Downloads

Research Areas

A Multimodal Journey in Text-to-Image and Video Creation Using AI

Abstract

Keywords

Citation of this Article

References

International Research Journal of Innovations in Engineering
and Technology - IRJIET

Editorial Policies

Quick Links