A Multimodal Journey in Text-to-Image and Video Creation Using AI

Abstract

Text-to-image and video AI models represent technologies that combine narrative with visual content. The model works by converting written text (descriptions, sentences or phrases) into corresponding images or videos. Leveraging advanced deep learning architectures such as Generative Adversarial Networks (GANs) or Transformers, this intelligence can interpret content in narratives and generate visual content consistent with text. In the text-to-image domain, the model creates real images based on text that describe scenes, objects, or even complex scenes described in the text. In film, he arranges images or frames to create a well-rounded, coherent film that suits the narrative. The impact of this technology is broad, providing powerful tools to transform the content of content into a graphical representation, expanding content creation, visual arts, e-commerce, and accessibility for the visually impaired. For producing high resolution images we have implemented EDSR4X model. The EDSR (Enhanced Deep Super-Resolution) model is a state-of-the-art architecture specifically designed for single-image super-resolution tasks. It belongs to the category of convolutional neural networks (CNNs) and focuses on improving the resolution of low-quality images.

Country : India

1 Prof. Balaji Chaugule2 Akanksha Gawade3 Pranav Mane4 Adarsh Thazhathethil5 Shashwat Kulkarni

  1. Department of Information Technology, Savitribai Phule Pune University or Zeal College of Engineering and Research, Pune, India
  2. Department of Information Technology, Savitribai Phule Pune University or Zeal College of Engineering and Research, Pune, India
  3. Department of Information Technology, Savitribai Phule Pune University or Zeal College of Engineering and Research, Pune, India
  4. Department of Information Technology, Savitribai Phule Pune University or Zeal College of Engineering and Research, Pune, India
  5. Department of Information Technology, Savitribai Phule Pune University or Zeal College of Engineering and Research, Pune, India

IRJIET, Volume 8, Issue 1, January 2024 pp. 11-14

doi.org/10.47001/IRJIET/2024.801002

References

  1. B. Goertzel and C. Pennachin, Artificial general intelligence. Springer, 2007, vol. 2.
  2. V. C. Muller and N. Bostrom, “Future progress in artificial intelligence: A survey of expert opinion,” in Fundamental issues of artificial intelligence. Springer, 2016, pp. 555–572.
  3. J. Clune, “Ai-gas: Ai-generating algorithms, an alternate paradigm for producing general artificial intelligence,” arXiv preprint arXiv:1905.10985, 2019.
  4. R. Fjelland, “Why general artificial intelligence will not be realized,” Humanities and Social Sciences Communications, vol. 7, no. 1, pp. 1–9, 2020.
  5. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, 2015.
  6. E. Mansimov, E. Parisotto, J. L. Ba, and R. Salakhutdinov, “Generating images from captions with attention,” ICLR, 2016.
  7. S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” in International conference on machine learning. PMLR, 2016, pp. 1060–1069.
  8. H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “Stackgan: Text to photorealistic image synthesis with stacked generative adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5907– 5915.
  9. N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning textto-image diffusion models for subject-driven generation,” arXiv preprint arXiv:2208.12242, 2022.
  10. A.Blattmann, R. Rombach, K. Oktay, and B. Ommer, “Retrieval augmented diffusion models,” arXiv preprint arXiv:2204.11824, 2022.
  11. S. Sheynin, O. Ashual, A. Polyak, U. Singer, O. Gafni, E. Nachmani, and Y. Taigman, “Knndiffusion: Image generation via large-scale retrieval,” arXiv preprint arXiv:2204.02849, 2022.
  12. R. Rombach, A. Blattmann, and B. Ommer, “Text- guided synthesis of artistic images with retrieval augmented diffusion models,” arXiv preprint arXiv:2207.13038, 2022.
  13. W. Chen, H. Hu, C. Saharia, and W. W. Cohen, “Re-imagen: Retrieval-augmented text-to-image generator,” arXiv preprint arXiv:2209.14491, 2022.
  14. U. Khandelwal, O. Levy, D. Jurafsky, L. Zettlemoyer, and M. Lewis, “Generalization through memorization: Nearest neighbor language models,” arXiv preprint arXiv:1911.00172, 2019.
  15. U. Khandelwal, A. Fan, D. Jurafsky, L. Zettlemoyer, and M. Lewis, “Nearest neighbor machine translation,” arXiv preprint arXiv:2010.00710, 2020.
  16. K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang, “Retrieval augmented language model pretraining,” in International Conference on Machine Learning. PMLR, 2020, pp. 3929–3938.
  17. Y. Meng, S. Zong, X. Li, X. Sun, T. Zhang, F. Wu, and J. Li, “Gnnlm: Language modeling based on global contexts via gnn,” arXiv preprint arXiv:2110.08743, 2021.
  18. P. Schramowski, M. Brack, B. Deiseroth, and K. Kersting, “Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models,” arXiv preprint arXiv:2211.05105, 2022.
  19. L. Struppek, D. Hintersdorf, and K. Kersting, “The biased artist: Exploiting cultural biases via homoglyphs in text-guided image generation models,” arXiv preprint arXiv:2209.08891, 2022.
  20. H. Bansal, D. Yin, M. Monajatipoor, and K.-W. Chang, “How well can text-to-image generative models understand ethical natural language interventions?” arXiv preprint arXiv:2210.15230, 2022.
  21. Z. Sha, Z. Li, N. Yu, and Y. Zhang, “De-fake: Detection and attribution of fake images generated by text-to-image diffusion models,” arXiv preprint arXiv:2210.06998, 2022.