Text-to-Image Datasets: Characteristics, Challenges, and Opportunities

Abstract

Text-to-image synthesis is an intriguing field of study that seeks to create visuals from textual descriptions. The primary objective of this domain is to provide visuals that align with the provided written description for both semantic coherence and visual reality. Despite significant advancements in text-to-image synthesis in recent years, it continues to encounter numerous hurdles, primarily concerning picture realism and semantic coherence. To address these challenges, selecting diverse datasets with comprehensive annotations will markedly improve model performance in addressing these difficulties. Datasets with varied visual material and comprehensive textual descriptions aid models in understanding intricate links between text and images, enhancing both semantic coherence and image authenticity. This review paper examines 20 datasets available for text-to-image synthesis, categorizing them by scope, variety, and application domains. The meticulous selection and curation of datasets are crucial for enhancing text-to-image synthesis technology. Ultimately, the careful selection and curation of datasets play a pivotal role in advancing the state-of-the-art in text-to-image synthesis.

Country : Iraq

1 Haitham ALHAJI2 Alaa Yaseen Taqa

  1. Computer Science Department, College of Computer Science and Mathematics, University of Mosul, Nineveh, Iraq
  2. Computer Science Department, College of Education for Pure Science, University of Mosul, Nineveh, Iraq

IRJIET, Volume 9, Issue 3, March 2025 pp. 67-77

doi.org/10.47001/IRJIET/2025.903009

References

  1. Y. X. Tan, C. P. Lee, M. Neo, and K. M. Lim, “Text-to-image synthesis with self-supervised learning,” Pattern Recognit Lett, vol. 157, pp. 119–126, May 2022, doi: 10.1016/j.patrec.2022.04.010.
  2. S. K. Alhabeeb and A. A. Al-Shargabi, “Text-to-Image Synthesis With Generative Models: Methods, Datasets, Performance Metrics, Challenges, and Future Direction,” IEEE Access, vol. 12, pp. 24412–24427, 2024, doi: 10.1109/ACCESS.2024.3365043.
  3. N. Zhang and H. Tang, “Text-to-Image Synthesis: A Decade Survey,” Nov. 2024.
  4. S. O. Hasoon and M. M. Al-Hashimi, “Hybrid Deep Neural network and Long Short term Memory Network for Predicting of Sunspot Time Series,” Int J Math Comput Sci, vol. 17, no. 3, pp. 955–967, 2022, [Online]. Available: http://ijmcs.future-in-tech.net
  5. R. Talal Ibrahim and F. Mahmood Ramo, “Hybrid Intelligent Technique with Deep Learning to Classify Personality Traits,” International Journal of Computing and Digital Systems, vol. 13, no. 1, pp. 2210–142, 2023, doi: 10.12785/ijcds/130119.
  6. “Disentangled Diffusion: T2I model to extract multiple concepts from a single image | AI-SCHOLAR | AI: (Artificial Intelligence) Articles and technical information media.” Accessed: Mar. 01, 2025. [Online]. Available: https://ai-scholar.tech/en/articles/image-generation/T2I-DisenDiff
  7. Ian J. Goodfellow et al., “Generative Adversarial Networks,” Adv Neural Inf Process Syst, vol. 27, Jun. 2014, doi: 10.48550/arXiv.1406.2661.
  8. Y. Namani, I. Reghioua, G. Bendiab, M. A. Labiod, and S. Shiaeles, “DeepGuard: Identification and Attribution of AI-Generated Synthetic Images,” Electronics (Basel), vol. 14, no. 4, p. 665, Feb. 2025, doi: 10.3390/electronics14040665.
  9. M. Mirza and S. Osindero, “Conditional Generative Adversarial Nets,” Nov. 2014.
  10. A.Ramesh et al., “Zero-Shot Text-to-Image Generation,” in International Conference on Machine Learning, PMLR (Proceedings of Machine Learning Research), Feb. 2021, pp. 8821–8831. doi: 10.48550/arXiv.2102.12092.
  11. A.Vaswani et al., “Attention Is All You Need,” Jun. 2017.
  12. M. Dalal, A. C. Li, and R. Taori, “Autoregressive Models: What Are They Good For?,” Oct. 2019.
  13. M. Khoshnoodi, V. Jain, M. Gao, M. Srikanth, and A. Chadha, “A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models,” May 2024.
  14. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-Resolution Image Synthesis with Latent Diffusion Models,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, 2022, pp. 10674–10685. doi: 10.1109/CVPR52688.2022.01042.
  15. A.Nichol et al., “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models,” arXiv preprint arXiv:2112.10741, Dec. 2021.
  16. L. Li et al., “T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation,” Jan. 2025.
  17. M. Ahsan Habib et al., “Exploring Progress in Text-to-Image Synthesis: An In-Depth Survey on the Evolution of Generative Adversarial Networks,” IEEE Access, vol. 12, pp. 178401–178440, 2024, doi: 10.1109/ACCESS.2024.3435541.
  18. R. Shelby, S. Rismani, and N. Rostamzadeh, “Generative AI in Creative Practice: ML-Artist Folk Theories of T2I Use, Harm, and Harm-Reduction,” in Proceedings of the CHI Conference on Human Factors in Computing Systems, New York, NY, USA: ACM, May 2024, pp. 1–17. doi: 10.1145/3613904.3642461.
  19. M.-E. Nilsback and A. Zisserman, “Automated Flower Classification over a Large Number of Classes,” in 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, IEEE, Dec. 2008, pp. 722–729. doi: 10.1109/ICVGIP.2008.47.
  20. Wah, Catherine, Steve Branson, Peter Welinder, and Serge Belongie., “The Caltech-UCSD Birds-200-2011 Dataset,” California Institute of Technology, 2011.
  21. Li Deng, “The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web],” IEEE Signal Process Mag, vol. 29, no. 6, pp. 141–142, Nov. 2012, doi: 10.1109/MSP.2012.2211477.
  22. T.-Y. Lin et al., “Microsoft COCO: Common Objects in Context,” 2014, pp. 740–755. doi: 10.1007/978-3-319-10602-1_48.
  23. R. Krishna et al., “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations,” Int J Comput Vis, vol. 123, no. 1, pp. 32–73, May 2017, doi: 10.1007/s11263-016-0981-7.
  24. X. Chen, L. Qing, X. He, X. Luo, and Y. Xu, “FTGAN: A Fully-trained Generative Adversarial Networks for Text to Face Generation,” Apr. 2019.
  25. W. Xia, Y. Yang, J.-H. Xue, and B. Wu, “TediGAN: Text-Guided Diverse Face Image Generation and Manipulation,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Jun. 2021, pp. 2256–2265. doi: 10.1109/CVPR46437.2021.00229.
  26. Y. Zhou, “Generative Adversarial Network for Text-to-Face Synthesis and Manipulation,” in Proceedings of the 29th ACM International Conference on Multimedia, New York, NY, USA: ACM, Oct. 2021, pp. 2940–2944. doi: 10.1145/3474085.3481026.
  27. Z. Zhang et al., “UFC-BERT: Unifying Multi-Modal Controls for Conditional Image Synthesis,” Adv Neural Inf Process Syst, vol. 34, 2021.
  28. Y. Jiang, Z. Huang, X. Pan, C. C. Loy, and Z. Liu, “Talk-to-Edit: Fine-Grained Facial Editing via Dialog,” Sep. 2021.
  29. T. Wang, T. Zhang, and B. Lovell, “Faces à la Carte: Text-to-Face Generation via Attribute Disentanglement,” in 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, Jan. 2021, pp. 3379–3387. doi: 10.1109/WACV48630.2021.00342.
  30. C. Schuhmann et al., “LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs,” Nov. 2021.
  31. A.Gatt et al., “Face2Text: Collecting an Annotated Image Description Corpus for the Generation of Rich Face Descriptions,” Mar. 2018.
  32. Y. Zhou and N. Shimada, “ABLE: Aesthetic Box Lunch Editing,” in Proceedings of the 1st International Workshop on Multimedia for Cooking, Eating, and related Applications, New York, NY, USA: ACM, Oct. 2022, pp. 53–56. doi: 10.1145/3552485.3554935.
  33. C. Schuhmann et al., “LAION-5B: An open large-scale dataset for training next generation image-text models,” Oct. 2022.
  34. Z. J. Wang, E. Montoya, D. Munechika, H. Yang, B. Hoover, and D. H. Chau, “DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models,” Oct. 2022.
  35. “GitHub - kakaobrain/coyo-dataset: COYO-700M: Large-scale Image-Text Pair Dataset.” Accessed: Feb. 27, 2024. [Online]. Available: https://github.com/kakaobrain/coyo-dataset
  36. Y. Jiang, S. Yang, H. Qiu, W. Wu, C. C. Loy, and Z. Liu, “Text2Human: Text-Driven Controllable Human Image Generation,” May 2022.
  37. A.A. Ramakrishnan, S. X. Huang, and D. Lee, “ANNA: Abstractive Text-to-Image Synthesis with Filtered News Captions,” Jan. 2023.
  38. N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, “DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation,” Aug. 2022.
  39. M. Chen et al., “Evaluating Text-to-Image Generative Models: An Empirical Study on Human Image Synthesis,” Mar. 2024.
  40. X. Wu, S. Huang, and F. Wei, “Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation,” Apr. 2024.
  41. M. M. Ismael, I. A. Saleh, and M. S. Student, “Software Maintenance Potential Prediction Based on Machine Learning,” International Research Journal of Innovations in Engineering and Technology (IRJIET), vol. 6, no. 10, pp. 56–62, 2022, doi: 10.47001/IRJIET/2022.610009.
  42. H. Altememi and Y. Al-Irhaim, “A Comparative Study for Speech Summarization Based on Machine Learning: A Survey,” AL-Rafidain Journal of Computer Sciences and Mathematics, vol. 16, no. 2, pp. 89–96, Dec. 2022, doi: 10.33899/csmj.2022.176595.
  43. R. S. Mahamed Najeeb and I. O. Abdul Majjed Dahl, “Brain Tumor Segmentation Utilizing Generative Adversarial, Resnet And Unet Deep Learning,” in 2022 8th International Conference on Contemporary Information Technology and Mathematics (ICCITM), IEEE, Aug. 2022, pp. 85–89. doi: 10.1109/ICCITM56309.2022.10031760.
  44. A.Ali and A. Taqa, “Analytical Study of Traditional and Intelligent Textual Plagiarism Detection Approaches,” JOURNAL OF EDUCATION AND SCIENCE, vol. 31, no. 1, pp. 8–25, Mar. 2022, doi: 10.33899/edusj.2021.131895.1192.
  45. A.M.S. Saleh and A. Y. Taqa, “A Recent Trends in eBooks Recommender Systems: A Comparative Survey,” Iraqi Journal of Science, pp. 487–511, Jan. 2024, doi: 10.24996/ijs.2024.65.1.39.
  46. V. Joynt et al., “A Comparative Analysis of Text-to-Image Generative AI Models in Scientific Contexts: A Case Study on Nuclear Power,” Dec. 2023.
  47. A.F.de C. Vázquez and E. C. Garrido-Merchán, “A Taxonomy of the Biases of the Images created by Generative Artificial Intelligence,” May 2024.
  48. T. Sandoval-Martin and E. Martínez-Sanzo, “Perpetuation of Gender Bias in Visual Representation of Professions in the Generative AI Tools DALL·E and Bing Image Creator,” Soc Sci, vol. 13, no. 5, p. 250, May 2024, doi: 10.3390/socsci13050250.
  49. J. Kaddour, J. Harris, M. Mozes, H. Bradley, R. Raileanu, and R. McHardy, “Challenges and Applications of Large Language Models,” Jul. 2023.
  50. Z. N. Al‐Kateeb and D. B. Abdullah, “AdaBoost‐powered cloud of things framework for low‐latency, energy‐efficient chronic kidney disease prediction,” Transactions on Emerging Telecommunications Technologies, vol. 35, no. 6, Jun. 2024, doi: 10.1002/ett.5007.
  51. A.Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical Text-Conditional Image Generation with CLIP Latents,” arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, Apr. 2022.
  52. I.Hussain Rather, · Sushil Kumar, · Amir, and H. Gandomi, “Breaking the data barrier: a review of deep learning techniques for democratizing AI with small datasets,” Artif Intell Rev, vol. 57, p. 226, 123AD, doi: 10.1007/s10462-024-10859-3.
  53. K. Pilz, L. Heim, and N. Brown, “Increased Compute Efficiency and the Diffusion of AI Capabilities,” Nov. 2023.
  54. S. I. Khaleel and L. F. Salih, “A survey of predicting software reliability using machine learning methods,” IAES International Journal of Artificial Intelligence (IJ-AI), vol. 13, no. 1, p. 35, Mar. 2024, doi: 10.11591/ijai.v13.i1.pp35-44.
  55. X. Wang, X. Yi, X. Xie, and J. Jia, “Embedding an Ethical Mind: Aligning Text-to-Image Synthesis via Lightweight Value Optimization,” Oct. 2024, doi: 10.1145/3664647.3681652.
  56. M. Al-kfairy, D. Mustafa, N. Kshetri, M. Insiew, and O. Alfandi, “Ethical Challenges and Solutions of Generative AI: An Interdisciplinary Perspective,” Informatics, vol. 11, no. 3, p. 58, Aug. 2024, doi: 10.3390/informatics11030058.
  57. A.Birhane, V. U. Prabhu, and E. Kahembwe, “Multimodal datasets: misogyny, pornography, and malignant stereotypes,” Oct. 2021.
  58. G. A. Tahir, “Ethical Challenges in Computer Vision: Ensuring Privacy and Mitigating Bias in Publicly Available Datasets,” Aug. 2024.