Exploring the Capabilities of Large Language Model Mistral Large (Mistral) on Medical Challenge Problems and Hallucinations

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in various natural language processing tasks, including question answering, text generation, and multimodal understanding. However, their performance in specialized domains such as healthcare and their propensity for generating hallucinated (false) information remains an area of active investigation. This research paper explores the capabilities and limitations of Mistral's LLM, mistral-large-2402, in tackling medical challenge problems and assessing its tendency to hallucinate. The study is motivated by the potential of LLMs to augment medical decision-making processes and the need to evaluate their reliability in critical domains like healthcare. We investigate mistral-large-2402's performance on a curated dataset of medical challenge problems, spanning diagnosis, treatment recommendation, and medical condition analysis tasks. Additionally, we examine the model's propensity for hallucinating by analyzing its responses for factual inconsistencies and unsubstantiated claims. Through quantitative and qualitative analyses, we provide insights into mistral-large-2402's strengths and weaknesses in handling medical challenges. Our evaluation methodology involves measuring the model's accuracy, completeness, and coherence of responses, as well as its ability to recognize and mitigate hallucinations. The findings of this study contribute to the ongoing discourse on the responsible deployment of LLMs in healthcare and highlight potential areas for improvement in model design and training.

Country : India

1 Pooja Mishra2 Rutuja Bhujbal3 Tushar Singh

  1. Dr. D. Y. Patil Institute of Engineering Management and Research, Pune, Maharashtra, India
  2. Dr. D. Y. Patil Institute of Engineering Management and Research, Pune, Maharashtra, India
  3. Dr. D. Y. Patil Institute of Engineering Management and Research, Pune, Maharashtra, India

IRJIET, Volume 8, Issue 5, May 2024 pp. 156-164

doi.org/10.47001/IRJIET/2024.804024

References

  1. Ankit Pal, Malaikannan Sankarasubbu. 2024: Gemini Goes to Med School: Exploring the Capabilities of Multimodal Large Language Models on Medical Challenge Problems & Hallucinations. Arxiv./abs/2402.07023
  2. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed (2023). Mistral 7B. Arxiv./abs/2310.06825
  3. Hsiao-Ching Tsai, Yueh-Fen Huang, Chih-Wei Kuo et al. Comparative Analysis of Automatic Literature Review Using Mistral Large Language Model and Human Reviewers, 07 March 2024, PREPRINT (Version 1) Research Square. [https://doi.org/10.21203/rs.3.rs-4022248/v1]
  4. Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Abubakr Babiker, Nathanael Schärli, Aakanksha Chowdhery, Philip Mansfield, Dina Demner-Fushman, Blaise Agüera y Arcas, Dale Webster, Greg S. Corrado, Yossi Matias, Katherine Chou, Juraj Gottweis, Nenad Tomasev, Yun Liu, Alvin Rajkomar, Joelle Barral, Christopher Semturs, Alan Karthikesalingam & Vivek Natarajan, (2023): Large language models encode clinical knowledge. Nature, 620(7972), 172-180. https://doi.org/10.1038/s41586-023-06291-2
  5. Mistral Instruct 7B Fine Tuning on MedMCQA Dataset by Saankhya Mondal. Medium.
  6. Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier, Richard Dufour: BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains. arXiv:2402.10373v1 [cs.CL] 15 Feb 2024.
  7. Tirth Dave, Sai Anirudh Athaluri, and Satyam Singh (2023). ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell, 6:1169595.
  8. Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, Alexandre Sallinen, Alireza Sakhaeirad, Vinitra Swamy, Igor Krawczuk, Deniz Bayazit, Axel Marmet, Syrielle Montariol, Mary-Anne Hartley, Martin Jaggi, and Antoine Bosselut. 2023. Meditron-70b: Scaling medical pretraining for large language models.
  9. Marco Guevara, Shan Chen, Spencer Thomas, Tafadzwa L. Chaunzwa, Idalid Franco, Benjamin H. Kann, Shalini Moningi, Jack M. Qian, Madeleine Goldstein, Susan Harper, Hugo J. W. L. Aerts, Paul J. Catalano, Guergana K. Savova, Raymond H. Mak, and Danielle S. Bitterman. 2024. Large language models to identify social determinants of health in electronic health records. npj Digital Medicine, 7(1):6.
  10. Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577, Hong Kong, China. Association for Computational Linguistics.
  11. Xisen Jin, Xiang Ren, Daniel Preotiuc-Pietro, and Pengxiang Cheng. 2023. Dataless knowledge fusion by merging weights of language models. In The Eleventh International Conference on Learning Representations.
  12. Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
  13. Yunxiang Li, Zihan Li, Kai Zhang, Ruilong Dan, Steve Jiang, and You Zhang. 2023. Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge.
  14. OpenAI. 2023. Chatgpt: Language models are few-shot learners. https://openai.com/blog/ chatgpt.