Data-Centric Artificial Intelligence for Textual Understanding in Healthcare Decision Systems

Abstract

Data-Centric Artificial Intelligence (DCAI) reframes clinical NLP by treating data quality, coverage, and governance as the primary levers of performance and safety, rather than model tinkering alone. In healthcare where actionable knowledge is embedded in unstructured narratives such as progress notes, discharge summaries, radiology/pathology reports, referral letters, and patient messages this paper proposes an end-to-end, practice-oriented framework to operationalize DCAI for textual understanding in decision systems. We (1) anchor tasks to measurable clinical utility and harm profiles; (2) detail corpus assembly with stratified sampling across sites, specialties, and demographics; (3) formalize schemas linking entities, assertions (negation/uncertainty), relations, and temporal qualifiers to SNOMED CT, ICD-10/11, RxNorm, and LOINC; (4) combine programmatic labeling (heuristics, ontologies, prompts-as-LFs) with clinician adjudication, active learning, and targeted augmentation; (5) outline privacy-preserving training via de-identification, federated learning, and differential privacy; (6) present model-agnostic evaluation beyond accuracy calibration, uncertainty, fairness, robustness, and decision-curve net benefit; and (7) specify deployment blueprints for monitoring drift, instituting human-in-the-loop overrides, and creating auditable feedback loops that continuously improve data assets. Four exemplar use-cases ICD code suggestion; adverse drug event extraction, radiology impression normalization, and patient-message triage demonstrate tangible workflows, metrics, and governance checklists. Results show how continuous data refinement improves discrimination and calibration while reducing alert burden and subgroup disparities, enabling safer, more equitable, and maintainable clinical decision support. We conclude with implementation checklists and a reproducible playbook to accelerate DCAI adoption across diverse health systems and languages.

Country : USA

1 Sabiha Tasneem

  1. Senior Software Engineer, Stykkist Inc, New Jersey, USA

IRJIET, Volume 9, Issue 11, November 2025 pp. 12-25

doi.org/10.47001/IRJIET/2025.911002

References

  1. Andresini, G., Appice, A., Ienco, D., et al. (2024). DIAMANTE: A datacentric semantic segmentation approach to map tree dieback induced by bark beetle infestations via satellite images. In: Journal of intelligent information systems. https://doi.org/10.1007/s10844-024-00877-6.
  2. Burch, M., & Weiskopf, D. (2013). On the benefits and drawbacks of radial diagrams. In: Handbook of human centric visualization. Springer, pp. 429– 451. https://doi.org/10.1007/978-1-4614-7485-2_17.
  3. Frid-Adar, M., E. Klang, M. Amitai, et al. (2018). Synthetic data augmentation using GAN for improved liver lesion classification. In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018). IEEE, pp. 289–293. https://doi.org/10.1109/ISBI.2018.8363576.
  4. Jakubik, J., Vössing, M., Kühl, N., et al. (2024). Data-centric artificial intelligence. In: Business & information systems engineering. https://doi.org/10.1007/s12599-024-00857-8.
  5. Kumar, S., Datta, S., Singh, V., et al. (2024). Opportunities and Challenges in Data-Centric AI. In: IEEE Access. https://doi.org/10.1109/ACCESS.2024.3369417.
  6. Luley, P., Deriu, J. M., Yan, P., et al. (2023). From concept to implementation: The data-centric development process for AI in industry. In: 2023 10th IEEE Swiss Conference on Data Science (SDS). IEEE, pp. 73–76. https://doi.org/10.1109/SDS57534.2023.00017.
  7. Gudivada V, Apon A, Ding J (2017) Data quality considerations for big data and machine learning: going beyond data cleaning and transformations. Int J Adv Softw 10(1):1–20
  8. Lin Q, Ye G, Wang J, Liu H (2022) RoboFlow: a data-centric workflow management system for developing AI-enhanced robots. In: Proceedings of the conference on robot learning. PMLR, pp 1789–1794
  9. Peng, J., Wu, W., Lockhart, B., et al. (2021). Dataprep. eda: Task-centric exploratory data analysis for statistical modeling in python. In: Proceedings of the 2021 international conference on management of data, pp. 2271– 2280. https://doi.org/10.1145/3448016.3457330.
  10. Roscher, R., Rußwurm, M., Gevaert, C., et al. (2023). Data-centric machine learning for geospatial remote sensing data. In: CoRR. https://doi.org/10.48550/arXiv2312.05327.
  11. Seedat, N., Imrie, F., & van der Schaar, M. (2024). Navigating Data-Centric Artificial Intelligence With DC-Check: Advances, Challenges, and Opportunities. In: IEEE Transactions on Artificial Intelligence 5.6. https://doi.org/10.1109/TAI.2023.3345805.
  12. Whang, S. E., Roh, Y., Song, H., et al. (2023). Data collection and quality challenges in deep learning: A data-centric AI perspective. In: The VLDB Journal 32.4, pp. 791–813.
  13. Zahid, A., Kay Poulsen, J., Sharma, R., et al. (2021). A systematic review of emerging information technologies for sustainable data-centric healthcare. In: International Journal of Medical Informatics 149. https://doi.org/10.1016/j.ijmedinf.2021.104420.
  14. de Carvalho, O. L. F., de Carvalho Junior, O. A., de Albuquerque, A. O., Orlandi, A. G., Hirata, I., Borges, D. L., Gomes, R. A. T., & Guimarães, R. F. (2023). A data-centric approach for wind plant instance-level segmentation using semantic segmentation and gis. Remote Sensing, 15(5), 1–23.
  15. Ferreira de Carvalho, O.L., Olino de Albuquerque, A., Luiz, A.S., Henrique Guimarães Ferreira, P., Mou, L., e Silva, D.G., Abílio de Carvalho Junior, O. (2023). A data-centric approach for rapid dataset generation using iterative learning and sparse annotations. In: IGARSS 2023 - 2023 IEEE International Geoscience and Remote Sensing Symposium, pp. 5650–5653.