Large Language Models for Automated Healthcare Data Dictionary Generation and Maintenance

Authors

  • Bindu Madhavi Mangalampalli, Sasi Kumar Kolla Author

DOI:

https://doi.org/10.64149/

Keywords:

Healthcare Data Dictionaries, LLM-Based Data Governance, Automated Metadata Management, Data Dictionary Automation, Healthcare Data Interoperability, Metadata Quality Improvement, Schema Change Impact Analysis, Data Governance Frameworks, Clinical Data Standardization, Intelligent Data Documentation

Abstract

Data dictionaries enable consistent and reliable communication within and between healthcare organizations but are often incomplete, inconsistent, outdated, or poorly maintained. A methodology harnessing large language models (LLMs) facilitates the automated inception, augmentation, and upkeep of data dictionaries and associated artifacts throughout their lifecycle. Experimental results on an end-to-end data dictionary generation pipeline highlight considerable improvements in artifact quality, with specific categories prepared at least an order of magnitude faster than by human expert contributors. The approach addresses two crucial yet largely neglected aspects of data dictionary governance: the identification of new attributes within hospital information systems and the assessment of impact from proposed changes. In healthcare, data dictionaries describe the information content supported by a system. These dictionaries help facilitate interoperability and data-sharing by formally documenting core knowledge (such as metadata about entities and attributes), by enabling data sharing in an external dataset against which new datasets can be validated, and by providing domain knowledge about external datasets for data fusion, data integration, and machine learning. Nevertheless, many platforms do not have data dictionaries of any kind, and the quality of those that do is often poor. Poor-quality data dictionaries suffer from a range of problems, including incompleteness, inconsistency, outdated content, and lack of audit trails for historical queries. Poor-quality (and missing) data dictionaries also impact data governance by slowing down and increasing the risk in aspects such as the identification of privacy-sensitive fields, the preparation of new data-sharing contracts, and impact assessments for schema changes.

Downloads

Published

2025-12-25

How to Cite

Large Language Models for Automated Healthcare Data Dictionary Generation and Maintenance. (2025). Vascular and Endovascular Review, 8(20s), 363-375. https://doi.org/10.64149/