Generative AI for Synthetic PHI: Privacy-Preserving Training Data for Healthcare LLMs
Abstract
Having AI models capable of creating synthetic protected health information (PHI), as generative AI models, like GANs and VAEs, and being the means to train healthcare LLMs without violating the privacy of patients, is a viable solution to the problem. The paper compares the usefulness and privacy trade-offs of GANs, VAEs, and federated learning architecture with differentially-augmented architectures and architectures designed to support homomorphic encryption and federated averaging. Researchers applied synthetic data pipelines trained and tested on de-identified MIMIC-III and Physio Net data to ensure that different privacy budgets (0.1 1.0) were considered. The following were obtained across clinical tasks: re-identification risk, distributional fidelity using Kolmogorov-Smirnov tests, and Pearson correlation and downstream LLM performance in precision, recall, and F1. As proved by statistical analysis, these two methods can lower the probability of re-identification: differential privacy and federated learning with homomorphic encryption may decrease the risk by up to 80 percent and 60 percent, respectively, concerning utility loss (F1 drop 15 percent or less) and F1 decay of less than 9 percent. ANOVA, McNemar tests, and paired t-tests support the view that privacy increases significantly, and utility has an acceptable level. The benefits of a case study include its clinical trial simulation of real-life, diagnostic model development, and rare disease analytics. In HIPAA Safe Harbor and GDPR pseudonymization, regulatory analysis links these practices to serve practical governance by suggesting the creation of model cards, datasheets, and an auditable blockchain to create audit trails. It requires interdisciplinary contributions from clinicians, data scientists, and engineers to be deployed. Our results confirm that privacy-preserving synthetic PHI is a safe source of training healthcare LLMs to allow secure and scalable development of AI in healthcare.