Generative AI for Synthetic PHI: Privacy-Preserving Training Data for Healthcare LLMs

Kawaljeet Singh Chadha

Open Access

Generative AI for Synthetic PHI: Privacy-Preserving Training Data for Healthcare LLMs

PDF

Kawaljeet Singh Chadha ¹

⁴ Business Analyst II MI, McLaren Health Care, TX, USA

Abstract

Having AI models capable of creating synthetic protected health information (PHI), as generative AI models, like GANs and VAEs, and being the means to train healthcare LLMs without violating the privacy of patients, is a viable solution to the problem. The paper compares the usefulness and privacy trade-offs of GANs, VAEs, and federated learning architecture with differentially-augmented architectures and architectures designed to support homomorphic encryption and federated averaging. Researchers applied synthetic data pipelines trained and tested on de-identified MIMIC-III and Physio Net data to ensure that different privacy budgets (0.1 1.0) were considered. The following were obtained across clinical tasks: re-identification risk, distributional fidelity using Kolmogorov-Smirnov tests, and Pearson correlation and downstream LLM performance in precision, recall, and F1. As proved by statistical analysis, these two methods can lower the probability of re-identification: differential privacy and federated learning with homomorphic encryption may decrease the risk by up to 80 percent and 60 percent, respectively, concerning utility loss (F1 drop 15 percent or less) and F1 decay of less than 9 percent. ANOVA, McNemar tests, and paired t-tests support the view that privacy increases significantly, and utility has an acceptable level. The benefits of a case study include its clinical trial simulation of real-life, diagnostic model development, and rare disease analytics. In HIPAA Safe Harbor and GDPR pseudonymization, regulatory analysis links these practices to serve practical governance by suggesting the creation of model cards, datasheets, and an auditable blockchain to create audit trails. It requires interdisciplinary contributions from clinicians, data scientists, and engineers to be deployed. Our results confirm that privacy-preserving synthetic PHI is a safe source of training healthcare LLMs to allow secure and scalable development of AI in healthcare.

How to Cite

Kawaljeet Singh Chadha. (2024). Generative AI for Synthetic PHI: Privacy-Preserving Training Data for Healthcare LLMs. Frontiers in Emerging Artificial Intelligence and Machine Learning, 1(01), 26–43. Retrieved from https://irjernet.com/index.php/feaiml/article/view/174

⬇ Endnote/Zotero/Mendeley (RIS) ⬇ BibTeX

References

Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., & Zhang, L. (2016). Deep learning with differential privacy. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 308–318

Aono, Y., Hayashi, T., Wang, L., Hayashi, S., & Wang, X. (2017). Privacy-preserving deep learning via additively homomorphic encryption. IEEE Transactions on Information Forensics and Security, 13(5), 1333–1345.

Beltrán, E. T. M., Pérez, M. Q., Sánchez, P. M. S., Bernal, S. L., Bovet, G., Pérez, M. G., ... & Celdrán, A. H. (2023). Decentralized federated learning: Fundamentals, state of the art, frameworks, trends, and challenges. IEEE Communications Surveys & Tutorials, 25(4), 2983-3013. https://doi.org/10.1109/COMST.2023.3315746

Chavan, A. (2023). Managing scalability and cost in microservices architecture: Balancing infinite scalability with financial constraints. Journal of Artificial Intelligence & Cloud Computing, 2, E264. http://doi.org/10.47363/JAICC/2023(2)E264

Choi, E., Biswal, S., Malin, B., Duke, J., Stewart, W. F., & Sun, J. (2017). Generating multi-label discrete electronic health records using generative adversarial networks. arXiv preprint arXiv:1703.06490.

Dwork, C., & Roth, A. (2014). The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4), 211–407.

Karwa, K. (2023). AI-powered career coaching: Evaluating feedback tools for design students. Indian Journal of Economics & Business. https://www.ashwinanokha.com/ijeb-v22-4-2023.php

Konneru, N. M. K. (2021). Integrating security into CI/CD pipelines: A DevSecOps approach with SAST, DAST, and SCA tools. International Journal of Science and Research Archive. Retrieved from https://ijsra.net/content/role-notification-scheduling-improving-patient

Kumar, A. (2019). The convergence of predictive analytics in driving business intelligence and enhancing DevOps efficiency. International Journal of Computational Engineering and Management, 6(6), 118-142. Retrieved from https://ijcem.in/wp-content/uploads/THE-CONVERGENCE-OF-PREDICTIVE-ANALYTICS-IN-DRIVING-BUSINESS-INTELLIGENCE-AND-ENHANCING-DEVOPS-EFFICIENCY.pdf

Lestyán, S. (2022). Privacy of Vehicular Time Series Data (Doctoral dissertation, Budapest University of Technology and Economics (Hungary)).

Li, L., Jammeh, E., & Wang, F. (2019). A recurrent variational autoencoder for longitudinal patient records. IEEE Journal of Biomedical and Health Informatics, 23(5), 2298–2307.

Lim, L., & Lee, H. C. (2023). Open datasets in perioperative medicine: a narrative review. Anesthesia and Pain Medicine, 18(3), 213-219. https://doi.org/10.17085/apm.23076

Machanavajjhala, A., Kifer, D., Gehrke, J., & Venkitasubramaniam, M. (2007). l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data, 1(1), 3–es.

Majeed, A. (2023). Attribute-centric and synthetic data based privacy preserving methods: A systematic review. Journal of Cybersecurity and Privacy, 3(3), 638-661. https://doi.org/10.3390/jcp3030030

McMahan, H. B., Moore, E., Ramage, D., & Hampson, S. (2017). Communication-efficient learning of deep networks from decentralized data. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), 54, 1273–1282.

Patel, S. D., & Kumar, P. (2019). Privacy-preserving healthcare AI using federated learning: A survey. IEEE Access, 7, 62313-62324. https://doi.org/10.1109/ACCESS.2019.2915068

Pineda Rincón, E. A., & Moreno-Sandoval, L. G. (2021). Design of an architecture contributing to the protection and privacy of the data associated with the electronic health record. Information, 12(8), 313. https://doi.org/10.3390/info12080313

Prabowo, O. M., Mulyana, E., Nugraha, I. G. B. B., & Supangkat, S. H. (2023). Cognitive city platform as digital public infrastructure for developing a smart, sustainable and resilient city in Indonesia. IEEE Access, 11, 120157-120178. https://doi.org/10.1109/ACCESS.2023.3327305

Raju, R. K. (2017). Dynamic memory inference network for natural language inference. International Journal of Science and Research (IJSR), 6(2). https://www.ijsr.net/archive/v6i2/SR24926091431.pdf

Restrepo, J. P., Rivera, J. C., Laniado, H., Osorio, P., & Becerra, O. A. (2023). Nonparametric generation of synthetic data using copulas. Electronics, 12(7), 1601. https://doi.org/10.3390/electronics12071601

Rieke, N., Hancox, J., Li, W., Milletari, F., Roth, H. R., Albarqouni, S., et al. (2020). The future of privacy-preserving AI in medicine. Nature Medicine, 26, 1391–1398.

Rosenbaum, A., Soltan, S., & Hamza, W. (2023). Using large language models (llms) to synthesize training data. Amazon Science. https://www.amazon.science/author/saleh-soltan

Singh, V. (2023). Federated learning for privacy-preserving medical data analysis: Applying federated learning to analyze sensitive health data without compromising patient privacy. International Journal of Advanced Engineering and Technology, 5(S4). https://romanpub.com/resources/Vol%205%20%2C%20No%20S4%20-%2026.pdf

Singh, V., Murarka, Y., Jaiswal, A., & Kanani, P. (2020). Detection and classification of arrhythmia. International Journal of Grid and Distributed Computing, 13(6). http://sersc.org/journals/index.php/IJGDC/article/view/9128

Smith, J. A., & Lee, H. K. (2018). Generative adversarial networks in healthcare data synthesis: A review. Journal of Health Informatics, 11(3), 205-220. https://doi.org/10.1016/j.jhi.2018.05.004

Taylor, G. W., & Zhao, X. (2020). Variational autoencoders for healthcare applications: A survey. Machine Learning in Medicine, 15(2), 89-102. https://doi.org/10.1007/s10115-020-01363-1

Thomas, J. A. (2021). Assessing the fitness for use of real-world electronic health records and log data with and without the application of privacy preserving technologies. University of Washington. https://www.proquest.com/openview/031285590381a09a41195ec508e009e0/1?pq-origsite=gscholar&cbl=18750&diss=y

Topol, E. J. (2019). High-performance medicine: The convergence of human and artificial intelligence. Nature Medicine, 25, 44–56.

Wang, Y., & Zhang, T. (2021). Homomorphic encryption techniques in healthcare data privacy. Journal of Biomedical Informatics, 116, 103-111. https://doi.org/10.1016/j.jbi.2020.103570

Zyskind, G., Nathan, O., & Pentland, A. (2015). Decentralizing privacy: Using blockchain to protect personal data. In 2015 IEEE Security and Privacy Workshops (pp. 180–184).

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors retain the copyright of their articles published in this journal. All articles are licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). This license permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly cited.