Anonymization of medical texts with natural language processing

Authors

DOI:

https://doi.org/10.59681/2175-4411.v17.2025.1227

Keywords:

data anonymization, medical records, natural language processing

Abstract

Objective: To present and evaluate an anonymization method for medical records in Portuguese, using a pre-trained named entity recognition (NER) model without fine-tuning. Method: The GLiNER (Generalist and Lightweight Model for Named Entity Recognition) model was applied to identify and mask potentially identifying information (example: name, age, organization, and city) in 27,540 discharge summaries (12,163 patients) from a tertiary hospital in São Paulo (2017-2023). Information loss was evaluated with ROUGE F1, BLEU-4, BERTscore, and human analysis of errors was performed on a random sample (N=400). Result: Human analysis showed anonymization failure in two cases (0.50%), allowing the identification of the patient or the assistant. Quantitative metrics indicated preservation of textual utility (median BERTscore: 0.76). Conclusion: The model is efficient but not perfect, highlighting the need for hybrid anonymization approach (automatic and human validation) to comply with the General Law for the Protection of Personal Data. It can be used as a step for creationing necessary medical datasets for the development of natural language processing in Brazil.

Author Biographies

Rildo Pinto da Silva, Faculdade de Medicina de Ribeirão Preto, Universidade de São Paulo

Doutorando em clínica médica do Departamento de Clínica Médica da Faculdade de Medicina de Ribeirão Preto da Universidade de São Paulo.

Antonio Pazin-Filho, Faculdade de Medicina de Ribeirão Preto, Universidade de São Paulo

Professor Titular do Departamento de Clínica Médica da Faculdade de Medicina de Ribeirão Preto da Universidade de São Paulo

References

Landolsi MY, Hlaoua L, Ben Romdhane L. Information extraction from electronic medical documents: state of the art and future research directions. Knowl Inf Syst 2023; 65: 463–516. DOI: https://doi.org/10.1007/s10115-022-01779-1

Hossain E, Rana R, Higgins N, et al. Natural Language Processing in Electronic Health Records in relation to healthcare decision-making: A systematic review. Comput Biol Med 2023; 155: 106649. DOI: https://doi.org/10.1016/j.compbiomed.2023.106649

Luo X, Deng Z, Yang B, et al. Pre-trained language models in medicine: A survey. Artif Intell Med 2024; 154: 102904. DOI: https://doi.org/10.1016/j.artmed.2024.102904

Brasil, Lei no. 13.709, de 14 de Agosto de 2018. Lei Geral de Proteção de Dados Pessoais (LGPD): LGPD, 2018.

Sweeney L. k-Anonymity: A model for protecting privacy. Int. J. Unc. Fuzz. Knowl. Based Syst. 2002; 10: 557–570. DOI: https://doi.org/10.1142/S0218488502001648

Liu J, Gupta S, Chen A, et al. OpenDeID Pipeline for Unstructured Electronic Health Record Text Notes Based on Rules and Transformers: Deidentification Algorithm Development and Validation Study. J Med Internet Res 2023; 25: e48145. DOI: https://doi.org/10.2196/48145

Johnson AEW, Bulgarelli L, Pollard TJ. Deidentification of free-text medical records using pre-trained bidirectional transformers. Proc ACM Conf Health Inference Learn (2020) 2020; 2020: 214–221. DOI: https://doi.org/10.1145/3368555.3384455

Vakili T, Henriksson A, Dalianis H. End-to-end pseudonymization of fine-tuned clinical BERT models Privacy preservation with maintained data utility. BMC Med Inform Decis Mak 2024; 24: 162. DOI: https://doi.org/10.1186/s12911-024-02546-8

Minaee S, Mikolov T, Nikzad N, et al. Large Language Models: A Survey, 2024.

Yoon J, Drumright LN, van der Schaar M. Anonymization Through Data Synthesis Using Generative Adversarial Networks (ADS-GAN). IEEE J. Biomed. Health Inform. 2020; 24: 2378–2388. DOI: https://doi.org/10.1109/JBHI.2020.2980262

Gadotti A, Rocher L, Houssiau F, et al. Anonymization: The imperfect science of using data while preserving privacy. Sci Adv 2024; 10: eadn7053. DOI: https://doi.org/10.1126/sciadv.adn7053

Johnson AEW, Bulgarelli L, Shen L, et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data 2023; 10: 1. DOI: https://doi.org/10.1038/s41597-023-01945-2

Nigo M, Rasmy L, Mao B, et al. Deep learning model for personalized prediction of positive MRSA culture using time-series electronic health records. Nat Commun 2024; 15: 2036. DOI: https://doi.org/10.1038/s41467-024-46211-0

Falter M, Godderis D, Scherrenberg M, et al. Using natural language processing for automated classification of disease and to identify misclassified ICD codes in cardiac disease. Eur Heart J Digit Health 2024; 5: 229–234. DOI: https://doi.org/10.1093/ehjdh/ztae008

Lin C-Y. ROUGE: A Package for Automatic Evaluation of Summaries. In: Text Summarization Branches Out, pp. 74–81. Barcelona, Spain: Association for Computational Linguistics.

Papineni K, Roukos S, Ward T, et al. Bleu: a Method for Automatic Evaluation of Machine Translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. (ed Isabelle P, Charniak E and Lin D), pp. 311–318. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics. DOI: https://doi.org/10.3115/1073083.1073135

Zhang T, Kishore V, Wu F, et al. BERTScore: Evaluating Text Generation with BERT, 2019.

Lee Y-Q, Chen C-T, Chen C-C, et al. Unlocking the Secrets Behind Advanced Artificial Intelligence Language Models in Deidentifying Chinese-English Mixed Clinical Text: Development and Validation Study. J Med Internet Res 2024; 26: e48443. DOI: https://doi.org/10.2196/48443

Preiksaitis C, Ashenburg N, Bunney G, et al. The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review. JMIR Med Inform 2024; 12: e53787. DOI: https://doi.org/10.2196/53787

Park Y-J, Pillai A, Deng J, et al. Assessing the research landscape and clinical utility of large language models: a scoping review. BMC Med Inform Decis Mak 2024; 24: 72. DOI: https://doi.org/10.1186/s12911-024-02459-6

Oliveira LESE, Peters AC, Da Silva AMP, et al. SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks. J Biomed Semantics 2022; 13: 13. DOI: https://doi.org/10.1186/s13326-022-00269-1

Seastedt KP, Schwab P, O'Brien Z, et al. Global healthcare fairness: We should be sharing more, not less, data. PLOS Digit Health 2022; 1: e0000102. DOI: https://doi.org/10.1371/journal.pdig.0000102

Imagem gerada no Adobe Firefly. Robô vestido de médico faz anotações em um prancheta.

Published

2025-05-26

How to Cite

Silva, R. P. da, & Pazin-Filho, A. (2025). Anonymization of medical texts with natural language processing. Journal of Health Informatics, 17(1), 1227. https://doi.org/10.59681/2175-4411.v17.2025.1227

Issue

Section

Original Articles

Similar Articles

1 2 3 4 5 6 7 8 9 10 > >> 

You may also start an advanced similarity search for this article.