Anonymization of administrative billing codes with repeated diagnoses through censoring.

Abstract

Patient-specific data from electronic medical records (EMRs) is increasingly shared in a de-identified form to support research. However, EMRs are susceptible to noise, error, and variation, which can limit their utility for reuse. One way to enhance the utility of EMRs is to record the number of times diagnosis codes are assigned to a patient when this data is shared. This is, however, challenging because releasing such data may be leveraged to compromise patients' identity. In this paper, we present an approach that, to the best of our knowledge, is the first that can prevent re-identification through repeated diagnosis codes. Our method transforms records to preserve privacy while retaining much of their utility. Experiments conducted using 2676 patients from the EMR system of the Vanderbilt University Medical Center verify that our method is able to retain an average of 95.4% of the diagnosis codes in a common data sharing scenario.