Many people have asked me various questions about Personal Health Information (PHI) which is covered by the Health Insurance Portability and Accountability Act (HIPAA, not HIPPA). The process of removing data (de-identification or anonymization) that could violate someone’s privacy is complex. This is particularly true when there is unstructured data (i.e., free text). The U.S. Department of Health and Human Services has detailed guidelines on proper de-identification techniques, which are found here or at Bing’s cached copy here.
I found the guidelines to be very informative. The discussion on zip codes was interesting. Zip codes, particularly in areas that aren’t densely populated, have to be abbreviated to the first three digits. Even when you restrict a zip code to the first three digits, there is a list of 17 specific three digit zip codes that you cannot use at all.
The document says that age must be removed from a patient’s record if the patient’s age is greater than 89. Can you imagine a patient summary beginning with “The patient is a 107 year old man…”? For supercentenarians, age does provide a clue as to who they are. Changing a patient’s age or date of birth helps greatly in de-identification, but care must be taken. You don’t want to make an adult a minor or vice-versa.
Consider this statement in the medical record: “The patient became ill after eating a [insertNameOfReligiousHolidayHere] meal.” One could argue that removing the name of a religious holiday makes for a neutral record. That might be the appropriate thing to do, but there could be clinical value in knowing the religious holiday or the religion of the patient. It could be useful to know if certain things would be eaten or definitely not eaten.
There are published algorithms for processing textual data and de-identifying it. You can download Perl regular expression scripts from PhysioNet for free. The download also includes several dictionaries that the scripts use. Notice there is a dictionary of medical terms and several dictionaries of people’s names. Obviously there is value in knowing if a word is a person’s name or a medical term. DeBakey appears in the SNOMED dictionary as a medical term. There is a DeBakey clamp, a DeBakey pump, and a DeBakey graft. But what if the patient’s name was DeBakey? DeBakey does not appear in the dictionary of common names. Would the scripts recognize DeBakey as a medical term and not remove what actually is the patient’s name?
A known weakness of processing textual data against dictionaries is misspelled words and names. Some names are particularly difficult to spell and will not always be found in the dictionaries because of the inevitable misspellings.