THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

John Paul Cook

De-identification of Personal Health Information

Many people have asked me various questions about Personal Health Information (PHI) which is covered by the Health Insurance Portability and Accountability Act (HIPAA, not HIPPA). The process of removing data (de-identification or anonymization) that could violate someone’s privacy is  complex. This is particularly true when there is unstructured data (i.e., free text). The U.S. Department of Health and Human Services has detailed guidelines on proper de-identification techniques, which are found here or at Bing’s cached copy here.

I found the guidelines to be very informative. The discussion on zip codes was interesting. Zip codes, particularly in areas that aren’t densely populated, have to be abbreviated to the first three digits. Even when you restrict a zip code to the first three digits, there is a list of 17 specific three digit zip codes that you cannot use at all.

The document says that age must be removed from a patient’s record if the patient’s age is greater than 89. Can you imagine a patient summary beginning with “The patient is a 107 year old man…”? For supercentenarians, age does provide a clue as to who they are. Changing a patient’s age or date of birth helps greatly in de-identification, but care must be taken. You don’t want to make an adult a minor or vice-versa.

Consider this statement in the medical record: “The patient became ill after eating a [insertNameOfReligiousHolidayHere] meal.” One could argue that removing the name of a religious holiday makes for a neutral record. That might be the appropriate thing to do, but there could be clinical value in knowing the religious holiday or the religion of the patient. It could be useful to know if certain things would be eaten or definitely not eaten.

There are published algorithms for processing textual data and de-identifying it. You can download Perl regular expression scripts from PhysioNet for free. The download also includes several dictionaries that the scripts use. Notice there is a dictionary of medical terms and several dictionaries of people’s names. Obviously there is value in knowing if a word is a person’s name or a medical term. DeBakey appears in the SNOMED dictionary as a medical term. There is a DeBakey clamp, a DeBakey pump, and a DeBakey graft. But what if the patient’s name was DeBakey? DeBakey does not appear in the dictionary of common names. Would the scripts recognize DeBakey as a medical term and not remove what actually is the patient’s name?

A known weakness of processing textual data against dictionaries is misspelled words and names. Some names are particularly difficult to spell and will not always be found in the dictionaries because of the inevitable misspellings.

Published Wednesday, July 10, 2013 11:25 PM by John Paul Cook

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Alex Thomas said:

Hi,

I think there may be an issue with the link to the US Dept of Health guidelines on de-identification techniques.

July 15, 2013 10:52 PM
 

John Paul Cook said:

If you can't reach the HHS site, you can use the link I provided above to Bing's cached copy.

July 15, 2013 11:23 PM

Leave a Comment

(required) 
(required) 
Submit

About John Paul Cook

John Paul Cook is both a Registered Nurse and a Microsoft SQL Server MVP experienced in Microsoft SQL Server and Oracle database application design, development, and implementation. He has spoken at many conferences including Microsoft TechEd and the SQL PASS Summit. He has worked in oil and gas, financial, manufacturing, and healthcare industries. Experienced in systems integration and workflow analysis, John is passionate about combining his IT experience with his nursing background to solve difficult problems in healthcare. He sees opportunities in using business intelligence and Big Data to satisfy healthcare meaningful use requirements and improve patient outcomes. John graduated from Vanderbilt University with a Master of Science in Nursing Informatics and is an active member of the Sigma Theta Tau nursing honor society. Contributing author to SQL Server MVP Deep Dives and SQL Server MVP Deep Dives Volume 2.
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement