De-identification and Confidentiality of Research Data | Human Research Protection Program (HRPP)

General Principles
Anonymization and pseudonymization
Whole genome/exome data
Regulatory considerations for PHI
Qualitative data
Medical imaging data
Audio/video recordings
Rare diseases, small datasets, and other distinctive data
UCSF Resources
References
Definitions

General Principles

Develop a plan for confidentiality protections and, when appropriate, de-identification before data collection begins. Follow the plan consistently through the lifecycle of the research to protect participant confidentiality and data integrity.
The informed consent process must describe whether and how participants’ confidentiality will be maintained during and after the study.
Limit the amount of identifiable information and sensitive information in your dataset. Collect and record only data that is necessary for the research.
Record and store data in accordance with UCSF policies and procedures. Review the list of commonly approved storage options.
When possible, code or pseudonymize the dataset such that identifiers are stored and secured separately from the research data.
Destroy identifiers (even those secured separately from the research dataset) at the earliest opportunity. Audio or video recordings should be transcribed, and the transcription redacted of identifiable information as soon as possible.
For research involving only secondary data/specimen analyses, the identifiability of the data/specimens is a crucial consideration in the IRB review process. Research using only de-identified data/specimens without subject contact typically does not require informed consent and might not be considered human subjects research that requires formal IRB approval.
Data-sharing agreements should include a requirement that the recipient not attempt to re-identify the data.

Anonymization and pseudonymization

The anonymization of data is permanently irreversible. Any direct identifiers have been permanently removed from the dataset, and no code or link exists to link the dataset back to identifiers. There is no means to link the data back to specific individuals.

Pseudonymization, also known as coding, may be reversible or irreversible. Pseudonymization requires that any direct or indirect identifiers be removed from the dataset and replaced with a pseudonym, a numerical code, or generalized/aggregated data. The pseudonym or code must not be derived from any direct identifiers such as birthdate or phone number. HIPAA requires that “the code or other means of record identification is not derived from or related to information about the individual” such as date of birth or initials.

Pseudonymization is reversible when a “key” or a “map” exists to link the pseudonym/code to individual identities. The key is retained and secured separately from the pseudonymized dataset and allows the holder to re-identify the dataset.

Pseudonymized data is irreversibly anonymized when the code or the key is permanently destroyed, or when such a key never existed. The IRB recommends that any key be destroyed at the earliest opportunity.

Whole genome/exome data

Neither the HIPAA Privacy Rule nor the Common Rule considers whole genome/exome sequencing data to be inherently identifiable. The NIH Genomic Data Sharing (GDS) Policy considers genomic data to be de-identified when it, “meet[s] the definition for de-identified data in the HHS Regulations for Protection of Human Subjects and [is] stripped of the 18 identifiers listed in the HIPAA Privacy Rule.” Nevertheless, there have been several instances of “de-identified” genomic data being re-identified by researchers and law enforcement. Wan, et. al.1 provide a discussion of some of these cases.

Due to the risk of re-identification of whole genome/exome sequencing data, such data should be shared only in controlled access databases. For example, only tenure-track professor-level or senior scientist-level researchers may request access to the NIH database of Genotypes and Phenotypes (dbGaP).

Regulatory considerations for PHI

The HIPAA Privacy Rule outlines two standards by which a covered entity may determine that health information is not individually identifiable: the “Safe Harbor” standard and the “Expert Determination” standard (45 CFR 164.514).

Limited Data Set

A Limited Data Set (LDS) is defined by removing all HIPAA identifiers except ZIP codes, city/town, state, and dates. A covered entity may use or disclose an LDS without authorization or a waiver, and a Data Use Agreement (DUA) that meets the requirements of the Privacy Rule is required. Although the Privacy Rule considers an LDS to be identifiable, the Common Rule does not. Research involving only the use of an LDS with a proper DUA generally does not meet the definition of Human Subject Research.

Qualitative Data

De-identifying qualitative data may be challenging because the data are unstructured and de-identification may affect the integrity of the data. When appropriate, it might be advisable to obtain participants’ consent to maintain and publish identifiable information.

Remove direct identifiers such as names, addresses, dates of birth, etc. from the dataset. Reduce the precision of indirect identifiers by generalizing or aggregating the data. For example, cities may be aggregated to a region or dates aggregated to a decade, and job titles may be generalized to an area of expertise.

The UCSF Library provides several resources for sharing and de-identifying qualitative data.

Medical Imaging Data

Medical images may include PHI in the image itself using either “burned-in” annotations or digital overlay. DICOM data includes PHI in the file header as metadata. UCSF offers the PACS-AIR service to automate the de-identification of header data, but this service cannot remove PHI from the pixels of the image itself. Additional services to remove any remaining PHI include DICOM Cleaner or the UCSF Radiology CRC core.

The American College of Radiology provides specific recommendations for the de-identification of medical images before publication. Ideally, a screenshot that contains only the anatomical area of interest should be published. PHI should be cropped rather than masked.

Brain images such as head CTs may require the removal of facial features such as ears, nose, and lips to prevent re-identification. So-called “defacing” may be achieved by reducing noise, locating the area of interest, and cropping the image.

Although HIPAA generally considers the Safe Harbor method sufficient to de-identify medical images, the Common Rule requires that data not be “readily identifiable” to the investigator. It may be possible for an investigator to readily identify particularly distinctive images from her own patients, even after the removal of the eighteen HIPAA identifiers by an honest broker. The IRB might determine that images from investigators’ own patients cannot be de-identified.

Audio/video recordings

Voiceprints, full-face images, and comparable images are inherently identifiable and must be protected accordingly.

Audiovisual data should be transcribed and the transcription redacted or pseudonymized at the earliest opportunity. Depending on the sensitivity of the data, it may be advisable to transcribe and destroy audiovisual recordings immediately.

Techniques for de-identifying audiovisual data include audio manipulation and image pixelation or redaction. The Qualitative Data Repository (QDR) notes that such techniques may be cost-prohibitive and can adversely impact the integrity of the data. The QDR suggests that it may be preferable to obtain consent to share audiovisual data, or to share de-identified transcripts.

Rare diseases, small datasets, and other distinctive data

The Common Rule considers data to be de-identified when the individual’s identity may not be readily ascertained by the investigator or otherwise associated with the data. Additionally, the Privacy Rule requires that a “covered entity does not have actual knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information,” for data to be sufficiently de-identified.

It may not always be possible to sufficiently de-identify data by simply removing identifiers in accordance with the Safe Harbor method. Particularly distinctive data such as rare diseases, distinctive images, or other unique circumstances might always be identifiable and should be protected accordingly.