Using Unsupervised Learning to Harmonize Data Across Data Systems

Details

Basic Details

Date Posted

Wednesday, July 28, 2021

Status

Complete

Description

A Common Data Model (CDM) is a critical step that unifies the coding “vocabulary” applied within the Sentinel Distributed Database (SDD). However, despite a common vocabulary, the coding “dialect” (i.e., the use and interpretation of codes for a particular clinical procedure or diagnosis) may differ across Data Partners due to heterogeneity in both care practice and financial drivers. With increasingly diverse Sentinel Data Partners and electronic health record (EHR) coding systems, there is more potential variation in the way a clinical concept can be coded. Variability in coding habits has been observed for decades1 and can degrade phenotyping model accuracy and causal inference model performance when models are applied at a new Data Partner.

The goal of the project is to assess the potential of data-driven statistical methods for describing and reducing coding differences between healthcare systems in the SDD. Findings of this project will inform development and deployment of statistical methods and computational tools for transferring knowledge learned from one Data Partner to another and pave the way towards automated curation and harmonization of EHR data in the SDD more broadly.

1Crombie, D.L, et al. 1992, King, M.S., et al., 2001, Leone, M.A., et al., 2006

Deliverable(s) (3)

Aim 1 and 2 Computational Protocol: SKAT or Burden Test Aim 3 Computation Protocol: Code Embedding Aim 3 Computational Protocol: Code Mapping

Contributors

Workgroup Leader(s)

Xu Shi, PhD; Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI

Workgroup Member(s)

Jennifer Nelson, PhD; David Carrell, PhD; Charissa Tomlinson; Kaiser Permanente Washington Health Research Institute, Seattle, WA

Xianshi Yu, PhD; University of Michigan School of Public Health, Ann Arbor, MI

Patrick Heagerty, PhD, MS; James Floyd, MD; University of Washington, Seattle, WA

Brian Hazlehurst, PhD; Denis Nyongesa; Daniel Sapp; Kaiser Permanente Center for Health Research, Portland, OR

Tianxi Cai, ScD; Department of Biomedical Informatics, Harvard Medical School and Harvard T.H. Chan School of Public Health, Boston, MA

Sudha Raman, PhD; Department of Population Health Sciences at Duke University School of Medicine, Durham, NC

Shirley Wang, PhD; Division of Pharmacoepidemiology and Pharmacoeconomics at Brigham and Women’s Hospital, Boston, MA

Sharon Davis, PhD, MS; Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN

Danijela Stojanovic, PharmD, PhD; Sara Karami PhD, MPH; Patricia Bright, MSPH, PhD; Office of Surveillance and Epidemiology, Center for Drug Evaluation and Research, Food and Drug Administration, Silver Spring, MD

Yong Ma, PhD; Office of Biostatistics, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, MD

Jie (Jenni) Li, PhD; Office of Pharmacovigilance and Epidemiology, Office of Surveillance and Epidemiology, Center for Drug Evaluation and Research, Food and Drug Administration, Silver Spring, MD

Jenna Wong PhD, MSc; Kathleen Shattuck, MPH; Department of Population Medicine, Harvard Pilgrim Health Care Institute and Harvard Medical School, Boston, MA