event-icon
Description

Programmatic Theme: Clinical Research Informatics

Abstract: Healthcare analytics is impeded by a lack of machine learning (ML) model generalizability, the ability of a model to predict accurately on varied data sources not included in the model’s training dataset. We leveraged free-text laboratory data from a Health Information Exchange network to evaluate ML generalization using Notifiable Condition Detection (NCD) for public health surveillance as a use case. We 1) built ML models for detecting syphilis, salmonella, and histoplasmosis; 2) evaluated generalizability of these models across data from holdout lab systems, and; 3) explored factors that influence weak model generalizability. Models for predicting each disease reported considerable accuracy. However, they demonstrated poor generalizability across data from holdout lab systems being tested. Our evaluation determined that weak generalization was influenced by variant syntactic nature of free-text datasets across each lab system. Results highlight the need for actionable methodology to generalize ML solutions for healthcare analytics.

Learning Objective:
Healthcare analytics is impeded by a lack of machine learning (ML) model generalizability, the ability of a model to predict accurately on varied data sources not included in the model’s training dataset.

Machine learning models trained using free-text laboratory data from a Health Information Exchange network are capable of detecting syphilis, salmonella, and histoplasmosis with significantly high performance measures.

However, these models demonstrate poor generalizability across data from holdout lab systems being tested. Weak generalization is influenced by variant syntactic nature of free-text datasets across each lab system.

Authors:

Gregory Dexter (Presenter)
Regenstrief Institute

Shaun Grannis, Regenstrief Institute
Brian Dixon, Regenstrief Institute
Suranga Kasthurirathne, Regenstrief Institute

Keywords, Themes & Types