Scalable Data-driven Phenotypes via Unsupervised Feature Learning.

Abstract

Inferring precise phenotypic patterns from population-scale clinical data is a critical computational task of personalized medicine. The dominant approach uses supervised learning, in which a human expert specifies which patterns to look for (by designating a learning task and class labels) and where to look for them (by constructing input features). This scales poorly and misses the unexpected patterns, which are the most informative. Unsupervised feature learning overcomes these limitations by identifying patterns (or features) that form a compact, expressive representation of their source data without depending on class labels or human input. Unsupervised features have become the state of the art for image and speech recognition. In this work we introduce the idea of using them for the problem of phenotype recognition. We present an unsupervised approach for inferring broadly useful, data-driven micro-phenotypes in the form of continuous, longitudinal features learned from noisy, sparse, and irregular EMR data. In our demonstration project, unsupervised features learned from longitudinal laboratory values were as accurate (0.96 AUC) in a classification task unknown to the learning algorithm as gold-standard features engineered by an expert with complete knowledge of the domain, the data, and the class labels.