Embedding medical journeys with machine learning to improve member health at Aetna

BI, Reporting & Business Use Cases

Jai Bansal

Senior Manager, Data Science at Aetna

This talk describes an internal data product called Member Embeddings that facilitates modeling of member medical journeys with machine learning.

Medical claims are the key data source we use to understand health journeys at Aetna. Claims are the data artifacts that result from our members' interactions with the healthcare system. Claims contain data like the amount the provider billed, the place of service, and provider specialty. The primary medical information in a claim is represented in codes that indicate the diagnoses, procedures, or drugs for which a member was billed. These codes give us a semi-structured view into the medical reason for each claim and so contain rich information about members' health journeys. However, since the codes themselves are categorical and high-dimensional (>10K cardinality), it's challenging to extract insight or predictive power directly from the raw codes on a claim.

To transform claim codes into a more useful format for machine learning, we turned to the concept of embeddings. Word embeddings are widely used in natural language processing to provide numeric vector representations of individual words.

We use a similar approach with our claims data. We treat each claim code as a word or token and use embedding algorithms to learn lower-dimensional vector representations that preserve the original high-dimensional semantic meaning.

This process converts the categorical features into dense numeric representations. In our case, we use sequences of anonymized member claim diagnosis, procedure, and drug codes as training data. We tested a variety of algorithms to learn embeddings for each type of claim code.

We found that the trained embeddings showed relationships between codes that were reasonable from the point of view of subject matter experts. In addition, using the embeddings to predict future healthcare-related events outperformed other basic features, making this tool an easy way to improve predictive model performance and save data scientist time.