AI Driven Classification of Immunization-related Clinical Data
Overview
The Cal Poly Digital Transformation Hub (DxHub), powered by Amazon Web Services (AWS), collaborated with HLN Consulting to explore how artificial intelligence (AI) can support public health by automating the classification of immunization-related clinical data. Together, the team designed and developed a prototype that can dynamically parse unstructured and structured patient data for possible mapping to Clinical Decision Support for immunization (CDSi) codes. These CDSi codes used to determine what immunizations a patient may need based on certain medical conditions. This prototype leverages large language models (LLMs), medical ontologies, and cloud-native services to help clinicians navigate complex electronic health records (EHRs) and ensure patients receive timely, accurate immunization guidance.
Problem
For public health agencies and clinical partners, determining w
hich immunizations a patient needs depends on accurately identifying underlying health conditions. While CDSi codes provide a standard for making these decisions, patient records are often filled with free text, inconsistent terminology, or missing structured codes. This creates a significant challenge: if systems can’t recognize what conditions a patient has, they can’t reliably recommend immunizations. HLN saw an opportunity to apply AI in helping bridge this gap—developing a solution that could intelligently map medical conditions to CDSi codes even when data is unstructured or variably described.
Innovation In Action
The team interviewed subject matter experts to help pinpoint the most frequent challenges health care providers face when interpreting patient data. From these findings, the team envisioned two complementary solutions: one using LLMs to match unstructured patient conditions to CDSi codes, and a second that leveraged existing SNOMED codes in structured documents to achieve the same goal. The team then quickly began prototyping these solutions to determine a viable solution.
Technical Solution
The prototype explored two distinct but synergistic approaches to CDSi code classification:
1. LLM-Based Classification
- This approach starts with a patient’s electronic health record in text format.
- Extracts only current conditions (those without an end date) from the record.
- Retrieves a list of CDSi codes and their descriptions stored in Amazon Simple Storage Service (S3).
- Sends both the conditions and the codes to a large language model (LLM) accessed through Amazon Bedrock, with a custom prompt instructing the model to match conditions only when there is clear alignment, avoiding assumptions or inferences.
The model responds with:
- A list of matched CDSi codes
- Corresponding observation titles
- References linking each match to the source condition in the patient record
2. SNOMED-to-CDSi Mapping
Recognizing that many EHRs include SNOMED clinical terms, a second approach uses SNOMED codes as a bridge to CDSi classification:
- Direct Matching: SNOMED codes are extracted from structured sections of the patient’s CDA-formatted health document, including conditions and surgeries. These codes are then matched to CDSi codes using a lookup table stored in Amazon DynamoDB.
- AI-Based Extraction: For conditions buried in unstructured text, AWS Comprehend Medical extracts SNOMED codes, which are then queried against the same DynamoDB mapping table to return applicable CDSi codes.
These two methods give health agencies flexibility depending on the structure and quality of incoming patient data—enabling a robust CDSi classification pipeline across diverse EHR formats.
Experiments & Abandoned Approaches
The team also explored several innovative but ultimately unused paths, which are documented to avoid future repetition and to share learnings with the broader community:
Medical BERT Similarity Analysis
The team tested several pretrained medical BERT models by comparing semantic similarity between patient condition terms and CDSi observations across relation types like synonymy, causality, and comorbidity. Surprisingly, unrelated or even opposite terms often scored higher than clinically relevant ones, revealing a disconnect between linguistic similarity and actual clinical relationships. As a result, this approach was deemed unreliable for downstream decision-making.
Context Expansion via Hyde AI Mechanism
Another early experiment used a Hyde AI-inspired approach to enrich LLM prompts with additional concepts—like likely medications or symptoms—based on each CDSi observation title. While this enriched context showed some promise, the model frequently made incorrect assumptions about clinical relationships, leading to misclassifications. Due to its unpredictability and lack of clinical reasoning, this method was ultimately set aside.
Supporting Documents
| Source Code | All of the code and assets developed during the course of creating the prototype. |
About the DxHub
The Cal Poly Digital Transformation Hub (DxHub) is a strategic relationship with Amazon Web Services (AWS) and is the world’s first cloud innovation center supported by AWS on a University campus. The primary goal of the DxHub is to provide students with real-world problem-solving experiences by immersing them in the application of proven innovation methods in combination with the latest technologies to solve important challenges in the public sector. The challenges being addressed cover a wide variety of topics including homelessness, evidence-based policing, digital literacy, virtual cybersecurity laboratories and many others. The DxHub leverages the deep subject matter expertise of government, education, and non-profit organizations to clearly understand the customers affected by public sector challenges and develop solutions that meet the customer needs.
