Mapping Genomics Data

Overview
The Cal Poly Digital Transformation Hub (DxHub), powered by Amazon Web Services (AWS), partnered with the Wisconsin State Laboratory of Hygiene and Virginia Department of General Services Division of Consolidated Laboratory Services (DCLS) to revolutionize genomic data standardization. The project resulted in ‘GenomicsMapper,’ an innovative AI-powered solution that automates the conversion of laboratory-specific genomic terminology into standardized formats required by federal repositories. This transformation reduces manual processing time from 15-20 hours to 2-3 hours per week while significantly improving submission accuracy and compliance.
Problem
Public health laboratories across the United States process thousands of genomic samples monthly, each using locally-optimized terminologies that must be standardized before submission to federal repositories. Bioinformatics teams spend an average of [15-20 hours per week] manually mapping data fields to meet National Center for Biotechnology Information (NCBI) submission requirements. This manual process results in submission rejection rates as high as 35%, creating costly delays in making vital genomic data available to the scientific community. The challenge represents over 780 hours annually per laboratory spent on data formatting rather than critical analysis and response activities.
Innovation In Action
The DxHub team developed GenomicsMapper, a cutting-edge solution that transforms how laboratories prepare genomic data for public repositories. The system leverages generative AI to automatically align diverse laboratory terminologies with federally standardized formats. Through natural language processing capabilities, GenomicsMapper analyzes laboratory-specific terms and matches them to standardized NCBI BioSample definitions, ensuring consistent and accurate data submission.
Technical Solution
The application’s architecture is API-driven and utilizes several AWS managed services for security, scalability, and reliability. Amazon Bedrock and the Claude foundation model Sonnet 3.5 v2 power the secure processing of sensitive genomic metadata. AWS Lambda and Amazon API Gateway provide the backend application logic, while data mapping definitions are securely stored using Amazon Simple Storage Service (Amazon S3). The system’s sophisticated natural language processing capabilities enable it to parse and analyze source data structure and terminology, then match fields with corresponding NCBI standardized terms using comprehensive definition libraries.
Next Steps
Early implementation results demonstrate transformative potential with 98.5% accuracy in field mapping across 10,000 samples and a 75% decrease in submission rejection rates. As genomic sequencing becomes increasingly central to public health surveillance and clinical diagnostics, the solution is positioned for broader adoption across public health laboratories nationwide.
Supporting Documents
Source Code | All of the code and assets developed during the course of creating the prototype. |
About the DxHub
The Cal Poly Digital Transformation Hub (DxHub) is a strategic relationship with Amazon Web Services (AWS) and is the world’s first cloud innovation center supported by AWS on a University campus. The primary goal of the DxHub is to provide students with real-world problem-solving experiences by immersing them in the application of proven innovation methods in combination with the latest technologies to solve important challenges in the public sector. The challenges being addressed cover a wide variety of topics including homelessness, evidence-based policing, digital literacy, virtual cybersecurity laboratories and many others. The DxHub leverages the deep subject matter expertise of government, education, and non-profit organizations to clearly understand the customers affected by public sector challenges and develop solutions that meet the customer needs.