Automating Housing Record Extraction with Amazon Textract for SLO County

Overview

The California Polytechnic State University, San Luis Obispo (Cal Poly) Digital Transformation Hub (DxHub), powered by Amazon Web Services (AWS), collaborated with the City of San Luis Obispo’s IT Department to automate and streamline the extraction of vital housing records from state-issued documentation.

The City’s Office of Sustainability and Natural Resources is developing a concierge service to help residents of manufactured homes upgrade their units. This service will assist with improving energy efficiency, supporting clean energy adoption, and enhancing health outcomes. As part of the project’s data collection phase, the City received scanned title and registration documents in PDF format. Manually reviewing and entering key information from these forms, such as decal numbers, manufacturer details, serial numbers, and sale/transfer information, places a significant burden on administrative staff, especially as they scale.

Recognizing the opportunity for innovation, the DxHub team developed a lightweight, ML-enabled web application for dynamic data processing. This application leverages AWS Textract, an intelligent document processing service that uses machine learning to accurately transform key information from title and registration records into actionable data. By providing a simple, user-friendly interface to upload documents and view extracted results, the solution reduces processing time, improves accuracy, and sets a foundation for scaling document handling and extraction.

This project exemplifies the DxHub’s mission of applying emerging cloud technologies to solve real-world public sector challenges while giving Cal Poly students direct experience with practical innovation.

Problem

To best serve manufactured homes communities and provide assistance with unit upgrades requires the City to maintain accurate, up-to-date information about each property. This includes key data points like the home’s decal number, manufacturer, model, manufactured date, serial numbers, dimensions, ownership history, and sale price.

For the City of San Luis Obispo, this information typically arrives in the form of standardized PDF documents. Extracting the necessary data manually is tedious and time-intensive, often taking several minutes per document and opening the door to potential human errors such as typos, missed fields, or inconsistent formatting.

With plans to process records across ten different parks, the City expressed interest in reducing the time spent on manual data entry. By leveraging machine learning, they hoped to streamline the workflow and shift staff focus toward higher-value tasks.

The challenge here was clear: how to reliably automate the extraction of structured, usable data from standardized scanned forms, even when some documents contained incomplete information.

Innovation In Action

To address this challenge, the DxHub team designed and built the Document Data Extractor: a ML-enabled web application that simplifies the document intake and review process.

Using Streamlit for the frontend, the application presents a clean, intuitive interface where staff members can simply upload a single PDF or an entire batch of PDFs, and view the extracted data almost instantly. Under the hood, the uploaded file is securely processed using AWS Textract’s AnalyzeDocument API, which intelligently detects forms and tables, extracting key-value pairs and structural relationships found within the document.

The extracted data is then parsed and organized into a structured JSON format, clearly listing:

  • Decal Number
  • Manufacturer
  • Model
  • Manufactured Date
  • First Sold Date
  • Serial Numbers with HUD Insignia, Length, Width
  • Record Conditions
  • Last Reported Registered Owner
  • Sale/Transfer Details
  • Situs Address

One key innovation of the system is its resilience: not all documents contain every field. Some may lack sale price details; others may omit registered owner information. The application handles these cases gracefully, leaving empty fields without crashing or requiring manual intervention.

 

Technical Solution

The Document Data Extractor is built with flexibility, security, and performance in mind. The frontend is developed using Streamlit, providing a responsive, browser-accessible interface. On the backend, AWS Textract performs the core analysis of scanned documents, using its “Forms” and “Tables” features to extract structured data. The raw output is then processed by custom Python scripts, which map the data into clean, consistent field structures suitable for storage, review, or export. Uploaded PDFs are handled securely, stored in temporary locations, and deleted immediately after processing to minimize data persistence and reduce security risks. The system currently supports CSV downloads, multi-document batch uploads, and dashboard views for tracking extraction progress across entire parks. This modular architecture ensures the solution can be easily extended, adapted to new document types, and scaled as the City’s needs evolve.

Impact

With the Document Data Extractor, the City of San Luis Obispo is now equipped with a tool that can cut document processing time dramatically from minutes per file to seconds, all the while improving data reliability.

By reducing staff workload on repetitive clerical tasks, the City frees up valuable time to focus on higher-impact activities such as strategic planning, property management, and resident services.

Additionally, the system’s consistent structuring of extracted data supports better reporting, compliance tracking, and operational transparency.

This project also provides a repeatable model for other municipalities with similar workflows, leveraging an ML-enhanced system built to efficiently process batches of documents.

Student Spotlight

Sharon Liang

Software Developer

Dhvani Goel

Software Developer

Supporting Documents

Source CodeAll of the code and assets developed during the course of creating the prototype.

About the DxHub

The Cal Poly Digital Transformation Hub (DxHub) is a strategic relationship with Amazon Web Services (AWS) and is the world’s first cloud innovation center supported by AWS on a University campus. The primary goal of the DxHub is to provide students with real-world problem-solving experiences by immersing them in the application of proven innovation methods in combination with the latest technologies to solve important challenges in the public sector. The challenges being addressed cover a wide variety of topics including homelessness, evidence-based policing, digital literacy, virtual cybersecurity laboratories and many others. The DxHub leverages the deep subject matter expertise of government, education, and non-profit organizations to clearly understand the customers affected by public sector challenges and develop solutions that meet the customer needs.