The Evolution of Service Level Agreements: Why AI Evaluations Matter in Mortgage
By Tela G. Mathias
Traditional service level agreements (SLAs) are how we measure technology performance in the mortgage industry, and really in all software solutions. These agreements historically focused on quantifiable metrics such as system uptime, response times, and service availability. The attempts to scale and increase adoption of generative artificial intelligence (genAI)-based solutions in mortgage has created a need for more sophisticated performance measures that go beyond traditional operational metrics.
While traditional SLAs effectively measure whether a web-based loan origination system (LOS) loads quickly or if an automated underwriting system (AUS) remains accessible, they fall short in evaluating the quality and reliability of generative artificial intelligence (genAI). A system can maintain perfect uptime while delivering inaccurate or biased results. This gap between operational performance and actual effectiveness necessitates a new framework for measuring AI system performance.
AI evaluations shift how we measure technology performance in mortgage lending. These systematic assessment methods focus on the quality and reliability of AI outputs, rather than just system operational performance. For instance, let’s imagine a hypothetical genAI agent whose objective is to resolve complaints from consumers regarding escrow shock, an unexpected and significant increase in a homeowner’s monthly mortgage payment due to changes in their escrow account requirements.
This agent monitors email to identify complaints of this type, runs a root cause analyzer, creates a management action plan, kicks off a workflow for a human in the loop, and presents the contextual plan to the operator for review and communication to the homeowner. We might need an evaluation framework to measure:
- Compliant classification accuracy. Did the AI system find the right complaints? Did it miss any?
- Root cause analysis quality. Did the AI system correctly determine the root cause of the complaint?
- Action plan effectiveness. Did the AI system create the right action plan? Did it correctly report the complaint to CFPB? Did it annotate the system correctly?
These types of metrics fit well within an organization’s responsible AI (RAI) framework and help us evaluate our performance against the reliability pillar, especially.
The emergence of open-source evaluation tools has made it at least feasible, even if technically challenging, for mortgage companies to implement RAI frameworks. Tools like promptfoo enable systematic testing of large language models, helping organizations:
- Validate model outputs against established criteria
- Identify potential security vulnerabilities
- Ensure consistent performance across different scenarios
- Monitor and maintain compliance with industry standards
As genAI continues to transform mortgage lending, the industry should adopt evaluation-based performance metrics that match the sophistication of these new technologies. This evolution from traditional SLAs to evaluation frameworks will help ensure that AI systems operate reliably and deliver trustworthy, compliant, and fair results.
Organizations that adapt their performance measurement approaches to include evaluations will be better positioned to leverage AI technologies effectively while maintaining the high standards of accuracy and fairness. I believe in and will encourage regulators and housing agencies to look for evaluation-based performance frameworks in genAI based systems.