Implementing NIST AI RMF: Measuring (Part 3 of 4)

Beyond Trust Theater: Metrics That Matter

Jul 25, 2025

This is Part 3 of a four-part series on implementing practical AI governance using the NIST AI Risk Management Framework. This series transforms NIST's four core functions (GOVERN, MAP, MEASURE, MANAGE) into actionable implementation strategies.
We’ve built enabling governance structures (GOVERN) and gained visibility into hidden AI deployments (MAP). Now we address the measurement gap: why organizations focus on technical performance metrics while ignoring the trustworthiness characteristics that determine whether AI systems are safe to deploy. The MEASURE function establishes evaluation frameworks that separate meaningful assessment from trust theater.

I spoke with an org whose data team now spends just 90 minutes reviewing a few dozen AI metrics each quarter. This wasn't an LLM company or mega-corp, just a mid-sized org using AI to augment customer service. Dashboards tracked model accuracy (98.3%), latency (sub-200ms), uptime (99.9%), user adoption (50%). Beautiful layout, color-coded alerts, automated reporting. No notes!

But when leadership asked about consistency monitoring and customer trust evaluation, the team came back with answers a few days later:

eight months deployed
$300K invested
zero trustworthiness assessment conducted
CSAT scores down 23%, with complaints about inconsistent AI behavior

The AI was technically strong and broadly adopted, “perfect” even, but delivered inconsistent experiences. The team measured everything except whether the system was trustworthy enough to deploy safely.

This is a Trust Theater problem. Orgs build elaborate technical measurement systems while ignoring basic safety and fairness evaluations that are the real success indicators. In Part 2, we discussed discovering hidden AI systems through MAP. Now, the challenge is to measure those systems for trustworthiness, not just performance.

The disconnect is glaring. IBM reports that 75% of AI initiatives fail [1]; yet orgs continue to report success based on technical metrics alone. It’s like tracking uptime while ignoring response quality, or “App Service Availability” instead of “Application Availability”. (Microsoft Azure-loving readers will relate to that one!)

The measurement gap doesn’t stem from lack of data, but from measuring the wrong things. The NIST framework integrates with any measurement culture (KPI, OKR, or hybrid) but only if it focuses on trustworthiness, not just efficiency.

The Disconnect

We’ve seen this pattern since AI moved from the bench to the desktop. Orgs measure what feels technically significant (accuracy, latency, throughput) while ignoring NIST-defined trustworthiness characteristics [2, Section 3.2.2: Characteristics of Trustworthy AI].

Traditional metrics don’t capture AI trustworthiness, so you need to expand your measurement scope significantly. AI systems require evaluation across seven NIST characteristics:

validity/reliability
safety
security/resilience
accountability/transparency
explainability/interpretability
privacy enhancement
fairness with managed bias

It’s easy to count accuracy, much harder to measure explainability. But without evaluating trustworthiness and characteristics thereof, technical metrics are misleading. Note that I refrained from saying “pointless” or “useless” here, but in reality unless users can and do trust a system, that’s just rearranging the furniture.

Orgs that evaluate trustworthiness before deployment avoid drift and inconsistencies that erode satisfaction. Those that don’t often encounter reliability and trust issues post-deployment [3]. The orgs seeing success define trust thresholds before deployment, assess across all seven characteristics, monitor continuously, and maintain accountability structures.

This isn’t compliance overhead, it’s what enables sustainable value. A 20% improvement in consistency evaluation isn’t just risk management. It prevents churn, rework, and system collapse.

The NIST MEASURE Function

The NIST MEASURE function [2, Section 3.3: MEASURE Function] defines a structured approach to trustworthiness evaluation. Same risk-based logic introduced in GOVERN and MAP, now applied to measurement intensity, and you can reuse your prior classifications:

Red: comprehensive evaluation
Yellow: systematic assessment
Green: targeted validation

MEASURE 1: Foundation Selection

Start by selecting methods and metrics that target the most significant risks identified during MAP. Not what’s easiest to collect, but what matters most. (Risk significance > convenience.) The framework supports adaptation to KPI cultures (completion rates, coverage metrics) or OKRs (assessment targets, confidence levels).

MEASURE 2: Evaluating Trustworthiness Characteristics

This is the substantive work. Systematic evaluation across all seven trustworthiness characteristics:

Foundation characteristics:

Validity/reliability: performance under deployment conditions
Safety: acceptable risk boundaries and drift monitoring
Security/resilience: resistance to attack, recoverability

Trust-enabling characteristics:

Accountability/transparency: stakeholder access and auditability
Explainability/interpretability: clarity for oversight
Privacy enhancement: protection of sensitive data
Fairness: mitigation of systemic bias

(And no, this isn’t box-checking! It’s systematic analysis + deployment safety.)

MEASURE 3: Risk Tracking Mechanisms

Initial assessments don’t hold value without ongoing monitoring. AI systems drift, fairness degrades, safety limits shift, and explainability declines under new conditions.

To mitigate these, implement systematic monitoring [2, Section 3.3.3: Risk Tracking]. The shorthand for ensuring you have coverage for this one is to make sure you have just three things:

something to track behavior drift,
alert triggers for safety checks, and
a mechanism for users and staff to report issues.

MEASURE 4: Measurement Effectiveness Feedback

Do your measurement systems work in practice? Validation happens through SME review, stakeholder feedback, and performance monitoring [2, Section 3.3.4: Feedback].

Do consistency checks catch real issues? Are safety metrics preventing incidents? Is explainability supporting human oversight? The goal is not theoretical assurance but actual deployment-relevant performance measurement.

30-Day Trustworthiness Measurement Implementation

In the first two parts of the series (GOVERN and MAP), I recommended a few Phases for implementation of each function, but here we have an opportunity to measure this in a sprint-style format. Here’s a sample timeline for building trustworthiness evaluation into existing performance frameworks. (#ymmv)

Week 1: Define Thresholds Articulate measurable trustworthiness requirements:

safety boundaries
fairness minimums
explainability conditions

Each characteristic needs acceptance criteria. When does a failure suspend deployment? When does drift trigger review? No, you cannot decide these things later. This is the moment, use this opportunity well and really drill into it, leveraging that trafficlight system if you can.

Week 2: Implement Systematic Evaluation Manual spot-checking breaks down quickly. Automate wherever possible:

alerts for safety boundary violations
notifications for fairness degradation
explainability review triggers

Tie monitoring outputs into existing risk governance infrastructure. Use those familiar Red/Yellow/Green trafficlights to match evaluation intensity to risk:

[Red] High-risk: comprehensive evaluation
[Yellow] Medium-risk: systematic assessment
[Green] Low-risk: targeted validation with continuous monitoring

Week 3: Establish Baselines Document current-state benchmarks across characteristics. Use internal expectations, external standards, and regulatory requirements to frame baselines.

Build a cadence and response plan for incidents that arise:

monthly reviews and P1 escalations for high-risk systems
quarterly and advanced escalation management for medium-risk
annual review and normal escalation management for low-risk

Week 4: Launch Review Cycles Operational reviews track safety, fairness, and explainability as primary health indicators. Technical metrics are always secondary. Important to delivery, but they are not the be-all-end-all. Quarterly assessments ensure continued deployment viability, and annual framework reviews realign goals with evolving values and regulations. Orgs following this approach see improved AI reliability, fewer user experience issues, and clearer accountability pathways [4].

About Measuring Trust

Many teams overbuild performance dashboards while ignoring trustworthiness. That team from the beginning of this article is a perfect example: It looks good, but doesn’t prevent real-world harm.

You have to be comfortable halting deployments that don’t meet thresholds, especially for initiatives like this where deployment velocity is often a priority. Legacy measurement approaches often fail here, optimized for narrow performance metrics and not multi-dimensional trust analysis.

Systematic means consistent, documented, repeatable; not improvised or reactionary. That’s what enables tracking, comparison, and accountability over time.

The framework sounds complex when you focus on such large goals, but it’s really not. You need:

trustworthiness thresholds
consistent evaluation
regular review
stop conditions

If your measurement system lacks these, you’re not managing risk… you’re accumulating it or, worse yet, accepting it.

Next week, we conclude with MANAGE, focusing on how to scale AI systems without sacrificing control or customer trust.

Read the full series:

Part 1: Governing
Part 2: Mapping
Part 3: Measuring
Part 4: Managing

References

[1] IBM Institute for Business Value. "CEO Study 2025: AI Leadership." IBM Newsroom, May 6, 2025. https://newsroom.ibm.com/2025-05-06-ibm-study-ceos-double-down-on-ai-while-navigating-enterprise-hurdles

[2] National Institute of Standards and Technology. "AI Risk Management Framework (AI RMF 1.0)" and "AI RMF Playbook." NIST, Jan 2023, updated Jul 2024. https://www.nist.gov/itl/ai-risk-management-framework

[3] MIT Sloan Management Review. "A Framework for Assessing AI Risk." MIT, 2024. https://mitsloan.mit.edu/ideas-made-to-matter/a-framework-assessing-ai-risk

[4] Frontiers in Computer Science. "Challenges and Best Practices in Corporate AI Governance: Lessons from the Biopharmaceutical Industry.", 2024. https://www.frontiersin.org/articles/10.3389/fcomp.2022.1068361/full

CREDITS: base cover image from airc.nist.gov; Anthropic Claude Sonnet 4 for editorial review

Discussion about this post

Ready for more?