Article

Monitoring LLM Performance to Manage Model Risk and Measure User Satisfaction – Two Sides of the Same Coin

How to create AI experiences that are both safe and valuable by striking a balance between risk-focused and value-focused LLM evaluations and monitoring.

Published on

October 30, 2024

Florian Diem

In conversational AI, as in real life, there’s always a balancing act: creating value for users while managing the risks that come with deploying a technology that’s no longer deterministic. This inherent unpredictability makes it a lot harder to safeguard than traditional tech. Interestingly, the same metrics that keep your conversational AI safe and reliable can often double as indicators of user satisfaction. The challenge? Striking a balance in a typically resource-constrained enterprise environment - building evaluation capabilities that not only manage the inherent risk but also ensure that users derive real value from your conversational AI product.

When I first started working in conversational AI within Unilever, I saw firsthand how prioritising risk management was essential to protect users and brand integrity—especially for a large publicly traded company where a single PR misstep could tank its share price. Now, as I continue my journey of building conversational AI interfaces in a high-stakes field like health tech at Flo Health, I continue navigating this balancing act and I’d like to share a few observations on why businesses should avoid separating risk and value in their LLM evaluation efforts.

Let’s start by exploring why performance monitoring can address both these critical objectives.

1. Understanding Risk in Conversational AI: Accuracy, Privacy, and Reputation

The shift from deterministic to generative technologies is a fundamental one, altering how we evaluate and manage systems. Unlike traditional, rule-based systems that deliver predictable outcomes based on predefined inputs, generative models - like large language models (LLMs) - produce responses that can vary widely depending on context, user input, and underlying data. This inherent unpredictability means that generative AI requires a different evaluation approach.

In deterministic systems, evaluations focus on ensuring that inputs consistently produce expected results. But with generative AI, especially in conversational interfaces, evaluating performance is more complex and nuanced. Responses can be contextually accurate or may “hallucinate” - generate plausible but entirely incorrect information. So, when we talk about “risks” in AI, we’re not just discussing operational quirks; we’re talking about potential failures that could impact trust, compliance, brand image, and even your company’s share price.

Here are three key risk areas for any team building AI-driven interfaces:

1. Accuracy Risk: At its core, accuracy risk concerns the potential for AI to either “hallucinate”—generate convincing but entirely wrong information (untrue) - or fail to provide essential information (incomplete). In cases where users rely on AI for critical insights, these errors can erode trust, damage the brand, and, in the worst case, even harm users.

2. Privacy and Compliance: Privacy compliance is essential in industries handling personal or sensitive information, such as healthcare. Just as with any data-driven product, AI systems must align with local data and AI policies. Regulatory breaches here can lead to hefty penalties and long-term reputational damage.

3. Reputational Risk: Perhaps the most visible risk of all. Imagine an AI that occasionally delivers responses that seem off or insensitive. Even minor missteps can lead to a wave of user mistrust and harm the brand’s reputation, especially in sensitive sectors.

2. LLM Evaluation from a Risk Management Perspective

In the early stages of deploying conversational AI, companies understandably place heavy emphasis on risk management. After all, products and experiences are built on a new type of technology that’s largely untested and bears higher risk due to its generative nature.

Risk management within conversational AI isn’t simply about tracking accuracy; it’s about proactively measuring performance across critical dimensions like safety, bias, compliance, privacy, and reputation - before you go live, and especially after you’re live and users actually engage with your product. Establishing tools and processes to evaluate these risks forms the backbone of responsible, resilient AI deployment, creating a foundation where the AI aligns with brand values, complies with laws and regulations, and meets user expectations.

Here’s a breakdown of key risk categories and the types of monitoring required:

1. Safety Evaluations: At the core of AI risk management is keeping interactions safe by preventing harmful or inappropriate responses. Techniques like “LLM-as-a-judge,” where one model evaluates another’s responses based on predefined safety criteria, can be effective and scalable but may falter when handling nuanced or complex rule sets. If two human reviewers struggle to agree on rule compliance, the LLM-as-a-judge approach may also fall short. While this method can scale well for high-volume applications, I recommend carefully considering whether you’ll be able to fully automate the evaluation with this approach or whether a different/hybrid approach might suit your case better.

2. Accuracy Testing: Accuracy is essential for maintaining trust. It’s not only about correct answers but also about identifying when the model might be unsure or prone to hallucination. Regular testing against curated datasets helps track accuracy rates, while spot-checks or human evaluations on sensitive queries can reveal where the model’s confidence might outstrip its actual knowledge. Your architecture plays a key role here: LLMs using RAG architectures often show reduced inaccuracy due to greater control over response content. However, even RAG architectures are susceptible to hallucinations, especially when queries fall outside the database’s scope. LLMs actually behave very similar to humans: a well trained human with high integrity will admit when he or she doesn’t have the answer to a question, others will confidently try to make something up. I suppose we’ve all been there.

3. Privacy and Compliance Monitoring: Privacy is especially vital when handling personal information. Regular audits and automated compliance checks help ensure data isn’t stored or processed in ways that could breach regulations. For example, businesses can implement filters or specific prompts to prevent the AI from discussing or storing sensitive topics. Although privacy audits can be resource-intensive, they are essential for maintaining user trust and regulatory compliance.

Implementing a robust PII data-scraping capability is also highly recommended. Such a tool scans and detects PII data within user dialogues, anonymising it before it gets stored. This step is crucial because, even if your conversational AI is not intended as a data collection tool, users may still enter personal information into the dialogue. Consequently, this data could end up in your logs, potentially putting you in breach of legal regulations. Algorithm-based tools like Presidio are available for this purpose, but in my experience, an LLM-based approach can work even better, as it can detect more nuanced instances of PII without relying on a predefined library, which is almost always out-of-date by default.

4. Reputational Risk Management: Sometimes, a model’s responses can stray into ambiguous territory, potentially damaging a brand’s reputation (think McDonald’s, Google, AirCanada). Here, the key is to establish the same crisis response mechanisms that your company likely already has in place for other parts of the business. The process is simple: identify and acknowledge the issue, take responsibility, over-deliver on the fix, and recognise that there will always be some folks “trying to get you.” But when deciding between delivering value to your users and exposing yourself to the risk posed by these PR vultures, always opt for your users.

5. Balancing Scalable vs. Non-Scalable Evaluations: When managing risk in conversational AI, there’s a constant balance between scalability and precision. Popular methods, like using the LLM as a judge, are excellent for scalable, quick evaluations but aren’t foolproof, especially in sensitive contexts. In these cases, non-scalable steps like manual reviews or risk- and topic-based analysis for highly sensitive queries are invaluable. The goal is not to choose one over the other but to use each in tandem, ensuring that high-risk interactions are covered by a mix of scalable tools and meticulous human oversight.

3. The Balancing Act: Why Performance Monitoring Must Serve Dual Purposes

Here’s where the balancing act comes into play. The best performance monitoring setups allow AI to stay flexible enough to create satisfying user experiences while remaining safe and reliable. In other words, monitoring isn’t just about pinpointing mistakes or risks; it’s about identifying opportunities to add value.

The beauty of this approach lies in its adaptability. As conversational AI matures, its monitoring systems should evolve, reflecting the shifting needs of both the company and users. If AI is too regulated, it might feel safe but bland, lacking the engaging qualities that make interactions memorable. On the other hand, giving the AI too much freedom could enrich conversations at the expense of safety. The key is to adapt incrementally, using performance metrics to guide these adjustments.

4. Key Takeaways: Building a Balanced Evaluation Strategy for Your Conversational AI

How can businesses balance risk and value in a way that benefits both brand and user? Here are a few practical takeaways:

1. Don’t fight the focus on risk management: The early stages may feel overly cautious, but proving that your AI can be trusted to manage risks will pay dividends in the long run. Risk management isn’t just a box to check; it’s the foundation on which all future value is built.

2. Proving success with AI means succeeding on two fronts: delivering value to users and keeping them safe. An AI that doesn’t protect its users or the brand will ultimately undermine its own value. This dual accountability is what builds sustainable trust with both users and stakeholders.

3. Use risk management tools to gather user insights: Tools initially intended for risk assessment can often reveal opportunities to enhance the user experience. For instance, topic analytics can identify sensitive areas for safeguarding, but they can also highlight trending topics where users are interested but where the AI currently lacks depth. These insights help you spot gaps in user experience that need attention.

4. Trust the process: Once initial risk concerns are under control, the focus will naturally shift toward optimising user engagement. This transition doesn’t happen overnight, but it’s inevitable. When the CEO or CFO eventually asks for the ROI on the AI, being ready with a well-monitored, data-backed approach will allow you to demonstrate tangible value—without being caught off-guard.

5. The biggest risk is not knowing what your users are doing: Whether it’s a website, an app, or conversational AI, understanding how users engage with your product is critical. Unmuting those interactions wink is the only way to manage risks effectively while also optimising for real value. This dual-purpose insight provides the information needed to fine-tune both safety measures and user engagement.

5. Final Thoughts: Unmuting the Full Potential of AI

Balancing risk and value isn’t just about covering bases; it’s about using those insights to create impactful AI experiences. The future of conversational AI lies in its ability to adapt, learn, and continuously improve, using real-time insights to enhance user satisfaction and operational safety.

For anyone serious about conversational AI, this isn’t a nice-to-have—it’s essential. Because in the end, companies that strike this balance will lead the next evolution of AI-driven engagement.

With Unmuted.ai, I’m committed to helping businesses find this balance, “unmuting” valuable conversational AI insights. When performance monitoring looks through both risk and value lenses, we unlock a path toward AI experiences that aren’t just compliant but compelling.