Building the AI quality flywheel: How TheyDo turns user feedback into better AI

Chris Swart · Senior Machine Learning Engineer
    Building_the_AI_quality_flywheel_Blog

    Businesses are buzzing about AI and its potential to revolutionize everything from operations and decision-making to the bottom line. At TheyDo, we’re on board — but we didn’t just drink the Kool-Aid. We put AI to work where it matters.

    Our Journey AI transforms unstructured customer data into actionable insights, revealing hidden opportunities and pain points across user journeys. We process multiple input sources to generate valuable outputs that enterprises rely on for strategic decisions. But those insights are only as good as the quality of the AI behind them. A journey riddled with duplicate steps, vague insights, or unnecessary complexity doesn’t just frustrate users — it erodes trust. And in AI, trust is everything.

    TheyDo

    That’s why we built a structured, self-improving system that turns AI evaluation from a black box into a quality flywheel, ensuring every insight helps you make better decisions, faster — and with confidence.

    Here’s how we did it:

    The quality challenge: Cracking the measurement code

    AI-generated content must be accurate, structured, and meaningful — but ensuring consistent quality at scale is anything but straightforward. AI outputs can vary unpredictably, and what looks good on the surface may fail to provide real value.

    So, how do you measure and maintain the quality of AI-generated content at scale?

    To tackle this challenge, we combined multiple approaches to ensure AI-generated content meets the highest standards. By leveraging user feedback, automated guardrails, and AI-driven evaluation, we created a system that continuously refines and improves itself. Here’s how each method plays a critical role:

    1. User feedback: The gold standard (but too slow)

    The best feedback comes directly from our users. When they tell us, “This journey is amazing” or “These steps are confusing”, we get powerful qualitative insights. The problem? Feedback is sporadic, subjective, and arrives too late to prevent poor experiences.

    2. Automated guardrails: Catching obvious errors

    We implemented automated tests to flag common AI missteps:

    • Duplicate steps or phases

    • Excessive step counts

    • Placeholder text in outputs 

    These guardrails acted as a first line of defense, preventing glaring errors before they reached users. But they couldn’t evaluate more subjective elements like clarity, coherence, or value.

    3. LLM-as-Judge: Scaling AI evaluation

    To address the shortfalls of our first two methods, we introduced an innovative approach: using AI to evaluate AI. We built an LLM-as-Judge system, leveraging large language models (LLMs) to assess the quality of AI-generated journey maps. This allowed us to:

    • Correlate AI-generated scores with real user feedback

    • Scale evaluation beyond what human reviewers could handle

    • Understand how model tweaks impacted performance

    The technical approach

    To effectively evaluate and improve our AI-generated outputs, we needed an LLM evaluation framework that met five key requirements:

    • Self-hostable for data privacy

    • Robust production monitoring to track performance

    • Comprehensive API/SDK for seamless integration

    • Visual prompt comparison tools for testing and refinement

    • Experiment management for prompt versioning and iteration

    Evaluating our options

    We assessed five potential solutions:

    • Braintrust (Closed-source LLM engineering platform)

    • MLflow (Open-source MLOps platform)

    • Chainforge (Visual prompt evaluator)

    • Langfuse (Open-source LLM engineering platform)

    • A custom-built solution on our Honeycomb infrastructure

    Feature comparison

    Key featureBraintrustMLflowChainforgeLangfuseCustom solution
    Self-hosting
    Ease of setup⚠️ Moderate🔴 Complex🟢 Simple⚠️ Moderate🔴 Complex
    Comprehensive API⚠️ Limited
    Visual prompt tools⚠️ Requires dev
    Evaluation capabilities✅ Advanced✅ Advanced⚠️ Basic✅ Advanced⚠️ Custom dev
    Prompt management⚠️ Requires dev

    Click to see detailed comparison (full 24-criteria evaluation)

    LLM Eval ToolsBraintrustMLflowChainforgeLangfuseCustom Solution (Honeycomb)
    Core Capabilities
    Self-hosting
    UI for visualizations⚠️ Requires development
    Programmatic API/SDK⚠️ Limited
    Production monitoring⚠️ Limited
    Open source⚠️ only auto-evals
    Evaluation Methods
    LLM-as-a-Judge⚠️ Basic⚠️ Requires implementation
    Heuristic metrics⚠️ Limited⚠️ Requires implementation
    Statistical metrics⚠️ Limited⚠️ Requires implementation
    Experiment Management
    Prompt versioning⚠️ Requires implementation
    A/B testing⚠️ Requires implementation
    Experiment tracking⚠️ Limited⚠️ Via Honeycomb
    Dataset management⚠️ Requires implementation
    Observability
    Tracing
    Metrics collection⚠️ Limited
    Cost tracking⚠️ Limited
    Latency monitoring
    Integration Capabilities
    Multiple LLM providers
    Integration with RAG⚠️ Limited⚠️ Requires implementation
    User Experience
    Setup complexity⚠️ Moderate🔴 High🟢 Low⚠️ Moderate🔴 High
    Visual prompt comparison⚠️ Requires development
    Team collaboration⚠️ Limited⚠️ Via other tools
    Learning curve⚠️ Moderate🔴 Steep🟢 Gentle⚠️ Moderate🔴 Steep

    And the winner is — Langfuse

    We chose Langfuse because it hit the sweet spot for our needs. It lets us keep our data in-house (unlike Braintrust), has good monitoring tools, and makes managing experiments straightforward. It's easier to set up than MLflow and more full-featured than Chainforge.

    The ML flywheel: A continuous improvement loop

    TheyDo

    The most impactful result of our LLM evaluation system is the creation of a continuous improvement loop — what we call our ML Flywheel. This self-reinforcing system ensures that every iteration enhances AI quality and reliability. At its core, the flywheel is built on four key components:

    1. AI-generated content: Journey AI generates insights based on current prompts and models.

    2. User feedback: Users interact with the generated content, offering explicit (comments, ratings) and implicit (engagement patterns) feedback.

    3. AI evaluation: Automated guardrails and LLM-as-Judge tools assess quality, correlating results with real user feedback.

    4. Better prompts: Evaluation insights feed directly into prompt improvements, refining future AI outputs.

    This creates a virtuous cycle: better prompts → better AI outputs → better user feedback → stronger evaluation data → better prompts.

    Lessons learned: What it takes to get AI right

    Through this process, we uncovered several key insights:

    1. Combine human + AI evaluation

      • Rule-based checks catch obvious errors, while LLM-as-Judge handles subjective quality assessments.

    2. Validate against real feedback

      • AI evaluations must align with actual user reactions to remain meaningful.

    3. Customize for your use case

      • Generic benchmarks aren’t enough — evaluators must target your specific failure modes.

    4. Test at multiple levels

      • Some problems only appear with real-world usage, requiring both development-time and production-time testing.

    5. Make evaluation an integral part of AI development

      • Continuous feedback loops, not just final validation, drive real AI quality improvement.

    What’s next?

    The AI flywheel doesn’t stop spinning. Our next steps include:

    • Automating prompt iteration based on historical user feedback and LLM-as-Judge scores.

    • Strengthening AI-to-user feedback loops to accelerate improvements.

    • Exploring new evaluation tools like DSPy for more sophisticated prompt tuning.

    By committing to a systematic, AI-driven approach to quality, TheyDo ensures that every AI-generated journey insight meets the highest standards — helping businesses make better decisions, faster. Stay tuned for the next installment of our AI flywheel series, where we’ll take a deeper dive into Langfuse and how it powers our AI-driven insights.

    Ready to see Journey AI in action?

    Discover how TheyDo's Journey AI can transform your customer data into clear, actionable insights. Start your free trial now.