Building the AI quality flywheel: How TheyDo turns user feedback into better AI
)
Businesses are buzzing about AI and its potential to revolutionize everything from operations and decision-making to the bottom line. At TheyDo, we’re on board — but we didn’t just drink the Kool-Aid. We put AI to work where it matters.
Our Journey AI transforms unstructured customer data into actionable insights, revealing hidden opportunities and pain points across user journeys. We process multiple input sources to generate valuable outputs that enterprises rely on for strategic decisions. But those insights are only as good as the quality of the AI behind them. A journey riddled with duplicate steps, vague insights, or unnecessary complexity doesn’t just frustrate users — it erodes trust. And in AI, trust is everything.
That’s why we built a structured, self-improving system that turns AI evaluation from a black box into a quality flywheel, ensuring every insight helps you make better decisions, faster — and with confidence.
Here’s how we did it:
The quality challenge: Cracking the measurement code
AI-generated content must be accurate, structured, and meaningful — but ensuring consistent quality at scale is anything but straightforward. AI outputs can vary unpredictably, and what looks good on the surface may fail to provide real value.
So, how do you measure and maintain the quality of AI-generated content at scale?
To tackle this challenge, we combined multiple approaches to ensure AI-generated content meets the highest standards. By leveraging user feedback, automated guardrails, and AI-driven evaluation, we created a system that continuously refines and improves itself. Here’s how each method plays a critical role:
1. User feedback: The gold standard (but too slow)
The best feedback comes directly from our users. When they tell us, “This journey is amazing” or “These steps are confusing”, we get powerful qualitative insights. The problem? Feedback is sporadic, subjective, and arrives too late to prevent poor experiences.
2. Automated guardrails: Catching obvious errors
We implemented automated tests to flag common AI missteps:
Duplicate steps or phases
Excessive step counts
Placeholder text in outputs
These guardrails acted as a first line of defense, preventing glaring errors before they reached users. But they couldn’t evaluate more subjective elements like clarity, coherence, or value.
3. LLM-as-Judge: Scaling AI evaluation
To address the shortfalls of our first two methods, we introduced an innovative approach: using AI to evaluate AI. We built an LLM-as-Judge system, leveraging large language models (LLMs) to assess the quality of AI-generated journey maps. This allowed us to:
Correlate AI-generated scores with real user feedback
Scale evaluation beyond what human reviewers could handle
Understand how model tweaks impacted performance
The technical approach
To effectively evaluate and improve our AI-generated outputs, we needed an LLM evaluation framework that met five key requirements:
Self-hostable for data privacy
Robust production monitoring to track performance
Comprehensive API/SDK for seamless integration
Visual prompt comparison tools for testing and refinement
Experiment management for prompt versioning and iteration
Evaluating our options
We assessed five potential solutions:
Braintrust (Closed-source LLM engineering platform)
MLflow (Open-source MLOps platform)
Chainforge (Visual prompt evaluator)
Langfuse (Open-source LLM engineering platform)
A custom-built solution on our Honeycomb infrastructure
Feature comparison
Key feature | Braintrust | MLflow | Chainforge | Langfuse | Custom solution |
---|---|---|---|---|---|
Self-hosting | ❌ | ✅ | ✅ | ✅ | ✅ |
Ease of setup | ⚠️ Moderate | 🔴 Complex | 🟢 Simple | ⚠️ Moderate | 🔴 Complex |
Comprehensive API | ✅ | ✅ | ⚠️ Limited | ✅ | ✅ |
Visual prompt tools | ✅ | ✅ | ✅ | ✅ | ⚠️ Requires dev |
Evaluation capabilities | ✅ Advanced | ✅ Advanced | ⚠️ Basic | ✅ Advanced | ⚠️ Custom dev |
Prompt management | ✅ | ✅ | ❌ | ✅ | ⚠️ Requires dev |
Click to see detailed comparison (full 24-criteria evaluation)▼
LLM Eval Tools | Braintrust | MLflow | Chainforge | Langfuse | Custom Solution (Honeycomb) |
---|---|---|---|---|---|
Core Capabilities | |||||
Self-hosting | ❌ | ✅ | ✅ | ✅ | ✅ |
UI for visualizations | ✅ | ✅ | ✅ | ✅ | ⚠️ Requires development |
Programmatic API/SDK | ✅ | ✅ | ⚠️ Limited | ✅ | ✅ |
Production monitoring | ✅ | ✅ | ⚠️ Limited | ✅ | ✅ |
Open source | ⚠️ only auto-evals | ✅ | ✅ | ✅ | ✅ |
Evaluation Methods | |||||
LLM-as-a-Judge | ✅ | ✅ | ⚠️ Basic | ✅ | ⚠️ Requires implementation |
Heuristic metrics | ✅ | ✅ | ⚠️ Limited | ✅ | ⚠️ Requires implementation |
Statistical metrics | ✅ | ✅ | ❌ | ⚠️ Limited | ⚠️ Requires implementation |
Experiment Management | |||||
Prompt versioning | ✅ | ✅ | ❌ | ✅ | ⚠️ Requires implementation |
A/B testing | ✅ | ✅ | ✅ | ✅ | ⚠️ Requires implementation |
Experiment tracking | ✅ | ✅ | ⚠️ Limited | ✅ | ⚠️ Via Honeycomb |
Dataset management | ✅ | ✅ | ❌ | ✅ | ⚠️ Requires implementation |
Observability | ✅ | ✅ | ✅ | ||
Tracing | ✅ | ✅ | ❌ | ✅ | ✅ |
Metrics collection | ✅ | ✅ | ⚠️ Limited | ✅ | ✅ |
Cost tracking | ✅ | ⚠️ Limited | ❌ | ✅ | ✅ |
Latency monitoring | ✅ | ✅ | ❌ | ✅ | ✅ |
Integration Capabilities | |||||
Multiple LLM providers | ✅ | ✅ | ✅ | ✅ | ✅ |
Integration with RAG | ⚠️ Limited | ✅ | ❌ | ✅ | ⚠️ Requires implementation |
User Experience | |||||
Setup complexity | ⚠️ Moderate | 🔴 High | 🟢 Low | ⚠️ Moderate | 🔴 High |
Visual prompt comparison | ✅ | ✅ | ✅ | ✅ | ⚠️ Requires development |
Team collaboration | ✅ | ✅ | ⚠️ Limited | ✅ | ⚠️ Via other tools |
Learning curve | ⚠️ Moderate | 🔴 Steep | 🟢 Gentle | ⚠️ Moderate | 🔴 Steep |
And the winner is — Langfuse
We chose Langfuse because it hit the sweet spot for our needs. It lets us keep our data in-house (unlike Braintrust), has good monitoring tools, and makes managing experiments straightforward. It's easier to set up than MLflow and more full-featured than Chainforge.
The ML flywheel: A continuous improvement loop
)
The most impactful result of our LLM evaluation system is the creation of a continuous improvement loop — what we call our ML Flywheel. This self-reinforcing system ensures that every iteration enhances AI quality and reliability. At its core, the flywheel is built on four key components:
AI-generated content: Journey AI generates insights based on current prompts and models.
User feedback: Users interact with the generated content, offering explicit (comments, ratings) and implicit (engagement patterns) feedback.
AI evaluation: Automated guardrails and LLM-as-Judge tools assess quality, correlating results with real user feedback.
Better prompts: Evaluation insights feed directly into prompt improvements, refining future AI outputs.
This creates a virtuous cycle: better prompts → better AI outputs → better user feedback → stronger evaluation data → better prompts.
Lessons learned: What it takes to get AI right
Through this process, we uncovered several key insights:
Combine human + AI evaluation
Rule-based checks catch obvious errors, while LLM-as-Judge handles subjective quality assessments.
Validate against real feedback
AI evaluations must align with actual user reactions to remain meaningful.
Customize for your use case
Generic benchmarks aren’t enough — evaluators must target your specific failure modes.
Test at multiple levels
Some problems only appear with real-world usage, requiring both development-time and production-time testing.
Make evaluation an integral part of AI development
Continuous feedback loops, not just final validation, drive real AI quality improvement.
What’s next?
The AI flywheel doesn’t stop spinning. Our next steps include:
Automating prompt iteration based on historical user feedback and LLM-as-Judge scores.
Strengthening AI-to-user feedback loops to accelerate improvements.
Exploring new evaluation tools like DSPy for more sophisticated prompt tuning.
By committing to a systematic, AI-driven approach to quality, TheyDo ensures that every AI-generated journey insight meets the highest standards — helping businesses make better decisions, faster. Stay tuned for the next installment of our AI flywheel series, where we’ll take a deeper dive into Langfuse and how it powers our AI-driven insights.
Ready to see Journey AI in action?
Discover how TheyDo's Journey AI can transform your customer data into clear, actionable insights. Start your free trial now.