Everyone talks about feedback loops for AI agents like it's rocket science. MLOps pipelines, A/B testing frameworks, human-in-the-loop annotation systems, vector databases for embeddings... I found a simpler way.
The Problem
My sales agent was misclassifying user messages. Someone says "sorry not 41 it's 32" (changing their shoe size), and the AI thought they were negotiating.
I needed a way to:
- Catch these mistakes
- Use them to improve the AI
- Without building a complex system
The Solution: Dual Confidence Thresholds
Instead of just using the AI's answer, I ask for a confidence score too.
def get_user_intent_llm(message, possible_intents):
prompt = f"""
Classify this message: "{message}"
Valid intents: {possible_intents}
Respond in this format:
INTENT: <your classification>
CONFIDENCE: <0.0 to 1.0>
"""
response = llm.invoke(prompt)
intent, confidence = parse_response(response)
# LOG EVERYTHING
print(f"INTENT: {intent}, CONFIDENCE: {confidence}, MESSAGE: '{message}'")
return intent, confidence
Then I use two thresholds:
HIGH_THRESHOLD = 0.8
LOW_THRESHOLD = 0.5
if confidence >= HIGH_THRESHOLD:
# AI is confident, use the intent
use_intent(intent)
elif confidence >= LOW_THRESHOLD:
# AI is unsure, use intent BUT LOG IT
log_edge_case(message, intent, confidence) # <-- This is your feedback loop
use_intent(intent)
else:
# AI has no idea, ask for clarification
ask_user_to_clarify()
The Feedback Loop
That log_edge_case() function is your entire feedback loop. It writes to a simple file:
def log_edge_case(message, intent, confidence):
with open("edge_cases.txt", "a") as f:
f.write(f"{confidence:.2f} | {intent} | {message}\n")
After a week, your file looks like this:
0.65 | negotiate | sorry not 41 it's 32
0.58 | yes | no problem boss
0.72 | location | I stay for Lekki
0.55 | other | can I exchange if size doesn't fit?
How to Use It
Every few days:
- Open
edge_cases.txt - Review the messages and their classifications
- Ask yourself: "Was the AI right?"
- Collect the corrected examples for fine-tuning
These logged edge cases become your training data. When you have enough (50-100 examples), you can fine-tune your model to handle these specific patterns better.
Why This Works
- No complex infrastructure - It's just a text file
- You see real user data - Not synthetic test cases
- Medium confidence = edge cases - These are exactly the examples you need to improve
- Gradual improvement - Each week you add a few examples, AI gets smarter
The Full Picture
User Message
↓
Intent Classification (with confidence)
↓
┌─────────────────────────────────────┐
│ HIGH (>0.8) → Use directly │
│ MEDIUM (0.5-0.8) → Use + LOG │ ← Your feedback loop
│ LOW (<0.5) → Ask clarification │
└─────────────────────────────────────┘
↓
Review logs weekly
↓
Add examples to system prompt
↓
Repeat