Model Serving, LLM Observability & Drift Detection
AI/ML Operations | Technical Operations Excellence
Traditional observability measures infrastructure. LLM observability measures:
| Dimension | Question |
|---|---|
| Behavior | Is the model doing what we expect? |
| Quality | Are outputs accurate, helpful, safe? |
| Reasoning | Is the chain-of-thought sound? |
Sample multiple completions, check consistency. Inconsistent facts = hallucination.
Use another LLM to evaluate groundedness against retrieved context.
Cross-Layer Attention Probing - classifier on model activations (open-source only).
| Aspect | Question |
|---|---|
| Faithfulness | Does CoT reflect actual reasoning? |
| Verbosity | Is reasoning externalized? |
| Readability | Can humans understand it? |
| Necessity | Is CoT required for complexity? |
CoT most relevant when task is difficult enough to externalize reasoning
| Stage | Reliability Practice |
|---|---|
| Data Ingest | Schema validation, drift checks |
| Feature Store | Versioning, consistency |
| Training | Checkpointing, resource limits |
| Eval | Automated benchmarks, holdouts |
| Metric | Target |
|---|---|
| Task Completion Rate | >90% |
| Tool Call Accuracy | >98% |
| Context Utilization | >70% |
| Hallucination Rate | <1% |
| Human Escalation | <5% |
| Type | What to Watch |
|---|---|
| Data Drift | Input distribution shifts |
| Concept Drift | Relationship changes |
| Model Drift | Prediction quality decay |
Monitor production predictions vs training distribution continuously
| Platform | Strength | OSS? |
|---|---|---|
| Langfuse | Tracing, evals | Yes |
| Arize Phoenix | RAG analysis | Yes |
| LangSmith | LangChain native | No |
Observe the Reasoning
AI reliability requires new observability primitives.