Open-source toolkit to instrument, evaluate, and track LLM applications with feedback functions and dashboard-driven comparisons.
Key features
Fine-Grained Instrumentation: Records calls across prompt, model, retriever, and knowledge-source boundaries to capture full context for each LLM interaction and enable detailed post-hoc analysis.
Feedback Functions Framework: Pluggable evaluators (feedback functions) that run automatically alongside app executions to check for metrics like groundedness, helpfulness, and safety and flag failing responses.
RAG-Focused Tooling: Built-in patterns and examples for Retrieval-Augmented Generation workflows (the RAG Triad) to evaluate retriever effectiveness and end-to-end grounding of responses.
Dashboard & Leaderboards: A web UI to view runs, compare app versions, surface failure modes, and maintain leaderboards for experiments and evaluation metrics.
Provider & Stack Agnostic Integrations: Support for multiple model providers and orchestration layers (examples and issue threads reference OpenAI, Ollama, Gemini, LangChain adapters), allowing reuse across different stacks.
Virtual Records & Simulation: Utilities like TruVirtual and VirtualApp to create virtualized records for offline testing and deterministic evaluation of feedback functions.
Observability & OTEL Plans: Design docs and a PRD for OpenTelemetry integration to standardize spans and make instrumentation more debuggable and extensible.
Package Distribution & Quickstart: Installable Python package (pip install trulens) with quick usage examples to instrument a prototype and start collecting evaluations rapidly.
Fine-grained, stack-agnostic instrumentation to capture app records and interactions with LLMs and retrievers
Support for popular stacks like LangChain and vector stores (examples include Pinecone integration)
Extensible feedback/provider architecture to add custom evaluators and endpoints
Best for
Instrumenting LLM Apps: Add TruLens instrumentation to a RAG or chat app to automatically record prompts, model outputs, retriever calls, and metadata for later analysis.
Automated Feedback Evaluation: Run feedback functions on each recorded run to detect hallucinations, grounding failures, or policy/safety violations during CI or experimentation.
Model and Prompt Comparison: Use the dashboard and leaderboards to compare different model families, prompt templates, or retriever configurations side-by-side using consistent metrics.
Offline Testing with Virtual Records: Create VirtualApp/VirtualRecord datasets to reproduce and test failure modes offline and validate feedback function fixes before deployment.
Observability Integration: Integrate TruLens traces with OpenTelemetry (or other observability tooling) to align LLM evaluations with standard telemetry and tracing pipelines.
Cost & Token Monitoring: Track token usage and cost metrics across different providers and model configurations to optimize for budget and performance.
Debugging Provider Integrations: Use recorded traces and feedback outputs to diagnose provider-specific issues (e.g., adapter errors for OpenAI, LangChain, Ollama) and iterate on provider configs.
Instrumenting and evaluating RAG systems end-to-end during development