Get Started with Evaluation
This guide helps you set up your first evaluation. If you want to understand what evaluation is and why it matters, check out the Evaluation Overview first. For details on concepts like scores, datasets, and experiments, see Core Concepts.
Get API keys
- Create Langfuse account or self-host Langfuse.
- Create new API credentials in the project settings.
Set up your AI agent
Use the Langfuse Skill in your editor's agent mode to automatically set up evaluations for your application.
What is a Skill? A reusable instruction package for AI coding agents. It gives your agent Langfuse-specific workflows and best practices out of the box.
Install the Langfuse Skill in your coding tool:
Langfuse has a Cursor Plugin that includes the skill automatically.
Claude stores its skills in a .claude/skills directory, you can install skills either globally or per project.
Copy the Langfuse Skill to your local claude skills folder. We'd recommend using a symlink to keep the skill up to date.
You can do this using npm (skills CLI):
npx skills add langfuse/skills --skill "langfuse" --agent "claude-code"Alternatively you can do this manually
- Clone repo somewhere stable
git clone https://github.com/langfuse/skills.git /path/to/langfuse-skills- Make sure Claude skills dir exists (common location)
mkdir -p ~/.claude/skills- Symlink the skill folder
ln -s /path/to/langfuse-skills/skills/langfuse ~/.claude/skills/langfuseCodex stores its skills in a .agents/skills directory, you can install skills either globally or per project. See Codex docs: Where to save skills.
Copy the Langfuse Skill to your local codex skills folder. We'd recommend using a symlink to keep the skill up to date.
You can do this using npm (skills CLI):
npx skills add langfuse/skills --skill "langfuse" --agent "codex"Alternatively you can do this manually
- Clone repo somewhere stable
git clone https://github.com/langfuse/skills.git /path/to/langfuse-skills- Make sure Codex skills dir exists (common location)
mkdir -p ~/.agents/skills- Symlink the skill folder
ln -s /path/to/langfuse-skills/skills/langfuse ~/.agents/skills/langfuseFor other AI coding agents, the skill folder structure is:
<agent-skill-root> depends on your tool. The npm command below installs to the correct location automatically.
For other AI coding agents, install via npm (skills CLI):
npx skills add langfuse/skills --skill "langfuse"If you want to target a specific agent directly:
npx skills add langfuse/skills --skill "langfuse" --agent "<agent-id>"Alternatively you can do this manually
- Clone repo somewhere stable
git clone https://github.com/langfuse/skills.git /path/to/langfuse-skills- Make sure your agent's skills dir exists
mkdir -p /path/to/<agent-skill-root>/skills- Symlink the skill folder
ln -s /path/to/langfuse-skills/skills/langfuse /path/to/<agent-skill-root>/skills/langfuseSet up evals
Start a new agent session, then prompt it to set up evaluations:
"Set up Langfuse evaluations for this application. Help me choose the right evaluation approach and implement it."The agent will analyze your codebase, recommend the best evaluation method, and help you implement it.
Pick your starting point
Different teams need different evaluation approaches. Pick the one that matches what you want to do right now — you can always add more later.
Monitor Production
Automatically score live traces to catch quality issues in real time.
Test Before Shipping
Run your app against a dataset and evaluate results before deploying.
Human Review
Set up structured review queues for domain experts to label and score traces.
Not sure which to pick? Here's a rule of thumb:
- Already have traces in Langfuse? Start with Monitor Production — you'll get scores on your existing data within minutes.
- Building something new or changing prompts? Start with Test Before Shipping — create a dataset and run experiments to validate changes.
- Need ground truth or expert review? Start with Human Review — build a labeled dataset from real traces.
Monitor Production
Use LLM-as-a-Judge to automatically evaluate live traces. An LLM scores your application's outputs against criteria you define — no code changes required.
Prerequisites: Traces flowing into Langfuse and an LLM connection configured.
Create an evaluator
Navigate to Evaluators in the sidebar and click + Set up Evaluator. Choose a managed evaluator (e.g., Hallucination, Helpfulness) or write your own evaluation prompt.
Select your target data
Choose Live Observations to evaluate individual operations (recommended) or Live Traces to evaluate complete workflows. Add filters to target specific operations — for example, only evaluate observations named chat-response.
Map variables and activate
Map the evaluator's variables (like {{input}} and {{output}}) to the corresponding fields in your traces. Preview how the evaluation prompt looks with real data, then save.
New matching traces will be scored automatically. Check the Scores tab on any trace to see results.
Test Before Shipping
Run your application against a fixed dataset and evaluate the outputs. This is how you catch regressions before deploying.
Prerequisites: Langfuse SDK installed (Python v3+ or JS/TS v4+).
Define test data
Start with a few representative inputs and expected outputs. You can use local data or create a dataset in Langfuse.
Run an experiment
Use the experiment runner SDK to execute your application against every test case and optionally score the results.
from langfuse import get_client, Evaluation
from langfuse.openai import OpenAI
langfuse = get_client()
def my_task(*, item, **kwargs):
response = OpenAI().chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": item["input"]}],
)
return response.choices[0].message.content
def check_answer(*, output, expected_output, **kwargs):
is_correct = expected_output.lower() in output.lower()
return Evaluation(name="correctness", value=1.0 if is_correct else 0.0)
result = langfuse.run_experiment(
name="My First Experiment",
data=[
{"input": "What is the capital of France?", "expected_output": "Paris"},
{"input": "What is the capital of Germany?", "expected_output": "Berlin"},
],
task=my_task,
evaluators=[check_answer],
)
print(result.format())import { OpenAI } from "openai";
import { NodeSDK } from "@opentelemetry/sdk-node";
import { LangfuseClient, ExperimentItem } from "@langfuse/client";
import { observeOpenAI } from "@langfuse/openai";
import { LangfuseSpanProcessor } from "@langfuse/otel";
const otelSdk = new NodeSDK({ spanProcessors: [new LangfuseSpanProcessor()] });
otelSdk.start();
const langfuse = new LangfuseClient();
const testData: ExperimentItem[] = [
{ input: "What is the capital of France?", expectedOutput: "Paris" },
{ input: "What is the capital of Germany?", expectedOutput: "Berlin" },
];
const myTask = async (item: ExperimentItem) => {
const response = await observeOpenAI(new OpenAI()).chat.completions.create({
model: "gpt-4.1",
messages: [{ role: "user", content: item.input as string }],
});
return response.choices[0].message.content;
};
const checkAnswer = async ({ output, expectedOutput }) => ({
name: "correctness",
value: expectedOutput && output.toLowerCase().includes(expectedOutput.toLowerCase()) ? 1.0 : 0.0,
});
const result = await langfuse.experiment.run({
name: "My First Experiment",
data: testData,
task: myTask,
evaluators: [checkAnswer],
});
console.log(await result.format());
await otelSdk.shutdown();Review results
The experiment runner prints a summary table. If you used a Langfuse dataset, results are also available in the Langfuse UI under Datasets where you can compare runs side by side.
Human Review
Set up annotation queues so domain experts can review traces and add scores manually. This is the best way to build ground truth data and calibrate automated evaluators.
Prerequisites: Traces in Langfuse and at least one score config.
Create a score config
Go to Settings → Score Configs and create a config that defines what you want to measure. For example, a categorical config with values correct, partially_correct, and incorrect.
Create an annotation queue
Navigate to Annotation Queues and click New Queue. Give it a name, attach your score config, and optionally assign team members.
Add traces and start reviewing
Select traces from the Traces table and click Actions → Add to queue. Open the queue and work through items — score each one, add comments, then click Complete + next.
Next steps
Now that you have your first evaluation running, here are recommended next steps:
- Combine methods: Use annotation queues to build ground truth, then calibrate LLM-as-a-Judge evaluators against human scores.
- Build a dataset: Collect edge cases from production into a dataset for repeatable testing.
- Add to CI: Run experiments in your test suite to catch regressions automatically.
- Track trends: Use score analytics and custom dashboards to monitor evaluation scores over time.
Looking for something specific? Check the Evaluation Methods and Experiments sections for detailed guides.