Get Started with Evaluation

This guide helps you set up your first evaluation. If you want to understand what evaluation is and why it matters, check out the Evaluation Overview first. For details on concepts like scores, datasets, and experiments, see Core Concepts.

Get API keys

Create Langfuse account or self-host Langfuse.
Create new API credentials in the project settings.

Set up your AI agent

Use the Langfuse Skill in your editor's agent mode to automatically set up evaluations for your application.

What is a Skill? A reusable instruction package for AI coding agents. It gives your agent Langfuse-specific workflows and best practices out of the box.

Install the Langfuse Skill in your coding tool:

Langfuse has a Cursor Plugin that includes the skill automatically.

Install Plugin in Cursor

Claude stores its skills in a .claude/skills directory, you can install skills either globally or per project.

Copy the Langfuse Skill to your local claude skills folder. We'd recommend using a symlink to keep the skill up to date.

You can do this using npm (skills CLI):

npx skills add langfuse/skills --skill "langfuse" --agent "claude-code"

Alternatively you can do this manually

Clone repo somewhere stable

git clone https://github.com/langfuse/skills.git /path/to/langfuse-skills

Make sure Claude skills dir exists (common location)

mkdir -p ~/.claude/skills

Symlink the skill folder

ln -s /path/to/langfuse-skills/skills/langfuse ~/.claude/skills/langfuse

Codex stores its skills in a .agents/skills directory, you can install skills either globally or per project. See Codex docs: Where to save skills.

Copy the Langfuse Skill to your local codex skills folder. We'd recommend using a symlink to keep the skill up to date.

You can do this using npm (skills CLI):

npx skills add langfuse/skills --skill "langfuse" --agent "codex"

Alternatively you can do this manually

Clone repo somewhere stable

git clone https://github.com/langfuse/skills.git /path/to/langfuse-skills

Make sure Codex skills dir exists (common location)

mkdir -p ~/.agents/skills

Symlink the skill folder

ln -s /path/to/langfuse-skills/skills/langfuse ~/.agents/skills/langfuse

For other AI coding agents, the skill folder structure is:

<agent-skill-root> depends on your tool. The npm command below installs to the correct location automatically.

For other AI coding agents, install via npm (skills CLI):

npx skills add langfuse/skills --skill "langfuse"

If you want to target a specific agent directly:

npx skills add langfuse/skills --skill "langfuse" --agent "<agent-id>"

Alternatively you can do this manually

Clone repo somewhere stable

git clone https://github.com/langfuse/skills.git /path/to/langfuse-skills

Make sure your agent's skills dir exists

mkdir -p /path/to/<agent-skill-root>/skills

Symlink the skill folder

ln -s /path/to/langfuse-skills/skills/langfuse /path/to/<agent-skill-root>/skills/langfuse

Set up evals

Start a new agent session, then prompt it to set up evaluations:

"Set up Langfuse evaluations for this application. Help me choose the right evaluation approach and implement it."

The agent will analyze your codebase, recommend the best evaluation method, and help you implement it.

Pick your starting point

Different teams need different evaluation approaches. Pick the one that matches what you want to do right now — you can always add more later.

Monitor Production

Automatically score live traces to catch quality issues in real time.

Test Before Shipping

Run your app against a dataset and evaluate results before deploying.

Human Review

Set up structured review queues for domain experts to label and score traces.

Not sure which to pick? Here's a rule of thumb:

Already have traces in Langfuse? Start with Monitor Production — you'll get scores on your existing data within minutes.
Building something new or changing prompts? Start with Test Before Shipping — create a dataset and run experiments to validate changes.
Need ground truth or expert review? Start with Human Review — build a labeled dataset from real traces.

Monitor Production

Use LLM-as-a-Judge to automatically evaluate live traces. An LLM scores your application's outputs against criteria you define — no code changes required.

Prerequisites: Traces flowing into Langfuse and an LLM connection configured.

Create an evaluator

Navigate to Evaluators in the sidebar and click + Set up Evaluator. Choose a managed evaluator (e.g., Hallucination, Helpfulness) or write your own evaluation prompt.

Select your target data

Choose Live Observations to evaluate individual operations (recommended) or Live Traces to evaluate complete workflows. Add filters to target specific operations — for example, only evaluate observations named chat-response.

Map variables and activate

Map the evaluator's variables (like {{input}} and {{output}}) to the corresponding fields in your traces. Preview how the evaluation prompt looks with real data, then save.

New matching traces will be scored automatically. Check the Scores tab on any trace to see results.

Full LLM-as-a-Judge documentation

Test Before Shipping

Run your application against a fixed dataset and evaluate the outputs. This is how you catch regressions before deploying.

Prerequisites: Langfuse SDK installed (Python v3+ or JS/TS v4+).

Define test data

Start with a few representative inputs and expected outputs. You can use local data or create a dataset in Langfuse.

Run an experiment

Use the experiment runner SDK to execute your application against every test case and optionally score the results.

from langfuse import get_client, Evaluation
from langfuse.openai import OpenAI

langfuse = get_client()

def my_task(*, item, **kwargs):
    response = OpenAI().chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": item["input"]}],
    )
    return response.choices[0].message.content

def check_answer(*, output, expected_output, **kwargs):
    is_correct = expected_output.lower() in output.lower()
    return Evaluation(name="correctness", value=1.0 if is_correct else 0.0)

result = langfuse.run_experiment(
    name="My First Experiment",
    data=[
        {"input": "What is the capital of France?", "expected_output": "Paris"},
        {"input": "What is the capital of Germany?", "expected_output": "Berlin"},
    ],
    task=my_task,
    evaluators=[check_answer],
)

print(result.format())

import { OpenAI } from "openai";
import { NodeSDK } from "@opentelemetry/sdk-node";
import { LangfuseClient, ExperimentItem } from "@langfuse/client";
import { observeOpenAI } from "@langfuse/openai";
import { LangfuseSpanProcessor } from "@langfuse/otel";

const otelSdk = new NodeSDK({ spanProcessors: [new LangfuseSpanProcessor()] });
otelSdk.start();

const langfuse = new LangfuseClient();

const testData: ExperimentItem[] = [
  { input: "What is the capital of France?", expectedOutput: "Paris" },
  { input: "What is the capital of Germany?", expectedOutput: "Berlin" },
];

const myTask = async (item: ExperimentItem) => {
  const response = await observeOpenAI(new OpenAI()).chat.completions.create({
    model: "gpt-4.1",
    messages: [{ role: "user", content: item.input as string }],
  });
  return response.choices[0].message.content;
};

const checkAnswer = async ({ output, expectedOutput }) => ({
  name: "correctness",
  value: expectedOutput && output.toLowerCase().includes(expectedOutput.toLowerCase()) ? 1.0 : 0.0,
});

const result = await langfuse.experiment.run({
  name: "My First Experiment",
  data: testData,
  task: myTask,
  evaluators: [checkAnswer],
});

console.log(await result.format());
await otelSdk.shutdown();

Combine methods: Use annotation queues to build ground truth, then calibrate LLM-as-a-Judge evaluators against human scores.
Build a dataset: Collect edge cases from production into a dataset for repeatable testing.
Add to CI: Run experiments in your test suite to catch regressions automatically.
Track trends: Use score analytics and custom dashboards to monitor evaluation scores over time.

Looking for something specific? Check the Evaluation Methods and Experiments sections for detailed guides.

Was this page helpful?

Support

Get Started with Evaluation

Get API keys

Set up your AI agent

Set up evals

Pick your starting point

Monitor Production

Test Before Shipping

Human Review

Monitor Production

Create an evaluator

Select your target data

Map variables and activate

Full LLM-as-a-Judge documentation

Test Before Shipping

Define test data

Run an experiment

Review results

Experiments via SDK

Experiments via UI

Human Review

Create a score config

Create an annotation queue

Add traces and start reviewing

Full Annotation Queues documentation

Next steps

On this page