CI/CD Integration

Use Langfuse experiments in your CI/CD pipeline to catch quality regressions before they ship.

The workflow is:

Store your test cases in a Langfuse dataset.
Write an experiment with the Python or JS/TS SDK that tests your application against the dataset.
Add evaluators to score the experiment results.
Raise RegressionError when a score violates your threshold.
Create a GitHub Actions workflow that runs the script with langfuse/experiment-action.

GitHub Actions workflow

Create a workflow with the trigger you need, for example pull_request or release. Pin the action to a release from the langfuse/experiment-action releases.

RunnerContext requires Langfuse Python SDK v4.6.0 or newer, or Langfuse JS SDK v5.3.0 or newer. The action installs the latest SDK version by default; set python_sdk_version or js_sdk_version if you pin SDK versions.

name: Langfuse experiment gate

on:
  # Run the gate for every pull request. Change this to `push`, `release`, or another
  # trigger if you want to run experiments at a different point in your workflow.
  pull_request:

permissions:
  # Required to check out the repository.
  contents: read
  # Required to post or update the experiment result comment on pull requests.
  pull-requests: write
  # Optional: lets the result link to this specific job's logs.
  # Without this permission, the action falls back to the workflow-run URL.
  actions: read

jobs:
  experiment:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6

      # Required only if you run Python experiments
      - uses: actions/setup-python@v6
        with:
          python-version: "3.14"

      # Required only if you run TypeScript or JavaScript experiments
      - uses: actions/setup-node@v6
        with:
          node-version: "24"

      - uses: langfuse/experiment-action@<release tag>
        with:
          # the credentials for Langfuse
          langfuse_public_key: ${{ secrets.LANGFUSE_PUBLIC_KEY }}
          langfuse_secret_key: ${{ secrets.LANGFUSE_SECRET_KEY }}
          langfuse_base_url: https://cloud.langfuse.com

          # the location of your experiment scripts
          experiment_path: experiments/support-agent-gate

          # the dataset to run the experiment against
          dataset_name: support-agent-regression-set
          dataset_version: "2026-04-27T00:00:00Z"

          # GitHub token so that the action can comment on PRs
          github_token: ${{ github.token }}

          # additional metadata to store with the experiment and show in the Langfuse UI
          experiment_metadata: |
            gate=pr
            suite=support-agent

Action inputs and outputs

Input	Required	Description
`langfuse_public_key`	Yes	Langfuse public key used by the SDK client. Store it as a GitHub secret.
`langfuse_secret_key`	Yes	Langfuse secret key used by the SDK client. Store it as a GitHub secret.
`langfuse_base_url`	No	Langfuse host. Defaults to `https://cloud.langfuse.com`; change this for self-hosted Langfuse.
`experiment_path`	Yes	Path to an experiment script, directory, or glob pattern. Supports Python, TypeScript, and JavaScript.
`dataset_name`	No	Langfuse dataset loaded by the action and provided to the SDK via `RunnerContext`. If omitted, the script must provide its own data.
`dataset_version`	No	Optional timestamp to pin the dataset version for reproducible CI runs. Defaults to the latest dataset version.
`experiment_metadata`	No	Additional `key=value` metadata added to the experiment together with default GitHub metadata. This metadata is visible in the Langfuse UI.
`should_fail_on_regression`	No	Fail the CI job when an experiment raises `RegressionError`. Defaults to `true`.
`should_fail_on_script_error`	No	Fail the CI job when an experiment script crashes or raises a non-regression error. Defaults to `true`.
`should_comment_on_pr`	No	Post or update the experiment report as a pull request comment. Defaults to `true`.
`python_sdk_version`	No	Langfuse Python SDK version installed by the action for `.py` experiments. Defaults to `latest`; use v4.6.0 or newer.
`js_sdk_version`	No	`@langfuse/client` version installed by the action for TypeScript or JavaScript experiments. Defaults to `latest`; use v5.3.0 or newer.
`should_skip_sdk_installation`	No	Skip SDK installation when you manage the Python or Node environment yourself before this action. Defaults to `false`.
`github_token`	No	GitHub token used to post PR comments and resolve the current job URL. Leave blank to skip both.

See the full input reference in the langfuse/experiment-action README.

Output	Description
`result_json`	Normalized JSON result for downstream workflow steps.
`failed`	`true` if any experiment script errored or raised a regression; otherwise `false`.

Additional secrets

If your experiment needs provider keys or other secrets, set them as environment variables on the action step. The experiment subprocess inherits the step environment.

- uses: langfuse/experiment-action@<release tag>
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
  with:
    langfuse_public_key: ${{ secrets.LANGFUSE_PUBLIC_KEY }}
    langfuse_secret_key: ${{ secrets.LANGFUSE_SECRET_KEY }}
    experiment_path: experiments/support-agent-gate
    dataset_name: support-agent-regression-set

Your experiment can read these values from os.environ[...] in Python or process.env... in TypeScript and JavaScript. See the langfuse/experiment-action README for details.

Experiment script definition

Each script must define an experiment(context) function. The action creates a RunnerContext and passes it to this function.

RunnerContext handles the CI-specific setup for you:

initializes the Langfuse SDK client from the action inputs
loads the dataset items from dataset_name and applies dataset_version
adds default metadata under langfuse.*, such as commit SHA, branch, job URL, and actor. These values are visible in the Langfuse UI.

Use context.runExperiment (JS/TS) or context.run_experiment (Python) to run the experiment with these defaults.

import type { RunnerContext } from "@langfuse/client";

export async function experiment(context: RunnerContext) {
  return await context.runExperiment({
    name: "PR gate",
    task: async (item) => item.input,
  });
}

from langfuse import RunnerContext


def experiment(context: RunnerContext):
    return context.run_experiment(
        name="PR gate",
        task=lambda item, **_: item.input,
    )

Pass explicit values to context.runExperiment / context.run_experiment when you want to override action-provided defaults such as metadata.

Failing on regressions

Raise RegressionError when a result should block the workflow. The example below fails when average exact-match accuracy is below the threshold.

import {
  RegressionError,
  type Evaluation,
  type RunnerContext,
} from "@langfuse/client";

const THRESHOLD = 0.95;

export async function experiment(context: RunnerContext) {
  const result = await context.runExperiment({
    name: "PR gate: support agent",
    task: answerSupportQuestion,
    evaluators: [exactMatch],
    runEvaluators: [avgAccuracy],
    metadata: { suite: "support-agent" },
  });

  const accuracy = result.runEvaluations.find(
    (evaluation) => evaluation.name === "avg_accuracy",
  )?.value;

  if (typeof accuracy !== "number" || accuracy < THRESHOLD) {
    throw new RegressionError({
      result,
      metric: "avg_accuracy",
      value: typeof accuracy === "number" ? accuracy : 0,
      threshold: THRESHOLD,
    });
  }

  return result;
}

async function answerSupportQuestion(item: { input?: unknown }) {
  const { question } = item.input as { question: string };
  // Replace this stub with your application logic.
  return question;
}

async function exactMatch({
  output,
  expectedOutput,
}: {
  output: string;
  expectedOutput?: string;
}): Promise<Evaluation> {
  const passed = output.trim() === expectedOutput?.trim();

  return {
    name: "exact_match",
    value: passed ? 1 : 0,
    comment: passed ? "match" : "mismatch",
  };
}

async function avgAccuracy({
  itemResults,
}: {
  itemResults: Array<{ evaluations: Evaluation[] }>;
}): Promise<Evaluation> {
  const scores = itemResults
    .flatMap((item) =>
      item.evaluations
        .filter((evaluation) => evaluation.name === "exact_match")
        .map((evaluation) => Number(evaluation.value)),
    )
    .filter((score) => Number.isFinite(score));
  const value = scores.length
    ? scores.reduce((sum, score) => sum + score, 0) / scores.length
    : 0;

  return { name: "avg_accuracy", value };
}

from langfuse import Evaluation, RegressionError, RunnerContext


THRESHOLD = 0.95


def experiment(context: RunnerContext):
    result = context.run_experiment(
        name="PR gate: support agent",
        task=answer_support_question,
        evaluators=[exact_match],
        run_evaluators=[avg_accuracy],
        metadata={"suite": "support-agent"},
    )

    accuracy = next(
        (
            evaluation.value
            for evaluation in result.run_evaluations
            if evaluation.name == "avg_accuracy"
        ),
        None,
    )

    if not isinstance(accuracy, (int, float)) or accuracy < THRESHOLD:
        raise RegressionError(
            result=result,
            metric="avg_accuracy",
            value=float(accuracy) if isinstance(accuracy, (int, float)) else 0.0,
            threshold=THRESHOLD,
        )

    return result


def answer_support_question(item, **kwargs):
    # Replace this stub with your application logic.
    return item.input["question"]


def exact_match(*, output, expected_output, **kwargs):
    passed = output.strip() == (expected_output or "").strip()
    return Evaluation(
        name="exact_match",
        value=1.0 if passed else 0.0,
        comment="match" if passed else "mismatch",
    )


def avg_accuracy(*, item_results, **kwargs):
    scores = [
        evaluation.value
        for item in item_results
        for evaluation in item.evaluations
        if evaluation.name == "exact_match" and isinstance(evaluation.value, (int, float))
    ]
    return Evaluation(name="avg_accuracy", value=sum(scores) / len(scores) if scores else 0.0)

Action output

When github_token is provided and the workflow has pull-requests: write, the action posts or updates a pull request comment with:

pass, regression, or script-error status per experiment script
run-level scores such as avg_accuracy
a link to the GitHub Action run
a link to the Langfuse experiment comparison view for dataset-backed runs
a compact table of item outputs and item-level scores

The same normalized data is available as the result_json action output. Use this when a later workflow step needs to upload the result as an artifact, send a Slack notification, or feed another reporting system. The output schema is available in the langfuse/experiment-action repository.

- uses: langfuse/experiment-action@<release tag>
  id: experiment
  with:
    # ...

- name: Store experiment result
  run: echo '${{ steps.experiment.outputs.result_json }}' > experiment-result.json

Other CI/CD systems

Integrate the experiment runner with testing frameworks like Pytest and Vitest to run automated evaluations in your CI pipeline. Use evaluators to create assertions that fail tests based on experiment results.

# test_geography_experiment.py
import pytest
from langfuse import get_client, Evaluation
from langfuse.openai import OpenAI

# Test data for European capitals
test_data = [
    {"input": "What is the capital of France?", "expected_output": "Paris"},
    {"input": "What is the capital of Germany?", "expected_output": "Berlin"},
    {"input": "What is the capital of Spain?", "expected_output": "Madrid"},
]

def geography_task(*, item, **kwargs):
    """Task function that answers geography questions"""
    question = item["input"]
    response = OpenAI().chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": question}]
    )
    return response.choices[0].message.content

def accuracy_evaluator(*, input, output, expected_output, **kwargs):
    """Evaluator that checks if the expected answer is in the output"""
    if expected_output and expected_output.lower() in output.lower():
        return Evaluation(name="accuracy", value=1.0)

    return Evaluation(name="accuracy", value=0.0)

def average_accuracy_evaluator(*, item_results, **kwargs):
    """Run evaluator that calculates average accuracy across all items"""
    accuracies = [
        eval.value for result in item_results
        for eval in result.evaluations if eval.name == "accuracy"
    ]

    if not accuracies:
        return Evaluation(name="avg_accuracy", value=None)

    avg = sum(accuracies) / len(accuracies)

    return Evaluation(name="avg_accuracy", value=avg, comment=f"Average accuracy: {avg:.2%}")

@pytest.fixture
def langfuse_client():
    """Initialize Langfuse client for testing"""
    return get_client()

def test_geography_accuracy_passes(langfuse_client):
    """Test that passes when accuracy is above threshold"""
    result = langfuse_client.run_experiment(
        name="Geography Test - Should Pass",
        data=test_data,
        task=geography_task,
        evaluators=[accuracy_evaluator],
        run_evaluators=[average_accuracy_evaluator]
    )

    # Access the run evaluator result directly
    avg_accuracy = next(
        eval.value for eval in result.run_evaluations
        if eval.name == "avg_accuracy"
    )

    # Assert minimum accuracy threshold
    assert avg_accuracy >= 0.8, f"Average accuracy {avg_accuracy:.2f} below threshold 0.8"

def test_geography_accuracy_fails(langfuse_client):
    """Example test that demonstrates failure conditions"""
    # Use a weaker model or harder questions to demonstrate test failure
    def failing_task(*, item, **kwargs):
        # Simulate a task that gives wrong answers
        return "I don't know"

    result = langfuse_client.run_experiment(
        name="Geography Test - Should Fail",
        data=test_data,
        task=failing_task,
        evaluators=[accuracy_evaluator],
        run_evaluators=[average_accuracy_evaluator]
    )

    # Access the run evaluator result directly
    avg_accuracy = next(
        eval.value for eval in result.run_evaluations
        if eval.name == "avg_accuracy"
    )

    # This test will fail because the task gives wrong answers
    with pytest.raises(AssertionError):
        assert avg_accuracy >= 0.8, f"Expected test to fail with low accuracy: {avg_accuracy:.2f}"

// test/geography-experiment.test.ts
import { describe, it, expect, beforeAll, afterAll } from "vitest";
import { OpenAI } from "openai";
import { NodeSDK } from "@opentelemetry/sdk-node";
import { LangfuseClient, ExperimentItem } from "@langfuse/client";
import { observeOpenAI } from "@langfuse/openai";
import { LangfuseSpanProcessor } from "@langfuse/otel";

// Test data for European capitals
const testData: ExperimentItem[] = [
  { input: "What is the capital of France?", expectedOutput: "Paris" },
  { input: "What is the capital of Germany?", expectedOutput: "Berlin" },
  { input: "What is the capital of Spain?", expectedOutput: "Madrid" },
];

let otelSdk: NodeSDK;
let langfuse: LangfuseClient;

beforeAll(async () => {
  // Initialize OpenTelemetry
  otelSdk = new NodeSDK({ spanProcessors: [new LangfuseSpanProcessor()] });
  otelSdk.start();

  // Initialize Langfuse client
  langfuse = new LangfuseClient();
});

afterAll(async () => {
  // Clean shutdown
  await otelSdk.shutdown();
});

const geographyTask = async (item: ExperimentItem) => {
  const question = item.input;
  const response = await observeOpenAI(new OpenAI()).chat.completions.create({
    model: "gpt-4",
    messages: [{ role: "user", content: question }],
  });

  return response.choices[0].message.content;
};

const accuracyEvaluator = async ({ input, output, expectedOutput }) => {
  if (
    expectedOutput &&
    output.toLowerCase().includes(expectedOutput.toLowerCase())
  ) {
    return { name: "accuracy", value: 1 };
  }
  return { name: "accuracy", value: 0 };
};

const averageAccuracyEvaluator = async ({ itemResults }) => {
  // Calculate average accuracy across all items
  const accuracies = itemResults
    .flatMap((result) => result.evaluations)
    .filter((evaluation) => evaluation.name === "accuracy")
    .map((evaluation) => evaluation.value as number);

  if (accuracies.length === 0) {
    return { name: "avg_accuracy", value: null };
  }

  const avg = accuracies.reduce((sum, val) => sum + val, 0) / accuracies.length;
  return {
    name: "avg_accuracy",
    value: avg,
    comment: `Average accuracy: ${(avg * 100).toFixed(1)}%`,
  };
};

describe("Geography Experiment Tests", () => {
  it("should pass when accuracy is above threshold", async () => {
    const result = await langfuse.experiment.run({
      name: "Geography Test - Should Pass",
      data: testData,
      task: geographyTask,
      evaluators: [accuracyEvaluator],
      runEvaluators: [averageAccuracyEvaluator],
    });

    // Access the run evaluator result directly
    const avgAccuracy = result.runEvaluations.find(
      (eval) => eval.name === "avg_accuracy",
    )?.value as number;

    // Assert minimum accuracy threshold
    expect(avgAccuracy).toBeGreaterThanOrEqual(0.8);
  }, 30_000); // 30 second timeout for API calls

  it("should fail when accuracy is below threshold", async () => {
    // Task that gives wrong answers to demonstrate test failure
    const failingTask = async (item: ExperimentItem) => {
      return "I don't know";
    };

    const result = await langfuse.experiment.run({
      name: "Geography Test - Should Fail",
      data: testData,
      task: failingTask,
      evaluators: [accuracyEvaluator],
      runEvaluators: [averageAccuracyEvaluator],
    });

    // Access the run evaluator result directly
    const avgAccuracy = result.runEvaluations.find(
      (eval) => eval.name === "avg_accuracy",
    )?.value as number;

    // This test will fail because the task gives wrong answers
    expect(() => {
      expect(avgAccuracy).toBeGreaterThanOrEqual(0.8);
    }).toThrow();
  }, 30_000);
});

These examples show how to use the experiment runner's evaluation results to create meaningful test assertions in your CI pipeline. Tests can fail when accuracy drops below acceptable thresholds, ensuring model quality standards are maintained automatically.

Was this page helpful?

On this page