CI/CD for AI: Running Prompt Regression Tests and LLM-as-a-Judge Gates in GitHub Actions
- 10 min read
On February 13, 2026, GitHub shipped Agentic Workflows as a technical preview - AI agents triggered by issues and pull requests that triage bugs, generate fixes, write tests, and open PRs without human intervention. It’s a real shift in how code moves through a pipeline.
The problem: most teams adopting AI-assisted development haven’t updated their testing approach to match. The tests that work for deterministic code - input goes in, exact output comes out, compare strings - break immediately when the output is AI-generated. You get false failures on phrasing changes that don’t affect quality. You miss real regressions because the output still looks plausible. The existing test infrastructure is the wrong tool.
Here’s what actually works.
Why Your Existing Tests Don’t Work
Conventional CI/CD tests determinism. Given input X, the system must produce exactly output Y. This works because code is deterministic - the same inputs reliably produce the same outputs, and any deviation is a bug.
AI outputs are probabilistic. The same prompt with the same inputs produces different outputs on different runs, at different temperatures, across different model versions. “Different” doesn’t mean “wrong.” Two responses can be entirely different strings and both be correct answers to the question asked.
This creates three failure modes that don’t exist in traditional testing:
False negatives from output variation: Your test expected “The answer is 42” but the model returned “The result is 42.” The test fails. Nothing regressed. This is the most common source of test suite instability in AI pipelines.
Fragile infrastructure masking real problems: Timing, rate limits, and environmental noise generate failures that have nothing to do with your prompt or your code. When everything looks like flakiness, real regressions hide in the noise.
Compliance traps: The outcome is correct but the path was wrong. The agent returned the right answer but called the wrong tool, skipped a required validation step, or reasoned through a path that would fail on edge cases you haven’t seen yet. String-matching the output misses this entirely.
The fix is to stop testing output equality and start testing output quality.
The Three Things You Actually Want to Test
Before building the infrastructure, be explicit about what you’re evaluating:
Correctness: Did the model produce a response that answers the question or fulfills the task? Not “did it use exactly these words” but “does this response accomplish the goal?”
Format compliance: Is the output in the expected structure? If you need JSON with specific keys, does it return that? If you need a numbered list, is it numbered? This is the easiest category to test with traditional assertions, but AI models break format compliance more often than correctness.
Behavior compliance: Did the agent take the right path to get there? Did it use the expected tools? Did it respect the constraints? A response can be correct and well-formatted while coming from a reasoning path that will fail on different inputs.
Each of these requires a different testing pattern.
Pattern 1: Prompt Snapshot Tests
Snapshot testing for AI outputs means capturing the structure of expected outputs rather than the exact text. When the structure changes, you investigate. When only the phrasing changes, you don’t.
The simplest version: define a schema for your expected output and assert that the actual output matches the schema.
# evals/test_extraction.py
import json
import pytest
from your_agent import run_extraction
def test_extraction_structure():
result = run_extraction("Extract the key dates from: Meeting on Jan 5, deadline Feb 12")
data = json.loads(result)
# Test structure, not content
assert "dates" in data
assert isinstance(data["dates"], list)
assert len(data["dates"]) >= 1
# Each date has required fields
for date in data["dates"]:
assert "date" in date
assert "context" in date
For outputs that aren’t structured, snapshot the semantic properties instead:
def test_summary_properties():
summary = run_summarization(LONG_DOCUMENT)
# Structure tests
assert len(summary) < len(LONG_DOCUMENT) * 0.3 # actually shorter
assert len(summary) > 100 # not empty
assert not summary.startswith("I ") # not first person
This catches format regressions and catastrophic failures without breaking on every rephrasing.
Pattern 2: LLM-as-a-Judge in GitHub Actions
For correctness evaluation, you need a judge - a secondary model that evaluates whether the primary model’s output meets quality criteria. This is the standard pattern for AI evaluation, and it works.
The core concept: instead of comparing to a fixed expected string, you give a fast cheap model (Claude Haiku, Gemini Flash) your output plus a rubric, and it returns a score. You fail the CI step if the score falls below your threshold.
Here’s the minimal Python implementation:
# evals/judge.py
import anthropic
import json
import sys
client = anthropic.Anthropic()
def judge_output(task: str, output: str, rubric: str) -> float:
"""Returns a quality score between 0.0 and 1.0"""
prompt = f"""You are evaluating the quality of an AI system's output.
Task: {task}
Output to evaluate:
{output}
Rubric:
{rubric}
Score this output from 0.0 to 1.0 based on the rubric.
Return ONLY a JSON object with a single key "score" containing the float.
Example: {{"score": 0.85}}"""
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=100,
temperature=0, # deterministic judge calls
messages=[{"role": "user", "content": prompt}]
)
result = json.loads(response.content[0].text)
return result["score"]
def run_judge_suite(threshold: float = 0.85):
test_cases = [
{
"task": "Summarize a technical document in plain language",
"input": SAMPLE_DOCUMENT,
"rubric": "1.0: Clear, accurate, no jargon. 0.5: Accurate but unclear. 0.0: Inaccurate or incomplete."
},
# Add your actual test cases
]
results = []
for case in test_cases:
output = run_your_agent(case["input"])
score = judge_output(case["task"], output, case["rubric"])
results.append({"score": score, "passed": score >= threshold})
print(f"Score: {score:.2f} | {'PASS' if score >= threshold else 'FAIL'}")
failed = [r for r in results if not r["passed"]]
if failed:
print(f"\n{len(failed)}/{len(results)} eval cases failed")
sys.exit(1)
avg_score = sum(r["score"] for r in results) / len(results)
print(f"\nAll cases passed. Average score: {avg_score:.2f}")
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--threshold", type=float, default=0.85)
args = parser.parse_args()
run_judge_suite(args.threshold)
The GitHub Actions workflow:
name: AI Eval
on: [push, pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install dependencies
run: pip install anthropic
- name: Run prompt regression suite
run: python evals/run_evals.py
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- name: LLM-as-Judge gate
run: python evals/judge.py --threshold 0.85
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
A few things that matter in this implementation:
temperature=0 on judge calls. The judge model’s scores need to be as stable as possible. You won’t get perfect determinism, but temperature=0 gets you close enough that threshold comparisons are reliable.
Use a cheap fast model for the judge. Haiku or Flash - the judge doesn’t need to be your best model. It needs to be consistent and fast. Running expensive judge calls on every PR will either slow your pipeline or cost more than the value it provides.
Write rubrics that describe the extremes. “1.0 means X, 0.0 means Y” gives the judge model a clear reference frame. Vague rubrics produce noisy scores.
Pattern 3: Behavior Gates
The hardest category to test: did the agent do the right thing, not just return the right output?
This matters when your agent uses tools. An agent might return a correct final answer by calling the wrong tools, skipping validations, or following a path that will break on different inputs. String-matching the output misses all of this.
The pattern: log what your agent does, not just what it returns, and assert on the log.
# Your agent wrapper
class TrackedAgent:
def __init__(self):
self.tool_calls = []
self.reasoning_steps = []
def reset(self):
self.tool_calls = []
self.reasoning_steps = []
def on_tool_call(self, tool_name: str, args: dict):
self.tool_calls.append({"tool": tool_name, "args": args})
# In your eval
agent = TrackedAgent()
result = agent.run("Find the user's account and check their subscription status")
# Assert behavior, not just output
assert "get_user" in [c["tool"] for c in agent.tool_calls], "Should call get_user"
assert "get_subscription" in [c["tool"] for c in agent.tool_calls], "Should check subscription"
# Assert ordering when it matters
tool_names = [c["tool"] for c in agent.tool_calls]
assert tool_names.index("get_user") < tool_names.index("get_subscription"), \
"Should get user before checking subscription"
This is more work to set up but catches a category of regression that output testing entirely misses. When you refactor your system prompt and the agent stops checking subscriptions before modifying accounts - that’s a real regression. This is how you catch it.
Practical Gotchas
Separate your eval suite from your unit tests. Evals are slow, expensive, and probabilistic. Unit tests are fast, free, and deterministic. Running your full eval suite on every commit will either wreck your CI time budget or cost $200/month in API calls. The practical split: run unit tests on every PR, run evals on merges to main and on a daily schedule.
Cache expensive eval runs. Hash the prompt plus input. If you’ve run this exact eval recently and the hash matches, skip it and use the cached score. This dramatically cuts costs without sacrificing coverage.
import hashlib
import json
def eval_cache_key(task: str, input_data: str) -> str:
content = json.dumps({"task": task, "input": input_data}, sort_keys=True)
return hashlib.sha256(content.encode()).hexdigest()
Re-establish your baseline when you upgrade model versions. Model drift is real. Claude Haiku 4.5 and Claude Haiku 4.6 will score differently on the same rubric. When you upgrade, run your eval suite and update your baseline scores before you start using the threshold as a gate. Otherwise you’ll either have a CI pipeline that fails for no reason or one that passes on degraded quality.
Store eval results over time. A single CI run tells you pass/fail. A history of scores tells you trend. Something that’s scoring 0.91 consistently but is now at 0.87 over three runs is a signal worth investigating even if it’s still above threshold.
What This Actually Gets You
The goal isn’t perfect test coverage of AI outputs - that doesn’t exist. The goal is a signal: did something change in a way I should care about?
Prompt snapshot tests catch format regressions and catastrophic failures. LLM-as-a-Judge catches quality degradation. Behavior gates catch reasoning path changes. Together they give you enough coverage to merge AI changes with confidence rather than hope.
The investment is real. Writing good rubrics takes time. Setting up the tracking infrastructure for behavior gates takes time. The payoff is that you can iterate on your prompts and agent logic without manually testing every case after every change - which, at any real development velocity, you won’t actually do.
Build the simplest version first. Start with snapshot tests for format, add LLM-as-a-Judge for one or two high-stakes flows, and add behavior gates only where you’ve actually had regressions. Expand from there as you learn what breaks.