Skip to content

LLM Judge

The LLM Judge cross-references every AI claim against actual tool outputs. When the AI says “the test passes” but the exit code was 1, that’s a critical contradiction flagged with full evidence.

Terminal window
sfs audit ses_abc --model gpt-4o

Findings are auto-classified by category:

SeverityCategoriesExample
CRITICALtest_result, command_output, dependency”Test passes” but exit code 1
HIGHfile_existence, data_misread, code_claim”Created file” but no Write call
LOWotherAmbiguous claims

Works with any OpenAI-compatible endpoint:

Terminal window
# LiteLLM
sfs audit ses_abc --base-url https://litellm.internal/v1 --model gpt-4o
# Ollama (no API key needed)
sfs audit ses_abc --base-url http://localhost:11434/v1 --model llama3
# vLLM
sfs audit ses_abc --base-url http://gpu-server:8000/v1 --model my-model

Configure automatic auditing in the dashboard Settings or via CLI:

Terminal window
sfs config set audit.trigger on_sync # Audit after every push
sfs config set audit.trigger on_pr # Audit when PR/MR is opened
sfs config set audit.trigger manual # Only when you run sfs audit

Run 3 passes and only report findings where 2+ agree:

Terminal window
sfs audit ses_abc --consensus # 3x cost, higher confidence
Terminal window
sfs audit ses_abc --format json # JSON
sfs audit ses_abc --format markdown # Markdown report
sfs audit ses_abc --format csv # CSV table