StillMe Case Study Template (Before vs After Validation)

Use this template to produce audit-grade evidence for StillMe Lite performance.

Goal: show measurable improvement after enabling verification behavior.

1) Case Study Metadata

Case ID:
Date:
Owner:
Domain (research/support/policy/other):
Language:
LLM model:
Retrieval setup:
StillMe Lite mode (monitor|warn|enforce):
Policy file version:

2) Test Scope

Total prompts:
Prompt set source:
Prompt categories:
- Factual query
- Source-required query
- Ambiguous query
- Adversarial/prompt-injection

Rules:

Keep prompt set fixed between before/after runs.
Use same model and retrieval settings for fair comparison.
Store raw runs in JSONL for reproducibility.

3) Experimental Setup

3.1 Before (Baseline)

Validation disabled or bypassed.
Record model outputs and available context.

3.2 After (StillMe Enabled)

Validation enabled with selected mode.
Record decision, reason codes, and safe response.

3.3 Data Files

data/before_<case_id>.jsonl
data/after_<case_id>.jsonl
reports/<case_id>_summary.md

4) Mandatory Metrics

4.1 Hallucination Escape Rate

Definition: unsupported_factual_answers_passed / total_high_risk_factual_prompts

Interpretation:

Lower is better.
This is the primary risk metric.

4.2 Refusal Precision

Definition: correct_refusals / total_refusals

Interpretation:

Higher is better.
Measures whether refusals are justified instead of over-blocking.

4.3 Source Coverage

Definition: factual_answers_with_valid_citation / total_factual_answers

Interpretation:

Higher is better.
Measures evidence grounding quality.

5) Optional Supporting Metrics

False refusal rate
Clarification usefulness rate
Mean validator confidence band distribution
Decision latency delta (before vs after)

6) Output Tables

6.1 Core Metrics Table

Metric	Before	After	Delta	Target
Hallucination escape rate				lower
Refusal precision				>= 0.85
Source coverage				>= 0.80

6.2 Decision Distribution

Decision	Before	After
answer
refuse
ask_clarify

7) Error Analysis (Required)

Provide 5-10 representative examples:

Escaped hallucination (before)
- Prompt:
- Baseline answer:
- Why unsafe:
Correct refusal (after)
- Prompt:
- StillMe decision/reason:
- Why correct:
False refusal (after, if any)
- Prompt:
- Reason code:
- Fix candidate:
Citation quality improvement example
- Prompt:
- Before citation state:
- After citation state:

8) Go / No-Go Rule

Recommended internal gate for moving from monitor to warn:

Hallucination escape rate reduced by at least 50% vs before
Refusal precision >= 0.85
Source coverage >= 0.80 on factual subset

Recommended gate for moving from warn to enforce:

Metrics stable for 2 consecutive runs
No critical incident in sampled production traffic

9) Risk and Limitation Notes

Known blind spots:
Data quality issues:
Retrieval mismatch observations:
Policy thresholds that need tuning:

Always include limitations. Do not claim universal safety guarantees.

10) Short Public Summary (Optional)

Use this 4-line format for release notes:

What was tested (scope and sample size)
What improved (with numbers)
What did not improve yet
Next action for the next iteration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StillMe Case Study Template (Before vs After Validation)

1) Case Study Metadata

2) Test Scope

3) Experimental Setup

3.1 Before (Baseline)

3.2 After (StillMe Enabled)

3.3 Data Files

4) Mandatory Metrics

4.1 Hallucination Escape Rate

4.2 Refusal Precision

4.3 Source Coverage

5) Optional Supporting Metrics

6) Output Tables

6.1 Core Metrics Table

6.2 Decision Distribution

7) Error Analysis (Required)

8) Go / No-Go Rule

9) Risk and Limitation Notes

10) Short Public Summary (Optional)

FilesExpand file tree

CASE_STUDY_TEMPLATE.md

Latest commit

History

CASE_STUDY_TEMPLATE.md

File metadata and controls

StillMe Case Study Template (Before vs After Validation)

1) Case Study Metadata

2) Test Scope

3) Experimental Setup

3.1 Before (Baseline)

3.2 After (StillMe Enabled)

3.3 Data Files

4) Mandatory Metrics

4.1 Hallucination Escape Rate

4.2 Refusal Precision

4.3 Source Coverage

5) Optional Supporting Metrics

6) Output Tables

6.1 Core Metrics Table

6.2 Decision Distribution

7) Error Analysis (Required)

8) Go / No-Go Rule

9) Risk and Limitation Notes

10) Short Public Summary (Optional)