Skip to content

Commit 90f8278

Browse files
feat: add production realism pack for schema-lens v0.1.1
1 parent 3567d6c commit 90f8278

37 files changed

Lines changed: 1890 additions & 42 deletions

README.md

Lines changed: 52 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
Schema Lens is a Solr schema evolution impact simulator. It compares baseline vs shadow relevance before you ship schema/query changes.
44

5-
Target first release: `v0.1.0`.
5+
Target release: `v0.1.1`.
66

77
## Why
88

@@ -11,13 +11,20 @@ Schema and query parameter changes can silently degrade ranking quality. `schema
1111
## Features (v0.1)
1212

1313
- CLI commands: `validate`, `inspect`, `shadow create`, `shadow index`, `replay`, `compare`, `report`, `run`
14+
- Additional commands: `queries extract`, `docs sample`, `gate`
1415
- SolrCloud-first shadow collection lifecycle via Collections API
1516
- Schema change operations:
1617
- `schema.field.update`
1718
- `schema.fieldType.replace`
1819
- `schema.analyzer.remove_filter`
1920
- `queryparams.set`
2021
- Query replay and metrics: overlap@K, jaccard@K, kendall tau@K
22+
- Schema preflight dependency/risk report: `schema_risk.json`
23+
- Optional production realism sources:
24+
- Queries from logs (`queries.source.type=log`)
25+
- Docs sampled directly from Solr (`data.docs_source.type=solr`)
26+
- Optional structured explain capture (`evaluation.explain.structured=true`)
27+
- Quality gate policy command for CI (`schema-lens gate`)
2128
- Reproducible outputs: `run_manifest.json`, `replay.json`, `compare.json`, `report.json`, `report.html`
2229

2330
## Quickstart
@@ -50,6 +57,50 @@ schema-lens run examples/changesets/fieldtype-change.yaml --out out/demo
5057
cat out/demo/report.json
5158
```
5259

60+
## Run Against Production Safely
61+
62+
Use the production realism example:
63+
64+
```bash
65+
schema-lens run examples/changesets/prod_realism_example.yaml --out out/prod_like_run
66+
```
67+
68+
Extract canonical replay queries from logs:
69+
70+
```bash
71+
schema-lens queries extract \
72+
--from examples/logs/solr_requests.log \
73+
--out out/queries_extracted.jsonl \
74+
--max 500 \
75+
--sample reservoir
76+
```
77+
78+
Sample docs from Solr:
79+
80+
```bash
81+
schema-lens docs sample \
82+
--solr-url http://localhost:8983/solr \
83+
--collection products \
84+
--mode cursormark \
85+
--query "*:*" \
86+
--fl "id,title,text,category,price" \
87+
--sample-n 5000 \
88+
--out out/docs_sample.jsonl
89+
```
90+
91+
Apply CI relevance gate:
92+
93+
```bash
94+
schema-lens gate \
95+
--compare out/prod_like_run/compare.json \
96+
--policy examples/policy/gate_default.yaml
97+
```
98+
99+
Warnings:
100+
- Keep sampled docs lean (`fl`) and bounded (`sample_n`).
101+
- Keep sanitization enabled when replaying from real logs.
102+
- Review `schema_risk.json` before approving schema rollout.
103+
53104
## Example command
54105

55106
```bash

docs/changeset-spec.md

Lines changed: 35 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -26,18 +26,43 @@ shadow:
2626

2727
data:
2828
docs_source:
29-
type: "file"
30-
path: "examples/docs.jsonl"
29+
type: "file" # "file" | "solr"
30+
path: "examples/docs.jsonl" # required for type=file
3131
format: "jsonl"
3232
id_field: "id"
33+
# solr source options (type=solr):
34+
# solr_url: "http://localhost:8983/solr"
35+
# collection: "products"
36+
# mode: "export" # "export" | "cursormark"
37+
# query: "*:*"
38+
# sort: "id asc"
39+
# fl: "id,title,text,category"
40+
# sample_n: 50000
41+
# batch_size: 500
42+
# out_sample_path: "out/docs_sample.jsonl"
3343
sample_n: 50000
3444

3545
queries:
3646
source:
37-
type: "file"
47+
type: "file" # "file" | "log"
3848
path: "examples/queries.txt"
39-
format: "simple"
49+
format: "simple" # file: "simple" | "jsonl", log: "solr_params" | "jsonl"
4050
max_queries: 2000
51+
sampling:
52+
mode: "reservoir" # "top" | "reservoir"
53+
seed: 42
54+
sanitize:
55+
enabled: true
56+
rules:
57+
- type: "mask_email"
58+
- type: "mask_uuid"
59+
- type: "drop_param"
60+
name: "token"
61+
- type: "drop_param"
62+
name: "auth"
63+
64+
preflight:
65+
fail_on_risk: false
4166

4267
changes:
4368
- op: "schema.field.update"
@@ -64,6 +89,7 @@ evaluation:
6489
- kendall_tau
6590
explain:
6691
enabled: true
92+
structured: false
6793
max_queries: 25
6894
max_docs_per_query: 3
6995
```
@@ -72,7 +98,8 @@ evaluation:
7298
7399
- `baseline.solr_url`
74100
- `baseline.collection`
75-
- `data.docs_source.path`
101+
- `data.docs_source.path` when `data.docs_source.type=file`
102+
- `data.docs_source.solr_url` and `data.docs_source.collection` when `data.docs_source.type=solr`
76103
- `queries.source.path`
77104

78105
## Supported operations
@@ -85,6 +112,9 @@ evaluation:
85112
## Notes
86113

87114
- `queryparams.set` affects replay parameters only.
115+
- `queries.source.type=log` enables log extraction + canonical JSONL replay generation.
116+
- `data.docs_source.type=solr` samples docs from Solr and writes reproducible JSONL output.
117+
- Preflight always emits `schema_risk.json`; set `preflight.fail_on_risk=true` to block execution on HIGH risks.
88118
- `schema.analyzer.remove_filter.filter_class` can be a Java class (for example `solr.LowerCaseFilterFactory`) or the short filter name (`lowercase`).
89119
- `shadow.allow_shared_configset_fallback=true` allows a non-isolated fallback when Solr blocks configset clone operations (401 on trusted base configsets). This is explicit and can affect baseline behavior.
90120
- Empty `changes` is allowed with a warning.
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
schema_lens_version: 1
2+
3+
baseline:
4+
solr_url: "http://localhost:8983/solr"
5+
collection: "products"
6+
request_defaults:
7+
rows: 10
8+
fl: "id,score"
9+
defType: "edismax"
10+
extra_params:
11+
fq: []
12+
qf: "title^3 text"
13+
pf: "title^10"
14+
15+
shadow:
16+
mode: "solrcloud"
17+
solr_url: "http://localhost:8983/solr"
18+
collection_name_template: "{collection}__shadow__{ts}"
19+
num_shards: 1
20+
replication_factor: 1
21+
cleanup: true
22+
allow_shared_configset_fallback: true
23+
24+
preflight:
25+
fail_on_risk: false
26+
27+
data:
28+
docs_source:
29+
type: "solr"
30+
solr_url: "http://localhost:8983/solr"
31+
collection: "products"
32+
mode: "cursormark"
33+
query: "*:*"
34+
sort: "id asc"
35+
fl: "id,title,text,category,price"
36+
sample_n: 200
37+
batch_size: 50
38+
out_sample_path: "out/docs_sample.jsonl"
39+
40+
queries:
41+
source:
42+
type: "log"
43+
path: "examples/logs/solr_requests.log"
44+
format: "solr_params"
45+
max_queries: 100
46+
sampling:
47+
mode: "reservoir"
48+
seed: 42
49+
sanitize:
50+
enabled: true
51+
rules:
52+
- type: "mask_email"
53+
- type: "mask_uuid"
54+
- type: "drop_param"
55+
name: "auth"
56+
- type: "drop_param"
57+
name: "token"
58+
59+
changes:
60+
- op: "schema.field.update"
61+
field: "title"
62+
set:
63+
stored: true
64+
- op: "queryparams.set"
65+
set:
66+
qf: "title^5 text"
67+
pf: "title^20"
68+
69+
evaluation:
70+
k: 10
71+
metrics:
72+
- overlap
73+
- jaccard
74+
- kendall_tau
75+
explain:
76+
enabled: true
77+
structured: true
78+
max_queries: 5
79+
max_docs_per_query: 2
80+

examples/logs/solr_requests.log

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
q=bearings&fq=category:tools&defType=edismax&qf=title%5E3%20text&rows=20
2+
q=pipes&fq=price:%5B10%20TO%2050%5D&sort=price%20asc
3+
/browse/select?q=bolts&fq=tenant:acme&mm=2<75%25&pf=title%5E10&debug=false
4+

examples/policy/gate_default.yaml

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
version: 1
2+
k: 10
3+
4+
fail:
5+
- metric: "avg_overlap"
6+
op: "<"
7+
value: 0.80
8+
- metric: "pct_high_risk_queries"
9+
op: ">"
10+
value: 5.0
11+
- metric: "pct_queries_overlap_lt"
12+
args:
13+
threshold: 0.60
14+
op: ">"
15+
value: 10.0
16+
17+
warn:
18+
- metric: "pct_med_risk_queries"
19+
op: ">"
20+
value: 20.0
21+
22+
golden_queries:
23+
enabled: true
24+
file: "../queries/golden.jsonl"
25+
requirements:
26+
must_contain_topk: 10
27+
max_missing_pct: 0.0
28+

examples/queries/golden.jsonl

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{"name":"supplier exact","params":{"q":"SupplierId:1001247","defType":"lucene"},"expected_ids":["1001247"]}
2+
{"name":"category head","params":{"q":"bearings","defType":"edismax","qf":"title^3 text"},"expected_ids":["B-001","B-019"]}
3+

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "schema-lens"
7-
version = "0.1.0"
7+
version = "0.1.1"
88
description = "Schema Evolution & Impact Simulator for Solr"
99
readme = "README.md"
1010
requires-python = ">=3.11"

schema_lens/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
"""schema-lens package."""
22

3-
__version__ = "0.1.0"
3+
__version__ = "0.1.1"

schema_lens/changesets/validator.py

Lines changed: 46 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -52,16 +52,52 @@ def validate_changeset(changeset: Changeset, check_paths: bool = True) -> Valida
5252
if version not in (None, 1):
5353
report.errors.append(f"Unsupported schema_lens_version: {version}")
5454

55-
required = [
56-
"baseline.solr_url",
57-
"baseline.collection",
58-
"data.docs_source.path",
59-
"queries.source.path",
60-
]
55+
required = ["baseline.solr_url", "baseline.collection"]
6156
for key in required:
6257
if _get_in(raw, key) in (None, ""):
6358
report.errors.append(f"Missing required field: {key}")
6459

60+
docs_source = _get_in(raw, "data.docs_source") or {}
61+
if not isinstance(docs_source, dict):
62+
report.errors.append("data.docs_source must be an object")
63+
docs_source = {}
64+
docs_source_type = str(docs_source.get("type", "file"))
65+
if docs_source_type not in {"file", "solr"}:
66+
report.errors.append("data.docs_source.type must be 'file' or 'solr'")
67+
if docs_source_type == "file":
68+
if not docs_source.get("path"):
69+
report.errors.append("Missing required field: data.docs_source.path")
70+
else:
71+
for key in ("solr_url", "collection"):
72+
if not docs_source.get(key):
73+
report.errors.append(f"Missing required field: data.docs_source.{key}")
74+
mode = docs_source.get("mode")
75+
if mode and mode not in {"export", "cursormark"}:
76+
report.errors.append("data.docs_source.mode must be 'export' or 'cursormark'")
77+
78+
query_source = _get_in(raw, "queries.source") or {}
79+
if not isinstance(query_source, dict):
80+
report.errors.append("queries.source must be an object")
81+
query_source = {}
82+
query_source_type = str(query_source.get("type", "file"))
83+
if query_source_type not in {"file", "log"}:
84+
report.errors.append("queries.source.type must be 'file' or 'log'")
85+
if not query_source.get("path"):
86+
report.errors.append("Missing required field: queries.source.path")
87+
88+
if query_source_type == "log":
89+
fmt = str(query_source.get("format", "solr_params"))
90+
if fmt not in {"solr_params", "jsonl"}:
91+
report.errors.append("queries.source.format must be 'solr_params' or 'jsonl'")
92+
93+
sampling_mode = _get_in(raw, "queries.sampling.mode")
94+
if sampling_mode is not None and sampling_mode not in {"top", "reservoir"}:
95+
report.errors.append("queries.sampling.mode must be 'top' or 'reservoir'")
96+
97+
preflight_fail = _get_in(raw, "preflight.fail_on_risk")
98+
if preflight_fail is not None and not isinstance(preflight_fail, bool):
99+
report.errors.append("preflight.fail_on_risk must be boolean")
100+
65101
changes = raw.get("changes", [])
66102
if not isinstance(changes, list):
67103
report.errors.append("changes must be a list")
@@ -103,12 +139,11 @@ def validate_changeset(changeset: Changeset, check_paths: bool = True) -> Valida
103139
report.errors.append(f"{loc}.set must be an object")
104140

105141
if check_paths:
106-
docs_path = _get_in(raw, "data.docs_source.path")
142+
docs_path = _get_in(raw, "data.docs_source.path") if docs_source_type == "file" else None
107143
queries_path = _get_in(raw, "queries.source.path")
108-
path_entries = (
109-
("data.docs_source.path", docs_path),
110-
("queries.source.path", queries_path),
111-
)
144+
path_entries = [("queries.source.path", queries_path)]
145+
if docs_path is not None:
146+
path_entries.append(("data.docs_source.path", docs_path))
112147
for label, p in path_entries:
113148
if isinstance(p, str):
114149
fp = _resolve_input_path(changeset.path, p)

0 commit comments

Comments
 (0)