You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: add configurable ParquetMergePolicyConfig to index settings (#6362)
* feat: add configurable ParquetMergePolicyConfig to index settings
Adds `parquet_merge_policy` section to `IndexingSettings`, making the
Parquet merge policy configurable per-index via YAML. Parameters:
- merge_factor (default 10): min splits to trigger a merge
- max_merge_factor (default 12): max splits per merge
- max_merge_ops (default 4): bounds write amplification
- target_split_size_bytes (default 256 MiB): target output size
- maturation_period (default 48h): split maturity timeout
- max_finalize_merge_operations (default 3): cold-window shutdown limit
Mirrors the existing merge_policy config pattern for logs/traces.
Updates index-config.md documentation with the new section.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add ParquetIndexingConfig with sort_fields and window_duration_secs
Adds `parquet_indexing` section to `IndexingSettings` for per-index
Parquet pipeline configuration:
- `sort_fields`: sort schema override (Husky-style pipe-delimited
syntax with /V2 suffix). Controls row ordering, query pruning,
compression locality, and compaction scope. When omitted, uses
the product-type default.
- `window_duration_secs`: time window for split partitioning
(default 900s / 15 min). Must divide 3600.
Updates docs/configuration/index-config.md with:
- "Parquet indexing settings" section explaining both parameters
- Full sort schema syntax reference (column types, direction
overrides, & LSM cutoff marker)
- Examples showing minimal, custom, and advanced configurations
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: update indexing service fingerprint constants and nightly fmt
Adding ParquetMergePolicyConfig and ParquetIndexingConfig to
IndexingSettings changes the Hash output, which changes the pipeline
params fingerprints. Updated the hardcoded test constants.
Added a comment explaining how to recompute them when IndexingSettings
fields change.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: docs/configuration/index-config.md
+83-1Lines changed: 83 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -594,7 +594,9 @@ This section describes indexing settings for a given index.
594
594
| ------------- | ------------- | ------------- |
595
595
|`commit_timeout_secs`| Maximum number of seconds before committing a split since its creation. |`60`|
596
596
|`split_num_docs_target`| Target number of docs per split. |`10000000`|
597
-
|`merge_policy`| Describes the strategy used to trigger split merge operations (see [Merge policies](#merge-policies) section below). |
597
+
|`merge_policy`| Describes the strategy used to trigger split merge operations for logs/traces (see [Merge policies](#merge-policies) section below). |
598
+
|`parquet_merge_policy`| Describes the merge policy for Parquet (metrics/sketches) splits (see [Parquet merge policy](#parquet-merge-policy) section below). |
|`resources.heap_size`| Indexer heap size per source per index. |`2000000000`|
599
601
|`docstore_compression_level`| Level of compression used by zstd for the docstore. Lower values may increase ingest speed, at the cost of index size |`8`|
600
602
|`docstore_blocksize`| Size of blocks in the docstore, in bytes. Lower values may improve doc retrieval speed, at the cost of index size |`1000000`|
@@ -687,6 +689,86 @@ indexing_settings:
687
689
type: "no_merge"
688
690
```
689
691
692
+
### Parquet indexing settings
693
+
694
+
*For indexes using the Parquet indexing pipeline (metrics, sketches).*
695
+
696
+
These settings control how the Parquet pipeline sorts, windows, and writes incoming data. They affect both ingest-time performance and downstream query/compaction efficiency.
| `sort_fields` | Sort schema for row ordering in Parquet files (see syntax below). When omitted, the product-type default is used. | `metric_name\|service\|env\|datacenter\|region\|host\|timeseries_id\|timestamp_secs/V2` |
711
+
| `window_duration_secs` | Time window duration in seconds for split partitioning. Must evenly divide 3600. Larger values = fewer splits but coarser time pruning. | `900` (15 minutes) |
712
+
713
+
#### Sort schema syntax
714
+
715
+
The sort schema uses pipe-delimited column names with a `/V2` version suffix:
716
+
717
+
```text
718
+
column1|column2|...|timestamp_secs/V2
719
+
```
720
+
721
+
**Column types** are inferred from name suffixes:
722
+
- `__s`→ string (e.g., `custom_tag__s`)
723
+
- `__i`→ int64 (e.g., `priority__i`)
724
+
- Well-known names like `metric_name`, `service`, `env`, `host`, `timestamp_secs`, and `timeseries_id` have built-in type mappings and don't need suffixes.
725
+
726
+
**Sort direction** defaults to ascending for most columns and descending for timestamp columns. Override with `+` (ascending) or `-` (descending) as a prefix or suffix on the column name:
- **Query pruning**: queries filtering on leading columns (e.g., `metric_name`) can skip entire splits whose row key ranges don't match.
738
+
- **Compression**: grouping similar values together (e.g., all rows for the same metric name) improves columnar compression ratios.
739
+
- **Compaction scope**: splits with different sort schemas are never merged together. Changing the sort schema on an existing index creates a new compaction scope — old splits are not re-sorted.
740
+
741
+
**The `&` marker** (advanced) sets the LSM comparison cutoff: columns after `&` are used for sort order but not for compaction locality decisions. For example, `metric_name|&host|timestamp_secs/V2` sorts by metric_name then host, but only metric_name determines which splits can be merged.
742
+
743
+
#### Parquet merge policy
744
+
745
+
*For indexes using the Parquet indexing pipeline (metrics, sketches).*
746
+
747
+
The Parquet merge policy controls how Parquet splits within a compaction scope (same time window, partition, and sort schema) are merged. It uses a constant write amplification strategy: splits at the same merge level are greedily accumulated until reaching `max_merge_factor` or `target_split_size_bytes`.
748
+
749
+
```yaml
750
+
version: 0.7
751
+
index_id: "my-metrics-index"
752
+
# ...
753
+
indexing_settings:
754
+
parquet_merge_policy:
755
+
merge_factor: 10
756
+
max_merge_factor: 12
757
+
max_merge_ops: 4
758
+
target_split_size_bytes: 268435456
759
+
maturation_period: 48h
760
+
max_finalize_merge_operations: 3
761
+
```
762
+
763
+
764
+
| Variable | Description | Default value |
765
+
| ------------- | ------------- | ------------- |
766
+
| `merge_factor` | Minimum number of splits to trigger a merge. | `10` |
767
+
| `max_merge_factor` | Maximum number of splits in a single merge operation. | `12` |
768
+
| `max_merge_ops` | Maximum number of merges a split can undergo before becoming mature. Bounds total write amplification. | `4` |
769
+
| `target_split_size_bytes` | Target size for merged output splits in bytes. Merges trigger when accumulated bytes reach this threshold, even if `merge_factor` is not reached. | `268435456` (256 MiB) |
770
+
| `maturation_period` | Duration after creation when a split becomes mature (never merged again). | `48h` |
771
+
| `max_finalize_merge_operations` | *(advanced)* Maximum number of merge operations emitted during cold-window finalization at pipeline shutdown. Set to `0` to disable. | `3` |
0 commit comments