solr-datagen

Generate and index realistic documents into Apache Solr at scale, or reindex existing documents in-place.

Given a Solr URL (with collection/core name), solr-datagen can either generate synthetic documents with realistic data across all field types and index them in parallel batches (index mode), or read existing documents from the collection and write them back in-place using cursorMark deep paging (reindex mode). Works with Solr 7.x through 10.x.

Features

Schema-aware — automatically discovers fields, types, unique key, and multiValued settings
Type-diverse generation — covers strings, text, integers, longs, floats, doubles, dates, and booleans
Solr 7–10 compatible — handles both Trie (7.x; deprecated but still present through 10.x) and Point (8.x+) field type classes transparently
Scales to millions — threaded batch submission with backpressure and cursorMark deep paging for reindex
Reproducible — optional --seed for deterministic output (index mode)
Resilient — exponential-backoff retries on batch failures, graceful Ctrl+C handling

Requirements

Python 3.9+
A running Apache Solr instance with at least one collection/core

Installation

git clone https://github.com/rahulgoswami/solr-datagen.git
cd solr-datagen
pip install -r requirements.txt

Usage

# Index mode (generate and index synthetic documents)
python -m solr_datagen index <solr_url> <count> [options]

# Reindex mode (read existing docs and write them back in-place)
python -m solr_datagen reindex <solr_url> [options]

The legacy form python -m solr_datagen <solr_url> <count> (without a subcommand) still works and is equivalent to index.

Index mode examples

# Dry run — inspect schema without indexing
python -m solr_datagen index http://localhost:8983/solr/my_collection 0 --dry-run

# Index 1,000 documents with defaults
python -m solr_datagen index http://localhost:8983/solr/my_collection 1000

# Index 1M documents with tuned settings
python -m solr_datagen index http://localhost:8983/solr/my_collection 1000000 \
  --batch-size 1000 --workers 8

# With basic auth
python -m solr_datagen index http://localhost:8983/solr/my_collection 5000 \
  --auth admin:secret

# Always include specific fields alongside the diverse selection
python -m solr_datagen index http://localhost:8983/solr/my_collection 1000 \
  --include-fields title,price,category

# Reproducible run
python -m solr_datagen index http://localhost:8983/solr/my_collection 500 --seed 42

Reindex mode examples

# Reindex all documents in the collection
python -m solr_datagen reindex http://localhost:8983/solr/my_collection

# Reindex only documents matching a filter query
python -m solr_datagen reindex http://localhost:8983/solr/my_collection \
  --fq "category:electronics"

# Reindex with more workers and a larger read page
python -m solr_datagen reindex http://localhost:8983/solr/my_collection \
  --workers 8 --read-batch-size 5000

# With basic auth
python -m solr_datagen reindex http://localhost:8983/solr/my_collection \
  --auth admin:secret

Options

Shared options (both modes)

Flag	Default	Description
`-b`, `--batch-size`	500	Documents per HTTP POST request
`-c`, `--commit-within`	0	`commitWithin` in ms; 0 disables per-batch commits (see note below)
`-w`, `--workers`	4	Parallel submission threads
`-a`, `--auth`	None	Basic auth as `user:password`
`-v`, `--verbose`	false	Enable debug logging

Index mode only

Flag	Default	Description
`count`	required	Number of documents to generate
`-f`, `--max-fields`	20	Max fields to select from schema
`--fields-per-type`	3	Max fields per type category
`-s`, `--seed`	None	Random seed for reproducibility
`--include-fields`	None	Comma-separated field names to always include, in addition to the diverse selection; unknown or unstored names raise an error before indexing starts
`--dry-run`	false	Analyse schema only, don't index

Reindex mode only

Flag	Default	Description
`--fq`	None	Filter query to limit the documents reindexed (Solr syntax); omit to reindex all
`--read-batch-size`	2000	Documents per cursorMark read page

Commit behavior (changed in this release)

Both modes now default to --commit-within 0, meaning no per-batch commitWithin is sent to Solr. Instead, the tool relies on the collection's server-side autoCommit (typically every 15 seconds with openSearcher=false in modern Solr configsets) for tlog management during the run, and issues a single explicit hard commit at the end to make all documents visible and durable.

If your collection does not have autoCommit configured (uncommon in production, but possible in dev/test), the transaction log will grow unbounded during a large run. In that case, pass -c 60000 or similar to enable periodic per-batch commits.

Reindex behavior

Reindex mode reads documents from the collection using Solr's cursorMark API (deep paging sorted by the unique key), strips fields that should not be re-posted, and writes them back in-place.

Fields stripped before reposting:

All copyField destinations (they will be repopulated from their source fields on reindex)
_root_ and _nest_path_ (nested document markers — see limitations)

_version_ is preserved. Each document is re-posted with its current _version_ value. If the document was modified by another writer between the read and the write, Solr will reject the reindex of that document (HTTP 409). The batch is counted as a failed batch; the run continues with subsequent batches.

Limitations:

Fields with stored=false are not retrievable. If a copyField source is not stored, its data will be absent in the reindexed document.
Nested/child documents are not supported in this release. Parent-child relationships (_root_, _nest_path_) are stripped, which may leave orphaned children.
If the index is being actively written to during a reindex run, cursorMark may skip or duplicate some documents. The _version_ check will surface accidental concurrent overwrites as failed batches.

How It Works

Index mode

Connect — validates the Solr URL, detects version and mode (standalone/SolrCloud)
Introspect — fetches fields and field types from the Schema API, skips internal and non-stored fields
Select — picks a diverse subset of fields (up to --max-fields), ensuring representation across type categories, plus the unique key, required fields, and any --include-fields names
Generate — creates documents using pre-computed data pools (via Faker) for high throughput
Index — submits documents in parallel batches with backpressure, retries, and progress reporting

Reindex mode

Connect — same as above
Schema precheck — fetches unique key, field type, and copyField rules; verifies the unique key is sortable (required for cursorMark)
Read — pages through all (or filtered) documents using cursorMark sorted by unique key; each page strips copyField destinations before batching
Index — same parallel batching pipeline as index mode; _version_ is preserved for optimistic concurrency

Project Structure

solr_datagen/
├── __init__.py
├── __main__.py          # python -m solr_datagen entry point
├── cli.py               # argument parsing and orchestration
├── config.py            # constants and field-type mappings
├── solr_client.py       # Solr HTTP client
├── schema_analyzer.py   # schema introspection and field selection
├── data_generator.py    # per-type random data generation
├── indexer.py           # batch submission pipeline (DocSource protocol)
├── reindex_source.py    # cursorMark-based document reader for reindex mode
├── copyfield_filter.py  # copyField destination matcher
├── sort_check.py        # uniqueKey sortability precheck for cursorMark
└── progress.py          # progress tracking and reporting

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
solr_datagen		solr_datagen
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

solr-datagen

Features

Requirements

Installation

Usage

Index mode examples

Reindex mode examples

Options

Commit behavior (changed in this release)

Reindex behavior

How It Works

Index mode

Reindex mode

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

solr-datagen

Features

Requirements

Installation

Usage

Index mode examples

Reindex mode examples

Options

Commit behavior (changed in this release)

Reindex behavior

How It Works

Index mode

Reindex mode

Project Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages