Qwen-Bailian Anonymous Dataset

This repository provides anonymized, production-derived LLM usage traces collected from a Qwen model serving cluster on Aliyun Bailian. The dataset is designed for trace-driven evaluation of LLM serving systems, including caching, batching, scheduling, and end-to-end inference optimization.

The traces in this repository represent different usage scenarios:

Scenario	Description	Trace File
To-C Trace	Chat-style interactive services	`qwen_traceA_blksz_16.jsonl`
To-B Trace	API-driven task automation	`qwen_traceB_blksz_16.jsonl`
Thinking Trace	Reasoning-intensive chat	`qwen_thinking_blksz_16.jsonl`
Coder Trace	Code generation	`qwen_coder_blksz_16.jsonl`

🔥 What’s New

New Thinking Trace Captures long-form reasoning workloads with long output lengths.
New Coder Trace Represents code-generation and interactive programming workloads.
Official Trace Replayer We open-sourced a high-fidelity, timestamp-faithful trace replayer for end-to-end benchmarking:
👉 https://github.com/blitz-serving/trace-replayer

Overview

This dataset contains a two-hour sampled anonymized KVCache trace of requests sent to a single Qwen model serving cluster on Aliyun Bailian. It is used for validating design techniques for LLM serving systems as well as inspiring future usage with the following key workload characteristics collected:

Temporal distribution of requests;
Input/output token length;
Session structure and chat turn patterns;
Request type composition (text, search, image, file)

Key Features

Production-Representative: Subset retains real-world traffic patterns
Privacy-Compliant: Salted hashing + domain remapping anonymization
Structured Format: JSON Lines with schema documentation
Apache 2.0 Licensed: Permissive open-source license for commercial use

For insights drawn from this dataset and techniques validated with it, please refer to our works:

Data Specification of the Traces

Each file contains a representative workload, e.g., qwen_traceB_blksz_16.jsonl refer to a to-B trace collected at 2024.12. Each record contains the following information:

{
  "chat_id": 159,                                   // Randomized chat identifier
  "parent_chat_id": 55,                             // -1 for root requests
  "timestamp": 61.114,                              // Seconds since request arrive
  "input_length": 521,                              // Input token count
  "output_length": 132,                             // Output token count
  "type": "text",                                   // Request type: text/search/image/file
  "turn": 2,                                        // Conversation turn number
  "hash_ids": [1089, 1090, 1091, 6326, ..., 13148]  // Salted SipHash blocks (16 tokens per block)
}

Anonymization Process

Token Block Hashing:
- Group tokens into 16-token blocks
- Apply salted SipHash-2-4 to each block
Domain Remapping:
- Map hash values to sequential integers
- Breaks correlation between hash IDs and original content
ID Randomization:
- Replace chat IDs with sequential integers
- No linkage to user accounts or device identifiers
Time-based Anonymization:
- All timestamps are normalized to trace-relative values, starting from 0 at the beginning of each trace file. Original absolute timestamps (e.g., Unix time) are removed to prevent temporal correlation with external events or user behavior patterns.

Trace Replayer (Recommended)

To enable end-to-end, trace-driven benchmarking, we provide an official open-source Trace Replayer:

👉 https://github.com/blitz-serving/trace-replayer

Trace Replayer is a Rust-based, high-throughput replay engine that:

Reconstructs synthetic prompts from input length + block hashes
Preserves KVCache hit/miss patterns
Replays requests against real backends (e.g., vLLM) via standard APIs
Records per-request latency, TTFT/TPOT (backend-dependent), and timing drift

It can achieve 100+ QPS and 500K+ tokens/s using ~30 CPU threads, sufficient to stress-test 16–32 instance Qwen3-30B-A3B deployments.

Supported backends include OpenAI-compatible APIs, TGI, and AIBrix.

FAQ

For common questions about trace patterns (e.g., missing tokens across turns, block hash mismatches), see docs/qa-context-growth-pattern.md. The released hash_ids are anonymized hashes of the actual token IDs consumed by the inference engine after the model-specific chat_template has already been applied. Do not apply chat_template again when using these traces.

Privacy & Compliance

No PII: All content hashed with irreversible cryptographic functions
Unlinkable: No cross-session or user-device associations preserved
GDPR/CCPA Compliant: Meets anonymous data standards under major regulations

License

Apache License 2.0

A permissive license allowing commercial use and modifications, requiring only preservation of the license notice in derivative works

Citation

If you find this dataset useful or use it in your research, please kindly cite our paper using the following bib, thanks!

@inproceedings {kvcache,
  title={KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider},
  author={Wang, Jiahao and Han, Jinbo and Wei, Xingda and Shen, Sijie and Zhang, Dingyan and Fang, Chenguang and Chen, Rong and Yu, Wenyuan and Chen, Haibo}, 
  booktitle = {2025 USENIX Annual Technical Conference (USENIX ATC 25)},
  year = {2025},
  url = {https://www.usenix.org/conference/atc25/presentation/wang-jiahao},
  publisher = {USENIX Association},
  month = jul,
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
docs		docs
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
qwen_coder_blksz_16.jsonl		qwen_coder_blksz_16.jsonl
qwen_thinking_blksz_16.jsonl		qwen_thinking_blksz_16.jsonl
qwen_traceA_blksz_16.jsonl		qwen_traceA_blksz_16.jsonl
qwen_traceB_blksz_16.jsonl		qwen_traceB_blksz_16.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Qwen-Bailian Anonymous Dataset

🔥 What’s New

Overview

Key Features

Data Specification of the Traces

Anonymization Process

Trace Replayer (Recommended)

FAQ

Privacy & Compliance

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Qwen-Bailian Anonymous Dataset

🔥 What’s New

Overview

Key Features

Data Specification of the Traces

Anonymization Process

Trace Replayer (Recommended)

FAQ

Privacy & Compliance

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages