Skip to content

Latest commit

 

History

History
71 lines (44 loc) · 2.37 KB

File metadata and controls

71 lines (44 loc) · 2.37 KB

📁 DeepPrune Datasets

This repository contains the datasets used in the DeepPrune experiments, organized into four categories: pre-experiment, fine-tuning, offline evaluation, and online evaluation.


🔍 Pre-experiment Datasets

The pre_exp_data/ directory contains datasets used in preliminary experiments. These datasets facilitate a comparative analysis between semantic similarity (computed using Sentence-BERT) and zero-shot judgments from large language models (LLMs).


🛠️ Fine-tuning Datasets

Located in the finetune_data/ directory, these datasets are formatted for use with Llama-Factory and include:

  • train.jsonl – Training data
  • test.jsonl – Evaluation data

Each line in these .jsonl files is a JSON object with the following fields:

{
  "instruction": "It's like a system prompt or task description",
  "input": "Two truncated answers to be checked whether their answers are identical",
  "output": "The expected model response: identical/not identical"
}

These datasets are used to fine-tune base models before applying the DeepPrune pruning strategy.

⚠️ Here .jsonl files have been truncated can be used to finetune models directly. If you want to try other strategies, please use train.json and test.json to generate your own datasets.


📊 Offline Evaluation Datasets

In the offline_test_data/ directory, we provide model-generated responses from the following models on a shared set of problems:

  • glm-4.5-air
  • Qwen3-4B-Thinking-2507
  • QwQ-32B

These outputs are used to evaluate the performance of models after fine-tuning.


🌐 Online Evaluation Datasets

The online_test_data/ directory contains datasets collected through active querying of large language models. Specifically:

  • For each problem, we gathered 512 model-generated answers from:
    • DeepSeek-R1-0528-Qwen3-8B
    • gpt-oss-20b
    • Qwen3-32B

Each JSON file in this folder includes the following fields:

{
  "problem": "The original question or task",
  "answer": "The model's generated response",
  "true_answer": "The ground-truth or reference answer"
}

These datasets are used to empirically validate the effectiveness of DeepPrune in real-world, dynamic settings—measuring how pruning impacts output quality under diverse sampling conditions.