Skip to content

Commit 5bb811c

Browse files
committed
feat: Add WikiText download script and update docs
- Added scripts/setup/download_data.py to fix missing data issue - Updated USAGE.md with data preparation steps - Updated ROADMAP.md to mark data curation as done
1 parent 84695e3 commit 5bb811c

3 files changed

Lines changed: 52 additions & 3 deletions

File tree

docs/ROADMAP.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,8 @@ This phase focuses on moving from a structurally complete prototype to a model t
2525

2626
* **Tasks:**
2727
* **1. Data Curation and Preprocessing:**
28-
* **[IN PROGRESS]** Expand the training corpus beyond the initial small text files.
29-
* **[TODO]** Download and preprocess a standard dataset (e.g., a subset of WikiText, SlimPajama, or C4).
28+
* **[DONE]** Expand the training corpus beyond the initial small text files.
29+
* **[DONE]** Download and preprocess a standard dataset (WikiText-103 via `scripts/setup/download_data.py`).
3030
* **[TODO]** Implement a robust vocabulary generation process with a larger vocabulary size (e.g., 8,000-16,000 tokens) to ensure good coverage of the training data.
3131
* **2. Baseline Model Training:**
3232
* **[TODO]** Execute the full four-stage training pipeline on the newly curated dataset.

docs/USAGE.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,18 @@ print(output)
3434
## Training
3535

3636
### 1. Prepare Data
37-
Ensure you have a text corpus in `data/text_corpus`.
37+
38+
Before training, you need a text corpus. We provide a script to download and preprocess the WikiText-103 dataset:
39+
40+
```bash
41+
# Install the datasets library if you haven't
42+
pip install datasets
43+
44+
# Download and prepare the data
45+
python scripts/setup/download_data.py --output_dir data/text_corpus
46+
```
47+
48+
This will create a `data/text_corpus/wikitext_train.txt` file ready for training.
3849

3950
### 2. Train Backbone
4051
```bash

scripts/setup/download_data.py

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
import os
2+
from datasets import load_dataset
3+
import argparse
4+
5+
def download_and_process_wikitext(output_dir):
6+
"""
7+
Downloads WikiText-103 and saves it as a raw text file for training.
8+
"""
9+
print(f"Downloading WikiText-103 to {output_dir}...")
10+
11+
try:
12+
dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")
13+
except Exception as e:
14+
print(f"Error loading dataset: {e}")
15+
return
16+
17+
os.makedirs(output_dir, exist_ok=True)
18+
output_file = os.path.join(output_dir, "wikitext_train.txt")
19+
20+
print("Processing and saving...")
21+
with open(output_file, "w", encoding="utf-8") as f:
22+
for i, item in enumerate(dataset):
23+
text = item['text']
24+
# Basic filtering: remove empty lines or very short headers
25+
if text.strip() and len(text.strip()) > 20:
26+
f.write(text)
27+
28+
if (i + 1) % 10000 == 0:
29+
print(f"Processed {i + 1} lines...", end="\r")
30+
31+
print(f"\nSuccessfully saved to {output_file}")
32+
33+
if __name__ == "__main__":
34+
parser = argparse.ArgumentParser(description="Download and preprocess WikiText-103")
35+
parser.add_argument("--output_dir", type=str, default="data/text_corpus", help="Output directory")
36+
args = parser.parse_args()
37+
38+
download_and_process_wikitext(args.output_dir)

0 commit comments

Comments
 (0)