feat: Add WikiText download script and update docs

Pomilon · Pomilon · commit 5bb811cb485c · 2025-12-05T12:52:06.000+01:00
- Added scripts/setup/download_data.py to fix missing data issue
- Updated USAGE.md with data preparation steps
- Updated ROADMAP.md to mark data curation as done
diff --git a/docs/ROADMAP.md b/docs/ROADMAP.md
@@ -25,8 +25,8 @@ This phase focuses on moving from a structurally complete prototype to a model t
 
 *   **Tasks:**
     *   **1. Data Curation and Preprocessing:**
-        *   **[IN PROGRESS]** Expand the training corpus beyond the initial small text files.
-        *   **[TODO]** Download and preprocess a standard dataset (e.g., a subset of WikiText, SlimPajama, or C4).
+        *   **[DONE]** Expand the training corpus beyond the initial small text files.
+        *   **[DONE]** Download and preprocess a standard dataset (WikiText-103 via `scripts/setup/download_data.py`).
         *   **[TODO]** Implement a robust vocabulary generation process with a larger vocabulary size (e.g., 8,000-16,000 tokens) to ensure good coverage of the training data.
     *   **2. Baseline Model Training:**
         *   **[TODO]** Execute the full four-stage training pipeline on the newly curated dataset.
diff --git a/docs/USAGE.md b/docs/USAGE.md
@@ -34,7 +34,18 @@ print(output)
 ## Training
 
 ### 1. Prepare Data
-Ensure you have a text corpus in `data/text_corpus`.
+
+Before training, you need a text corpus. We provide a script to download and preprocess the WikiText-103 dataset:
+
+```bash
+# Install the datasets library if you haven't
+pip install datasets
+
+# Download and prepare the data
+python scripts/setup/download_data.py --output_dir data/text_corpus
+```
+
+This will create a `data/text_corpus/wikitext_train.txt` file ready for training.
 
 ### 2. Train Backbone
 ```bash
diff --git a/scripts/setup/download_data.py b/scripts/setup/download_data.py
@@ -0,0 +1,38 @@
+import os
+from datasets import load_dataset
+import argparse
+
+def download_and_process_wikitext(output_dir):
+    """
+    Downloads WikiText-103 and saves it as a raw text file for training.
+    """
+    print(f"Downloading WikiText-103 to {output_dir}...")
+    
+    try:
+        dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")
+    except Exception as e:
+        print(f"Error loading dataset: {e}")
+        return
+
+    os.makedirs(output_dir, exist_ok=True)
+    output_file = os.path.join(output_dir, "wikitext_train.txt")
+
+    print("Processing and saving...")
+    with open(output_file, "w", encoding="utf-8") as f:
+        for i, item in enumerate(dataset):
+            text = item['text']
+            # Basic filtering: remove empty lines or very short headers
+            if text.strip() and len(text.strip()) > 20:
+                f.write(text)
+            
+            if (i + 1) % 10000 == 0:
+                print(f"Processed {i + 1} lines...", end="\r")
+    
+    print(f"\nSuccessfully saved to {output_file}")
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Download and preprocess WikiText-103")
+    parser.add_argument("--output_dir", type=str, default="data/text_corpus", help="Output directory")
+    args = parser.parse_args()
+
+    download_and_process_wikitext(args.output_dir)