Skip to content

generalMG/DocumentParser

Repository files navigation

ArXiv Database Project

Python tooling for loading the arXiv metadata snapshot into PostgreSQL, storing PDFs directly in the database, and running a PaddleOCR-VL pipeline (worker + visualization).

What you can do

  • Process arXiv papers end-to-end with a unified script (scripts/process_arxiv.py)
    • Load metadata from JSONL into PostgreSQL
    • Download PDFs directly into the database (BYTEA storage)
    • Concurrent downloads with rate limiting
    • Resume from where you left off automatically
  • Run a unified PaddleOCR-VL worker (GPU or CPU) with concurrent model workers (scripts/ocr_worker.py)
  • Extract text from PDFs with OCR and store results directly in the database
  • Visualize OCR layout boxes and extracted text side-by-side (scripts/visualize_ocr.py)
  • Quickly explore the metadata snapshot and plot year/version trends (fast_analysis.py)

Requirements

  • Python 3.8+
  • PostgreSQL 12+
  • ~5GB disk for the metadata JSON
  • Database storage for PDFs (expect ~0.37MB average per paper)
  • Optional GPU for PaddleOCR-VL (install paddlepaddle-gpu instead of paddlepaddle)

Install dependencies:

python -m venv .venv && source .venv/bin/activate  # optional but recommended
pip install -r requirements.txt

requests, PyPDF2, pdf2image, reportlab, and PaddleOCR packages are needed for the OCR pipeline.

Recommendation

Use uv package manager for faster and more reliable installation.

Configuration

Set environment variables directly or create a .env file (loaded by scripts via python-dotenv):

Variable Description Default
SQLALCHEMY_URL Full DB URL (overrides other DB vars) postgresql://user@/arxiv
DB_HOST Postgres host ("" for Unix socket) ""
DB_PORT Postgres port 5432
DB_NAME Database name arxiv
DB_USER Database user postgres
DB_PASSWORD Database password ""
ARXIV_DATA_PATH Path to the metadata JSONL snapshot ./arxiv-metadata-oai-snapshot.json
DOWNLOAD_DELAY Seconds to sleep between PDF downloads 3.0

Note: PDFs are now stored directly in PostgreSQL as binary data (BYTEA). PDF_BASE_PATH is only used by legacy scripts.

Unix socket example:

DB_HOST=
SQLALCHEMY_URL=postgresql://your_user@/arxiv

TCP example:

DB_HOST=localhost
SQLALCHEMY_URL=postgresql://your_user:password@localhost:5432/arxiv

Database setup

createdb arxiv          # once
alembic upgrade head    # apply migrations (001, 002, 003, 004)

Migrations:

  • 001: Initial schema (papers, categories, relationships)
  • 002: PDF download tracking columns
  • 003: PDF binary storage (BYTEA column for storing PDFs in database)
  • 004: OCR results storage (JSONB column + processing status tracking)

Unified Processing Pipeline (Recommended)

The process_arxiv.py script is the main entry point that handles metadata loading and PDF downloading in one go.

Starting from scratch (empty database)

# Process first 5000 papers (metadata + PDFs)
python scripts/process_arxiv.py --limit 5000 --workers 4

Resume from where you left off

# Continue processing 5000 MORE papers (auto-resumes)
python scripts/process_arxiv.py --limit 5000 --workers 4

The script automatically:

  • Checks what papers are already in the database
  • Skips those papers when reading the JSONL
  • Loads only NEW papers
  • Downloads PDFs for the new papers

Backfill missing PDFs

If you have papers in the database but some don't have PDFs:

# Check how many papers need PDFs
psql -d arxiv -c "SELECT COUNT(*) FROM arxiv_papers WHERE pdf_content IS NULL;"

# Download PDFs for papers that don't have them
python scripts/process_arxiv.py --skip-metadata --limit 3000 --workers 4

By default, backfill mode prioritizes papers that have never been attempted and skips rows that already have a recorded download error. To retry failed downloads explicitly:

python scripts/process_arxiv.py --skip-metadata --retry-download-errors --limit 3000 --workers 4

Common scenarios

Scenario 1: You have 4056 papers in DB, only 1294 have PDFs

# First, backfill the 2762 missing PDFs
python scripts/process_arxiv.py --skip-metadata --limit 2762 --workers 4

# Then continue loading new papers
python scripts/process_arxiv.py --limit 5000 --workers 4

Scenario 2: You want to load metadata only (no PDF downloads)

python scripts/process_arxiv.py --limit 10000 --skip-download

Scenario 3: Process everything with 8 concurrent workers

python scripts/process_arxiv.py --limit 10000 --workers 8 --delay 2.0

Pipeline features

  • Automatic resume: Skips papers already in database (use --no-resume to disable)
  • Concurrent downloads: Multi-threaded PDF downloads (default: 4 workers)
  • Rate limiting: Respects arXiv terms of service with configurable delays
  • Error handling: Tracks failed downloads with error messages for retry
  • Backfill prioritization: Avoids starving never-attempted papers behind permanent download failures
  • Progress tracking: Real-time progress bars for metadata and PDFs
  • Status syncing: Automatically ensures pdf_downloaded flag matches pdf_content presence
  • Database storage: PDFs stored as binary data (BYTEA) directly in PostgreSQL

Check your progress

# Total papers
psql -d arxiv -c "SELECT COUNT(*) as total FROM arxiv_papers;"

# Papers with PDFs
psql -d arxiv -c "SELECT COUNT(*) as with_pdfs FROM arxiv_papers WHERE pdf_content IS NOT NULL;"

# Papers needing PDFs
psql -d arxiv -c "SELECT COUNT(*) as need_pdfs FROM arxiv_papers WHERE pdf_content IS NULL;"

# Failed downloads
psql -d arxiv -c "SELECT COUNT(*) as failed FROM arxiv_papers WHERE pdf_download_error IS NOT NULL;"

# OCR progress
psql -d arxiv -c "SELECT COUNT(*) as ocr_done FROM arxiv_papers WHERE ocr_processed = true;"

Legacy/Alternative Scripts

The following individual scripts are still available if you need fine-grained control:

Load metadata only

python scripts/load_arxiv_data.py --max-records 1000

Download PDFs to filesystem

python scripts/download_pdfs.py --limit 200 --delay 3

Load filesystem PDFs into database

python scripts/load_pdfs_to_db.py --limit 1000

Sync status flags

python scripts/sync_pdf_status.py

OCR Worker (Database Storage)

Extract text from PDFs and store results directly in the database:

# Process 100 papers with 1 model worker on GPU 0
python scripts/ocr_worker.py --limit 100 --device gpu:0 --workers 1

# CPU fallback
python scripts/ocr_worker.py --limit 20 --device cpu --workers 1

# Limit pages per PDF
python scripts/ocr_worker.py --limit 100 --pages 10

# Cap rendered pages to avoid OOM on huge PDFs
python scripts/ocr_worker.py --limit 100 --render-page-limit 100

# Tune throughput and memory pressure
python scripts/ocr_worker.py --limit 100 --page-batch-size 4 --render-workers 2

# Retry previously failed papers
python scripts/ocr_worker.py --limit 100 --retry-errors

Worker features

  • Database storage: OCR results saved as JSONB directly in PostgreSQL
  • Process isolation: Model inference runs in dedicated worker processes
  • Bounded pipeline controls: Page limits, batch sizing, and render-worker controls
  • Resume support: Continues from previously stored partial OCR results
  • Progress/error tracking: Uses ocr_processed, ocr_processed_at, and ocr_error

OCR database columns

Column Type Description
ocr_results JSONB OCR output (layout boxes, text content, metadata)
ocr_processed BOOLEAN True when all pages completed
ocr_processed_at TIMESTAMP When OCR finished
ocr_error TEXT Error message if processing failed

Check OCR progress

# Papers with OCR results
psql -d arxiv -c "SELECT COUNT(*) FROM arxiv_papers WHERE ocr_results IS NOT NULL;"

# Fully processed
psql -d arxiv -c "SELECT COUNT(*) FROM arxiv_papers WHERE ocr_processed = true;"

# Failed OCR
psql -d arxiv -c "SELECT COUNT(*) FROM arxiv_papers WHERE ocr_error IS NOT NULL;"

Visualize OCR Output

View OCR results with bounding boxes and extracted text side-by-side:

# List papers with OCR results
python scripts/visualize_ocr.py --list

# Visualize a specific paper (pulls from database)
python scripts/visualize_ocr.py --paper-id 0704.0001 --save

# Specific page
python scripts/visualize_ocr.py --paper-id 0704.0001 --page 2 --save

# All pages
python scripts/visualize_ocr.py --paper-id 0704.0001 --all-pages --save

# Custom text panel size and font
python scripts/visualize_ocr.py --paper-id 0704.0001 --save --text-width 800 --font-size 20

Visualization features

  • Pulls PDF and OCR results directly from database (no local files needed)
  • Side-by-side view: annotated PDF page + extracted text panel
  • Color-coded bounding boxes by label type (title, text, figure, table, etc.)
  • Numbered boxes matching text entries for easy reference
  • Configurable text panel width and font size

Quick metadata analysis

python fast_analysis.py ./arxiv-metadata-oai-snapshot.json --cpu_count 8

Runs a multiprocessing scan of the JSONL file and saves outputs/arxiv_analysis_fast.png with year/version histograms.

Project layout

alembic/
  env.py
  versions/
    001_initial_schema.py
    002_add_pdf_download_tracking.py
    003_add_pdf_binary_storage.py
    004_add_ocr_results_storage.py
database/
  database.py          # URL builder, engine/session manager
  models.py            # ORM models + indexes
scripts/
  process_arxiv.py     # Unified pipeline: metadata + PDF downloads (RECOMMENDED)
  load_arxiv_data.py   # Legacy: JSONL -> Postgres loader
  download_pdfs.py     # Legacy: Download PDFs to filesystem
  load_pdfs_to_db.py   # Legacy: Load filesystem PDFs into database
  sync_pdf_status.py   # Legacy: Reconcile pdf_downloaded flags
  ocr_worker.py        # Unified OCR worker (render + inference + DB writes)
  visualize_ocr.py     # Visualize OCR results from database
  compute_perplexity.py# Evaluate OCR text quality via perplexity
fast_analysis.py       # Parallel metadata explorer
schema.sql             # Reference DDL (mirrors Alembic head)
requirements.txt
README.md

Notes & pitfalls

  • Large files: The JSON snapshot is ~5GB; PDFs stored in PostgreSQL will grow your database significantly (expect ~0.37MB average per paper)
  • Database storage: With BYTEA storage, 10k papers = ~3.7GB database growth. Plan your PostgreSQL storage accordingly
  • Socket vs TCP: Empty DB_HOST builds a Unix-socket URL; set DB_HOST=localhost if peer auth fails
  • Respect arXiv rate limits: Default delay is 3 seconds between downloads. Do not decrease below 2 seconds
  • Resume behavior: The unified script automatically resumes from where it left off. To reload existing papers, use --no-resume
  • Concurrent downloads: More workers = faster processing, but respect rate limits. Recommended: 4-8 workers
  • OCR models: The OCR worker expects models under ~/.paddlex/official_models by default; first run will download them
  • GPU memory: The OCR worker includes automatic memory cleanup and worker recycling for long runs

Storage comparison

  • Filesystem: ~3.7GB for 10k PDFs on disk
  • Database (BYTEA): ~3.7GB added to PostgreSQL database
  • Advantage of DB storage: Single source of truth, easier backups, no file path management, transactional consistency

License

Educational use only. See LICENSE.

About

Python data pipeline for arXiv metadata: SQLAlchemy + Alembic schema, PostgreSQL storage, PDF download tracking, and optional PaddleOCR processing.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors