ArXiv Database Project

Python tooling for loading the arXiv metadata snapshot into PostgreSQL, storing PDFs directly in the database, and running a PaddleOCR-VL pipeline (worker + visualization).

What you can do

Process arXiv papers end-to-end with a unified script (scripts/process_arxiv.py)
- Load metadata from JSONL into PostgreSQL
- Download PDFs directly into the database (BYTEA storage)
- Concurrent downloads with rate limiting
- Resume from where you left off automatically
Run a unified PaddleOCR-VL worker (GPU or CPU) with concurrent model workers (scripts/ocr_worker.py)
Extract text from PDFs with OCR and store results directly in the database
Visualize OCR layout boxes and extracted text side-by-side (scripts/visualize_ocr.py)
Quickly explore the metadata snapshot and plot year/version trends (fast_analysis.py)

Requirements

Python 3.8+
PostgreSQL 12+
~5GB disk for the metadata JSON
Database storage for PDFs (expect ~0.37MB average per paper)
Optional GPU for PaddleOCR-VL (install paddlepaddle-gpu instead of paddlepaddle)

Install dependencies:

python -m venv .venv && source .venv/bin/activate  # optional but recommended
pip install -r requirements.txt

requests, PyPDF2, pdf2image, reportlab, and PaddleOCR packages are needed for the OCR pipeline.

Recommendation

Use uv package manager for faster and more reliable installation.

Configuration

Set environment variables directly or create a .env file (loaded by scripts via python-dotenv):

Variable	Description	Default
`SQLALCHEMY_URL`	Full DB URL (overrides other DB vars)	`postgresql://user@/arxiv`
`DB_HOST`	Postgres host (`""` for Unix socket)	`""`
`DB_PORT`	Postgres port	`5432`
`DB_NAME`	Database name	`arxiv`
`DB_USER`	Database user	`postgres`
`DB_PASSWORD`	Database password	`""`
`ARXIV_DATA_PATH`	Path to the metadata JSONL snapshot	`./arxiv-metadata-oai-snapshot.json`
`DOWNLOAD_DELAY`	Seconds to sleep between PDF downloads	`3.0`

Note: PDFs are now stored directly in PostgreSQL as binary data (BYTEA). PDF_BASE_PATH is only used by legacy scripts.

Unix socket example:

DB_HOST=
SQLALCHEMY_URL=postgresql://your_user@/arxiv

TCP example:

DB_HOST=localhost
SQLALCHEMY_URL=postgresql://your_user:password@localhost:5432/arxiv

Database setup

createdb arxiv          # once
alembic upgrade head    # apply migrations (001, 002, 003, 004)

Migrations:

001: Initial schema (papers, categories, relationships)
002: PDF download tracking columns
003: PDF binary storage (BYTEA column for storing PDFs in database)
004: OCR results storage (JSONB column + processing status tracking)

Unified Processing Pipeline (Recommended)

The process_arxiv.py script is the main entry point that handles metadata loading and PDF downloading in one go.

Starting from scratch (empty database)

# Process first 5000 papers (metadata + PDFs)
python scripts/process_arxiv.py --limit 5000 --workers 4

Resume from where you left off

# Continue processing 5000 MORE papers (auto-resumes)
python scripts/process_arxiv.py --limit 5000 --workers 4

The script automatically:

Checks what papers are already in the database
Skips those papers when reading the JSONL
Loads only NEW papers
Downloads PDFs for the new papers

Backfill missing PDFs

If you have papers in the database but some don't have PDFs:

# Check how many papers need PDFs
psql -d arxiv -c "SELECT COUNT(*) FROM arxiv_papers WHERE pdf_content IS NULL;"

# Download PDFs for papers that don't have them
python scripts/process_arxiv.py --skip-metadata --limit 3000 --workers 4

By default, backfill mode prioritizes papers that have never been attempted and skips rows that already have a recorded download error. To retry failed downloads explicitly:

python scripts/process_arxiv.py --skip-metadata --retry-download-errors --limit 3000 --workers 4

Common scenarios

Scenario 1: You have 4056 papers in DB, only 1294 have PDFs

# First, backfill the 2762 missing PDFs
python scripts/process_arxiv.py --skip-metadata --limit 2762 --workers 4

# Then continue loading new papers
python scripts/process_arxiv.py --limit 5000 --workers 4

Scenario 2: You want to load metadata only (no PDF downloads)

python scripts/process_arxiv.py --limit 10000 --skip-download

Scenario 3: Process everything with 8 concurrent workers

python scripts/process_arxiv.py --limit 10000 --workers 8 --delay 2.0

Pipeline features

Automatic resume: Skips papers already in database (use --no-resume to disable)
Concurrent downloads: Multi-threaded PDF downloads (default: 4 workers)
Rate limiting: Respects arXiv terms of service with configurable delays
Error handling: Tracks failed downloads with error messages for retry
Backfill prioritization: Avoids starving never-attempted papers behind permanent download failures
Progress tracking: Real-time progress bars for metadata and PDFs
Status syncing: Automatically ensures pdf_downloaded flag matches pdf_content presence
Database storage: PDFs stored as binary data (BYTEA) directly in PostgreSQL

Check your progress

# Total papers
psql -d arxiv -c "SELECT COUNT(*) as total FROM arxiv_papers;"

# Papers with PDFs
psql -d arxiv -c "SELECT COUNT(*) as with_pdfs FROM arxiv_papers WHERE pdf_content IS NOT NULL;"

# Papers needing PDFs
psql -d arxiv -c "SELECT COUNT(*) as need_pdfs FROM arxiv_papers WHERE pdf_content IS NULL;"

# Failed downloads
psql -d arxiv -c "SELECT COUNT(*) as failed FROM arxiv_papers WHERE pdf_download_error IS NOT NULL;"

# OCR progress
psql -d arxiv -c "SELECT COUNT(*) as ocr_done FROM arxiv_papers WHERE ocr_processed = true;"

Legacy/Alternative Scripts

The following individual scripts are still available if you need fine-grained control:

Load metadata only

python scripts/load_arxiv_data.py --max-records 1000

Download PDFs to filesystem

python scripts/download_pdfs.py --limit 200 --delay 3

Load filesystem PDFs into database

python scripts/load_pdfs_to_db.py --limit 1000

Sync status flags

python scripts/sync_pdf_status.py

OCR Worker (Database Storage)

Extract text from PDFs and store results directly in the database:

# Process 100 papers with 1 model worker on GPU 0
python scripts/ocr_worker.py --limit 100 --device gpu:0 --workers 1

# CPU fallback
python scripts/ocr_worker.py --limit 20 --device cpu --workers 1

# Limit pages per PDF
python scripts/ocr_worker.py --limit 100 --pages 10

# Cap rendered pages to avoid OOM on huge PDFs
python scripts/ocr_worker.py --limit 100 --render-page-limit 100

# Tune throughput and memory pressure
python scripts/ocr_worker.py --limit 100 --page-batch-size 4 --render-workers 2

# Retry previously failed papers
python scripts/ocr_worker.py --limit 100 --retry-errors

Worker features

Database storage: OCR results saved as JSONB directly in PostgreSQL
Process isolation: Model inference runs in dedicated worker processes
Bounded pipeline controls: Page limits, batch sizing, and render-worker controls
Resume support: Continues from previously stored partial OCR results
Progress/error tracking: Uses ocr_processed, ocr_processed_at, and ocr_error

OCR database columns

Column	Type	Description
`ocr_results`	JSONB	OCR output (layout boxes, text content, metadata)
`ocr_processed`	BOOLEAN	True when all pages completed
`ocr_processed_at`	TIMESTAMP	When OCR finished
`ocr_error`	TEXT	Error message if processing failed

Check OCR progress

# Papers with OCR results
psql -d arxiv -c "SELECT COUNT(*) FROM arxiv_papers WHERE ocr_results IS NOT NULL;"

# Fully processed
psql -d arxiv -c "SELECT COUNT(*) FROM arxiv_papers WHERE ocr_processed = true;"

# Failed OCR
psql -d arxiv -c "SELECT COUNT(*) FROM arxiv_papers WHERE ocr_error IS NOT NULL;"

Visualize OCR Output

View OCR results with bounding boxes and extracted text side-by-side:

# List papers with OCR results
python scripts/visualize_ocr.py --list

# Visualize a specific paper (pulls from database)
python scripts/visualize_ocr.py --paper-id 0704.0001 --save

# Specific page
python scripts/visualize_ocr.py --paper-id 0704.0001 --page 2 --save

# All pages
python scripts/visualize_ocr.py --paper-id 0704.0001 --all-pages --save

# Custom text panel size and font
python scripts/visualize_ocr.py --paper-id 0704.0001 --save --text-width 800 --font-size 20

Visualization features

Pulls PDF and OCR results directly from database (no local files needed)
Side-by-side view: annotated PDF page + extracted text panel
Color-coded bounding boxes by label type (title, text, figure, table, etc.)
Numbered boxes matching text entries for easy reference
Configurable text panel width and font size

Quick metadata analysis

python fast_analysis.py ./arxiv-metadata-oai-snapshot.json --cpu_count 8

Runs a multiprocessing scan of the JSONL file and saves outputs/arxiv_analysis_fast.png with year/version histograms.

Project layout

alembic/
  env.py
  versions/
    001_initial_schema.py
    002_add_pdf_download_tracking.py
    003_add_pdf_binary_storage.py
    004_add_ocr_results_storage.py
database/
  database.py          # URL builder, engine/session manager
  models.py            # ORM models + indexes
scripts/
  process_arxiv.py     # Unified pipeline: metadata + PDF downloads (RECOMMENDED)
  load_arxiv_data.py   # Legacy: JSONL -> Postgres loader
  download_pdfs.py     # Legacy: Download PDFs to filesystem
  load_pdfs_to_db.py   # Legacy: Load filesystem PDFs into database
  sync_pdf_status.py   # Legacy: Reconcile pdf_downloaded flags
  ocr_worker.py        # Unified OCR worker (render + inference + DB writes)
  visualize_ocr.py     # Visualize OCR results from database
  compute_perplexity.py# Evaluate OCR text quality via perplexity
fast_analysis.py       # Parallel metadata explorer
schema.sql             # Reference DDL (mirrors Alembic head)
requirements.txt
README.md

Notes & pitfalls

Large files: The JSON snapshot is ~5GB; PDFs stored in PostgreSQL will grow your database significantly (expect ~0.37MB average per paper)
Database storage: With BYTEA storage, 10k papers = ~3.7GB database growth. Plan your PostgreSQL storage accordingly
Socket vs TCP: Empty DB_HOST builds a Unix-socket URL; set DB_HOST=localhost if peer auth fails
Respect arXiv rate limits: Default delay is 3 seconds between downloads. Do not decrease below 2 seconds
Resume behavior: The unified script automatically resumes from where it left off. To reload existing papers, use --no-resume
Concurrent downloads: More workers = faster processing, but respect rate limits. Recommended: 4-8 workers
OCR models: The OCR worker expects models under ~/.paddlex/official_models by default; first run will download them
GPU memory: The OCR worker includes automatic memory cleanup and worker recycling for long runs

Storage comparison

Filesystem: ~3.7GB for 10k PDFs on disk
Database (BYTEA): ~3.7GB added to PostgreSQL database
Advantage of DB storage: Single source of truth, easier backups, no file path management, transactional consistency

License

Educational use only. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
alembic		alembic
database		database
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
alembic.ini		alembic.ini
fast_analysis.py		fast_analysis.py
requirements.txt		requirements.txt
schema.sql		schema.sql

Folders and files

Latest commit

History

Repository files navigation

ArXiv Database Project

What you can do

Requirements

Recommendation

Configuration

Database setup

Unified Processing Pipeline (Recommended)

Starting from scratch (empty database)

Resume from where you left off

Backfill missing PDFs

Common scenarios

Pipeline features

Check your progress

Legacy/Alternative Scripts

Load metadata only

Download PDFs to filesystem

Load filesystem PDFs into database

Sync status flags

OCR Worker (Database Storage)

Worker features

OCR database columns

Check OCR progress

Visualize OCR Output

Visualization features

Quick metadata analysis

Project layout

Notes & pitfalls

Storage comparison

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages