banyan_extract is a python module that prepares documents for use in GenAI and LLM applications.
Rather than re-invent the wheel, banyan_extract aims to utilize state-of-the-art tools to provide this capability.
In a Python environment (conda, venv, etc.), use the following:
cd PATH_TO_REPO/
pip install banyan-extractgit clone https://github.com/sandialabs/banyan-ingest.git
cd banyan-ingest/
pip install .You will need poppler installed.
For the rotation detection functionality, you need Tesseract OCR (version 4.0 or higher recommended) installed on your system
pip install pytesseractThen install the Tesseract OCR binary:
- Linux (Ubuntu/Debian):
sudo apt install tesseract-ocr - Linux (Fedora/RHEL):
sudo dnf install tesseract - macOS:
brew install tesseract - Windows: Download from Tesseract GitHub
Note: Tesseract OCR is only required for automatic rotation detection. Manual rotation works without Tesseract.
Verify Installation: After installing, verify Tesseract is working:
import pytesseract
print(pytesseract.get_tesseract_version())The default OCR backend for PPTX processing is now Nemotron (changed from Surya).
To use Nemotron OCR (default):
pip install .[nemotronparse]To use Surya OCR:
pip install .[marker]Currently we provide support for marker (link here) and NVIDIA's nemotron-parse models (link here).
To install the necessary dependencies for these tools please use pip install .[marker] or pip install .[nemotronparse] respectively.
Default OCR Backend: Nemotron is now the default OCR backend for PPTX processing (changed from Surya).
Note: please ensure you follow the guidelines and usage licenses of the tools.
- Automatic rotation detection using Tesseract OCR's Orientation and Script Detection (OSD)
- Configurable confidence threshold for reliable results (default: 0.7)
- Graceful fallback to 0° rotation when Tesseract is not available
- Support for standard angles: 0°, 90°, 180°, and 270° detection
- Comprehensive error handling with detailed logging
Requirements: Tesseract OCR (version 4.0+) and pytesseract package for automatic detection.
Copy the .env.example file change NEMOTRON_ENDPOINT to the endpoint of the Nemotron-parse model you want to use.
The example_*.py scripts contain basic scripts for processing PDF documents using different OCR tools under the hood.
Use banyan-extract to run the tool from the command line. Example command that reads in a PDF named example.pdf and puts all the extracted content in a directory named banyan_output:
banyan-extract --backend nemoparse example.pdf banyan_output/# Process PPTX with default Nemotron OCR backend
banyan-extract presentation.pptx output_dir/
# Process PPTX with Surya OCR backend (explicit)
banyan-extract presentation.pptx output_dir/ --pptx_ocr_backend surya