Banyan Extract

banyan_extract is a python module that prepares documents for use in GenAI and LLM applications.

Rather than re-invent the wheel, banyan_extract aims to utilize state-of-the-art tools to provide this capability.

Installation

From PyPI (recommended)

In a Python environment (conda, venv, etc.), use the following:

cd PATH_TO_REPO/
pip install banyan-extract

From source

git clone https://github.com/sandialabs/banyan-ingest.git
cd banyan-ingest/
pip install .

Additional Dependecies

You will need poppler installed.

Rottaion Detection

For the rotation detection functionality, you need Tesseract OCR (version 4.0 or higher recommended) installed on your system

pip install pytesseract

Then install the Tesseract OCR binary:

Linux (Ubuntu/Debian): sudo apt install tesseract-ocr
Linux (Fedora/RHEL): sudo dnf install tesseract
macOS: brew install tesseract
Windows: Download from Tesseract GitHub

Note: Tesseract OCR is only required for automatic rotation detection. Manual rotation works without Tesseract.

Verify Installation: After installing, verify Tesseract is working:

import pytesseract
print(pytesseract.get_tesseract_version())

OCR Backend Dependencies

The default OCR backend for PPTX processing is now Nemotron (changed from Surya).

To use Nemotron OCR (default):

pip install .[nemotronparse]

To use Surya OCR:

pip install .[marker]

Supported Tools and File Formats

Currently we provide support for marker (link here) and NVIDIA's nemotron-parse models (link here).

To install the necessary dependencies for these tools please use pip install .[marker] or pip install .[nemotronparse] respectively.

Default OCR Backend: Nemotron is now the default OCR backend for PPTX processing (changed from Surya).

Note: please ensure you follow the guidelines and usage licenses of the tools.

Features

Tesseract OSD Rotation Detection

Automatic rotation detection using Tesseract OCR's Orientation and Script Detection (OSD)
Configurable confidence threshold for reliable results (default: 0.7)
Graceful fallback to 0° rotation when Tesseract is not available
Support for standard angles: 0°, 90°, 180°, and 270° detection
Comprehensive error handling with detailed logging

Requirements: Tesseract OCR (version 4.0+) and pytesseract package for automatic detection.

Using Nemotron-parse

Copy the .env.example file change NEMOTRON_ENDPOINT to the endpoint of the Nemotron-parse model you want to use.

Examples

The example_*.py scripts contain basic scripts for processing PDF documents using different OCR tools under the hood.

CLI Usage

Use banyan-extract to run the tool from the command line. Example command that reads in a PDF named example.pdf and puts all the extracted content in a directory named banyan_output:

banyan-extract --backend nemoparse example.pdf banyan_output/

PPTX Processing with Default Nemotron OCR

# Process PPTX with default Nemotron OCR backend
banyan-extract presentation.pptx output_dir/

# Process PPTX with Surya OCR backend (explicit)
banyan-extract presentation.pptx output_dir/ --pptx_ocr_backend surya

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
docs		docs
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
COPYRIGHT.md		COPYRIGHT.md
LICENSE		LICENSE
README.md		README.md
example_marker.py		example_marker.py
example_nemoparse.py		example_nemoparse.py
example_pm.py		example_pm.py
example_pptx.py		example_pptx.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Banyan Extract

Installation

From PyPI (recommended)

From source

Additional Dependecies

Rottaion Detection

OCR Backend Dependencies

Supported Tools and File Formats

Features

Tesseract OSD Rotation Detection

Using Nemotron-parse

Examples

CLI Usage

PPTX Processing with Default Nemotron OCR

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Banyan Extract

Installation

From PyPI (recommended)

From source

Additional Dependecies

Rottaion Detection

OCR Backend Dependencies

Supported Tools and File Formats

Features

Tesseract OSD Rotation Detection

Using Nemotron-parse

Examples

CLI Usage

PPTX Processing with Default Nemotron OCR

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages