Skip to content

sandialabs/banyan-ingest

Repository files navigation

Banyan Extract

banyan_extract is a python module that prepares documents for use in GenAI and LLM applications.

Rather than re-invent the wheel, banyan_extract aims to utilize state-of-the-art tools to provide this capability.

Installation

From PyPI (recommended)

In a Python environment (conda, venv, etc.), use the following:

cd PATH_TO_REPO/
pip install banyan-extract

From source

git clone https://github.com/sandialabs/banyan-ingest.git
cd banyan-ingest/
pip install .

Additional Dependecies

You will need poppler installed.

Rottaion Detection

For the rotation detection functionality, you need Tesseract OCR (version 4.0 or higher recommended) installed on your system

pip install pytesseract

Then install the Tesseract OCR binary:

  • Linux (Ubuntu/Debian): sudo apt install tesseract-ocr
  • Linux (Fedora/RHEL): sudo dnf install tesseract
  • macOS: brew install tesseract
  • Windows: Download from Tesseract GitHub

Note: Tesseract OCR is only required for automatic rotation detection. Manual rotation works without Tesseract.

Verify Installation: After installing, verify Tesseract is working:

import pytesseract
print(pytesseract.get_tesseract_version())

OCR Backend Dependencies

The default OCR backend for PPTX processing is now Nemotron (changed from Surya).

To use Nemotron OCR (default):

pip install .[nemotronparse]

To use Surya OCR:

pip install .[marker]

Supported Tools and File Formats

Currently we provide support for marker (link here) and NVIDIA's nemotron-parse models (link here).

To install the necessary dependencies for these tools please use pip install .[marker] or pip install .[nemotronparse] respectively.

Default OCR Backend: Nemotron is now the default OCR backend for PPTX processing (changed from Surya).

Note: please ensure you follow the guidelines and usage licenses of the tools.

Features

Tesseract OSD Rotation Detection

  • Automatic rotation detection using Tesseract OCR's Orientation and Script Detection (OSD)
  • Configurable confidence threshold for reliable results (default: 0.7)
  • Graceful fallback to 0° rotation when Tesseract is not available
  • Support for standard angles: 0°, 90°, 180°, and 270° detection
  • Comprehensive error handling with detailed logging

Requirements: Tesseract OCR (version 4.0+) and pytesseract package for automatic detection.

Using Nemotron-parse

Copy the .env.example file change NEMOTRON_ENDPOINT to the endpoint of the Nemotron-parse model you want to use.

Examples

The example_*.py scripts contain basic scripts for processing PDF documents using different OCR tools under the hood.

CLI Usage

Use banyan-extract to run the tool from the command line. Example command that reads in a PDF named example.pdf and puts all the extracted content in a directory named banyan_output:

banyan-extract --backend nemoparse example.pdf banyan_output/

PPTX Processing with Default Nemotron OCR

# Process PPTX with default Nemotron OCR backend
banyan-extract presentation.pptx output_dir/

# Process PPTX with Surya OCR backend (explicit)
banyan-extract presentation.pptx output_dir/ --pptx_ocr_backend surya

About

BANYAN-ingest is a python module that focuses on extracting information rich features from scientific and technical documents. Rather than re-invent the wheel, BANYAN-ingest is a toolbox of state-of-the-art document processing tools and techniques. The goal of BANYAN-ingest is to provide an easy-to-use interface and standardized outputs. 

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages