Skip to content

boorjanunezz/SmartInvoice-ETL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SmartInvoice-ETL

SmartInvoice-ETL is a robust pipeline designed to extract data from invoice PDFs using Azure Document Intelligence and store it in a SQL Server database. It includes a simulation mode for development and testing without Azure costs or external dependencies.

Features

  • Azure Integration: Extracts key fields (Invoice Number, Date, Client, NIF, Amount) from PDF invoices.
  • SQL Server Storage: Automatically inserts extracted data into a structured relational database.
  • Simulation Mode: Generates realistic mock data using Faker to test the pipeline without Azure API calls.
  • Robust Error Handling: Logging and error management for production reliability.
  • Secure Configuration: Uses environment variables for sensitive credentials.

Prerequisites

  • Python 3.8+
  • SQL Server (Express or Standard)
  • ODBC Driver 17 for SQL Server

Setup

  1. Clone the repository:

    git clone https://github.com/yourusername/SmartInvoice-ETL.git
    cd SmartInvoice-ETL
  2. Create a virtual environment:

    python -m venv venv
    .\venv\Scripts\activate  # Windows
    # source venv/bin/activate  # Linux/Mac
  3. Install dependencies:

    pip install -r requirements.txt
  4. Configure Environment: Create a .env file based on .env.example:

    AZURE_ENDPOINT="your_endpoint"
    AZURE_KEY="your_key"
    SQL_SERVER="localhost\SQLEXPRESS"
    SQL_DB="facturas"
    SIMULATE_DATA="True"  # Set to False to use real Azure extraction
  5. Initialize Database:

    python src/setup_db.py

Usage

Run the Pipeline

Place PDF invoices in data/input and run:

python src/main.py

Processed files will move to data/processed, and errors to data/error.

Test Data Simulation

To insert simulated data directly without files:

python src/insert_mock_data.py

Project Structure

SmartInvoice-ETL/
├── data/               # Input, processed, and error directories
├── logs/               # Execution logs
├── sql/                # SQL scripts for schema creation
├── src/
│   ├── main.py         # Main ETL pipeline
│   ├── config.py       # Configuration management
│   ├── utils.py        # Helper functions
│   ├── setup_db.py     # Database initialization script
│   └── mock_data.py    # Mock data generator
├── .env.example        # Template for environment variables
├── requirements.txt    # Python dependencies
└── README.md           # Project documentation

License

MIT

About

Solución inteligente para la digitalización y gestión de facturas. Transforma documentos PDF no estructurados en datos SQL procesables mediante IA, optimizando el flujo de trabajo financiero.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors