SmartInvoice-ETL is a robust pipeline designed to extract data from invoice PDFs using Azure Document Intelligence and store it in a SQL Server database. It includes a simulation mode for development and testing without Azure costs or external dependencies.
- Azure Integration: Extracts key fields (Invoice Number, Date, Client, NIF, Amount) from PDF invoices.
- SQL Server Storage: Automatically inserts extracted data into a structured relational database.
- Simulation Mode: Generates realistic mock data using
Fakerto test the pipeline without Azure API calls. - Robust Error Handling: Logging and error management for production reliability.
- Secure Configuration: Uses environment variables for sensitive credentials.
- Python 3.8+
- SQL Server (Express or Standard)
- ODBC Driver 17 for SQL Server
-
Clone the repository:
git clone https://github.com/yourusername/SmartInvoice-ETL.git cd SmartInvoice-ETL -
Create a virtual environment:
python -m venv venv .\venv\Scripts\activate # Windows # source venv/bin/activate # Linux/Mac
-
Install dependencies:
pip install -r requirements.txt
-
Configure Environment: Create a
.envfile based on.env.example:AZURE_ENDPOINT="your_endpoint" AZURE_KEY="your_key" SQL_SERVER="localhost\SQLEXPRESS" SQL_DB="facturas" SIMULATE_DATA="True" # Set to False to use real Azure extraction
-
Initialize Database:
python src/setup_db.py
Place PDF invoices in data/input and run:
python src/main.pyProcessed files will move to data/processed, and errors to data/error.
To insert simulated data directly without files:
python src/insert_mock_data.pySmartInvoice-ETL/
├── data/ # Input, processed, and error directories
├── logs/ # Execution logs
├── sql/ # SQL scripts for schema creation
├── src/
│ ├── main.py # Main ETL pipeline
│ ├── config.py # Configuration management
│ ├── utils.py # Helper functions
│ ├── setup_db.py # Database initialization script
│ └── mock_data.py # Mock data generator
├── .env.example # Template for environment variables
├── requirements.txt # Python dependencies
└── README.md # Project documentation
MIT