Configuration Guide¶
Complete guide to configuring the Workflow Bank Statement Separator for your environment.
Configuration Overview¶
The system uses environment variables for configuration, managed through a .env file. With 70+ configuration options, you can customize every aspect of the processing pipeline.
Interactive Configuration Help
Use the new env-help command for comprehensive environment variable documentation:
# Show all environment variables with descriptions and defaults
uv run bank-statement-separator env-help
# Filter by category for focused help
uv run bank-statement-separator env-help --category llm
uv run bank-statement-separator env-help --category processing
uv run bank-statement-separator env-help --category paperless
Configuration Template
Copy .env.example to .env to get started with default values and comprehensive documentation for all options.
Core Configuration¶
Required Variables¶
# LLM Provider Selection
LLM_PROVIDER=openai # openai, ollama, auto
# OpenAI Configuration (if using openai provider)
OPENAI_API_KEY=sk-your-api-key-here # Optional - fallback available
Flexible LLM Support
The system supports multiple LLM providers: - OpenAI: Cloud-based AI with high accuracy (~95%) - Ollama: Local AI processing for privacy and cost savings - Fallback: Pattern-matching without AI (~85% accuracy)
No API key required when using Ollama or fallback mode.
Essential Settings¶
# LLM Provider Settings
LLM_PROVIDER=openai # Provider selection
OPENAI_MODEL=gpt-4o-mini # OpenAI model selection
LLM_TEMPERATURE=0 # Deterministic output
# Processing Settings
DEFAULT_OUTPUT_DIR=./separated_statements # Output location
LOG_LEVEL=INFO # Logging verbosity
# Security
MAX_FILE_SIZE_MB=100 # File size limit
ENABLE_AUDIT_LOGGING=true # Compliance logging
Complete Configuration Reference¶
LLM Provider Configuration¶
The system supports multiple LLM providers through a flexible abstraction layer:
| Variable | Default | Description |
|---|---|---|
LLM_PROVIDER |
openai |
Provider: openai, ollama, auto |
LLM_FALLBACK_ENABLED |
true |
Enable fallback to pattern matching |
LLM_TEMPERATURE |
0 |
Model creativity (0-1) |
LLM_MAX_TOKENS |
4000 |
Maximum tokens per API call |
OpenAI Configuration¶
| Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
None | OpenAI API key for AI analysis |
OPENAI_MODEL |
gpt-4o-mini |
Model: gpt-4o-mini, gpt-4o, gpt-3.5-turbo |
Ollama Configuration¶
| Variable | Default | Description |
|---|---|---|
OLLAMA_BASE_URL |
http://localhost:11434 |
Ollama server URL |
OLLAMA_MODEL |
llama3.2 |
Local model name |
Provider Selection
- openai: Use OpenAI cloud models (requires API key)
- ollama: Use local Ollama models (privacy-focused, no API costs)
- auto: Automatically select best available provider
Model Selection
gpt-4o-mini: Best balance of cost and performance (recommended)gpt-4o: Highest accuracy, higher costgpt-3.5-turbo: Fastest, lower accuracy
Processing Configuration¶
| Variable | Default | Description |
|---|---|---|
CHUNK_SIZE |
6000 |
Text chunk size for processing |
CHUNK_OVERLAP |
800 |
Overlap between text chunks |
MAX_FILENAME_LENGTH |
240 |
Maximum filename length |
DEFAULT_OUTPUT_DIR |
./separated_statements |
Default output directory |
PROCESSED_INPUT_DIR |
Auto-generated | Processed file storage |
File Processing Settings¶
| Variable | Default | Description |
|---|---|---|
MAX_FILE_SIZE_MB |
100 |
Maximum input file size |
MAX_PAGES_PER_STATEMENT |
50 |
Pages per statement limit |
MAX_TOTAL_PAGES |
500 |
Total pages limit |
INCLUDE_BANK_IN_FILENAME |
true |
Include bank name in output |
DATE_FORMAT |
YYYY-MM |
Date format for filenames |
Security & Access Control¶
| Variable | Default | Description |
|---|---|---|
ALLOWED_INPUT_DIRS |
None | Comma-separated allowed input directories |
ALLOWED_OUTPUT_DIRS |
None | Comma-separated allowed output directories |
ENABLE_AUDIT_LOGGING |
true |
Enable security audit logging |
Production Security
For production deployments, always set ALLOWED_INPUT_DIRS and ALLOWED_OUTPUT_DIRS to restrict file access to specific secure directories.
Error Handling Configuration¶
Quarantine System¶
| Variable | Default | Description |
|---|---|---|
QUARANTINE_DIRECTORY |
./quarantine |
Failed document storage |
AUTO_QUARANTINE_CRITICAL_FAILURES |
true |
Auto-quarantine critical failures |
PRESERVE_FAILED_OUTPUTS |
true |
Keep partial outputs on failure |
MAX_RETRY_ATTEMPTS |
2 |
Retry count for transient failures |
Error Reporting¶
| Variable | Default | Description |
|---|---|---|
ENABLE_ERROR_REPORTING |
true |
Generate detailed error reports |
ERROR_REPORT_DIRECTORY |
./error_reports |
Error report storage |
CONTINUE_ON_VALIDATION_WARNINGS |
true |
Continue processing on warnings |
Validation Settings¶
| Variable | Default | Description |
|---|---|---|
VALIDATION_STRICTNESS |
normal |
Validation level: strict, normal, lenient |
MIN_PAGES_PER_STATEMENT |
1 |
Minimum pages per statement |
MAX_FILE_AGE_DAYS |
365 |
Maximum file age in days |
ALLOWED_FILE_EXTENSIONS |
.pdf |
Allowed file extensions |
REQUIRE_TEXT_CONTENT |
true |
Require extractable text |
MIN_TEXT_CONTENT_RATIO |
0.1 |
Minimum text content ratio |
Validation Strictness Levels
- Strict: All validation issues are errors (highest accuracy)
- Normal: Balanced approach with warnings (recommended)
- Lenient: Most issues are warnings (highest processing success rate)
Paperless-ngx Integration¶
Connection Settings¶
| Variable | Default | Description |
|---|---|---|
PAPERLESS_ENABLED |
false |
Enable paperless integration |
PAPERLESS_URL |
None | Paperless-ngx server URL |
PAPERLESS_TOKEN |
None | API authentication token |
Document Management¶
| Variable | Default | Description |
|---|---|---|
PAPERLESS_TAGS |
bank-statement,automated |
Auto-applied tags |
PAPERLESS_CORRESPONDENT |
Bank |
Default correspondent |
PAPERLESS_DOCUMENT_TYPE |
Bank Statement |
Document type |
PAPERLESS_STORAGE_PATH |
Bank Statements |
Storage path |
PAPERLESS_TAG_WAIT_TIME |
5 |
Wait time (seconds) before applying tags |
Input Document Processing¶
Configure how input documents from Paperless are tagged after successful processing:
| Variable | Default | Description |
|---|---|---|
PAPERLESS_INPUT_TAGGING_ENABLED |
true |
Enable input document tagging after processing |
PAPERLESS_INPUT_PROCESSED_TAG |
None | Tag to add to input documents after processing |
PAPERLESS_INPUT_REMOVE_UNPROCESSED_TAG |
false |
Remove 'unprocessed' tag after processing |
PAPERLESS_INPUT_PROCESSING_TAG |
None | Custom tag to mark documents as processed |
Auto-Creation
The system automatically creates tags, correspondents, document types, and storage paths in Paperless if they don't exist.
Input Document Tagging
When processing documents that originate from Paperless (using source_document_id), the system can automatically tag the original input documents as "processed" to prevent re-processing:
- **Option 1**: Add a "processed" tag: `PAPERLESS_INPUT_PROCESSED_TAG=processed`
- **Option 2**: Remove "unprocessed" tag: `PAPERLESS_INPUT_REMOVE_UNPROCESSED_TAG=true`
- **Option 3**: Use custom tag: `PAPERLESS_INPUT_PROCESSING_TAG=bank-statement-processed`
Only one option should be configured at a time. Input document tagging only occurs after successful output document processing and upload.
Logging Configuration¶
Log Levels¶
| Variable | Default | Description |
|---|---|---|
LOG_LEVEL |
INFO |
Logging level: DEBUG, INFO, WARNING, ERROR |
LOG_FILE |
./logs/statement_processing.log |
Log file location |
Audit Logging¶
| Variable | Default | Description |
|---|---|---|
ENABLE_AUDIT_LOGGING |
true |
Enable compliance logging |
AUDIT_LOG_FILE |
./logs/audit.log |
Audit log location |
Environment-Specific Configurations¶
Development Environment¶
# .env for development
OPENAI_API_KEY=sk-your-dev-key
LLM_MODEL=gpt-4o-mini
LOG_LEVEL=DEBUG
VALIDATION_STRICTNESS=lenient
PRESERVE_FAILED_OUTPUTS=true
MAX_RETRY_ATTEMPTS=1
ENABLE_ERROR_REPORTING=true
Testing Environment¶
# .env for testing
OPENAI_API_KEY="" # Test fallback mode
LLM_MODEL=gpt-4o-mini
LOG_LEVEL=INFO
VALIDATION_STRICTNESS=normal
DEFAULT_OUTPUT_DIR=./test/output
QUARANTINE_DIRECTORY=./test/quarantine
Production Environment¶
# .env for production
OPENAI_API_KEY=sk-your-prod-key
LLM_MODEL=gpt-4o-mini
LOG_LEVEL=INFO
VALIDATION_STRICTNESS=strict
ENABLE_AUDIT_LOGGING=true
MAX_FILE_SIZE_MB=200
# Security restrictions
ALLOWED_INPUT_DIRS=/secure/input,/approved/documents
ALLOWED_OUTPUT_DIRS=/secure/output,/processed/statements
QUARANTINE_DIRECTORY=/secure/quarantine
# Paperless integration
PAPERLESS_ENABLED=true
PAPERLESS_URL=https://paperless.yourcompany.com
PAPERLESS_TOKEN=your-production-token
# Input document processing tracking
PAPERLESS_INPUT_TAGGING_ENABLED=true
PAPERLESS_INPUT_PROCESSED_TAG=processed
Configuration Validation¶
Test your configuration:
# Validate configuration loading
uv run python -c "
from src.bank_statement_separator.config import load_config
config = load_config()
print('✅ Configuration loaded successfully')
print(f'Model: {config.llm_model}')
print(f'Output Dir: {config.default_output_dir}')
print(f'Validation: {config.validation_strictness}')
"
# Test API key (if configured)
uv run python -c "
import openai
from src.bank_statement_separator.config import load_config
config = load_config()
if config.openai_api_key:
client = openai.Client(api_key=config.openai_api_key)
models = client.models.list()
print('✅ OpenAI API key valid')
else:
print('ℹ️ No API key configured (fallback mode)')
"
Dynamic Configuration¶
Command-Line Overrides¶
Override configuration via command-line:
# Override output directory
uv run python -m src.bank_statement_separator.main \
process input.pdf --output /custom/output
# Override model
uv run python -m src.bank_statement_separator.main \
process input.pdf --model gpt-4o
# Override env file location
uv run python -m src.bank_statement_separator.main \
process input.pdf --env-file /path/to/custom.env
Environment File Management¶
The --env-file parameter enables easy switching between different environment configurations without modifying your main .env file.
Creating Environment-Specific Files¶
Create dedicated environment files for different scenarios:
# Create environment-specific configs
cp .env.example .env.dev # Development settings
cp .env.example .env.test # Testing settings
cp .env.example .env.prod # Production settings
Environment File Usage Examples¶
Create .env.dev with development-optimized settings:
# .env.dev - Development Configuration
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-your-dev-key-here
OPENAI_MODEL=gpt-4o-mini
LOG_LEVEL=DEBUG
DEFAULT_OUTPUT_DIR=./dev_output
VALIDATION_STRICTNESS=lenient
PRESERVE_FAILED_OUTPUTS=true
ENABLE_ERROR_REPORTING=true
MAX_RETRY_ATTEMPTS=1
Use the development environment:
Create .env.test for testing with fallback mode:
# .env.test - Testing Configuration
LLM_PROVIDER=auto
OPENAI_API_KEY=invalid-key-for-testing
OPENAI_MODEL=gpt-4o-mini
LOG_LEVEL=ERROR
DEFAULT_OUTPUT_DIR=./test_output
VALIDATION_STRICTNESS=normal
MAX_FILE_SIZE_MB=10
QUARANTINE_DIRECTORY=./test_quarantine
ENABLE_FALLBACK_PROCESSING=true
Run tests with testing environment:
Create .env.prod for production deployment:
# .env.prod - Production Configuration
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-your-production-key
OPENAI_MODEL=gpt-4o
LOG_LEVEL=WARNING
DEFAULT_OUTPUT_DIR=/var/app/output
VALIDATION_STRICTNESS=strict
MAX_FILE_SIZE_MB=200
# Security restrictions
ALLOWED_INPUT_DIRS=/secure/input,/approved/documents
ALLOWED_OUTPUT_DIRS=/secure/output,/processed/statements
QUARANTINE_DIRECTORY=/secure/quarantine
# Paperless integration
PAPERLESS_ENABLED=true
PAPERLESS_URL=https://paperless.company.com
PAPERLESS_TOKEN=prod-token-here
Deploy with production settings:
Advanced Environment Patterns¶
Team Collaboration¶
Each team member maintains their own environment file:
# Personal environment files
.env.alice # Alice's development setup
.env.bob # Bob's development setup
.env.shared # Shared team defaults
# Usage
uv run python -m src.bank_statement_separator.main \
process input.pdf --env-file .env.alice
CI/CD Integration¶
Use environment files in automated pipelines:
# GitHub Actions example
- name: Run processing with CI config
run: |
uv run python -m src.bank_statement_separator.main \
process test.pdf --env-file .env.ci --dry-run
Deployment-Specific Configuration¶
# Different deployment targets
.env.staging # Staging environment
.env.production # Production environment
.env.dr # Disaster recovery site
# Deployment
uv run python -m src.bank_statement_separator.main \
process input.pdf --env-file .env.staging
Environment File Validation¶
The system validates environment files before loading:
# Test environment file validity
uv run python -c "
from src.bank_statement_separator.config import load_config, validate_env_file
# Validate file exists and is readable
validate_env_file('.env.dev')
print('✅ Environment file is valid')
# Test configuration loading
config = load_config('.env.dev')
print(f'✅ Configuration loaded successfully')
print(f'Provider: {config.llm_provider}')
print(f'Model: {config.openai_model}')
print(f'Output: {config.default_output_dir}')
"
Error Handling¶
Common environment file issues and solutions:
Environment File Best Practices
- Never commit
.envfiles containing secrets to version control - Use descriptive names like
.env.dev,.env.prodinstead of generic names - Document required variables in each environment file header
- Test configurations before deploying to production
- Use relative paths where possible for portability
- Validate configurations after changes using the validation script above
Security Considerations
- Production env files should be stored securely and access-controlled
- Use different API keys for different environments
- Set appropriate file permissions (644 or 600)
- Never expose production credentials in development/test environments
Environment Variable Precedence¶
Configuration precedence (highest to lowest):
- Command-line arguments
- Environment variables
.envfile values- Default values in code
Configuration Best Practices¶
Security¶
# Never commit .env files
echo ".env" >> .gitignore
# Use different configs per environment
cp .env.example .env.development
cp .env.example .env.production
# Restrict file access in production
ALLOWED_INPUT_DIRS=/secure/input
ALLOWED_OUTPUT_DIRS=/secure/output
Performance¶
# Optimize for large files
MAX_FILE_SIZE_MB=500
CHUNK_SIZE=8000
CHUNK_OVERLAP=1000
# Balance accuracy vs speed
LLM_MODEL=gpt-4o-mini # Fast
LLM_TEMPERATURE=0 # Consistent
VALIDATION_STRICTNESS=normal # Balanced
Monitoring¶
# Enable comprehensive logging
LOG_LEVEL=INFO
ENABLE_AUDIT_LOGGING=true
ENABLE_ERROR_REPORTING=true
# Set up log rotation
LOG_FILE=/var/log/bank-separator/processing.log
AUDIT_LOG_FILE=/var/log/bank-separator/audit.log
Configuration Templates¶
High-Accuracy Setup¶
# For maximum processing accuracy
OPENAI_API_KEY=sk-your-key
LLM_MODEL=gpt-4o
LLM_TEMPERATURE=0
VALIDATION_STRICTNESS=strict
MAX_RETRY_ATTEMPTS=3
ENABLE_FALLBACK_PROCESSING=false
High-Throughput Setup¶
# For maximum processing speed
LLM_MODEL=gpt-4o-mini
VALIDATION_STRICTNESS=lenient
MAX_RETRY_ATTEMPTS=1
CHUNK_SIZE=8000
CONTINUE_ON_VALIDATION_WARNINGS=true
Budget-Conscious Setup¶
# Minimize API costs
OPENAI_API_KEY="" # Use fallback only
ENABLE_FALLBACK_PROCESSING=true
VALIDATION_STRICTNESS=lenient
MAX_RETRY_ATTEMPTS=1
Troubleshooting Configuration¶
Common Issues¶
Next Steps¶
After configuring your system:
- Test your setup: Run the Quick Start Guide
- Learn CLI usage: Review CLI Commands
- Set up integrations: Configure Paperless Integration
- Configure error detection: Set up Error Detection & Tagging (v0.3.0+)
- Handle errors: Understand Error Handling