Environment Variables Reference¶
Complete reference for all 70+ environment variables that control the Workflow Bank Statement Separator.
Configuration Overview¶
The system uses environment variables loaded from a .env file for configuration. Variables are organized into logical groups for different aspects of the system.
Interactive Environment Help
Get comprehensive, up-to-date environment variable documentation directly from the CLI:
# Show all environment variables with descriptions
uv run bank-statement-separator env-help
# Show variables by category
uv run bank-statement-separator env-help --category llm
uv run bank-statement-separator env-help --category processing
uv run bank-statement-separator env-help --category paperless
Configuration Template
Copy .env.example to .env to get started with documented default values for all variables.
The template file contains the most current variable documentation.
Core Processing Variables¶
AI Processing¶
| Variable | Type | Default | Description |
|---|---|---|---|
OPENAI_API_KEY |
String | None | OpenAI API key for LLM analysis |
LLM_MODEL |
Choice | gpt-4o-mini |
AI model: gpt-4o-mini, gpt-4o, gpt-3.5-turbo |
LLM_TEMPERATURE |
Float | 0 |
Model temperature (0-1, 0=deterministic) |
LLM_MAX_TOKENS |
Integer | 4000 |
Maximum tokens per API call |
ENABLE_FALLBACK_PROCESSING |
Boolean | true |
Enable pattern-matching when AI fails |
AI Model Selection
gpt-4o-mini: Best balance of cost and performance (recommended)gpt-4o: Highest accuracy, higher costgpt-3.5-turbo: Fastest processing, lower accuracy
Text Processing¶
| Variable | Type | Default | Description |
|---|---|---|---|
CHUNK_SIZE |
Integer | 6000 |
Text chunk size for LLM processing |
CHUNK_OVERLAP |
Integer | 800 |
Overlap between chunks for context |
TEXT_EXTRACTION_METHOD |
Choice | auto |
Method: auto, text, layout |
File Processing¶
| Variable | Type | Default | Description |
|---|---|---|---|
MAX_FILE_SIZE_MB |
Integer | 100 |
Maximum input file size in MB |
MAX_PAGES_PER_STATEMENT |
Integer | 50 |
Maximum pages per individual statement |
MAX_TOTAL_PAGES |
Integer | 500 |
Maximum total pages in input document |
PDF_RESOLUTION_DPI |
Integer | 150 |
DPI for PDF processing |
Output Configuration¶
File Organization¶
| Variable | Type | Default | Description |
|---|---|---|---|
DEFAULT_OUTPUT_DIR |
Path | ./separated_statements |
Default output directory |
PROCESSED_INPUT_DIR |
Path | Auto | Directory for processed input files |
INCLUDE_BANK_IN_FILENAME |
Boolean | true |
Include bank name in output filenames |
DATE_FORMAT |
String | YYYY-MM |
Date format for filenames |
MAX_FILENAME_LENGTH |
Integer | 240 |
Maximum filename length |
Filename Format
With INCLUDE_BANK_IN_FILENAME=true and DATE_FORMAT=YYYY-MM-DD:
westpac-2819-2015-05-21.pdf
anz-1234-2023-12-31.pdf
File Naming Patterns¶
| Variable | Type | Default | Description |
|---|---|---|---|
FILENAME_PATTERN |
String | {bank}-{account}-{date} |
Filename pattern template |
ACCOUNT_MASK_DIGITS |
Integer | 4 |
Number of account digits to show |
BANK_NAME_CLEANUP |
Boolean | true |
Clean bank names for filenames |
Security & Access Control¶
File Access¶
| Variable | Type | Default | Description |
|---|---|---|---|
ALLOWED_INPUT_DIRS |
List | None | Comma-separated allowed input directories |
ALLOWED_OUTPUT_DIRS |
List | None | Comma-separated allowed output directories |
RESTRICTED_PATHS |
List | None | Comma-separated forbidden paths |
Production Security
Always set ALLOWED_INPUT_DIRS and ALLOWED_OUTPUT_DIRS in production:
bash
ALLOWED_INPUT_DIRS=/secure/input,/approved/documents
ALLOWED_OUTPUT_DIRS=/secure/output,/processed/statements
API Security¶
| Variable | Type | Default | Description |
|---|---|---|---|
API_TIMEOUT_SECONDS |
Integer | 60 |
API request timeout |
API_RETRY_ATTEMPTS |
Integer | 3 |
API retry attempts on failure |
RATE_LIMIT_REQUESTS_PER_MINUTE |
Integer | 60 |
API rate limiting |
Error Handling & Quarantine¶
Quarantine System¶
| Variable | Type | Default | Description |
|---|---|---|---|
QUARANTINE_DIRECTORY |
Path | ./quarantine |
Directory for failed documents |
AUTO_QUARANTINE_CRITICAL_FAILURES |
Boolean | true |
Auto-quarantine critical failures |
PRESERVE_FAILED_OUTPUTS |
Boolean | true |
Keep partial outputs on failure |
QUARANTINE_MAX_SIZE_GB |
Integer | 10 |
Maximum quarantine directory size |
Error Reporting¶
| Variable | Type | Default | Description |
|---|---|---|---|
ENABLE_ERROR_REPORTING |
Boolean | true |
Generate detailed error reports |
ERROR_REPORT_DIRECTORY |
Path | ./error_reports |
Error report storage location |
ERROR_REPORT_MAX_AGE_DAYS |
Integer | 90 |
Maximum age for error reports |
INCLUDE_STACK_TRACES |
Boolean | false |
Include stack traces in reports |
Retry Logic¶
| Variable | Type | Default | Description |
|---|---|---|---|
MAX_RETRY_ATTEMPTS |
Integer | 2 |
Maximum retry attempts for failures |
RETRY_DELAY_SECONDS |
Float | 1.0 |
Delay between retry attempts |
RETRY_BACKOFF_FACTOR |
Float | 2.0 |
Exponential backoff multiplier |
CONTINUE_ON_VALIDATION_WARNINGS |
Boolean | true |
Continue processing on warnings |
Document Validation¶
Pre-Processing Validation¶
| Variable | Type | Default | Description |
|---|---|---|---|
VALIDATION_STRICTNESS |
Choice | normal |
Validation level: strict, normal, lenient |
MIN_PAGES_PER_STATEMENT |
Integer | 1 |
Minimum pages required per statement |
MAX_FILE_AGE_DAYS |
Integer | 365 |
Maximum age of input files |
ALLOWED_FILE_EXTENSIONS |
List | .pdf |
Allowed file extensions |
Validation Strictness Levels
- Strict: All validation issues cause processing to fail
- Normal: Balance between validation and processing success
- Lenient: Most validation issues generate warnings only
Content Validation¶
| Variable | Type | Default | Description |
|---|---|---|---|
REQUIRE_TEXT_CONTENT |
Boolean | true |
Require extractable text content |
MIN_TEXT_CONTENT_RATIO |
Float | 0.1 |
Minimum ratio of pages with text |
DETECT_SCANNED_DOCUMENTS |
Boolean | true |
Detect image-only documents |
MIN_WORDS_PER_PAGE |
Integer | 10 |
Minimum words per page |
Format Validation¶
| Variable | Type | Default | Description |
|---|---|---|---|
VALIDATE_PDF_STRUCTURE |
Boolean | true |
Validate PDF file structure |
ALLOW_PASSWORD_PROTECTED |
Boolean | false |
Allow password-protected PDFs |
CHECK_PDF_CORRUPTION |
Boolean | true |
Check for PDF corruption |
REQUIRE_PDF_VERSION |
String | None | Required PDF version (e.g., "1.4") |
Paperless-ngx Integration¶
Connection Settings¶
| Variable | Type | Default | Description |
|---|---|---|---|
PAPERLESS_ENABLED |
Boolean | false |
Enable Paperless-ngx integration |
PAPERLESS_URL |
URL | None | Paperless-ngx server URL |
PAPERLESS_TOKEN |
String | None | API authentication token |
PAPERLESS_TIMEOUT_SECONDS |
Integer | 30 |
API request timeout |
Document Metadata¶
| Variable | Type | Default | Description |
|---|---|---|---|
PAPERLESS_TAGS |
List | bank-statement,automated |
Auto-applied tags |
PAPERLESS_CORRESPONDENT |
String | Bank |
Default correspondent name |
PAPERLESS_DOCUMENT_TYPE |
String | Bank Statement |
Document type |
PAPERLESS_STORAGE_PATH |
String | Bank Statements |
Storage path |
PAPERLESS_TAG_WAIT_TIME |
Integer | 5 |
Wait time (seconds) before applying tags |
Auto-Creation
The system automatically creates missing tags, correspondents, document types, and storage paths in Paperless-ngx.
Input Document Processing¶
Configure how input documents from Paperless are tagged after successful processing:
| Variable | Type | Default | Description |
|---|---|---|---|
PAPERLESS_INPUT_TAGGING_ENABLED |
Boolean | true |
Enable input document tagging after processing |
PAPERLESS_INPUT_PROCESSED_TAG |
String | None | Tag to add to input documents after processing |
PAPERLESS_INPUT_REMOVE_UNPROCESSED_TAG |
Boolean | false |
Remove 'unprocessed' tag after processing |
PAPERLESS_INPUT_PROCESSING_TAG |
String | None | Custom tag to mark documents as processed |
Input Document Tagging Options
When processing documents that originate from Paperless (using source_document_id), configure one of these options:
1. **Add processed tag**: `PAPERLESS_INPUT_PROCESSED_TAG=processed`
2. **Remove unprocessed tag**: `PAPERLESS_INPUT_REMOVE_UNPROCESSED_TAG=true`
3. **Use custom tag**: `PAPERLESS_INPUT_PROCESSING_TAG=bank-statement-processed`
Input document tagging only occurs after successful output document processing and upload. This prevents re-processing of documents that have already been handled.
Error Detection and Tagging¶
Configure automatic error detection and tagging for documents with processing issues:
| Variable | Type | Default | Description |
|---|---|---|---|
PAPERLESS_ERROR_DETECTION_ENABLED |
Boolean | false |
Enable automatic error detection and tagging |
PAPERLESS_ERROR_TAGS |
List | None | Tags to apply to documents with processing errors |
PAPERLESS_ERROR_TAG_THRESHOLD |
Float | 0.5 |
Error severity threshold (0.0-1.0) for tagging |
PAPERLESS_ERROR_SEVERITY_LEVELS |
List | medium,high,critical |
Severity levels that trigger tagging |
PAPERLESS_ERROR_BATCH_TAGGING |
Boolean | false |
Use batch tagging (true) vs individual requests |
Error Detection System
The error detection system identifies 6 types of processing errors:
- **LLM Analysis Failures**: AI model errors or timeouts
- **Low Confidence Boundaries**: Statement detection with low confidence
- **PDF Processing Errors**: PDF generation or manipulation failures
- **Metadata Extraction Issues**: Failed to extract bank names, dates, accounts
- **File Output Problems**: Generated files missing or corrupted
- **Validation Failures**: Output validation checks failed
Only errors above the configured threshold and matching severity levels trigger automatic tagging.
Error Tagging Configuration
```bash # Basic error tagging setup PAPERLESS_ERROR_DETECTION_ENABLED=true PAPERLESS_ERROR_TAGS=processing:needs-review,error:automated-detection PAPERLESS_ERROR_TAG_THRESHOLD=0.7 PAPERLESS_ERROR_SEVERITY_LEVELS=high,critical
# Development/testing with comprehensive error detection
PAPERLESS_ERROR_DETECTION_ENABLED=true
PAPERLESS_ERROR_TAGS=test:error-detection,test:automated-tagging
PAPERLESS_ERROR_TAG_THRESHOLD=0.0
PAPERLESS_ERROR_SEVERITY_LEVELS=low,medium,high,critical
```
Upload Behavior¶
| Variable | Type | Default | Description |
|---|---|---|---|
PAPERLESS_AUTO_UPLOAD |
Boolean | true |
Auto-upload after processing |
PAPERLESS_DELETE_AFTER_UPLOAD |
Boolean | false |
Delete local files after upload |
PAPERLESS_RETRY_UPLOADS |
Boolean | true |
Retry failed uploads |
PAPERLESS_BATCH_SIZE |
Integer | 5 |
Maximum documents per batch |
Logging & Monitoring¶
Log Configuration¶
| Variable | Type | Default | Description |
|---|---|---|---|
LOG_LEVEL |
Choice | INFO |
Logging level: DEBUG, INFO, WARNING, ERROR |
LOG_FILE |
Path | ./logs/statement_processing.log |
Main log file location |
LOG_MAX_SIZE_MB |
Integer | 10 |
Maximum log file size |
LOG_BACKUP_COUNT |
Integer | 5 |
Number of backup log files |
Audit Logging¶
| Variable | Type | Default | Description |
|---|---|---|---|
ENABLE_AUDIT_LOGGING |
Boolean | true |
Enable security audit logging |
AUDIT_LOG_FILE |
Path | ./logs/audit.log |
Audit log file location |
AUDIT_LOG_LEVEL |
Choice | INFO |
Audit log level |
LOG_API_CALLS |
Boolean | true |
Log all API calls for monitoring |
Performance Monitoring¶
| Variable | Type | Default | Description |
|---|---|---|---|
ENABLE_PERFORMANCE_MONITORING |
Boolean | true |
Enable performance metrics |
LOG_PROCESSING_TIMES |
Boolean | true |
Log processing duration |
LOG_MEMORY_USAGE |
Boolean | false |
Log memory consumption |
PERFORMANCE_LOG_FILE |
Path | ./logs/performance.log |
Performance metrics log |
Development & Testing¶
Development Mode¶
| Variable | Type | Default | Description |
|---|---|---|---|
DEVELOPMENT_MODE |
Boolean | false |
Enable development features |
DEBUG_OUTPUT_DIR |
Path | ./debug |
Debug output directory |
PRESERVE_INTERMEDIATE_FILES |
Boolean | false |
Keep intermediate processing files |
ENABLE_PROFILING |
Boolean | false |
Enable performance profiling |
Testing Configuration¶
| Variable | Type | Default | Description |
|---|---|---|---|
TEST_MODE |
Boolean | false |
Enable test mode features |
MOCK_API_RESPONSES |
Boolean | false |
Use mock API responses |
TEST_DATA_DIR |
Path | ./test/input |
Test data directory |
GENERATE_TEST_REPORTS |
Boolean | false |
Generate test reports |
Configuration Validation¶
Variable Types¶
Variables are automatically validated based on their type:
Accept: true, false, 1, 0, yes, no (case-insensitive)
bash
ENABLE_AUDIT_LOGGING=true
PAPERLESS_ENABLED=false
Must be valid integers within allowed ranges:
bash
MAX_FILE_SIZE_MB=100
CHUNK_SIZE=6000
Must be valid floating-point numbers:
bash
LLM_TEMPERATURE=0.1
MIN_TEXT_CONTENT_RATIO=0.15
Validated as file system paths:
bash
DEFAULT_OUTPUT_DIR=./separated_statements
LOG_FILE=/var/log/processing.log
Comma-separated values:
bash
PAPERLESS_TAGS=bank-statement,automated,monthly
ALLOWED_INPUT_DIRS=/secure/input,/approved/docs
Must match predefined options:
bash
LLM_MODEL=gpt-4o-mini # or gpt-4o, gpt-3.5-turbo
VALIDATION_STRICTNESS=normal # or strict, lenient
Environment-Specific Configurations¶
Development Environment¶
# .env.development
OPENAI_API_KEY=sk-dev-key
LOG_LEVEL=DEBUG
VALIDATION_STRICTNESS=lenient
PRESERVE_FAILED_OUTPUTS=true
DEVELOPMENT_MODE=true
ENABLE_PROFILING=true
MAX_RETRY_ATTEMPTS=1
Testing Environment¶
# .env.testing
OPENAI_API_KEY="" # Test fallback mode
LOG_LEVEL=WARNING
TEST_MODE=true
MOCK_API_RESPONSES=true
QUARANTINE_DIRECTORY=./test/quarantine
DEFAULT_OUTPUT_DIR=./test/output
Production Environment¶
# .env.production
OPENAI_API_KEY=sk-prod-key
LOG_LEVEL=INFO
VALIDATION_STRICTNESS=strict
ENABLE_AUDIT_LOGGING=true
MAX_FILE_SIZE_MB=200
# Security
ALLOWED_INPUT_DIRS=/secure/input,/approved/documents
ALLOWED_OUTPUT_DIRS=/secure/output,/processed/statements
QUARANTINE_DIRECTORY=/secure/quarantine
# Paperless integration
PAPERLESS_ENABLED=true
PAPERLESS_URL=https://paperless.company.com
PAPERLESS_TOKEN=prod-api-token
# Input document processing tracking
PAPERLESS_INPUT_TAGGING_ENABLED=true
PAPERLESS_INPUT_PROCESSED_TAG=processed
# Error detection and tagging
PAPERLESS_ERROR_DETECTION_ENABLED=true
PAPERLESS_ERROR_TAGS=processing:needs-review,error:automated-detection
PAPERLESS_ERROR_TAG_THRESHOLD=0.7
PAPERLESS_ERROR_SEVERITY_LEVELS=high,critical
Configuration Validation¶
Test your configuration:
# Validate all variables
uv run python -c "
from src.bank_statement_separator.config import load_config
try:
config = load_config()
print('✅ Configuration valid')
print(f'Model: {config.llm_model}')
print(f'Output: {config.default_output_dir}')
print(f'Validation: {config.validation_strictness}')
except Exception as e:
print(f'❌ Configuration error: {e}')
"
# Check specific variable
uv run python -c "
import os
print(f'OPENAI_API_KEY: {"Set" if os.getenv("OPENAI_API_KEY") else "Not set"}')
print(f'LOG_LEVEL: {os.getenv("LOG_LEVEL", "Not set")}')
print(f'VALIDATION_STRICTNESS: {os.getenv("VALIDATION_STRICTNESS", "Not set")}')
"
Common Configuration Issues¶
````bash # ❌ Invalid LLM_TEMPERATURE=2.0 # Must be 0-1 MAX_FILE_SIZE_MB=abc # Must be integer VALIDATION_STRICTNESS=medium # Must be strict/normal/lenient
# ✅ Valid
LLM_TEMPERATURE=0.1
MAX_FILE_SIZE_MB=100
VALIDATION_STRICTNESS=normal
```
```bash # ❌ Problematic DEFAULT_OUTPUT_DIR=~/output # Tilde expansion issues QUARANTINE_DIRECTORY=output # Relative path confusion
# ✅ Better
DEFAULT_OUTPUT_DIR=/home/user/output # Absolute path
QUARANTINE_DIRECTORY=./quarantine # Explicit relative path
```
```bash # ❌ Insecure ALLOWED_INPUT_DIRS="" # No restrictions LOG_LEVEL=DEBUG # Too verbose for production
# ✅ Secure
ALLOWED_INPUT_DIRS=/secure/input
LOG_LEVEL=INFO
```
Configuration Best Practices¶
Security¶
- Never commit
.envfiles to version control - Use environment-specific configs (
.env.production,.env.development) - Restrict file access in production with
ALLOWED_*_DIRS - Use strong API keys and rotate them regularly
Performance¶
- Tune chunk sizes for your document types
- Set appropriate file size limits based on your hardware
- Configure retry settings for your network reliability
- Enable performance monitoring in production
Reliability¶
- Set up log rotation to prevent disk space issues
- Configure quarantine cleanup to manage storage
- Enable error reporting for troubleshooting
- Use strict validation for critical applications
Monitoring¶
- Enable audit logging for compliance
- Set up log monitoring and alerting
- Configure performance metrics collection
- Monitor API usage and costs ````