Paperless-ngx Integration Guide¶

Complete guide to integrating the Workflow Bank Statement Separator with Paperless-ngx document management system.

Overview¶

Paperless-ngx integration enables automatic upload and organization of separated bank statements into your document management system. The integration includes:

Automatic Upload: Processed statements uploaded after successful separation
Smart Organization: Auto-creation of tags, correspondents, and document types
Metadata Management: Automatic extraction and application of document metadata
Error Handling: Robust handling of upload failures with retry logic
Error Detection & Tagging: Automatic tagging of documents with processing errors for manual review

Production Ready

The Paperless integration has been live-tested with actual paperless-ngx instances and includes comprehensive error handling with 27 unit tests covering all functionality.

Prerequisites¶

Paperless-ngx Setup¶

Ensure you have:

Paperless-ngx instance: Running version 1.9.0 or higher
API Access: Admin account with API token
Network Access: System can reach Paperless-ngx server

Authentication¶

Generate an API token in Paperless-ngx:

Log into your Paperless-ngx admin interface
Navigate to Settings → API Tokens
Click Add Token and give it a descriptive name
Copy the generated token for configuration

Configuration¶

Basic Setup¶

Configure Paperless integration in your .env file:

# Enable Paperless integration
PAPERLESS_ENABLED=true

# Connection settings
PAPERLESS_URL=http://localhost:8000
PAPERLESS_TOKEN=your-api-token-here

# Optional: Connection timeout
PAPERLESS_TIMEOUT_SECONDS=30

Document Organization¶

Configure how documents are organized in Paperless:

# Auto-applied tags (comma-separated)
PAPERLESS_TAGS=bank-statement,automated

# Default correspondent name
PAPERLESS_CORRESPONDENT=Bank

# Document type
PAPERLESS_DOCUMENT_TYPE=Bank Statement

# Storage path for organization
PAPERLESS_STORAGE_PATH=Bank Statements

# Tag application timing (seconds)
PAPERLESS_TAG_WAIT_TIME=5

Tag Wait Time

PAPERLESS_TAG_WAIT_TIME controls how long the system waits (in seconds) before applying tags to uploaded documents. Paperless-ngx needs time to process documents before tags can be successfully applied. Adjust this value based on your Paperless instance speed:

- **Fast instances**: 3-5 seconds (default: 5)
- **Slower instances**: 8-10 seconds
- **Testing/immediate**: 0 seconds (may cause tag failures)

Advanced Options¶

# Upload behavior
PAPERLESS_AUTO_UPLOAD=true                    # Auto-upload after processing
PAPERLESS_DELETE_AFTER_UPLOAD=false          # Keep local files after upload
PAPERLESS_RETRY_UPLOADS=true                 # Retry failed uploads
PAPERLESS_BATCH_SIZE=5                       # Max documents per batch

# Backoff configuration for uploads
PAPERLESS_BACKOFF_MIN=2.0                    # Minimum backoff delay
PAPERLESS_BACKOFF_MAX=30.0                   # Maximum backoff delay
PAPERLESS_MAX_RETRIES=3                      # Maximum retry attempts

Input Document Processing Tracking¶

When processing documents that originate from Paperless-ngx (using the source_document_id parameter), the system can automatically tag the original input documents as "processed" after successful processing. This prevents re-processing of the same documents in future runs.

Configuration Options¶

Configure one of these input document tagging options:

# Option 1: Add a "processed" tag to input documents
PAPERLESS_INPUT_PROCESSED_TAG=processed

# Option 2: Remove "unprocessed" tag from input documents
PAPERLESS_INPUT_REMOVE_UNPROCESSED_TAG=true

# Option 3: Use a custom processing tag
PAPERLESS_INPUT_PROCESSING_TAG=bank-statement-processed

# Global enable/disable (default: true)
PAPERLESS_INPUT_TAGGING_ENABLED=true

How It Works¶

Document Processing: Process bank statements from Paperless using source_document_id
Output Processing: Create separated statements and upload them to Paperless
Input Tagging: After successful upload, tag the original input document as processed
Re-processing Prevention: Tagged documents can be filtered out in future processing runs

Usage Example¶

# Process a document from Paperless with ID 123
uv run python -c "
from src.bank_statement_separator.workflow import BankStatementWorkflow
from src.bank_statement_separator.config import load_config

config = load_config()
workflow = BankStatementWorkflow(config)

# source_document_id=123 tells the system this came from Paperless
result = workflow.run(
    input_file_path='/path/to/downloaded/statement.pdf',
    output_directory='/output',
    source_document_id=123
)

# After successful processing:
# - Output documents uploaded to Paperless with configured tags
# - Input document (ID 123) tagged as processed to prevent re-processing
"

Configuration Precedence¶

Only one input tagging option should be configured. The system checks in this order:

PAPERLESS_INPUT_PROCESSED_TAG (if set, adds this tag)
PAPERLESS_INPUT_REMOVE_UNPROCESSED_TAG (if true, removes 'unprocessed' tag)
PAPERLESS_INPUT_PROCESSING_TAG (if set, adds this custom tag)

Error Handling¶

Input document tagging failures are handled gracefully:

Non-blocking: Tagging failures don't stop the workflow
Detailed Logging: All tagging attempts and results are logged
Graceful Degradation: Missing tags or API errors are handled without stopping processing

Monitoring Input Document Processing¶

Check input document tagging results in logs:

# View input document tagging results
grep "input document.*processed" logs/statement_processing.log

# Check for tagging failures
grep "Failed to mark input document" logs/statement_processing.log

# View full workflow results including input tagging
grep "upload_results.*input_tagging" logs/statement_processing.log

Error Detection and Tagging¶

The system automatically detects processing errors during bank statement separation and applies configurable tags to uploaded documents in Paperless-ngx for manual review. This feature helps identify documents that may need human attention or reprocessing.

Overview¶

When processing errors occur, the system:

Detects Processing Issues: Identifies 6 types of processing errors during workflow execution
Applies Error Tags: Automatically tags affected documents in Paperless with configurable error tags
Enables Manual Review: Tagged documents can be easily filtered for manual review and correction
Maintains Audit Trail: Detailed logging of all error detection and tagging activities

Configuration¶

Enable error detection and tagging in your .env file:

# Enable error detection and tagging
PAPERLESS_ERROR_DETECTION_ENABLED=true

# Tags to apply to documents with processing errors (comma-separated)
PAPERLESS_ERROR_TAGS=processing:needs-review,error:automated-detection

# Severity threshold (0.0-1.0) - only errors above this threshold trigger tagging
PAPERLESS_ERROR_TAG_THRESHOLD=0.5

# Severity levels that trigger tagging (comma-separated)
PAPERLESS_ERROR_SEVERITY_LEVELS=medium,high,critical

# Tagging mode: false=individual requests, true=batch requests
PAPERLESS_ERROR_BATCH_TAGGING=false

Error Types Detected¶

The system detects these processing error categories:

Error Type	Severity	Description	Example Scenarios
LLM Analysis Failures	High	AI model fails to analyze document content	API timeouts, invalid responses, model errors
Low Confidence Boundaries	Medium-High	Statement boundaries detected with low confidence	Ambiguous page breaks, unclear statement separators
PDF Processing Errors	High	PDF generation or manipulation failures	Memory limits, corruption, format issues
Metadata Extraction Issues	Medium	Failed to extract bank names, dates, account numbers	Unrecognized formats, OCR failures
File Output Problems	Critical	Generated files missing or corrupted	Disk space, permissions, write failures
Validation Failures	High	Output validation checks failed	Page counts, content sampling, format verification

How It Works¶

1. Error Detection During Processing¶

# Error detection happens automatically during workflow execution
from bank_statement_separator.workflow import BankStatementWorkflow

workflow = BankStatementWorkflow(config)
result = workflow.run(
    input_file_path='statements.pdf',
    output_directory='./output'
)

# If errors are detected during processing, they are automatically
# identified and prepared for tagging

2. Automatic Tagging of Uploaded Documents¶

When documents are uploaded to Paperless, the system:

Checks if any processing errors were detected
Evaluates error severity against the configured threshold
Applies configured error tags to affected documents
Logs detailed tagging results for audit purposes

3. Error Tag Application¶

# Example: Document with PDF processing errors gets these tags applied:
processing:needs-review        # From PAPERLESS_ERROR_TAGS
error:automated-detection      # From PAPERLESS_ERROR_TAGS
error:pdf                     # Specific error type tag
error:severity:high           # Severity level tag

Configuration Examples¶

Basic Error Tagging¶

# Simple setup - tag documents needing review
PAPERLESS_ERROR_DETECTION_ENABLED=true
PAPERLESS_ERROR_TAGS=needs-review
PAPERLESS_ERROR_TAG_THRESHOLD=0.7
PAPERLESS_ERROR_SEVERITY_LEVELS=high,critical

Advanced Error Classification¶

# Detailed error classification with multiple tags
PAPERLESS_ERROR_DETECTION_ENABLED=true
PAPERLESS_ERROR_TAGS=processing:error,automated:detection,review:required
PAPERLESS_ERROR_TAG_THRESHOLD=0.5
PAPERLESS_ERROR_SEVERITY_LEVELS=medium,high,critical
PAPERLESS_ERROR_BATCH_TAGGING=true

Development/Testing Setup¶

# Tag all errors including low-severity ones for development
PAPERLESS_ERROR_DETECTION_ENABLED=true
PAPERLESS_ERROR_TAGS=test:error-detection,test:automated-tagging
PAPERLESS_ERROR_TAG_THRESHOLD=0.0
PAPERLESS_ERROR_SEVERITY_LEVELS=low,medium,high,critical

Usage Examples¶

Processing with Error Detection¶

# Process documents with error detection enabled
PAPERLESS_ERROR_DETECTION_ENABLED=true \
uv run python -m src.bank_statement_separator.main \
  process statements.pdf --output ./output --yes

The system will:

Process the bank statements normally
Detect any processing errors during workflow execution
Upload separated documents to Paperless
Apply error tags to documents where processing errors occurred
Log detailed error detection and tagging results

Reviewing Tagged Documents¶

In Paperless-ngx, filter documents by error tags:

# Search for documents needing review
# In Paperless web interface: Search for "needs-review" tag
# Or use API to find tagged documents:

curl -H "Authorization: Token $PAPERLESS_TOKEN" \
  "$PAPERLESS_URL/api/documents/?tags__name__in=needs-review"

Monitoring and Verification¶

Check Error Detection Results¶

# View error detection logs
grep "Detected.*processing errors" logs/statement_processing.log

# Check tagging results
grep "Error Tagging Results" logs/statement_processing.log

# View specific tagging details
grep "Document.*tags applied" logs/statement_processing.log

Verify in Paperless¶

Navigate to Documents: Go to your Paperless-ngx document list
Filter by Error Tags: Use tag filters to find documents with error tags
Review Document Status: Check which documents need manual review
Manual Processing: Reprocess or manually correct flagged documents

Error Tag Management¶

Creating Error Tags in Paperless¶

The system assumes error tags exist in your Paperless instance. Create them manually:

Go to Paperless Settings → Tags
Create Error Tags such as:
processing:needs-review (orange color)
error:automated-detection (red color)
error:pdf (dark red)
error:severity:high (bright red)

Tag Cleanup After Review¶

After manually reviewing and fixing documents:

# Remove error tags from corrected documents
# Via Paperless web interface or API

# Example API call to remove error tag from document 123:
curl -X PATCH \
  -H "Authorization: Token $PAPERLESS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"tags": [1,2,3]}' \  # List without error tag IDs
  "$PAPERLESS_URL/api/documents/123/"

Troubleshooting Error Tagging¶

Common Issues¶

Tags Not AppliedToo Many/Few Tags AppliedTagging Permission Errors

Problem: Error tags not appearing on uploaded documents

Solutions:

# Check if error detection is enabled
grep "PAPERLESS_ERROR_DETECTION_ENABLED" .env

# Verify error tags exist in Paperless
curl -H "Authorization: Token $PAPERLESS_TOKEN" \
     "$PAPERLESS_URL/api/tags/" | grep "needs-review"

# Check threshold settings
echo "Threshold: $PAPERLESS_ERROR_TAG_THRESHOLD"

Problem: Error tagging too aggressive or not sensitive enough

Solutions:

# Adjust severity threshold
PAPERLESS_ERROR_TAG_THRESHOLD=0.7  # More selective (higher threshold)
PAPERLESS_ERROR_TAG_THRESHOLD=0.3  # Less selective (lower threshold)

# Adjust severity levels
PAPERLESS_ERROR_SEVERITY_LEVELS=high,critical  # Only major errors
PAPERLESS_ERROR_SEVERITY_LEVELS=medium,high,critical  # Include medium errors

Problem: API token lacks permission to apply tags

Solutions:

# Test tag creation permission
curl -X POST \
  -H "Authorization: Token $PAPERLESS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"name": "test-tag"}' \
  "$PAPERLESS_URL/api/tags/"

# Ensure API token has admin permissions
# Or pre-create all required tags manually

Debug Error Detection¶

# Enable verbose error detection logging
LOG_LEVEL=DEBUG
PAPERLESS_ERROR_DETECTION_ENABLED=true

# Run with detailed output
uv run python -m src.bank_statement_separator.main \
  process statements.pdf --verbose

# Check error detection details
grep "error_detector\|error_tagger" logs/statement_processing.log

Integration with Workflow¶

Error detection integrates seamlessly with the standard processing workflow:

1. PDF Ingestion → 2. Document Analysis → 3. Statement Detection →
4. Metadata Extraction → 5. PDF Generation → 6. File Organization →
7. Output Validation → 8. Paperless Upload (with Error Tagging)

Error detection occurs throughout steps 2-7, and tagging happens during step 8 when documents are uploaded to Paperless.

Best Practices¶

Tag Naming Conventions¶

Use consistent, hierarchical tag naming:

# Recommended tag structure
PAPERLESS_ERROR_TAGS="
  processing:needs-review,    # Top-level category
  error:automated,           # Error source identifier
  review:priority-high       # Action-oriented tag
"

Severity Threshold Tuning¶

Start with conservative settings and adjust based on results:

# Production: Conservative (fewer false positives)
PAPERLESS_ERROR_TAG_THRESHOLD=0.7
PAPERLESS_ERROR_SEVERITY_LEVELS=high,critical

# Development: Comprehensive (catch more issues)
PAPERLESS_ERROR_TAG_THRESHOLD=0.4
PAPERLESS_ERROR_SEVERITY_LEVELS=medium,high,critical

Regular Review Process¶

Establish a workflow for reviewing tagged documents:

Daily: Check for new documents with error tags
Weekly: Review and process flagged documents
Monthly: Analyze error patterns and adjust thresholds
Quarterly: Clean up resolved error tags

This error detection and tagging system ensures that processing issues are automatically flagged for human review, improving the reliability and quality of automated bank statement processing.

Auto-Creation Features¶

The system automatically creates missing entities in Paperless-ngx:

Tags¶

Creates tags from PAPERLESS_TAGS if they don't exist
Applies bank-specific tags based on detected bank names
Adds processing timestamp tags

Correspondents¶

Creates default correspondent from PAPERLESS_CORRESPONDENT
Creates bank-specific correspondents (e.g., "Westpac", "ANZ")
Uses exact name matching to avoid duplicates

Document Types¶

Creates document type from PAPERLESS_DOCUMENT_TYPE
Handles different statement types (checking, savings, credit card)

Storage Paths¶

Creates storage path hierarchy from PAPERLESS_STORAGE_PATH
Organizes by bank and year automatically
Example: Bank Statements/Westpac/2024/

Usage Examples¶

Basic Integration¶

# Process and upload to Paperless
uv run python -m src.bank_statement_separator.main \
  process statements.pdf --output ./separated --yes

With Paperless enabled, this will:

Process the PDF and create separated statements
Upload each separated statement to Paperless-ngx
Apply configured tags, correspondent, and document type
Move input file to processed directory

Custom Configuration¶

# Use custom Paperless settings
PAPERLESS_TAGS="bank-statement,westpac,monthly" \
PAPERLESS_CORRESPONDENT="Westpac Bank" \
uv run python -m src.bank_statement_separator.main \
  process westpac-statements.pdf --yes

Workflow Integration¶

Processing Pipeline¶

The Paperless upload occurs as the final step in the 8-node workflow:

PDF Ingestion → 2. Document Analysis → 3. Statement Detection →
Metadata Extraction → 5. PDF Generation → 6. File Organization →
Output Validation → 8. Paperless Upload

Upload Process¶

For each separated statement:

Metadata Extraction: Extract bank name, account number, statement date
Entity Resolution: Find or create tags, correspondent, document type
File Upload: Upload PDF with metadata
Verification: Confirm successful upload
Logging: Record upload results

Metadata Mapping¶

Automatic Metadata Extraction¶

Statement Data	Paperless Field	Example
Bank Name	Correspondent	"Westpac Bank"
Account Number	Tags	"account-2819"
Statement Date	Date	2024-08-31
Statement Period	Tags	"2024-08"
Document Type	Document Type	"Bank Statement"

Smart Tagging¶

The system applies intelligent tags:

# Example tags for a Westpac statement
bank-statement          # From PAPERLESS_TAGS
automated              # From PAPERLESS_TAGS
westpac                # From detected bank name
account-2819           # From account number
2024-08               # From statement period

Error Handling¶

Upload Failures¶

The system handles various upload scenarios:

=== "Network Errors" - Automatically retries with exponential backoff and jitter - Logs detailed error information including backoff delays - Continues processing other documents - Configurable retry limits and delay parameters

=== "Authentication Errors" - Validates API token before processing - Provides clear error messages - Fails fast to prevent wasted processing

=== "Server Errors" - Distinguishes between temporary and permanent failures - Retries temporary failures (5xx errors) - Reports permanent failures (4xx errors) immediately

Error Recovery¶

# Check upload status
grep "paperless" logs/statement_processing.log

# Retry failed uploads manually
uv run python -c "
from src.bank_statement_separator.utils.paperless_client import PaperlessClient
from src.bank_statement_separator.config import load_config

config = load_config()
client = PaperlessClient(config)
# Retry specific file
client.upload_document('/path/to/failed/statement.pdf', {...})
"

Monitoring and Verification¶

Upload Verification¶

Check that uploads completed successfully:

# View upload logs
grep "PAPERLESS_UPLOAD" logs/audit.log

# Check processing results for upload status
uv run python -m src.bank_statement_separator.main \
  process statements.pdf --verbose

Paperless-ngx Verification¶

In your Paperless-ngx interface:

Navigate to Documents
Filter by recent uploads
Verify tags, correspondent, and document type
Check that files are properly organized

API Health Check¶

Test Paperless connection:

# Test API connectivity
uv run python -c "
from src.bank_statement_separator.utils.paperless_client import PaperlessClient
from src.bank_statement_separator.config import load_config

config = load_config()
if config.paperless_enabled:
    client = PaperlessClient(config)
    print('✅ Paperless connection successful')
    print(f'Server: {config.paperless_url}')
else:
    print('ℹ️ Paperless integration disabled')
"

Troubleshooting¶

Common Issues¶

Connection RefusedAuthentication FailedUpload TimeoutsMetadata Issues

Problem: Cannot connect to Paperless-ngx server

Solutions:

# Check server URL and port
curl -I http://localhost:8000/api/

# Verify network connectivity
ping paperless-server

# Check if Paperless is running
docker ps | grep paperless  # If using Docker

Problem: API token invalid or expired

Solutions:

# Test API token manually
curl -H "Authorization: Token your-token-here" \
     http://localhost:8000/api/documents/

# Generate new token in Paperless admin
# Update PAPERLESS_TOKEN in .env

Problem: Large files timing out during upload

Solutions:

# Increase timeout
PAPERLESS_TIMEOUT_SECONDS=120

# Process smaller batches
PAPERLESS_BATCH_SIZE=1

Problem: Tags or correspondents not created properly

Solutions:

# Enable verbose logging
LOG_LEVEL=DEBUG

# Check API responses
grep "paperless.*response" logs/statement_processing.log

# Verify entity resolution
grep "_resolve_" logs/statement_processing.log

Debug Mode¶

Enable detailed Paperless logging:

# Debug configuration
LOG_LEVEL=DEBUG
PAPERLESS_ENABLED=true

# Run with verbose output
uv run python -m src.bank_statement_separator.main \
  process statements.pdf --verbose

Production Deployment¶

Security Considerations¶

# Use HTTPS in production
PAPERLESS_URL=https://paperless.company.com

# Secure token storage
# Store token in secure key management system
PAPERLESS_TOKEN=$(vault kv get -field=token secret/paperless-api)

Performance Optimization¶

# Optimize for high-volume processing
PAPERLESS_BATCH_SIZE=10         # Process multiple documents
PAPERLESS_TIMEOUT_SECONDS=60    # Reasonable timeout
PAPERLESS_RETRY_UPLOADS=true    # Handle transient failures

Monitoring Setup¶

# Monitor upload success rates
grep "PAPERLESS_UPLOAD" logs/audit.log | \
  grep "$(date +%Y-%m-%d)" | \
  awk '/SUCCESS/{s++} /FAILED/{f++} END{print "Upload success rate: " s/(s+f)*100 "%"}'

# Alert on upload failures
upload_failures=$(grep "PAPERLESS_UPLOAD.*FAILED" logs/audit.log | \
  grep "$(date +%Y-%m-%d)" | wc -l)

if [[ $upload_failures -gt 5 ]]; then
    echo "High Paperless upload failure rate: $upload_failures" | \
        mail -s "Paperless Upload Alert" admin@company.com
fi

Advanced Configuration¶

Custom Metadata Templates¶

Configure dynamic metadata based on document content:

# Template variables available:
# {bank} - Detected bank name
# {account} - Account number
# {date} - Statement date
# {period} - Statement period

PAPERLESS_TAGS="bank-statement,{bank},{period}"
PAPERLESS_CORRESPONDENT="{bank} Bank"
PAPERLESS_STORAGE_PATH="Statements/{bank}/{year}"

Conditional Processing¶

Only upload certain types of documents:

# Custom processing logic
PAPERLESS_AUTO_UPLOAD=false  # Disable automatic upload

# Manual upload in scripts based on conditions
if [[ "$bank_name" == "Westpac" ]]; then
    PAPERLESS_ENABLED=true uv run python -m src.bank_statement_separator.main \
      process statements.pdf --yes
fi

Integration with Other Systems¶

Paperless integration can work alongside other document management:

# Multi-system upload
PAPERLESS_ENABLED=true
SHAREPOINT_ENABLED=true  # Custom integration
S3_BACKUP_ENABLED=true   # Custom integration

API Reference¶

The Paperless client provides these key methods:

Document Upload¶

from src.bank_statement_separator.utils.paperless_client import PaperlessClient

client = PaperlessClient(config)

# Upload single document
result = client.upload_document(
    file_path="/path/to/statement.pdf",
    metadata={
        "title": "Westpac Statement 2024-08",
        "tags": ["bank-statement", "westpac"],
        "correspondent": "Westpac Bank",
        "document_type": "Bank Statement"
    }
)

Entity Management¶

# Create or find entities
tag_id = client._resolve_tags(["bank-statement", "automated"])
correspondent_id = client._resolve_correspondent("Westpac Bank")
doc_type_id = client._resolve_document_type("Bank Statement")
storage_id = client._resolve_storage_path("Bank Statements/Westpac")

Migration and Backup¶

Data Migration¶

When setting up Paperless integration:

Backup existing data before enabling integration
Test with small batches first
Verify metadata mapping is correct
Monitor upload success rates during rollout

Disaster Recovery¶

# Backup Paperless database before major changes
docker exec paperless-db pg_dump -U paperless > paperless_backup.sql

# Export document list for recovery
curl -H "Authorization: Token $PAPERLESS_TOKEN" \
     "$PAPERLESS_URL/api/documents/" > documents_backup.json

This completes the Paperless-ngx integration guide. The system provides robust, production-ready document management integration with comprehensive error handling and monitoring capabilities.