Workflow Architecture Overview¶

Comprehensive overview of the 8-node LangGraph workflow with error handling and recovery mechanisms.

System Architecture¶

The Bank Statement Separator uses two complementary workflow systems:

Application Processing Workflow: A sophisticated 8-node LangGraph pipeline for PDF processing with error handling and recovery
CI/CD Pipeline Workflow: GitHub Actions workflows for automated testing, releasing, and documentation deployment

Complete Workflow Documentation

This document focuses on the application processing workflow. For comprehensive documentation of the CI/CD workflows including release automation, testing, and documentation deployment, see GitHub Workflows Architecture.

Application Processing Workflow¶

The core PDF processing uses an 8-node LangGraph pipeline with comprehensive error handling and recovery systems.

Complete Workflow Diagram¶

flowchart TD
    Start([PDF Input File]) --> PreValidation{Pre-Processing
Validation}

    PreValidation -->|✅ Valid| Node1[1. PDF Ingestion
📄 Load & Validate]
    PreValidation -->|❌ Invalid| QuarantinePreValidation[Quarantine:
Pre-validation Failure]

    Node1 --> Node1Error{Processing
Error?}
    Node1Error -->|✅ Success| Node2[2. Document Analysis
📊 Extract Text & Chunk]
    Node1Error -->|❌ Error| RetryLogic1{Retry
Logic}

    Node2 --> Node2Error{Processing
Error?}
    Node2Error -->|✅ Success| Node3[3. Statement Detection
🤖 AI Boundary Analysis]
    Node2Error -->|❌ Error| RetryLogic2{Retry
Logic}

    Node3 --> Node3Error{AI Available?}
    Node3Error -->|✅ Success| Node4[4. Metadata Extraction
🏷️ Account, Date, Bank]
    Node3Error -->|❌ API Failure| FallbackMode[Fallback Mode:
Pattern Matching]

    FallbackMode --> Node4

    Node4 --> Node4Error{Processing
Error?}
    Node4Error -->|✅ Success| Node5[5. PDF Generation
📋 Create Separate Files]
    Node4Error -->|❌ Error| RetryLogic4{Retry
Logic}

    Node5 --> Node5Error{Processing
Error?}
    Node5Error -->|✅ Success| Node6[6. File Organization
📁 Apply Naming & Structure]
    Node5Error -->|❌ Error| RetryLogic5{Retry
Logic}

    Node6 --> Node6Error{Processing
Error?}
    Node6Error -->|✅ Success| Node7[7. Output Validation
✅ Integrity Checking]
    Node6Error -->|❌ Error| RetryLogic6{Retry
Logic}

    Node7 --> ValidationResult{Validation
Result}
    ValidationResult -->|✅ Valid| Node8[8. Paperless Upload
📤 Document Management]
    ValidationResult -->|❌ Failed| QuarantineValidation[Quarantine:
Validation Failure]

    Node8 --> Node8Error{Upload
Success?}
    Node8Error -->|✅ Success| ErrorDetection[Error Detection
🔍 Analyze Processing Issues]
    Node8Error -->|❌ Upload Failed| RetryLogic8{Retry
Logic}

    ErrorDetection --> ErrorsFound{Processing
Errors Detected?}
    ErrorsFound -->|✅ Errors Found| ErrorTagging[Error Tagging
🏷️ Apply Error Tags]
    ErrorsFound -->|❌ No Errors| InputTagging{Source Document
ID Available?}

    ErrorTagging --> TaggingResult{Tagging
Success?}
    TaggingResult -->|✅ Success| InputTagging
    TaggingResult -->|❌ Failed| TaggingWarning[Log Tagging Warning
⚠️ Continue Processing]
    TaggingWarning --> InputTagging

    InputTagging -->|✅ Yes| TagInput[Tag Input Document
🏷️ Mark as Processed]
    InputTagging -->|❌ No| ProcessedFiles[Move to Processed
📂 Archive Input]

    TagInput --> TagResult{Tagging
Success?}
    TagResult -->|✅ Success| ProcessedFiles
    TagResult -->|❌ Failed| TagWarning[Log Warning
⚠️ Continue Processing]
    TagWarning --> ProcessedFiles

    ProcessedFiles --> Success([✅ Processing Complete
📊 Generate Report])

    %% Retry Logic Flows with Backoff
    RetryLogic1 -->|Retry with Backoff| Node1
    RetryLogic1 -->|Max Retries Exceeded| QuarantineCritical[Quarantine:
Critical Failure]

    RetryLogic2 -->|Retry with Backoff| Node2
    RetryLogic2 -->|Max Retries Exceeded| QuarantineCritical

    RetryLogic4 -->|Retry with Backoff| Node4
    RetryLogic4 -->|Max Retries Exceeded| QuarantineCritical

    RetryLogic5 -->|Retry with Backoff| Node5
    RetryLogic5 -->|Max Retries Exceeded| QuarantineCritical

    RetryLogic6 -->|Retry with Backoff| Node6
    RetryLogic6 -->|Max Retries Exceeded| QuarantineCritical

    RetryLogic8 -->|Retry with Backoff| Node8
    RetryLogic8 -->|Max Retries Exceeded| PartialSuccess[Partial Success:
Files Created, Upload Failed]

    %% Quarantine System
    QuarantinePreValidation --> ErrorReport1[Generate Error Report
📋 Recovery Suggestions]
    QuarantineCritical --> ErrorReport2[Generate Error Report
📋 Recovery Suggestions]
    QuarantineValidation --> ErrorReport3[Generate Error Report
📋 Recovery Suggestions]

    ErrorReport1 --> QuarantineDir[(🗂️ Quarantine Directory
Failed Documents)]
    ErrorReport2 --> QuarantineDir
    ErrorReport3 --> QuarantineDir

    PartialSuccess --> PartialReport[Generate Partial Report
⚠️ Upload Issue Noted]
    PartialReport --> Success

    %% Monitoring and Management
    QuarantineDir --> QuarantineManagement[Quarantine Management
🧹 CLI Tools]
    QuarantineManagement --> QuarantineClean[Periodic Cleanup
🗑️ Remove Old Files]
    QuarantineManagement --> QuarantineAnalysis[Error Analysis
📈 Pattern Detection]

    %% Styling
    classDef nodeStyle fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#000
    classDef errorStyle fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000
    classDef quarantineStyle fill:#ffebee,stroke:#c62828,stroke-width:2px,color:#000
    classDef successStyle fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px,color:#000
    classDef decisionStyle fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000

    class Node1,Node2,Node3,Node4,Node5,Node6,Node7,Node8,TagInput,ErrorDetection,ErrorTagging nodeStyle
    class Node1Error,Node2Error,Node3Error,Node4Error,Node5Error,Node6Error,Node8Error,ValidationResult decisionStyle
    class PreValidation,RetryLogic1,RetryLogic2,RetryLogic4,RetryLogic5,RetryLogic6,RetryLogic8,InputTagging,TagResult,ErrorsFound,TaggingResult decisionStyle
    class QuarantinePreValidation,QuarantineCritical,QuarantineValidation,ErrorReport1,ErrorReport2,ErrorReport3,QuarantineDir quarantineStyle
    class FallbackMode,PartialSuccess,PartialReport,TagWarning,TaggingWarning errorStyle
    class ProcessedFiles,Success,QuarantineClean,QuarantineAnalysis successStyle

Workflow Nodes Detailed¶

1. PDF Ingestion 📄¶

Purpose: Load and validate input PDF files
Validation: File format, size, accessibility, password protection
Error Handling: Pre-validation quarantine for invalid files
Fallback: None (critical failure point)

2. Document Analysis 📊¶

Purpose: Extract text content and create processing chunks
Processing: Text extraction, chunk creation with overlap
Error Handling: Retry logic for temporary failures
Fallback: Basic text extraction methods

3. Statement Detection 🤖¶

Purpose: Identify statement boundaries using AI analysis
AI Processing: OpenAI GPT models for intelligent detection
Error Handling: Automatic fallback to enhanced pattern matching
Fallback: Enhanced pattern-based detection with fragment filtering
Fragment Detection: Identifies and excludes low-confidence document fragments

4. Metadata Extraction 🏷️¶

Purpose: Extract account numbers, dates, and bank names
Processing: AI-powered metadata identification
Error Handling: Retry logic with graceful degradation
Fallback: Pattern-based extraction

5. PDF Generation 📋¶

Purpose: Create separate PDF files for each statement
Processing: Page-based PDF splitting with confidence filtering
Quality Control: Skips fragments with confidence < 0.3
Error Handling: Retry logic for file system issues
Fallback: Basic page splitting with fragment detection

6. File Organization 📁¶

Purpose: Apply naming conventions and organize outputs
Processing: Filename generation, directory structure
Error Handling: Retry logic for file operations
Fallback: Simple incremental naming

7. Output Validation ✅¶

Purpose: Verify integrity of generated files
Validation: Page count, file size, content sampling
Fragment Handling: Adjusts validation for skipped fragments
Error Handling: Quarantine for validation failures
Fallback: None (quality gate)

8. Paperless Upload 📤¶

Purpose: Upload to document management system
Processing: API upload with metadata application
Error Detection: Automatic analysis of processing issues and failures 🔍
Error Tagging: Apply error tags to documents with processing issues 🏷️
Input Document Tagging: Mark source documents as processed (if source_document_id provided)
Error Handling: Retry logic for network failures, graceful degradation for tagging failures
Fallback: Local storage with upload notification

Error Detection and Tagging Sub-System¶

Error Detection 🔍:

Analyzes workflow state for processing errors (LLM failures, boundary issues, PDF problems)
Detects low-confidence boundaries, suspicious patterns, and metadata extraction failures
Configurable severity thresholds and error type filtering
Supports 6 error categories: API failures, boundary issues, PDF processing, metadata extraction, validation failures, and file output errors

Error Tagging 🏷️:

Automatically applies configurable error tags to documents with detected issues
Supports both individual and batch tagging modes
Severity-based tag application (medium, high, critical errors)
Rollback capability on tagging failures to maintain data integrity
Comprehensive audit logging for all tagging operations

Error Detection and Tagging Workflow¶

The error detection and tagging system provides automatic identification and tagging of documents that encountered processing issues during the workflow execution.

flowchart TD
    Upload[Documents Successfully
Uploaded to Paperless] --> ErrorDetection[Error Detection
🔍 Analyze Workflow State]

    ErrorDetection --> AnalyzeState{Analyze
Processing State}
    AnalyzeState --> LLMCheck[Check LLM Analysis
Failures]
    AnalyzeState --> BoundaryCheck[Check Boundary
Detection Issues]
    AnalyzeState --> PDFCheck[Check PDF
Processing Errors]
    AnalyzeState --> MetadataCheck[Check Metadata
Extraction Problems]
    AnalyzeState --> ValidationCheck[Check Validation
Failures]
    AnalyzeState --> OutputCheck[Check File Output
Issues]

    LLMCheck --> ErrorsFound{Errors
Detected?}
    BoundaryCheck --> ErrorsFound
    PDFCheck --> ErrorsFound
    MetadataCheck --> ErrorsFound
    ValidationCheck --> ErrorsFound
    OutputCheck --> ErrorsFound

    ErrorsFound -->|❌ No Errors| InputTagging[Continue to Input
Document Tagging]
    ErrorsFound -->|✅ Errors Found| SeverityFilter{Meets Severity
Threshold?}

    SeverityFilter -->|❌ Below Threshold| InputTagging
    SeverityFilter -->|✅ Above Threshold| ConfigCheck{Error Detection
Enabled?}

    ConfigCheck -->|❌ Disabled| InputTagging
    ConfigCheck -->|✅ Enabled| TagsConfigured{Error Tags
Configured?}

    TagsConfigured -->|❌ No Tags| InputTagging
    TagsConfigured -->|✅ Tags Available| BatchMode{Batch Tagging
Mode?}

    BatchMode -->|✅ Batch Mode| BatchTagging[Apply Error Tags
to All Documents]
    BatchMode -->|❌ Individual Mode| IndividualTagging[Apply Error Tags
to Each Document]

    BatchTagging --> TaggingResult{All Tagging
Operations Successful?}
    IndividualTagging --> TaggingResult

    TaggingResult -->|✅ Success| TaggingSuccess[Log Successful
Error Tagging]
    TaggingResult -->|❌ Partial/Failed| TaggingWarning[Log Tagging Warning
⚠️ Continue Processing]

    TaggingSuccess --> InputTagging
    TaggingWarning --> RollbackCheck{Rollback
Required?}

    RollbackCheck -->|✅ Yes| RollbackTags[Rollback Failed
Tag Operations]
    RollbackCheck -->|❌ No| InputTagging

    RollbackTags --> InputTagging

    InputTagging --> Complete[Continue with
Input Document Processing]

    %% Error Type Details
    LLMCheck -->|API Failures| ErrorType1[error:llm
error:api-failure]
    BoundaryCheck -->|Low Confidence| ErrorType2[error:confidence
error:boundary]
    PDFCheck -->|Processing Issues| ErrorType3[error:pdf
error:processing]
    MetadataCheck -->|Extraction Failures| ErrorType4[error:metadata
error:extraction]
    ValidationCheck -->|Validation Issues| ErrorType5[error:validation]
    OutputCheck -->|File Issues| ErrorType6[error:output
error:file-system]

    ErrorType1 --> SeverityFilter
    ErrorType2 --> SeverityFilter
    ErrorType3 --> SeverityFilter
    ErrorType4 --> SeverityFilter
    ErrorType5 --> SeverityFilter
    ErrorType6 --> SeverityFilter

    %% Styling
    classDef detectionStyle fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000
    classDef taggingStyle fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px,color:#000
    classDef decisionStyle fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000
    classDef errorTypeStyle fill:#ffebee,stroke:#c62828,stroke-width:2px,color:#000
    classDef warningStyle fill:#fff8e1,stroke:#f57c00,stroke-width:2px,color:#000

    class ErrorDetection,LLMCheck,BoundaryCheck,PDFCheck,MetadataCheck,ValidationCheck,OutputCheck detectionStyle
    class BatchTagging,IndividualTagging,TaggingSuccess,Complete taggingStyle
    class AnalyzeState,ErrorsFound,SeverityFilter,ConfigCheck,TagsConfigured,BatchMode,TaggingResult,RollbackCheck decisionStyle
    class ErrorType1,ErrorType2,ErrorType3,ErrorType4,ErrorType5,ErrorType6 errorTypeStyle
    class TaggingWarning,RollbackTags warningStyle

Error Types and Tags¶

Error Category	Detection Criteria	Applied Tags	Severity
LLM Analysis Failure	API errors, model failures, fallback usage	`error:llm`, `error:api-failure`	High
Boundary Detection Issues	Low confidence boundaries, suspicious patterns	`error:confidence`, `error:boundary`	Medium
PDF Processing Errors	File corruption, access issues, format problems	`error:pdf`, `error:processing`	High
Metadata Extraction Failure	Missing account data, date parsing issues	`error:metadata`, `error:extraction`	Medium
Validation Failures	Content validation, integrity checks	`error:validation`	Medium-High
File Output Issues	Write failures, permissions, disk space	`error:output`, `error:file-system`	Critical

Configuration Options¶

PAPERLESS_ERROR_DETECTION_ENABLED: Enable/disable error detection system
PAPERLESS_ERROR_TAGS: Base error tags to apply to all error documents
PAPERLESS_ERROR_TAG_THRESHOLD: Confidence threshold for boundary error detection
PAPERLESS_ERROR_SEVERITY_LEVELS: Error severity levels that trigger tagging
PAPERLESS_ERROR_BATCH_TAGGING: Use batch mode vs individual document tagging
PAPERLESS_TAG_WAIT_TIME: Wait time between tagging operations

Error Handling Strategies¶

Error Classification¶

flowchart LR
    Error[Processing Error] --> Classification{Error Type}

    Classification -->|Network/API| Recoverable[Recoverable Error
🔄 Retry Logic]
    Classification -->|File System| Recoverable
    Classification -->|Temporary| Recoverable

    Classification -->|Invalid Format| Critical[Critical Error
🚫 Immediate Quarantine]
    Classification -->|Corruption| Critical
    Classification -->|Access Denied| Critical

    Classification -->|Validation| ValidationError[Validation Error
⚠️ Configurable Response]

    Recoverable --> RetryCount{Retry Count
< Max?}
    RetryCount -->|Yes| Delay[Exponential Backoff
with Jitter]
    RetryCount -->|No| Quarantine[Move to
Quarantine]

    Delay --> RateLimitCheck{Rate Limit
Exceeded?}
    RateLimitCheck -->|Yes| BackoffDelay[Apply Backoff
Strategy]
    RateLimitCheck -->|No| RetryProcess[Retry
Processing]

    BackoffDelay --> RetryProcess

    Critical --> Quarantine
    ValidationError --> StrictnessCheck{Validation
Strictness}

    StrictnessCheck -->|Strict| Quarantine
    StrictnessCheck -->|Normal| Warning[Log Warning
Continue Processing]
    StrictnessCheck -->|Lenient| Warning

    Quarantine --> ErrorReport[Generate Error Report
📋 Recovery Suggestions]

    classDef errorStyle fill:#fff3e0,stroke:#e65100,stroke-width:2px
    classDef quarantineStyle fill:#ffebee,stroke:#c62828,stroke-width:2px
    classDef successStyle fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px
    classDef decisionStyle fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px

    class Recoverable,ValidationError,Warning successStyle
    class Critical,Quarantine,ErrorReport quarantineStyle
    class Classification,RetryCount,StrictnessCheck decisionStyle

Validation Strictness Levels¶

Level	Description	Behavior	Use Case
Strict	All validation issues are errors	Fail fast, quarantine immediately	Production financial processing
Normal	Balanced validation approach	Warnings for minor issues, errors for critical	General business use
Lenient	Minimal validation blocking	Continue processing with warnings	Exploratory processing

Configuration Impact¶

Environment Variables Affecting Workflow¶

graph TD
    Config[Configuration] --> Processing[Processing Behavior]
    Config --> ErrorHandling[Error Handling]
    Config --> Integration[Integrations]

    Processing --> API[OPENAI_API_KEY
LLM_MODEL
LLM_TEMPERATURE]
    Processing --> Files[MAX_FILE_SIZE_MB
CHUNK_SIZE
CHUNK_OVERLAP]
    Processing --> Output[DEFAULT_OUTPUT_DIR
FILENAME_PATTERN
DATE_FORMAT]

    ErrorHandling --> Validation[VALIDATION_STRICTNESS
REQUIRE_TEXT_CONTENT
MIN_PAGES_PER_STATEMENT]
    ErrorHandling --> Quarantine[QUARANTINE_DIRECTORY
AUTO_QUARANTINE_CRITICAL_FAILURES
MAX_RETRY_ATTEMPTS]
    ErrorHandling --> Backoff[OPENAI_REQUESTS_PER_MINUTE
OPENAI_BURST_LIMIT
OPENAI_BACKOFF_MIN
OPENAI_BACKOFF_MAX]
    ErrorHandling --> Reporting[ENABLE_ERROR_REPORTING
ERROR_REPORT_DIRECTORY
PRESERVE_FAILED_OUTPUTS]

    Integration --> Paperless[PAPERLESS_ENABLED
PAPERLESS_URL
PAPERLESS_TOKEN
PAPERLESS_INPUT_TAGGING_ENABLED
PAPERLESS_INPUT_PROCESSED_TAG
PAPERLESS_ERROR_DETECTION_ENABLED
PAPERLESS_ERROR_TAGS
PAPERLESS_ERROR_TAG_THRESHOLD
PAPERLESS_ERROR_SEVERITY_LEVELS
PAPERLESS_ERROR_BATCH_TAGGING]
    Integration --> Logging[ENABLE_AUDIT_LOGGING
LOG_LEVEL
LOG_FILE]

    classDef configStyle fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
    classDef categoryStyle fill:#f1f8e9,stroke:#558b2f,stroke-width:2px

    class Config configStyle
    class Processing,ErrorHandling,Integration categoryStyle

Performance Characteristics¶

Processing Time Factors¶

Document Size: Larger documents require more processing time
AI Analysis: API calls add latency but improve accuracy
Statement Count: More statements increase processing complexity
Network Latency: Affects API calls and Paperless uploads
Rate Limiting: Backoff delays when hitting API limits (see Backoff Mechanisms)
Retry Logic: Failed operations with exponential backoff increase total processing time
Validation Level: Strict validation adds processing overhead

Typical Performance Metrics¶

Document Type	Processing Time	Memory Usage	Accuracy
Single Statement (5 pages)	2-5 seconds	<100MB	98%
Multi-Statement (20 pages)	10-30 seconds	200-400MB	95%
Large Document (50+ pages)	1-5 minutes	500MB+	93%

Monitoring and Observability¶

Key Metrics to Monitor¶

pie title Processing Metrics
    "Successful Processing" : 80
    "Quarantined (Validation)" : 8
    "Quarantined (Critical)" : 4
    "Partial Success" : 3
    "Rate Limited (Backoff)" : 5

Backoff-Specific Metrics¶

Rate Limit Hits: Frequency of rate limit encounters
Backoff Delays: Average and maximum backoff times
Retry Success Rate: Percentage of retries that succeed
Burst Token Usage: Current burst token levels
API Request Patterns: Requests per minute over time

Logging and Audit Trail¶

Processing Logs: Detailed execution traces
Audit Logs: Security and compliance tracking
Error Reports: Structured failure analysis
Performance Metrics: Processing time and resource usage

Recovery and Maintenance¶

Automated Recovery¶

Retry Logic: Automatic retry with exponential backoff and jitter
Rate Limiting: Token bucket rate limiting with configurable burst capacity
Fallback Processing: Pattern matching when AI unavailable
Partial Success Handling: Continue processing despite non-critical failures
Backoff Strategy: Configurable delays with jitter to prevent thundering herd

Manual Recovery¶

Quarantine Review: Regular review of failed documents
Configuration Tuning: Adjust validation strictness based on patterns
Batch Reprocessing: Process recovered documents in batches

Maintenance Operations¶

Quarantine Cleanup: Automated removal of old failed documents
Log Rotation: Prevent log files from consuming excessive disk space
Performance Monitoring: Track processing metrics over time

Workflow Integration Summary¶

The Bank Statement Separator implements two complementary workflow architectures:

Application Processing Workflow (This Document)¶

8-node LangGraph pipeline for PDF processing
Comprehensive error handling with quarantine system
AI-powered analysis with pattern-matching fallback
Rate limiting and backoff mechanisms for API calls
Audit logging and compliance tracking

CI/CD Pipeline Workflow (GitHub Workflows)¶

5 interconnected GitHub Actions workflows
Automated testing with Python matrix (3.11, 3.12)
Release automation using conventional commits
Security scanning and dependency review
Documentation versioning with mike deployment

Integration Points¶

Configuration: Environment variables control both processing behavior and CI/CD settings
Testing: CI workflows validate the processing pipeline functionality
Releases: Automated releases deploy both code and documentation updates
Monitoring: Both systems provide comprehensive logging and error reporting

This dual-workflow architecture ensures:

Robust Processing: Reliable document processing with fallback mechanisms
Quality Assurance: Automated testing and security scanning
Continuous Delivery: Automated releases and documentation updates
Comprehensive Monitoring: Full visibility into both processing and deployment workflows

For detailed information about the rate limiting and backoff mechanisms, see the Backoff Mechanisms Design Document.