Skip to content

Workflow Architecture OverviewΒΆ

Comprehensive overview of the 8-node LangGraph workflow with error handling and recovery mechanisms.

System ArchitectureΒΆ

The Bank Statement Separator uses two complementary workflow systems:

  1. Application Processing Workflow: A sophisticated 8-node LangGraph pipeline for PDF processing with error handling and recovery
  2. CI/CD Pipeline Workflow: GitHub Actions workflows for automated testing, releasing, and documentation deployment

Complete Workflow Documentation

This document focuses on the application processing workflow. For comprehensive documentation of the CI/CD workflows including release automation, testing, and documentation deployment, see GitHub Workflows Architecture.

Application Processing WorkflowΒΆ

The core PDF processing uses an 8-node LangGraph pipeline with comprehensive error handling and recovery systems.

Complete Workflow DiagramΒΆ

flowchart TD
    Start([PDF Input File]) --> PreValidation{Pre-Processing
Validation} PreValidation -->|βœ… Valid| Node1[1. PDF Ingestion
πŸ“„ Load & Validate] PreValidation -->|❌ Invalid| QuarantinePreValidation[Quarantine:
Pre-validation Failure] Node1 --> Node1Error{Processing
Error?} Node1Error -->|βœ… Success| Node2[2. Document Analysis
πŸ“Š Extract Text & Chunk] Node1Error -->|❌ Error| RetryLogic1{Retry
Logic} Node2 --> Node2Error{Processing
Error?} Node2Error -->|βœ… Success| Node3[3. Statement Detection
πŸ€– AI Boundary Analysis] Node2Error -->|❌ Error| RetryLogic2{Retry
Logic} Node3 --> Node3Error{AI Available?} Node3Error -->|βœ… Success| Node4[4. Metadata Extraction
🏷️ Account, Date, Bank] Node3Error -->|❌ API Failure| FallbackMode[Fallback Mode:
Pattern Matching] FallbackMode --> Node4 Node4 --> Node4Error{Processing
Error?} Node4Error -->|βœ… Success| Node5[5. PDF Generation
πŸ“‹ Create Separate Files] Node4Error -->|❌ Error| RetryLogic4{Retry
Logic} Node5 --> Node5Error{Processing
Error?} Node5Error -->|βœ… Success| Node6[6. File Organization
πŸ“ Apply Naming & Structure] Node5Error -->|❌ Error| RetryLogic5{Retry
Logic} Node6 --> Node6Error{Processing
Error?} Node6Error -->|βœ… Success| Node7[7. Output Validation
βœ… Integrity Checking] Node6Error -->|❌ Error| RetryLogic6{Retry
Logic} Node7 --> ValidationResult{Validation
Result} ValidationResult -->|βœ… Valid| Node8[8. Paperless Upload
πŸ“€ Document Management] ValidationResult -->|❌ Failed| QuarantineValidation[Quarantine:
Validation Failure] Node8 --> Node8Error{Upload
Success?} Node8Error -->|βœ… Success| ErrorDetection[Error Detection
πŸ” Analyze Processing Issues] Node8Error -->|❌ Upload Failed| RetryLogic8{Retry
Logic} ErrorDetection --> ErrorsFound{Processing
Errors Detected?} ErrorsFound -->|βœ… Errors Found| ErrorTagging[Error Tagging
🏷️ Apply Error Tags] ErrorsFound -->|❌ No Errors| InputTagging{Source Document
ID Available?} ErrorTagging --> TaggingResult{Tagging
Success?} TaggingResult -->|βœ… Success| InputTagging TaggingResult -->|❌ Failed| TaggingWarning[Log Tagging Warning
⚠️ Continue Processing] TaggingWarning --> InputTagging InputTagging -->|βœ… Yes| TagInput[Tag Input Document
🏷️ Mark as Processed] InputTagging -->|❌ No| ProcessedFiles[Move to Processed
πŸ“‚ Archive Input] TagInput --> TagResult{Tagging
Success?} TagResult -->|βœ… Success| ProcessedFiles TagResult -->|❌ Failed| TagWarning[Log Warning
⚠️ Continue Processing] TagWarning --> ProcessedFiles ProcessedFiles --> Success([βœ… Processing Complete
πŸ“Š Generate Report]) %% Retry Logic Flows with Backoff RetryLogic1 -->|Retry with Backoff| Node1 RetryLogic1 -->|Max Retries Exceeded| QuarantineCritical[Quarantine:
Critical Failure] RetryLogic2 -->|Retry with Backoff| Node2 RetryLogic2 -->|Max Retries Exceeded| QuarantineCritical RetryLogic4 -->|Retry with Backoff| Node4 RetryLogic4 -->|Max Retries Exceeded| QuarantineCritical RetryLogic5 -->|Retry with Backoff| Node5 RetryLogic5 -->|Max Retries Exceeded| QuarantineCritical RetryLogic6 -->|Retry with Backoff| Node6 RetryLogic6 -->|Max Retries Exceeded| QuarantineCritical RetryLogic8 -->|Retry with Backoff| Node8 RetryLogic8 -->|Max Retries Exceeded| PartialSuccess[Partial Success:
Files Created, Upload Failed] %% Quarantine System QuarantinePreValidation --> ErrorReport1[Generate Error Report
πŸ“‹ Recovery Suggestions] QuarantineCritical --> ErrorReport2[Generate Error Report
πŸ“‹ Recovery Suggestions] QuarantineValidation --> ErrorReport3[Generate Error Report
πŸ“‹ Recovery Suggestions] ErrorReport1 --> QuarantineDir[(πŸ—‚οΈ Quarantine Directory
Failed Documents)] ErrorReport2 --> QuarantineDir ErrorReport3 --> QuarantineDir PartialSuccess --> PartialReport[Generate Partial Report
⚠️ Upload Issue Noted] PartialReport --> Success %% Monitoring and Management QuarantineDir --> QuarantineManagement[Quarantine Management
🧹 CLI Tools] QuarantineManagement --> QuarantineClean[Periodic Cleanup
πŸ—‘οΈ Remove Old Files] QuarantineManagement --> QuarantineAnalysis[Error Analysis
πŸ“ˆ Pattern Detection] %% Styling classDef nodeStyle fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#000 classDef errorStyle fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000 classDef quarantineStyle fill:#ffebee,stroke:#c62828,stroke-width:2px,color:#000 classDef successStyle fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px,color:#000 classDef decisionStyle fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000 class Node1,Node2,Node3,Node4,Node5,Node6,Node7,Node8,TagInput,ErrorDetection,ErrorTagging nodeStyle class Node1Error,Node2Error,Node3Error,Node4Error,Node5Error,Node6Error,Node8Error,ValidationResult decisionStyle class PreValidation,RetryLogic1,RetryLogic2,RetryLogic4,RetryLogic5,RetryLogic6,RetryLogic8,InputTagging,TagResult,ErrorsFound,TaggingResult decisionStyle class QuarantinePreValidation,QuarantineCritical,QuarantineValidation,ErrorReport1,ErrorReport2,ErrorReport3,QuarantineDir quarantineStyle class FallbackMode,PartialSuccess,PartialReport,TagWarning,TaggingWarning errorStyle class ProcessedFiles,Success,QuarantineClean,QuarantineAnalysis successStyle

Workflow Nodes DetailedΒΆ

1. PDF Ingestion πŸ“„ΒΆ

  • Purpose: Load and validate input PDF files
  • Validation: File format, size, accessibility, password protection
  • Error Handling: Pre-validation quarantine for invalid files
  • Fallback: None (critical failure point)

2. Document Analysis πŸ“ŠΒΆ

  • Purpose: Extract text content and create processing chunks
  • Processing: Text extraction, chunk creation with overlap
  • Error Handling: Retry logic for temporary failures
  • Fallback: Basic text extraction methods

3. Statement Detection πŸ€–ΒΆ

  • Purpose: Identify statement boundaries using AI analysis
  • AI Processing: OpenAI GPT models for intelligent detection
  • Error Handling: Automatic fallback to enhanced pattern matching
  • Fallback: Enhanced pattern-based detection with fragment filtering
  • Fragment Detection: Identifies and excludes low-confidence document fragments

4. Metadata Extraction 🏷️¢

  • Purpose: Extract account numbers, dates, and bank names
  • Processing: AI-powered metadata identification
  • Error Handling: Retry logic with graceful degradation
  • Fallback: Pattern-based extraction

5. PDF Generation πŸ“‹ΒΆ

  • Purpose: Create separate PDF files for each statement
  • Processing: Page-based PDF splitting with confidence filtering
  • Quality Control: Skips fragments with confidence < 0.3
  • Error Handling: Retry logic for file system issues
  • Fallback: Basic page splitting with fragment detection

6. File Organization πŸ“ΒΆ

  • Purpose: Apply naming conventions and organize outputs
  • Processing: Filename generation, directory structure
  • Error Handling: Retry logic for file operations
  • Fallback: Simple incremental naming

7. Output Validation βœ…ΒΆ

  • Purpose: Verify integrity of generated files
  • Validation: Page count, file size, content sampling
  • Fragment Handling: Adjusts validation for skipped fragments
  • Error Handling: Quarantine for validation failures
  • Fallback: None (quality gate)

8. Paperless Upload πŸ“€ΒΆ

  • Purpose: Upload to document management system
  • Processing: API upload with metadata application
  • Error Detection: Automatic analysis of processing issues and failures πŸ”
  • Error Tagging: Apply error tags to documents with processing issues 🏷️
  • Input Document Tagging: Mark source documents as processed (if source_document_id provided)
  • Error Handling: Retry logic for network failures, graceful degradation for tagging failures
  • Fallback: Local storage with upload notification

Error Detection and Tagging Sub-SystemΒΆ

Error Detection πŸ”:

  • Analyzes workflow state for processing errors (LLM failures, boundary issues, PDF problems)
  • Detects low-confidence boundaries, suspicious patterns, and metadata extraction failures
  • Configurable severity thresholds and error type filtering
  • Supports 6 error categories: API failures, boundary issues, PDF processing, metadata extraction, validation failures, and file output errors

Error Tagging 🏷️:

  • Automatically applies configurable error tags to documents with detected issues
  • Supports both individual and batch tagging modes
  • Severity-based tag application (medium, high, critical errors)
  • Rollback capability on tagging failures to maintain data integrity
  • Comprehensive audit logging for all tagging operations

Error Detection and Tagging WorkflowΒΆ

The error detection and tagging system provides automatic identification and tagging of documents that encountered processing issues during the workflow execution.

flowchart TD
    Upload[Documents Successfully
Uploaded to Paperless] --> ErrorDetection[Error Detection
πŸ” Analyze Workflow State] ErrorDetection --> AnalyzeState{Analyze
Processing State} AnalyzeState --> LLMCheck[Check LLM Analysis
Failures] AnalyzeState --> BoundaryCheck[Check Boundary
Detection Issues] AnalyzeState --> PDFCheck[Check PDF
Processing Errors] AnalyzeState --> MetadataCheck[Check Metadata
Extraction Problems] AnalyzeState --> ValidationCheck[Check Validation
Failures] AnalyzeState --> OutputCheck[Check File Output
Issues] LLMCheck --> ErrorsFound{Errors
Detected?} BoundaryCheck --> ErrorsFound PDFCheck --> ErrorsFound MetadataCheck --> ErrorsFound ValidationCheck --> ErrorsFound OutputCheck --> ErrorsFound ErrorsFound -->|❌ No Errors| InputTagging[Continue to Input
Document Tagging] ErrorsFound -->|βœ… Errors Found| SeverityFilter{Meets Severity
Threshold?} SeverityFilter -->|❌ Below Threshold| InputTagging SeverityFilter -->|βœ… Above Threshold| ConfigCheck{Error Detection
Enabled?} ConfigCheck -->|❌ Disabled| InputTagging ConfigCheck -->|βœ… Enabled| TagsConfigured{Error Tags
Configured?} TagsConfigured -->|❌ No Tags| InputTagging TagsConfigured -->|βœ… Tags Available| BatchMode{Batch Tagging
Mode?} BatchMode -->|βœ… Batch Mode| BatchTagging[Apply Error Tags
to All Documents] BatchMode -->|❌ Individual Mode| IndividualTagging[Apply Error Tags
to Each Document] BatchTagging --> TaggingResult{All Tagging
Operations Successful?} IndividualTagging --> TaggingResult TaggingResult -->|βœ… Success| TaggingSuccess[Log Successful
Error Tagging] TaggingResult -->|❌ Partial/Failed| TaggingWarning[Log Tagging Warning
⚠️ Continue Processing] TaggingSuccess --> InputTagging TaggingWarning --> RollbackCheck{Rollback
Required?} RollbackCheck -->|βœ… Yes| RollbackTags[Rollback Failed
Tag Operations] RollbackCheck -->|❌ No| InputTagging RollbackTags --> InputTagging InputTagging --> Complete[Continue with
Input Document Processing] %% Error Type Details LLMCheck -->|API Failures| ErrorType1[error:llm
error:api-failure] BoundaryCheck -->|Low Confidence| ErrorType2[error:confidence
error:boundary] PDFCheck -->|Processing Issues| ErrorType3[error:pdf
error:processing] MetadataCheck -->|Extraction Failures| ErrorType4[error:metadata
error:extraction] ValidationCheck -->|Validation Issues| ErrorType5[error:validation] OutputCheck -->|File Issues| ErrorType6[error:output
error:file-system] ErrorType1 --> SeverityFilter ErrorType2 --> SeverityFilter ErrorType3 --> SeverityFilter ErrorType4 --> SeverityFilter ErrorType5 --> SeverityFilter ErrorType6 --> SeverityFilter %% Styling classDef detectionStyle fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000 classDef taggingStyle fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px,color:#000 classDef decisionStyle fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000 classDef errorTypeStyle fill:#ffebee,stroke:#c62828,stroke-width:2px,color:#000 classDef warningStyle fill:#fff8e1,stroke:#f57c00,stroke-width:2px,color:#000 class ErrorDetection,LLMCheck,BoundaryCheck,PDFCheck,MetadataCheck,ValidationCheck,OutputCheck detectionStyle class BatchTagging,IndividualTagging,TaggingSuccess,Complete taggingStyle class AnalyzeState,ErrorsFound,SeverityFilter,ConfigCheck,TagsConfigured,BatchMode,TaggingResult,RollbackCheck decisionStyle class ErrorType1,ErrorType2,ErrorType3,ErrorType4,ErrorType5,ErrorType6 errorTypeStyle class TaggingWarning,RollbackTags warningStyle

Error Types and TagsΒΆ

Error Category Detection Criteria Applied Tags Severity
LLM Analysis Failure API errors, model failures, fallback usage error:llm, error:api-failure High
Boundary Detection Issues Low confidence boundaries, suspicious patterns error:confidence, error:boundary Medium
PDF Processing Errors File corruption, access issues, format problems error:pdf, error:processing High
Metadata Extraction Failure Missing account data, date parsing issues error:metadata, error:extraction Medium
Validation Failures Content validation, integrity checks error:validation Medium-High
File Output Issues Write failures, permissions, disk space error:output, error:file-system Critical

Configuration OptionsΒΆ

  • PAPERLESS_ERROR_DETECTION_ENABLED: Enable/disable error detection system
  • PAPERLESS_ERROR_TAGS: Base error tags to apply to all error documents
  • PAPERLESS_ERROR_TAG_THRESHOLD: Confidence threshold for boundary error detection
  • PAPERLESS_ERROR_SEVERITY_LEVELS: Error severity levels that trigger tagging
  • PAPERLESS_ERROR_BATCH_TAGGING: Use batch mode vs individual document tagging
  • PAPERLESS_TAG_WAIT_TIME: Wait time between tagging operations

Error Handling StrategiesΒΆ

Error ClassificationΒΆ

flowchart LR
    Error[Processing Error] --> Classification{Error Type}

    Classification -->|Network/API| Recoverable[Recoverable Error
πŸ”„ Retry Logic] Classification -->|File System| Recoverable Classification -->|Temporary| Recoverable Classification -->|Invalid Format| Critical[Critical Error
🚫 Immediate Quarantine] Classification -->|Corruption| Critical Classification -->|Access Denied| Critical Classification -->|Validation| ValidationError[Validation Error
⚠️ Configurable Response] Recoverable --> RetryCount{Retry Count
< Max?} RetryCount -->|Yes| Delay[Exponential Backoff
with Jitter] RetryCount -->|No| Quarantine[Move to
Quarantine] Delay --> RateLimitCheck{Rate Limit
Exceeded?} RateLimitCheck -->|Yes| BackoffDelay[Apply Backoff
Strategy] RateLimitCheck -->|No| RetryProcess[Retry
Processing] BackoffDelay --> RetryProcess Critical --> Quarantine ValidationError --> StrictnessCheck{Validation
Strictness} StrictnessCheck -->|Strict| Quarantine StrictnessCheck -->|Normal| Warning[Log Warning
Continue Processing] StrictnessCheck -->|Lenient| Warning Quarantine --> ErrorReport[Generate Error Report
πŸ“‹ Recovery Suggestions] classDef errorStyle fill:#fff3e0,stroke:#e65100,stroke-width:2px classDef quarantineStyle fill:#ffebee,stroke:#c62828,stroke-width:2px classDef successStyle fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px classDef decisionStyle fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px class Recoverable,ValidationError,Warning successStyle class Critical,Quarantine,ErrorReport quarantineStyle class Classification,RetryCount,StrictnessCheck decisionStyle

Validation Strictness LevelsΒΆ

Level Description Behavior Use Case
Strict All validation issues are errors Fail fast, quarantine immediately Production financial processing
Normal Balanced validation approach Warnings for minor issues, errors for critical General business use
Lenient Minimal validation blocking Continue processing with warnings Exploratory processing

Configuration ImpactΒΆ

Environment Variables Affecting WorkflowΒΆ

graph TD
    Config[Configuration] --> Processing[Processing Behavior]
    Config --> ErrorHandling[Error Handling]
    Config --> Integration[Integrations]

    Processing --> API[OPENAI_API_KEY
LLM_MODEL
LLM_TEMPERATURE] Processing --> Files[MAX_FILE_SIZE_MB
CHUNK_SIZE
CHUNK_OVERLAP] Processing --> Output[DEFAULT_OUTPUT_DIR
FILENAME_PATTERN
DATE_FORMAT] ErrorHandling --> Validation[VALIDATION_STRICTNESS
REQUIRE_TEXT_CONTENT
MIN_PAGES_PER_STATEMENT] ErrorHandling --> Quarantine[QUARANTINE_DIRECTORY
AUTO_QUARANTINE_CRITICAL_FAILURES
MAX_RETRY_ATTEMPTS] ErrorHandling --> Backoff[OPENAI_REQUESTS_PER_MINUTE
OPENAI_BURST_LIMIT
OPENAI_BACKOFF_MIN
OPENAI_BACKOFF_MAX] ErrorHandling --> Reporting[ENABLE_ERROR_REPORTING
ERROR_REPORT_DIRECTORY
PRESERVE_FAILED_OUTPUTS] Integration --> Paperless[PAPERLESS_ENABLED
PAPERLESS_URL
PAPERLESS_TOKEN
PAPERLESS_INPUT_TAGGING_ENABLED
PAPERLESS_INPUT_PROCESSED_TAG
PAPERLESS_ERROR_DETECTION_ENABLED
PAPERLESS_ERROR_TAGS
PAPERLESS_ERROR_TAG_THRESHOLD
PAPERLESS_ERROR_SEVERITY_LEVELS
PAPERLESS_ERROR_BATCH_TAGGING] Integration --> Logging[ENABLE_AUDIT_LOGGING
LOG_LEVEL
LOG_FILE] classDef configStyle fill:#e3f2fd,stroke:#1565c0,stroke-width:2px classDef categoryStyle fill:#f1f8e9,stroke:#558b2f,stroke-width:2px class Config configStyle class Processing,ErrorHandling,Integration categoryStyle

Performance CharacteristicsΒΆ

Processing Time FactorsΒΆ

  1. Document Size: Larger documents require more processing time
  2. AI Analysis: API calls add latency but improve accuracy
  3. Statement Count: More statements increase processing complexity
  4. Network Latency: Affects API calls and Paperless uploads
  5. Rate Limiting: Backoff delays when hitting API limits (see Backoff Mechanisms)
  6. Retry Logic: Failed operations with exponential backoff increase total processing time
  7. Validation Level: Strict validation adds processing overhead

Typical Performance MetricsΒΆ

Document Type Processing Time Memory Usage Accuracy
Single Statement (5 pages) 2-5 seconds <100MB 98%
Multi-Statement (20 pages) 10-30 seconds 200-400MB 95%
Large Document (50+ pages) 1-5 minutes 500MB+ 93%

Monitoring and ObservabilityΒΆ

Key Metrics to MonitorΒΆ

pie title Processing Metrics
    "Successful Processing" : 80
    "Quarantined (Validation)" : 8
    "Quarantined (Critical)" : 4
    "Partial Success" : 3
    "Rate Limited (Backoff)" : 5

Backoff-Specific MetricsΒΆ

  • Rate Limit Hits: Frequency of rate limit encounters
  • Backoff Delays: Average and maximum backoff times
  • Retry Success Rate: Percentage of retries that succeed
  • Burst Token Usage: Current burst token levels
  • API Request Patterns: Requests per minute over time

Logging and Audit TrailΒΆ

  • Processing Logs: Detailed execution traces
  • Audit Logs: Security and compliance tracking
  • Error Reports: Structured failure analysis
  • Performance Metrics: Processing time and resource usage

Recovery and MaintenanceΒΆ

Automated RecoveryΒΆ

  • Retry Logic: Automatic retry with exponential backoff and jitter
  • Rate Limiting: Token bucket rate limiting with configurable burst capacity
  • Fallback Processing: Pattern matching when AI unavailable
  • Partial Success Handling: Continue processing despite non-critical failures
  • Backoff Strategy: Configurable delays with jitter to prevent thundering herd

Manual RecoveryΒΆ

  • Quarantine Review: Regular review of failed documents
  • Configuration Tuning: Adjust validation strictness based on patterns
  • Batch Reprocessing: Process recovered documents in batches

Maintenance OperationsΒΆ

  • Quarantine Cleanup: Automated removal of old failed documents
  • Log Rotation: Prevent log files from consuming excessive disk space
  • Performance Monitoring: Track processing metrics over time

Workflow Integration SummaryΒΆ

The Bank Statement Separator implements two complementary workflow architectures:

Application Processing Workflow (This Document)ΒΆ

  • 8-node LangGraph pipeline for PDF processing
  • Comprehensive error handling with quarantine system
  • AI-powered analysis with pattern-matching fallback
  • Rate limiting and backoff mechanisms for API calls
  • Audit logging and compliance tracking

CI/CD Pipeline Workflow (GitHub Workflows)ΒΆ

  • 5 interconnected GitHub Actions workflows
  • Automated testing with Python matrix (3.11, 3.12)
  • Release automation using conventional commits
  • Security scanning and dependency review
  • Documentation versioning with mike deployment

Integration PointsΒΆ

  1. Configuration: Environment variables control both processing behavior and CI/CD settings
  2. Testing: CI workflows validate the processing pipeline functionality
  3. Releases: Automated releases deploy both code and documentation updates
  4. Monitoring: Both systems provide comprehensive logging and error reporting

This dual-workflow architecture ensures:

  • Robust Processing: Reliable document processing with fallback mechanisms
  • Quality Assurance: Automated testing and security scanning
  • Continuous Delivery: Automated releases and documentation updates
  • Comprehensive Monitoring: Full visibility into both processing and deployment workflows

For detailed information about the rate limiting and backoff mechanisms, see the Backoff Mechanisms Design Document.