Workflow Architecture OverviewΒΆ
Comprehensive overview of the 8-node LangGraph workflow with error handling and recovery mechanisms.
System ArchitectureΒΆ
The Bank Statement Separator uses two complementary workflow systems:
- Application Processing Workflow: A sophisticated 8-node LangGraph pipeline for PDF processing with error handling and recovery
- CI/CD Pipeline Workflow: GitHub Actions workflows for automated testing, releasing, and documentation deployment
Complete Workflow Documentation
This document focuses on the application processing workflow. For comprehensive documentation of the CI/CD workflows including release automation, testing, and documentation deployment, see GitHub Workflows Architecture.
Application Processing WorkflowΒΆ
The core PDF processing uses an 8-node LangGraph pipeline with comprehensive error handling and recovery systems.
Complete Workflow DiagramΒΆ
flowchart TD
Start([PDF Input File]) --> PreValidation{Pre-Processing
Validation}
PreValidation -->|β
Valid| Node1[1. PDF Ingestion
π Load & Validate]
PreValidation -->|β Invalid| QuarantinePreValidation[Quarantine:
Pre-validation Failure]
Node1 --> Node1Error{Processing
Error?}
Node1Error -->|β
Success| Node2[2. Document Analysis
π Extract Text & Chunk]
Node1Error -->|β Error| RetryLogic1{Retry
Logic}
Node2 --> Node2Error{Processing
Error?}
Node2Error -->|β
Success| Node3[3. Statement Detection
π€ AI Boundary Analysis]
Node2Error -->|β Error| RetryLogic2{Retry
Logic}
Node3 --> Node3Error{AI Available?}
Node3Error -->|β
Success| Node4[4. Metadata Extraction
π·οΈ Account, Date, Bank]
Node3Error -->|β API Failure| FallbackMode[Fallback Mode:
Pattern Matching]
FallbackMode --> Node4
Node4 --> Node4Error{Processing
Error?}
Node4Error -->|β
Success| Node5[5. PDF Generation
π Create Separate Files]
Node4Error -->|β Error| RetryLogic4{Retry
Logic}
Node5 --> Node5Error{Processing
Error?}
Node5Error -->|β
Success| Node6[6. File Organization
π Apply Naming & Structure]
Node5Error -->|β Error| RetryLogic5{Retry
Logic}
Node6 --> Node6Error{Processing
Error?}
Node6Error -->|β
Success| Node7[7. Output Validation
β
Integrity Checking]
Node6Error -->|β Error| RetryLogic6{Retry
Logic}
Node7 --> ValidationResult{Validation
Result}
ValidationResult -->|β
Valid| Node8[8. Paperless Upload
π€ Document Management]
ValidationResult -->|β Failed| QuarantineValidation[Quarantine:
Validation Failure]
Node8 --> Node8Error{Upload
Success?}
Node8Error -->|β
Success| ErrorDetection[Error Detection
π Analyze Processing Issues]
Node8Error -->|β Upload Failed| RetryLogic8{Retry
Logic}
ErrorDetection --> ErrorsFound{Processing
Errors Detected?}
ErrorsFound -->|β
Errors Found| ErrorTagging[Error Tagging
π·οΈ Apply Error Tags]
ErrorsFound -->|β No Errors| InputTagging{Source Document
ID Available?}
ErrorTagging --> TaggingResult{Tagging
Success?}
TaggingResult -->|β
Success| InputTagging
TaggingResult -->|β Failed| TaggingWarning[Log Tagging Warning
β οΈ Continue Processing]
TaggingWarning --> InputTagging
InputTagging -->|β
Yes| TagInput[Tag Input Document
π·οΈ Mark as Processed]
InputTagging -->|β No| ProcessedFiles[Move to Processed
π Archive Input]
TagInput --> TagResult{Tagging
Success?}
TagResult -->|β
Success| ProcessedFiles
TagResult -->|β Failed| TagWarning[Log Warning
β οΈ Continue Processing]
TagWarning --> ProcessedFiles
ProcessedFiles --> Success([β
Processing Complete
π Generate Report])
%% Retry Logic Flows with Backoff
RetryLogic1 -->|Retry with Backoff| Node1
RetryLogic1 -->|Max Retries Exceeded| QuarantineCritical[Quarantine:
Critical Failure]
RetryLogic2 -->|Retry with Backoff| Node2
RetryLogic2 -->|Max Retries Exceeded| QuarantineCritical
RetryLogic4 -->|Retry with Backoff| Node4
RetryLogic4 -->|Max Retries Exceeded| QuarantineCritical
RetryLogic5 -->|Retry with Backoff| Node5
RetryLogic5 -->|Max Retries Exceeded| QuarantineCritical
RetryLogic6 -->|Retry with Backoff| Node6
RetryLogic6 -->|Max Retries Exceeded| QuarantineCritical
RetryLogic8 -->|Retry with Backoff| Node8
RetryLogic8 -->|Max Retries Exceeded| PartialSuccess[Partial Success:
Files Created, Upload Failed]
%% Quarantine System
QuarantinePreValidation --> ErrorReport1[Generate Error Report
π Recovery Suggestions]
QuarantineCritical --> ErrorReport2[Generate Error Report
π Recovery Suggestions]
QuarantineValidation --> ErrorReport3[Generate Error Report
π Recovery Suggestions]
ErrorReport1 --> QuarantineDir[(ποΈ Quarantine Directory
Failed Documents)]
ErrorReport2 --> QuarantineDir
ErrorReport3 --> QuarantineDir
PartialSuccess --> PartialReport[Generate Partial Report
β οΈ Upload Issue Noted]
PartialReport --> Success
%% Monitoring and Management
QuarantineDir --> QuarantineManagement[Quarantine Management
π§Ή CLI Tools]
QuarantineManagement --> QuarantineClean[Periodic Cleanup
ποΈ Remove Old Files]
QuarantineManagement --> QuarantineAnalysis[Error Analysis
π Pattern Detection]
%% Styling
classDef nodeStyle fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#000
classDef errorStyle fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000
classDef quarantineStyle fill:#ffebee,stroke:#c62828,stroke-width:2px,color:#000
classDef successStyle fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px,color:#000
classDef decisionStyle fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000
class Node1,Node2,Node3,Node4,Node5,Node6,Node7,Node8,TagInput,ErrorDetection,ErrorTagging nodeStyle
class Node1Error,Node2Error,Node3Error,Node4Error,Node5Error,Node6Error,Node8Error,ValidationResult decisionStyle
class PreValidation,RetryLogic1,RetryLogic2,RetryLogic4,RetryLogic5,RetryLogic6,RetryLogic8,InputTagging,TagResult,ErrorsFound,TaggingResult decisionStyle
class QuarantinePreValidation,QuarantineCritical,QuarantineValidation,ErrorReport1,ErrorReport2,ErrorReport3,QuarantineDir quarantineStyle
class FallbackMode,PartialSuccess,PartialReport,TagWarning,TaggingWarning errorStyle
class ProcessedFiles,Success,QuarantineClean,QuarantineAnalysis successStyle
Workflow Nodes DetailedΒΆ
1. PDF Ingestion πΒΆ
- Purpose: Load and validate input PDF files
- Validation: File format, size, accessibility, password protection
- Error Handling: Pre-validation quarantine for invalid files
- Fallback: None (critical failure point)
2. Document Analysis πΒΆ
- Purpose: Extract text content and create processing chunks
- Processing: Text extraction, chunk creation with overlap
- Error Handling: Retry logic for temporary failures
- Fallback: Basic text extraction methods
3. Statement Detection π€ΒΆ
- Purpose: Identify statement boundaries using AI analysis
- AI Processing: OpenAI GPT models for intelligent detection
- Error Handling: Automatic fallback to enhanced pattern matching
- Fallback: Enhanced pattern-based detection with fragment filtering
- Fragment Detection: Identifies and excludes low-confidence document fragments
4. Metadata Extraction π·οΈΒΆ
- Purpose: Extract account numbers, dates, and bank names
- Processing: AI-powered metadata identification
- Error Handling: Retry logic with graceful degradation
- Fallback: Pattern-based extraction
5. PDF Generation πΒΆ
- Purpose: Create separate PDF files for each statement
- Processing: Page-based PDF splitting with confidence filtering
- Quality Control: Skips fragments with confidence < 0.3
- Error Handling: Retry logic for file system issues
- Fallback: Basic page splitting with fragment detection
6. File Organization πΒΆ
- Purpose: Apply naming conventions and organize outputs
- Processing: Filename generation, directory structure
- Error Handling: Retry logic for file operations
- Fallback: Simple incremental naming
7. Output Validation β ΒΆ
- Purpose: Verify integrity of generated files
- Validation: Page count, file size, content sampling
- Fragment Handling: Adjusts validation for skipped fragments
- Error Handling: Quarantine for validation failures
- Fallback: None (quality gate)
8. Paperless Upload π€ΒΆ
- Purpose: Upload to document management system
- Processing: API upload with metadata application
- Error Detection: Automatic analysis of processing issues and failures π
- Error Tagging: Apply error tags to documents with processing issues π·οΈ
- Input Document Tagging: Mark source documents as processed (if
source_document_idprovided) - Error Handling: Retry logic for network failures, graceful degradation for tagging failures
- Fallback: Local storage with upload notification
Error Detection and Tagging Sub-SystemΒΆ
Error Detection π:
- Analyzes workflow state for processing errors (LLM failures, boundary issues, PDF problems)
- Detects low-confidence boundaries, suspicious patterns, and metadata extraction failures
- Configurable severity thresholds and error type filtering
- Supports 6 error categories: API failures, boundary issues, PDF processing, metadata extraction, validation failures, and file output errors
Error Tagging π·οΈ:
- Automatically applies configurable error tags to documents with detected issues
- Supports both individual and batch tagging modes
- Severity-based tag application (medium, high, critical errors)
- Rollback capability on tagging failures to maintain data integrity
- Comprehensive audit logging for all tagging operations
Error Detection and Tagging WorkflowΒΆ
The error detection and tagging system provides automatic identification and tagging of documents that encountered processing issues during the workflow execution.
flowchart TD
Upload[Documents Successfully
Uploaded to Paperless] --> ErrorDetection[Error Detection
π Analyze Workflow State]
ErrorDetection --> AnalyzeState{Analyze
Processing State}
AnalyzeState --> LLMCheck[Check LLM Analysis
Failures]
AnalyzeState --> BoundaryCheck[Check Boundary
Detection Issues]
AnalyzeState --> PDFCheck[Check PDF
Processing Errors]
AnalyzeState --> MetadataCheck[Check Metadata
Extraction Problems]
AnalyzeState --> ValidationCheck[Check Validation
Failures]
AnalyzeState --> OutputCheck[Check File Output
Issues]
LLMCheck --> ErrorsFound{Errors
Detected?}
BoundaryCheck --> ErrorsFound
PDFCheck --> ErrorsFound
MetadataCheck --> ErrorsFound
ValidationCheck --> ErrorsFound
OutputCheck --> ErrorsFound
ErrorsFound -->|β No Errors| InputTagging[Continue to Input
Document Tagging]
ErrorsFound -->|β
Errors Found| SeverityFilter{Meets Severity
Threshold?}
SeverityFilter -->|β Below Threshold| InputTagging
SeverityFilter -->|β
Above Threshold| ConfigCheck{Error Detection
Enabled?}
ConfigCheck -->|β Disabled| InputTagging
ConfigCheck -->|β
Enabled| TagsConfigured{Error Tags
Configured?}
TagsConfigured -->|β No Tags| InputTagging
TagsConfigured -->|β
Tags Available| BatchMode{Batch Tagging
Mode?}
BatchMode -->|β
Batch Mode| BatchTagging[Apply Error Tags
to All Documents]
BatchMode -->|β Individual Mode| IndividualTagging[Apply Error Tags
to Each Document]
BatchTagging --> TaggingResult{All Tagging
Operations Successful?}
IndividualTagging --> TaggingResult
TaggingResult -->|β
Success| TaggingSuccess[Log Successful
Error Tagging]
TaggingResult -->|β Partial/Failed| TaggingWarning[Log Tagging Warning
β οΈ Continue Processing]
TaggingSuccess --> InputTagging
TaggingWarning --> RollbackCheck{Rollback
Required?}
RollbackCheck -->|β
Yes| RollbackTags[Rollback Failed
Tag Operations]
RollbackCheck -->|β No| InputTagging
RollbackTags --> InputTagging
InputTagging --> Complete[Continue with
Input Document Processing]
%% Error Type Details
LLMCheck -->|API Failures| ErrorType1[error:llm
error:api-failure]
BoundaryCheck -->|Low Confidence| ErrorType2[error:confidence
error:boundary]
PDFCheck -->|Processing Issues| ErrorType3[error:pdf
error:processing]
MetadataCheck -->|Extraction Failures| ErrorType4[error:metadata
error:extraction]
ValidationCheck -->|Validation Issues| ErrorType5[error:validation]
OutputCheck -->|File Issues| ErrorType6[error:output
error:file-system]
ErrorType1 --> SeverityFilter
ErrorType2 --> SeverityFilter
ErrorType3 --> SeverityFilter
ErrorType4 --> SeverityFilter
ErrorType5 --> SeverityFilter
ErrorType6 --> SeverityFilter
%% Styling
classDef detectionStyle fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000
classDef taggingStyle fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px,color:#000
classDef decisionStyle fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000
classDef errorTypeStyle fill:#ffebee,stroke:#c62828,stroke-width:2px,color:#000
classDef warningStyle fill:#fff8e1,stroke:#f57c00,stroke-width:2px,color:#000
class ErrorDetection,LLMCheck,BoundaryCheck,PDFCheck,MetadataCheck,ValidationCheck,OutputCheck detectionStyle
class BatchTagging,IndividualTagging,TaggingSuccess,Complete taggingStyle
class AnalyzeState,ErrorsFound,SeverityFilter,ConfigCheck,TagsConfigured,BatchMode,TaggingResult,RollbackCheck decisionStyle
class ErrorType1,ErrorType2,ErrorType3,ErrorType4,ErrorType5,ErrorType6 errorTypeStyle
class TaggingWarning,RollbackTags warningStyle
Error Types and TagsΒΆ
| Error Category | Detection Criteria | Applied Tags | Severity |
|---|---|---|---|
| LLM Analysis Failure | API errors, model failures, fallback usage | error:llm, error:api-failure |
High |
| Boundary Detection Issues | Low confidence boundaries, suspicious patterns | error:confidence, error:boundary |
Medium |
| PDF Processing Errors | File corruption, access issues, format problems | error:pdf, error:processing |
High |
| Metadata Extraction Failure | Missing account data, date parsing issues | error:metadata, error:extraction |
Medium |
| Validation Failures | Content validation, integrity checks | error:validation |
Medium-High |
| File Output Issues | Write failures, permissions, disk space | error:output, error:file-system |
Critical |
Configuration OptionsΒΆ
PAPERLESS_ERROR_DETECTION_ENABLED: Enable/disable error detection systemPAPERLESS_ERROR_TAGS: Base error tags to apply to all error documentsPAPERLESS_ERROR_TAG_THRESHOLD: Confidence threshold for boundary error detectionPAPERLESS_ERROR_SEVERITY_LEVELS: Error severity levels that trigger taggingPAPERLESS_ERROR_BATCH_TAGGING: Use batch mode vs individual document taggingPAPERLESS_TAG_WAIT_TIME: Wait time between tagging operations
Error Handling StrategiesΒΆ
Error ClassificationΒΆ
flowchart LR
Error[Processing Error] --> Classification{Error Type}
Classification -->|Network/API| Recoverable[Recoverable Error
π Retry Logic]
Classification -->|File System| Recoverable
Classification -->|Temporary| Recoverable
Classification -->|Invalid Format| Critical[Critical Error
π« Immediate Quarantine]
Classification -->|Corruption| Critical
Classification -->|Access Denied| Critical
Classification -->|Validation| ValidationError[Validation Error
β οΈ Configurable Response]
Recoverable --> RetryCount{Retry Count
< Max?}
RetryCount -->|Yes| Delay[Exponential Backoff
with Jitter]
RetryCount -->|No| Quarantine[Move to
Quarantine]
Delay --> RateLimitCheck{Rate Limit
Exceeded?}
RateLimitCheck -->|Yes| BackoffDelay[Apply Backoff
Strategy]
RateLimitCheck -->|No| RetryProcess[Retry
Processing]
BackoffDelay --> RetryProcess
Critical --> Quarantine
ValidationError --> StrictnessCheck{Validation
Strictness}
StrictnessCheck -->|Strict| Quarantine
StrictnessCheck -->|Normal| Warning[Log Warning
Continue Processing]
StrictnessCheck -->|Lenient| Warning
Quarantine --> ErrorReport[Generate Error Report
π Recovery Suggestions]
classDef errorStyle fill:#fff3e0,stroke:#e65100,stroke-width:2px
classDef quarantineStyle fill:#ffebee,stroke:#c62828,stroke-width:2px
classDef successStyle fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px
classDef decisionStyle fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
class Recoverable,ValidationError,Warning successStyle
class Critical,Quarantine,ErrorReport quarantineStyle
class Classification,RetryCount,StrictnessCheck decisionStyle
Validation Strictness LevelsΒΆ
| Level | Description | Behavior | Use Case |
|---|---|---|---|
| Strict | All validation issues are errors | Fail fast, quarantine immediately | Production financial processing |
| Normal | Balanced validation approach | Warnings for minor issues, errors for critical | General business use |
| Lenient | Minimal validation blocking | Continue processing with warnings | Exploratory processing |
Configuration ImpactΒΆ
Environment Variables Affecting WorkflowΒΆ
graph TD
Config[Configuration] --> Processing[Processing Behavior]
Config --> ErrorHandling[Error Handling]
Config --> Integration[Integrations]
Processing --> API[OPENAI_API_KEY
LLM_MODEL
LLM_TEMPERATURE]
Processing --> Files[MAX_FILE_SIZE_MB
CHUNK_SIZE
CHUNK_OVERLAP]
Processing --> Output[DEFAULT_OUTPUT_DIR
FILENAME_PATTERN
DATE_FORMAT]
ErrorHandling --> Validation[VALIDATION_STRICTNESS
REQUIRE_TEXT_CONTENT
MIN_PAGES_PER_STATEMENT]
ErrorHandling --> Quarantine[QUARANTINE_DIRECTORY
AUTO_QUARANTINE_CRITICAL_FAILURES
MAX_RETRY_ATTEMPTS]
ErrorHandling --> Backoff[OPENAI_REQUESTS_PER_MINUTE
OPENAI_BURST_LIMIT
OPENAI_BACKOFF_MIN
OPENAI_BACKOFF_MAX]
ErrorHandling --> Reporting[ENABLE_ERROR_REPORTING
ERROR_REPORT_DIRECTORY
PRESERVE_FAILED_OUTPUTS]
Integration --> Paperless[PAPERLESS_ENABLED
PAPERLESS_URL
PAPERLESS_TOKEN
PAPERLESS_INPUT_TAGGING_ENABLED
PAPERLESS_INPUT_PROCESSED_TAG
PAPERLESS_ERROR_DETECTION_ENABLED
PAPERLESS_ERROR_TAGS
PAPERLESS_ERROR_TAG_THRESHOLD
PAPERLESS_ERROR_SEVERITY_LEVELS
PAPERLESS_ERROR_BATCH_TAGGING]
Integration --> Logging[ENABLE_AUDIT_LOGGING
LOG_LEVEL
LOG_FILE]
classDef configStyle fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
classDef categoryStyle fill:#f1f8e9,stroke:#558b2f,stroke-width:2px
class Config configStyle
class Processing,ErrorHandling,Integration categoryStyle
Performance CharacteristicsΒΆ
Processing Time FactorsΒΆ
- Document Size: Larger documents require more processing time
- AI Analysis: API calls add latency but improve accuracy
- Statement Count: More statements increase processing complexity
- Network Latency: Affects API calls and Paperless uploads
- Rate Limiting: Backoff delays when hitting API limits (see Backoff Mechanisms)
- Retry Logic: Failed operations with exponential backoff increase total processing time
- Validation Level: Strict validation adds processing overhead
Typical Performance MetricsΒΆ
| Document Type | Processing Time | Memory Usage | Accuracy |
|---|---|---|---|
| Single Statement (5 pages) | 2-5 seconds | <100MB | 98% |
| Multi-Statement (20 pages) | 10-30 seconds | 200-400MB | 95% |
| Large Document (50+ pages) | 1-5 minutes | 500MB+ | 93% |
Monitoring and ObservabilityΒΆ
Key Metrics to MonitorΒΆ
pie title Processing Metrics
"Successful Processing" : 80
"Quarantined (Validation)" : 8
"Quarantined (Critical)" : 4
"Partial Success" : 3
"Rate Limited (Backoff)" : 5
Backoff-Specific MetricsΒΆ
- Rate Limit Hits: Frequency of rate limit encounters
- Backoff Delays: Average and maximum backoff times
- Retry Success Rate: Percentage of retries that succeed
- Burst Token Usage: Current burst token levels
- API Request Patterns: Requests per minute over time
Logging and Audit TrailΒΆ
- Processing Logs: Detailed execution traces
- Audit Logs: Security and compliance tracking
- Error Reports: Structured failure analysis
- Performance Metrics: Processing time and resource usage
Recovery and MaintenanceΒΆ
Automated RecoveryΒΆ
- Retry Logic: Automatic retry with exponential backoff and jitter
- Rate Limiting: Token bucket rate limiting with configurable burst capacity
- Fallback Processing: Pattern matching when AI unavailable
- Partial Success Handling: Continue processing despite non-critical failures
- Backoff Strategy: Configurable delays with jitter to prevent thundering herd
Manual RecoveryΒΆ
- Quarantine Review: Regular review of failed documents
- Configuration Tuning: Adjust validation strictness based on patterns
- Batch Reprocessing: Process recovered documents in batches
Maintenance OperationsΒΆ
- Quarantine Cleanup: Automated removal of old failed documents
- Log Rotation: Prevent log files from consuming excessive disk space
- Performance Monitoring: Track processing metrics over time
Workflow Integration SummaryΒΆ
The Bank Statement Separator implements two complementary workflow architectures:
Application Processing Workflow (This Document)ΒΆ
- 8-node LangGraph pipeline for PDF processing
- Comprehensive error handling with quarantine system
- AI-powered analysis with pattern-matching fallback
- Rate limiting and backoff mechanisms for API calls
- Audit logging and compliance tracking
CI/CD Pipeline Workflow (GitHub Workflows)ΒΆ
- 5 interconnected GitHub Actions workflows
- Automated testing with Python matrix (3.11, 3.12)
- Release automation using conventional commits
- Security scanning and dependency review
- Documentation versioning with mike deployment
Integration PointsΒΆ
- Configuration: Environment variables control both processing behavior and CI/CD settings
- Testing: CI workflows validate the processing pipeline functionality
- Releases: Automated releases deploy both code and documentation updates
- Monitoring: Both systems provide comprehensive logging and error reporting
This dual-workflow architecture ensures:
- Robust Processing: Reliable document processing with fallback mechanisms
- Quality Assurance: Automated testing and security scanning
- Continuous Delivery: Automated releases and documentation updates
- Comprehensive Monitoring: Full visibility into both processing and deployment workflows
For detailed information about the rate limiting and backoff mechanisms, see the Backoff Mechanisms Design Document.