Bank Statement Separator¶
An AI-powered tool for automatically separating multi-statement PDF files using LangChain and LangGraph.
What's New¶
Latest Release: Version 0.4.0
- CLI Help System Improvements: Comprehensive overhaul of command-line interface help system for better user experience
- Enhanced Version Command: Cleaner version command display with removed borders for better output
- Code Quality Enhancements: Extensive code refactoring and quality improvements following best practices
- Modular Architecture: Improved help system architecture for better maintainability
- Backwards Compatible: All improvements are backwards compatible with no migration required
See full Release Notes for detailed changes and Changelog for complete version history.
Overview¶
The Bank Statement Separator is designed for people who need to process single PDF files containing multiple bank statements. It uses advanced AI models to intelligently identify statement boundaries, extract metadata, and create separate PDF files for each statement.
Production Ready
This system is production-ready with comprehensive error handling, document management integration, and robust testing. All 37 unit tests are passing with full feature coverage.
Key Features¶
- AI-Powered Analysis: Uses OpenAI GPT models for intelligent boundary detection
- LangGraph Workflow: Stateful 8-node processing pipeline with error recovery
- Smart Metadata Extraction: Automatically extracts account numbers, statement periods, and bank names
- Batch Processing: Process multiple PDF files from directories with pattern filtering
- Paperless-ngx Integration: Automatic upload to paperless-ngx document management with auto-creation
- Comprehensive Error Handling: Advanced quarantine system with detailed error reports and recovery suggestions
- Error Detection & Tagging: Automatic identification and tagging of processing issues (v0.3.0+)
- Document Validation: Pre-processing validation with configurable strictness levels
- Security-First Design: Secure credential management and file access controls
- Rich CLI Interface: Beautiful multi-command interface with progress indicators and quarantine management
- Audit Logging: Complete processing trail for compliance requirements
Advanced Error Handling and Resilience¶
The system implements sophisticated error handling with automatic recovery mechanisms to ensure reliable operation:
Intelligent Backoff Mechanisms¶
- Exponential Backoff with Jitter: Automatically retries failed API requests with exponentially increasing delays (1s, 2s, 4s, 8s...) plus random jitter (10%-100%) to prevent thundering herd problems
- Rate Limiting Integration: Token bucket algorithm with sliding window tracking prevents API quota exhaustion
- Selective Retry Logic: Only retries on recoverable errors (RateLimitError, timeouts) while failing immediately on permanent issues
- Configurable Limits: Adjustable retry attempts (default: 3) and delay caps (max: 60 seconds)
Comprehensive Quarantine System¶
- Automatic Document Isolation: Failed documents are moved to quarantine with detailed error reports
- Recovery Suggestions: Actionable guidance provided for each error type (password removal, format repair, quota upgrades)
- Validation Strictness Levels: Configurable error handling from strict (high accuracy) to lenient (high success rate)
- Error Report Generation: JSON reports with timestamps, failure reasons, and system diagnostics
Resilience Features¶
- Fallback Processing: Pattern-based boundary detection when AI services are unavailable
- Circuit Breaker Pattern: Temporary service suspension during persistent failures
- Resource Management: Memory and disk monitoring with graceful degradation
- Audit Trail: Complete error history for compliance and troubleshooting
For detailed implementation details, see the Backoff Mechanisms Design Document and Error Handling Guide.
Architecture Overview¶
graph TD
A[PDF Input] --> B[PDF Ingestion & Validation]
B --> C[Document Analysis]
C --> D[AI Statement Detection]
D --> E[Metadata Extraction]
E --> F[PDF Generation]
F --> G[File Organization]
G --> H[Output Validation]
H --> I[Paperless Upload]
I --> L[Error Detection & Tagging]
B --> J[Quarantine System]
H --> J
J --> K[Error Reports]
style A fill:#e1f5fe
style I fill:#e8f5e8
style L fill:#fff3e0
style J fill:#fff3e0
style K fill:#fff3e0
Detailed Architecture
For comprehensive workflow diagrams including error handling flows, retry logic, and configuration impacts, see the complete Workflow Architecture Overview.
Use Cases¶
Financial Analysis¶
- Multi-Bank Processing: Handle statements from multiple banks in a single document
- Period Separation: Automatically separate statements by time periods
- Compliance Reporting: Maintain audit trails for regulatory requirements
Document Management¶
- Paperless Integration: Auto-upload to document management systems
- Metadata Extraction: Automatically tag and categorize documents
- Error Recovery: Smart handling of processing failures with recovery suggestions
Cybersecurity¶
- Secure Processing: Protected credential management and access controls
- Audit Logging: Complete activity trails for security compliance
- Input Validation: Comprehensive document validation before processing
Quick Start¶
Get started in just a few minutes:
# Test with dry-run
uv run bank-statement-separator \
process statements.pdf --dry-run --yes
# Process single document
uv run bank-statement-separator \
process statements.pdf -o ./output --yes
# Process batch of documents
uv run bank-statement-separator \
batch-process /path/to/pdfs -o ./batch-output --yes
System Requirements¶
- Python: 3.11 or higher
- Package Manager: UV (recommended)
- API Access: OpenAI API key for optimal processing
- Memory: 4GB RAM minimum (8GB+ recommended for large documents)
- Storage: 100MB+ for quarantine and log files
Performance Metrics¶
| Metric | Value |
|---|---|
| Processing Speed | ~2-5 seconds per statement |
| Accuracy Rate | 95%+ with AI analysis |
| Fallback Success | 85%+ without API key |
| Memory Usage | <500MB per document |
| Test Coverage | 37/37 unit tests passing |
Roadmap¶
✅ Phase 1 - Core Features (Complete)¶
- LangGraph workflow implementation
- AI-powered boundary detection
- Metadata extraction
- CLI interface
✅ Phase 2 - Enhanced Features (Complete)¶
- Error handling & quarantine system
- Paperless-ngx integration
- Multi-command CLI
- Document validation
- Comprehensive testing
✅ Phase 3 - Error Detection & Advanced Features (Complete)¶
- Error detection & tagging system
- Enhanced Paperless integration with error flagging
- Configurable error handling with severity filtering
- Comprehensive test coverage and manual testing scripts
- Updated documentation and workflow diagrams
🚧 Phase 4 - Production Deployment (In Progress)¶
- Docker containerization
- Cloud storage integration
- Web dashboard
- Batch processing
📋 Phase 5 - Enterprise Features (Planned)¶
- Multi-tenant support
- REST API
- Custom workflows
- Advanced analytics
Documentation Versions¶
This documentation is versioned to match software releases. Use the version selector in the top navigation to access documentation for specific versions:
- Latest: Always points to the most recent release documentation
- Versioned: Access documentation for specific releases
Version URLs
- Latest:
https://madeinoz67.github.io/bank-statement-separator/ - Version 0.1.0:
https://madeinoz67.github.io/bank-statement-separator/v0.1.0/
Finding Your Version
Check your installed version with: uv run bank-statement-separator --version
License¶
This project is licensed under the MIT License - see the LICENSE file for details.
The MIT License is a permissive open-source license that allows you to:
- ✅ Use the software for commercial and private purposes
- ✅ Modify and distribute the software
- ✅ Include in proprietary software
- ✅ Sublicense the software
Get Help¶
- Documentation: Browse the complete documentation
- Issues: Report bugs or request features on GitHub Issues
- Discussions: Join the community discussion
- Support: Contact the development team
Ready to get started? Check out the Quick Start Guide or dive into the Installation Instructions.