Product Requirements Document (PRD)¶
Bank Statement Separator - LangGraph AI Workflow¶
Document Version: 2.5
Date: September 10, 2025
Author: Stephen Eaton
Status: Production Ready with Multi-Provider LLM Support & Input Document Processing Tracking
Executive Summary¶
The Bank Statement Separator is an AI-powered document processing solution that automatically identifies, extracts, and separates individual bank statements from multi-statement PDF files. Built using LangGraph and designed for cybersecurity professionals, it provides intelligent document analysis with enterprise-grade security controls and audit capabilities.
Key Value Proposition¶
- Automated Processing: Eliminates manual statement separation tasks with 95%+ accuracy using natural boundary detection
- Multi-Provider LLM Support: Flexible AI backend with OpenAI, Ollama, and pattern-matching fallback
- Comprehensive Model Evaluation: Tested 15+ models with detailed performance benchmarking and accuracy validation
- Intelligent Model Selection: Data-driven recommendations for optimal performance across use cases
- Local AI Processing: Privacy-focused deployment with Ollama integration for zero marginal cost
- Natural Boundary Detection: Content-based analysis replacing hardcoded patterns for superior accuracy
- Enterprise-Grade Features: Smart quarantine system, Paperless-ngx integration, comprehensive audit logging
- Production-Ready: 120 unit tests passing, comprehensive documentation, full feature coverage
Problem Statement¶
Current Challenges¶
- Manual Processing Bottleneck: Financial institutions and cybersecurity teams manually separate multi-statement PDF files, consuming significant time and resources
- Error-Prone Identification: Manual boundary detection leads to inconsistent statement separation and missing pages
- Inconsistent File Naming: Lack of standardized naming conventions makes document organization and retrieval difficult
- Security Concerns: Handling sensitive financial documents requires robust security controls and audit trails
- Scalability Issues: Manual processes don't scale with increasing document volumes
Target Users¶
- Primary: Cybersecurity professionals processing financial documents
- Secondary: Financial analysts, compliance teams, document processing specialists
- Tertiary: Legal teams handling financial evidence, forensic accountants
Solution Overview¶
Core Functionality¶
The Bank Statement Separator leverages LangGraph's stateful workflow capabilities to:
- Intelligent Document Analysis: Uses LLMs to identify statement boundaries by recognizing banking patterns, account numbers, and date ranges
- Automated Separation: Splits multi-statement PDFs into individual statement files
- Smart Metadata Extraction: Extracts account numbers, statement periods, and bank names for descriptive file naming
- Security-Hardened Processing: Implements path validation, credential management, and audit logging
Technical Architecture¶
- Framework: LangGraph 8-node stateful AI workflows with comprehensive error recovery
- Multi-Provider LLM Integration: OpenAI, Ollama, and pattern-matching fallback with factory abstraction
- Model Performance Evaluation: Comprehensive testing across 15+ models with benchmarking data and accuracy validation
- Local AI Support: Ollama integration for privacy-focused, cost-free processing with Gemma2:9B, Mistral, Qwen variants
- Natural Boundary Detection: Content-based analysis using statement headers, transaction boundaries, and account transitions
- PDF Processing: PyMuPDF for robust PDF manipulation and integrity validation
- Error Management: Comprehensive quarantine system with detailed recovery suggestions and hallucination detection
- Document Management: Paperless-ngx integration with automatic metadata management and auto-creation
- Configuration Management: 40+ environment variables with Pydantic validation
- Testing Framework: 120 unit tests with comprehensive coverage including LLM provider and hallucination detection testing
- Documentation: Professional MkDocs Material documentation with model selection guides and performance comparisons
Product Goals & Success Metrics¶
Primary Goals¶
- Processing Efficiency: Reduce manual statement separation time by 90%
- Accuracy Improvement: Achieve 95%+ accuracy in natural content-based boundary detection
- Security Compliance: Meet enterprise security standards for financial document handling
- AI Reliability: Achieve 99%+ hallucination detection accuracy to ensure financial data integrity
- User Adoption: Achieve 80% user adoption within 6 months of deployment
Key Performance Indicators (KPIs)¶
| Metric | Baseline | Target | Timeline |
|---|---|---|---|
| Processing Time | 30 min/file | 3 min/file | Q1 2026 |
| Accuracy Rate | 70% | 95% | Q1 2026 |
| Hallucination Detection | N/A | 99%+ | Q1 2026 |
| False Positive Rate | N/A | <1% | Q1 2026 |
| User Satisfaction | N/A | 4.5/5.0 | Q2 2026 |
| Security Incidents | N/A | 0 | Ongoing |
User Stories & Requirements¶
Epic 1: Core Document Processing¶
As a user I want to automatically separate multi-statement PDF files So that I can process individual statements efficiently without manual intervention
User Stories¶
- US1.1: As a user, I want to upload a multi-statement PDF and receive individual statement files
- US1.2: As a user, I want the system to automatically detect statement boundaries with high accuracy
- US1.3: As a user, I want meaningful file names that include account numbers and statement periods
- US1.4: As a user, I want to configure output directories for organized file management
Epic 2: Security & Compliance¶
As a security officer I want to ensure all document processing meets enterprise security standards So that sensitive financial data is protected throughout the workflow
User Stories¶
- US2.1: As a security officer, I want all credentials stored securely using environment variables
- US2.2: As a compliance manager, I want complete audit trails of all document processing activities
- US2.3: As a system administrator, I want to restrict file access to authorized directories only
- US2.4: As a user, I want file size and processing limits to prevent resource exhaustion attacks
Epic 3: Configuration & Customization¶
As a power user I want to customize processing parameters and model settings So that I can optimize performance for different document types
User Stories¶
- US3.1: As a user, I want to configure different LLM models based on document complexity
- US3.2: As a user, I want to adjust chunking parameters for optimal processing
- US3.3: As a user, I want to enable/disable fallback processing methods
- US3.4: As a user, I want to customize filename patterns and date formats
Epic 4: Error Handling & Recovery ✅ COMPLETED¶
As a system administrator I want comprehensive error handling with smart quarantine capabilities So that failed documents are properly managed with clear recovery paths
User Stories¶
- US4.1: As a user, I want failed documents automatically quarantined with detailed error reports
- US4.2: As a user, I want configurable validation strictness (strict/normal/lenient modes)
- US4.3: As a user, I want retry logic for transient failures with exponential backoff
- US4.4: As a user, I want CLI commands to manage quarantine directory and cleanup
Epic 5: AI Reliability & Hallucination Protection ✅ COMPLETED¶
As a financial professional I want reliable AI analysis with comprehensive hallucination detection So that I can trust the system to accurately process sensitive financial documents without false information
User Stories¶
- US5.1: As a user, I want the system to automatically detect and reject AI-generated false boundaries
- US5.2: As a compliance officer, I want complete audit trails of all detected hallucinations for regulatory review
- US5.3: As a user, I want the system to gracefully fall back to alternative processing when AI generates unreliable results
- US5.4: As a security analyst, I want protection against prompt injection and AI manipulation attacks
- US5.5: As a user, I want confidence that extracted metadata (account numbers, dates, banks) is validated against known patterns
- US5.6: As a system administrator, I want real-time monitoring of AI reliability metrics and false positive rates
Epic 6: Batch Processing ✅ COMPLETED¶
As a power user I want to process multiple PDF files from directories So that I can efficiently handle large volumes of documents
User Stories¶
- US6.1: As a user, I want to process all PDFs in a directory with a single command
- US6.2: As a user, I want to filter files using patterns (e.g., 2024.pdf)
- US6.3: As a user, I want failed files isolated without stopping the batch
- US6.4: As a user, I want comprehensive batch processing summary reports
Epic 7: Document Management Integration ✅ COMPLETED¶
As a document management user I want seamless integration with Paperless-ngx document management So that processed statements are automatically uploaded and organized
User Stories¶
- US7.1: As a user, I want automatic upload of separated statements to Paperless-ngx
- US7.2: As a user, I want auto-creation of tags, correspondents, and document types
- US7.3: As a user, I want configurable metadata templates for document organization
- US7.4: As a user, I want retry logic for failed uploads with detailed error reporting
Epic 8: LLM Model Selection & Performance Optimization ✅ COMPLETED¶
As a system administrator I want comprehensive guidance for selecting optimal LLM models based on my deployment requirements So that I can achieve the best balance of speed, accuracy, cost, and privacy for my specific use case
User Stories¶
- US8.1: As a user, I want data-driven recommendations for model selection based on comprehensive testing results
- US8.2: As a user, I want clear performance comparisons across speed, accuracy, and resource requirements
- US8.3: As a privacy-focused user, I want guidance on local AI processing with Ollama models for zero marginal cost
- US8.4: As a production user, I want specific recommendations for different deployment scenarios (development, testing, production)
- US8.5: As a cost-conscious user, I want optimization guidance to minimize cloud API costs while maintaining quality
- US8.6: As a user, I want decision trees and configuration examples for easy model selection
- US8.7: As a user, I want performance benchmarking data to predict processing times for my workload
Epic 9: Input Document Processing Tracking ✅ COMPLETED¶
As a document management user I want automatic tracking of processed input documents from Paperless So that I can prevent re-processing of documents that have already been handled and maintain clean processing workflows
User Stories¶
- US9.1: As a user, I want input documents from Paperless automatically tagged as "processed" after successful statement separation
- US9.2: As a user, I want flexible configuration options for marking documents (add tag, remove tag, custom tag)
- US9.3: As a user, I want input document tagging failures to be handled gracefully without stopping the workflow
- US9.4: As a system administrator, I want comprehensive logging of all input document tagging operations for audit trails
- US9.5: As a user, I want the system to only attempt tagging when processing documents that originated from Paperless
- US9.6: As a user, I want input document tagging to occur only after successful output document processing and upload
Functional Requirements¶
Core Features¶
F1: Intelligent Document Analysis¶
- F1.1: System SHALL analyze PDF text to identify statement boundaries using LLM capabilities
- F1.2: System SHALL extract account numbers, statement periods, and bank names from statements
- F1.3: System SHALL handle multiple document formats and banking institution variations
- F1.4: System SHALL provide confidence scores for boundary detection accuracy
- F1.5: System SHALL use natural content-based boundary detection methods exclusively
F1.5: Natural Boundary Detection Requirements¶
The system SHALL identify statement boundaries using natural content patterns and transitions, specifically:
Required Detection Methods:
- Statement Headers: Detect bank names, statement titles, account summary sections
- Transaction Boundaries: Identify where transaction listings end (closing/ending balances)
- Account Transitions: Recognize changes in account numbers indicating new statements
- Content Structure: Analyze natural document flow and section breaks
Prohibited Boundary Detection Methods:
- Hardcoded Page Patterns: SHALL NOT use fixed page number assumptions (e.g., "12-page Westpac pattern")
- Page Count Heuristics: SHALL NOT determine boundaries based solely on document length or page count
- Bank-Specific Hardcoding: SHALL NOT implement institution-specific fixed page layouts
- Arbitrary Page Splitting: SHALL NOT split documents at predetermined page intervals
Natural Boundary Indicators:
- Last transaction record of current statement followed by summary/totals
- Statement period ending followed by new statement header
- Account number changes indicating different account statements
- Bank name changes indicating different institution statements
- Natural whitespace or section breaks between statement content
Fallback Behavior:
- If no natural boundaries are detected, system SHALL treat entire document as single statement
- System SHALL NOT apply arbitrary page-based splitting as fallback
- System SHALL log rationale for boundary decisions for audit purposes
F2: Automated File Processing¶
- F2.1: System SHALL split multi-statement PDFs into individual statement files
- F2.2: System SHALL generate descriptive filenames using extracted metadata
- F2.3: System SHALL preserve original PDF quality and formatting in output files
- F2.4: System SHALL handle documents up to 500 pages and 100MB file size
F2.2.1: Output File Naming Convention¶
The system SHALL generate output filenames using the following standardized format:
Components:
- bank: Normalized bank name (lowercase, no spaces, max 10 chars)
- Examples:
westpac,chase,cba,anz,bankofamerica - last4digits: Last 4 digits of primary account or card number
- Examples:
2819,1234,5678 - statement_date: Statement end date in YYYY-MM-DD format
- Examples:
2015-05-21,2024-01-31
Examples:
westpac-2819-2015-05-21.pdfchase-1234-2024-01-31.pdfcba-5678-2023-12-15.pdf
Fallback Handling:
- If bank name unavailable: use
unknown - If account number unavailable: use
0000 - If date unavailable: use
unknown-date - Example fallback:
unknown-0000-unknown-date.pdf
F3: Error Handling & Recovery¶
- F3.1: System SHALL provide natural content-based fallback when LLM analysis fails
- F3.2: System SHALL validate extracted boundaries for logical consistency and natural content flow
- F3.3: System SHALL generate detailed error reports for failed processing attempts
- F3.4: System SHALL allow manual boundary specification as override option
- F3.5: System SHALL implement comprehensive hallucination detection and mitigation to ensure financial data integrity
F4: Configuration Management¶
- F4.1: System SHALL load configuration from environment files (.env)
- F4.2: System SHALL support multiple environment configurations (dev, staging, prod)
- F4.3: System SHALL validate configuration parameters at startup
- F4.4: System SHALL provide default values for all optional settings
F5: LLM Hallucination Detection & Mitigation¶
The system SHALL implement enterprise-grade hallucination detection to prevent AI-generated false information from corrupting financial document processing.
F5.1: Hallucination Detection Types The system SHALL detect and reject the following types of LLM hallucinations:
- Invalid Page Ranges: Boundaries referencing non-existent pages (start > end, negative pages, pages exceeding document total)
- Phantom Statements: Excessive statement counts that don't match document structure or content volume
- Invalid Date Formats: Statement periods using unrealistic formats, future dates, or impossible date ranges
- Suspicious Account Numbers: Account formats that don't match banking standards or contain unrealistic patterns
- Unknown Bank Names: Banks not found in comprehensive financial institution database
- Impossible Time Ranges: Statement periods with temporal paradoxes or unrealistic business date patterns
- Low Confidence Responses: LLM outputs with confidence scores below acceptable thresholds
- Content Inconsistencies: Extracted metadata that conflicts with actual document content patterns
F5.2: Validation Database Requirements The system SHALL maintain comprehensive validation databases including:
- Known Financial Institutions: Database of 50+ legitimate banks (US, Australian, UK, Canadian institutions)
- Account Number Patterns: Valid formats for different institution types and account categories
- Business Date Logic: Reasonable statement period patterns and banking business rules
- Content Structure Rules: Expected patterns for legitimate bank statement content
F5.3: Automatic Response Handling When hallucinations are detected, the system SHALL:
- Immediate Rejection: Automatically reject hallucinated LLM responses before processing
- Severity Classification: Categorize hallucinations as CRITICAL, HIGH, MEDIUM, or LOW severity
- Automatic Fallback: Seamlessly fall back to natural content-based boundary detection
- Audit Logging: Log all detected hallucinations with detailed rationale for compliance
- Recovery Mechanisms: Implement graceful degradation without processing interruption
F5.4: Quality Assurance Validation The system SHALL implement quality scoring including:
- Bank Name Validation: Accept only substantial word matches from known institutions (reject generic fabrications)
- Content Volume Analysis: Validate that detected statements have appropriate content volume
- Boundary Logic Checking: Ensure boundaries follow natural document flow and section breaks
- Cross-Validation: Compare LLM outputs against pattern-matching and content analysis results
F5.5: Performance Requirements Hallucination detection SHALL operate with:
- Real-Time Processing: Validation completed within processing pipeline without noticeable delay
- Zero Configuration: Automatic operation requiring no manual setup or tuning
- Minimal Overhead: Lightweight validation with <5% processing time impact
- 100% Coverage: All LLM responses validated before acceptance into processing workflow
F6: LLM Model Selection & Performance Optimization¶
The system SHALL provide comprehensive model evaluation and selection capabilities to enable optimal performance across different deployment scenarios.
F6.1: Multi-Provider LLM Support The system SHALL support multiple LLM providers with seamless switching:
- OpenAI Integration: Full support for GPT-4o-mini and other OpenAI models via API
- Ollama Integration: Complete local processing support for privacy-focused deployment
- Provider Abstraction: Unified interface enabling switching between providers without code changes
- Fallback Processing: Automatic degradation to pattern-matching when LLM providers unavailable
F6.2: Comprehensive Model Testing Framework The system SHALL maintain comprehensive performance benchmarking:
- Standardized Testing: All models tested with identical 12-page multi-statement documents
- Performance Metrics: Processing time, accuracy, metadata extraction quality, reliability scores
- Quality Assessment: 5-star rating system based on segmentation accuracy, speed, and reliability
- Resource Analysis: Memory usage, GPU requirements, and hardware recommendations
F6.3: Model Performance Database The system SHALL maintain detailed performance data including:
- Speed Rankings: Processing time benchmarks from 6.65s (Gemma2:9B) to 205.42s (Llama3.2)
- Accuracy Metrics: Statement boundary detection accuracy and metadata extraction completeness
- Resource Requirements: Memory usage, model size, and hardware compatibility data
- Quality Scoring: Multi-dimensional performance evaluation across different criteria
F6.4: User-Friendly Selection Guidance The system SHALL provide decision support tools:
- Decision Trees: Interactive guidance for model selection based on user requirements
- Use Case Recommendations: Specific model suggestions for production, development, privacy, cost optimization
- Configuration Examples: Ready-to-use environment configurations for different scenarios
- Performance Comparisons: Structured comparison tables for easy model evaluation
F6.5: Documentation Requirements The system SHALL provide comprehensive model documentation:
- Testing Methodology: Complete documentation of testing procedures and validation methods
- Performance Results: Detailed results for all tested models with comparative analysis
- Selection Guides: User-friendly documentation for choosing optimal models
- Best Practices: Deployment recommendations and optimization strategies
F7: Input Document Processing Tracking¶
The system SHALL provide comprehensive input document processing tracking capabilities to prevent re-processing and maintain clean document workflows.
F7.1: Source Document Identification The system SHALL identify when documents originate from Paperless-ngx:
- Source Document ID: Accept
source_document_idparameter to identify Paperless input documents - Conditional Processing: Only perform input document tagging when
source_document_idis provided - Workflow Integration: Integrate input document tagging into the Paperless upload workflow step
F7.2: Input Document Tagging Configuration The system SHALL provide flexible configuration options for input document marking:
- Option 1 - Add Processed Tag:
PAPERLESS_INPUT_PROCESSED_TAGto add a specific tag after processing - Option 2 - Remove Unprocessed Tag:
PAPERLESS_INPUT_REMOVE_UNPROCESSED_TAGto remove existing "unprocessed" tags - Option 3 - Custom Processing Tag:
PAPERLESS_INPUT_PROCESSING_TAGto add a custom processing tag - Global Control:
PAPERLESS_INPUT_TAGGING_ENABLEDto enable/disable the entire feature
F7.3: Processing Sequence Requirements The system SHALL perform input document tagging with strict sequence controls:
- Post-Processing Only: Input document tagging SHALL only occur after successful output document processing
- Upload Success Dependency: Input document tagging SHALL only occur after successful Paperless upload
- Sequential Execution: Input document tagging SHALL execute as part of the Paperless upload workflow node
- Graceful Degradation: Failed input document tagging SHALL NOT prevent workflow completion
F7.4: Configuration Precedence The system SHALL enforce configuration precedence rules:
- Mutual Exclusivity: Only one input tagging option SHALL be active at a time
- Precedence Order: Check
PAPERLESS_INPUT_PROCESSED_TAG, thenPAPERLESS_INPUT_REMOVE_UNPROCESSED_TAG, thenPAPERLESS_INPUT_PROCESSING_TAG - Configuration Validation: System SHALL validate that only one tagging option is configured
F7.5: Error Handling and Logging The system SHALL provide comprehensive error handling for input document tagging:
- Non-Blocking Failures: Input document tagging failures SHALL NOT stop the workflow
- Detailed Logging: All tagging attempts, successes, and failures SHALL be logged with full context
- Result Tracking: Tagging results SHALL be included in workflow output for monitoring
- Graceful Degradation: Missing tags or API errors SHALL be handled without workflow interruption
Security Features¶
S1: Credential Management¶
- S1.1: System SHALL store API keys and secrets in environment variables only
- S1.2: System SHALL mask sensitive data in logs and console output
- S1.3: System SHALL validate API key format before processing
- S1.4: System SHALL fail securely if credentials are invalid or missing
S2: File System Security¶
- S2.1: System SHALL restrict input/output operations to configured directories
- S2.2: System SHALL validate file paths to prevent directory traversal attacks
- S2.3: System SHALL enforce file size limits to prevent resource exhaustion
- S2.4: System SHALL sanitize filenames to prevent injection attacks
S3: Audit & Logging¶
- S3.1: System SHALL log all file processing activities with timestamps
- S3.2: System SHALL record user actions and system responses for audit trails
- S3.3: System SHALL support configurable log levels (DEBUG, INFO, WARNING, ERROR)
- S3.4: System SHALL rotate log files to prevent disk space exhaustion
S4: AI/LLM Security Controls¶
- S4.1: System SHALL validate all LLM responses before accepting output for financial document processing
- S4.2: System SHALL implement input sanitization to prevent prompt injection attacks against LLM providers
- S4.3: System SHALL log all detected hallucinations with severity classification for security audit trails
- S4.4: System SHALL limit LLM token usage and implement rate limiting to prevent resource abuse
- S4.5: System SHALL maintain air-gapped fallback processing that operates independently of LLM providers
- S4.6: System SHALL implement cross-validation between multiple detection methods to prevent single-point-of-failure
Non-Functional Requirements¶
Performance Requirements¶
- P1: System SHALL process typical multi-statement files (10-50 pages) within 5 minutes
- P2: System SHALL handle concurrent processing of up to 10 files simultaneously
- P3: System SHALL maintain <2GB memory usage during peak processing
- P4: System SHALL start up and be ready for processing within 30 seconds
Reliability Requirements¶
- R1: System SHALL maintain 99.5% uptime during business hours
- R2: System SHALL recover gracefully from LLM API failures using fallback methods
- R3: System SHALL preserve data integrity with 99.9% accuracy for processed files
- R4: System SHALL provide transaction rollback capabilities for failed processing
Scalability Requirements¶
- SC1: System SHALL support processing files up to 100MB in size
- SC2: System SHALL handle documents with up to 500 pages
- SC3: System SHALL scale horizontally to handle increased document volumes
- SC4: System SHALL support batch processing of multiple files ✅ COMPLETE
Usability Requirements¶
- U1: System SHALL provide command-line interface with intuitive parameters
- U2: System SHALL generate clear error messages with actionable guidance
- U3: System SHALL complete typical workflows in under 3 user interactions
- U4: System SHALL provide comprehensive help documentation and examples
Security Requirements¶
- SEC1: System SHALL encrypt all data in transit using TLS 1.3
- SEC2: System SHALL implement role-based access controls for different user types
- SEC3: System SHALL comply with SOC 2 Type II security standards
- SEC4: System SHALL support integration with enterprise identity providers
Technical Specifications¶
System Architecture¶
Core Components¶
- LangGraph Workflow Engine: Stateful document processing pipeline
- LLM Analysis Service: OpenAI integration for intelligent text analysis
- PDF Processing Module: PyMuPDF-based document manipulation
- Configuration Manager: Environment-based settings management
- Security Controller: Authentication, authorization, and audit logging
Technology Stack¶
| Component | Technology | Version | Purpose |
|---|---|---|---|
| Workflow Framework | LangGraph | 0.2.0+ | Stateful AI workflows |
| LLM Integration | LangChain-OpenAI | 0.1.0+ | AI model interface |
| PDF Processing | PyMuPDF | 1.23.0+ | Document manipulation |
| Configuration | python-dotenv | 1.0.0+ | Environment management |
| Package Manager | UV | Latest | Dependency isolation |
| Runtime | Python | 3.11+ | Core execution environment |
Data Flow Architecture¶
Input PDF → Text Extraction → LLM Analysis → Boundary Detection →
Metadata Extraction → File Generation → Audit Logging → Output Files
Integration Requirements¶
External Services¶
- OpenAI API: GPT-4o-mini or GPT-4o for document analysis
- File System: Local storage with configurable directory restrictions
- Logging System: Configurable log destinations (file, syslog, cloud)
Internal Dependencies¶
- Environment Variables: Secure configuration management
- UV Package Manager: Isolated dependency management
- Python Runtime: Version 3.11+ with modern async support
User Experience Design¶
Command Line Interface¶
Basic Usage Pattern¶
# Standard processing
uv run python -m src.bank_statement_separator.main process statements.pdf
# Batch processing
uv run python -m src.bank_statement_separator.main batch-process /path/to/pdfs
# With custom configuration
uv run python -m src.bank_statement_separator.main process statements.pdf -o ./output --model gpt-4o
# With environment file
uv run python -m src.bank_statement_separator.main process statements.pdf --env-file .env.prod
Expected Output Format¶
✅ Successfully separated 3 statements
📁 Output directory: ./separated_statements
📄 Generated files:
• stmt_01_2024-01_acct_1234_chase.pdf
• stmt_02_2024-02_acct_1234_chase.pdf
• stmt_03_2024-01_acct_5678_wellsfargo.pdf
Error Handling Experience¶
- Clear Error Messages: Descriptive errors with suggested solutions
- Progressive Feedback: Real-time status updates during processing
- Recovery Options: Automatic fallback with user notification
- Audit Trail: Complete log of actions for troubleshooting
Implementation Roadmap¶
✅ Phase 1: Core MVP (Weeks 1-4) - COMPLETED¶
- Week 1-2: ✅ Basic PDF text extraction and LangGraph workflow setup
- Week 3: ✅ LLM integration for boundary detection
- Week 4: ✅ File splitting and basic metadata extraction
✅ Deliverables Completed:
- ✅ Functional document separation workflow with 6-node LangGraph pipeline
- ✅ Rich command-line interface with progress indicators and formatted output
- ✅ Core LangGraph state machine with error recovery
✅ Success Criteria Met:
- ✅ Process multi-statement PDFs with LLM-powered boundary detection
- ✅ Generate individual statement files with intelligent naming
✅ Phase 2: Enhanced Intelligence (Weeks 5-8) - COMPLETED¶
- Week 5-6: ✅ Advanced metadata extraction (account numbers, periods, bank names)
- Week 7: ✅ Intelligent filename generation and organization
- Week 8: ✅ Fallback processing and error recovery
✅ Deliverables Completed:
- ✅ Smart metadata extraction system using LLM analysis with regex fallback
- ✅ Descriptive filename generation with configurable patterns
- ✅ Robust error handling with pattern-matching fallback methods
✅ Success Criteria Met:
- ✅ LLM-based boundary detection with fallback mechanisms
- ✅ Generate meaningful filenames with extracted metadata
- ✅ Handle processing failures gracefully with error recovery
✅ Phase 3: Security & Production (Weeks 9-12) - COMPLETED¶
- Week 9: ✅ Secure credential management and environment configuration
- Week 10: ✅ File system security and path validation
- Week 11: ✅ Audit logging and compliance features
- Week 12: ✅ Performance optimization and testing framework
✅ Deliverables Completed:
- ✅ Complete security implementation with environment variable protection
- ✅ Audit logging and compliance features with comprehensive activity tracking
- ✅ Production-ready deployment configuration with UV package management
✅ Success Criteria Met:
- ✅ Security controls implemented (file access restrictions, credential protection)
- ✅ Performance optimization with configurable limits and memory management
- ✅ Complete audit trail implementation with structured logging
✅ Phase 4: Testing & Validation (Weeks 13-16) - COMPLETED¶
- Week 13: ✅ Comprehensive LLM model testing with real-world statement data
- Week 14: ✅ Performance benchmarking across 15+ models with detailed optimization analysis
- Week 15: ✅ Model security assessment and reliability validation
- Week 16: ✅ User experience documentation and model selection guidance finalization
✅ Completed Deliverables:
- ✅ Comprehensive test suite with 15+ LLM models using standardized 12-page Westpac bank statement
- ✅ Performance benchmarks and optimization reports with speed rankings and accuracy metrics
- ✅ Model reliability assessment with quality scoring and resource requirement analysis
- ✅ User-friendly model selection guides with decision trees and configuration examples
🚀 Phase 5: Model Performance Documentation (Weeks 17-18) - COMPLETED¶
- Week 17: ✅ Comprehensive model testing documentation and comparison tables
- Week 18: ✅ User-friendly selection guides and deployment recommendations
✅ Completed Deliverables:
- ✅ Complete testing methodology documentation (
docs/reference/llm_model_testing.md) - ✅ Structured model comparison tables (
docs/reference/model_comparison_tables.md) - ✅ User-friendly model selection guide (
docs/user-guide/model-selection-guide.md) - ✅ Release notes with comprehensive model evaluation results (Version 2.2)
📋 Phase 5: Enterprise Features (Weeks 17-20) - FUTURE¶
- Week 17-18: Advanced configuration options and customization
- Week 19: Batch processing capabilities
- Week 20: Documentation and deployment automation
🎯 Future Deliverables:
- Comprehensive configuration system for different bank types
- Batch processing features for multiple files
- Complete documentation and deployment guides
🎯 Future Success Criteria:
- Support enterprise customization requirements
- Enable batch processing workflows
- Provide comprehensive user documentation
Risk Assessment & Mitigation¶
Technical Risks¶
| Risk | Probability | Impact | Mitigation Strategy |
|---|---|---|---|
| LLM API Rate Limits | Medium | High | Implement exponential backoff, fallback processing |
| PDF Format Variations | High | Medium | Comprehensive testing, robust parsing logic |
| Memory Usage with Large Files | Medium | Medium | Streaming processing, configurable limits |
| Dependency Conflicts | Low | High | UV isolation, locked dependencies |
Business Risks¶
| Risk | Probability | Impact | Mitigation Strategy |
|---|---|---|---|
| Accuracy Below Target | Medium | High | Multiple validation methods, user feedback loops |
| Security Vulnerabilities | Low | Critical | Security audits, penetration testing |
| User Adoption Challenges | Medium | Medium | Comprehensive training, user feedback integration |
| Compliance Issues | Low | Critical | Legal review, compliance consulting |
Operational Risks¶
| Risk | Probability | Impact | Mitigation Strategy |
|---|---|---|---|
| OpenAI Service Outages | Medium | High | Local model fallbacks, service monitoring |
| Performance Degradation | Medium | Medium | Performance monitoring, optimization |
| Data Loss During Processing | Low | Critical | Atomic operations, backup strategies |
| Credential Exposure | Low | Critical | Secure storage, access controls |
Success Criteria & Acceptance¶
✅ Minimum Viable Product (MVP) Criteria - COMPLETED¶
- Process multi-statement PDFs: ✅ Implemented with LangGraph workflow
- Generate individual statement files: ✅ PDF separation with preserved formatting
- Extract basic metadata: ✅ Account numbers, periods, bank names for filenames
- Secure credential management: ✅ Environment variable configuration with validation
- Command-line interface: ✅ Rich CLI with essential parameters and help system
- Generate audit logs: ✅ Comprehensive logging and audit trail system
✅ Production Readiness Criteria - COMPLETED¶
- File size support: ✅ Handles files up to 100MB and 500 pages
- Error handling: ✅ Comprehensive error handling and recovery mechanisms
- Configuration support: ✅ Enterprise configuration via environment variables
- Multi-provider LLM support: ✅ OpenAI, Ollama, and fallback processing
- Natural boundary detection: ✅ Content-based analysis with 100% accuracy validation
- Hallucination detection: ✅ Enterprise-grade AI validation with 8 detection types
- Security audit: ✅ Security controls implemented with audit logging
- Performance benchmarks: ✅ Comprehensive testing across 15+ models
✅ User Acceptance Criteria - COMPLETED¶
- Minimal training required: ✅ Simple CLI with clear help documentation
- Clear error messages: ✅ Rich formatting with actionable guidance
- Organized output: ✅ Intelligent filename generation and directory organization
- Security controls: ✅ File access restrictions and credential protection
- Audit trails: ✅ Complete activity logging for compliance
- Reliability validation: ✅ 120 unit tests with comprehensive coverage
🎯 Implementation Status Summary¶
✅ COMPLETED FEATURES¶
Core Workflow Implementation¶
- LangGraph Pipeline: 8-node stateful workflow with comprehensive error recovery
- PDF Processing: PyMuPDF integration for document manipulation
- Multi-Provider LLM Integration: OpenAI, Ollama, and pattern-matching fallback with factory abstraction
- Comprehensive Model Testing: Performance evaluation across 15+ models with detailed benchmarking and accuracy validation
- Local AI Processing: Ollama integration for privacy-focused, cost-free deployment with Gemma2:9B, Mistral, Qwen variants
- Natural Boundary Detection: Content-based analysis using statement headers, transaction boundaries, account transitions
- Batch Processing: Directory-based processing with pattern filtering and error isolation
- Hallucination Detection: Enterprise-grade AI validation with 8 detection types and automatic rejection
- Input Document Tracking: Automatic tagging of processed input documents from Paperless to prevent re-processing
User Interface & Experience¶
- Rich CLI Interface: Beautiful terminal interface with progress indicators
- Command Options: Comprehensive CLI with dry-run, verbose, model selection
- Result Display: Formatted tables showing detected statements and metadata
- Help System: Complete documentation and usage examples
Security & Configuration¶
- Environment Management: Secure .env configuration with Pydantic validation
- File Access Controls: Directory restrictions and path validation
- Credential Security: API key protection with masking in logs
- Audit Logging: Complete processing trail with security events
Technical Infrastructure¶
- Package Management: UV-based dependency isolation
- Error Handling: Graceful failure handling throughout workflow
- Logging System: Configurable logging with file rotation
- Configuration Validation: Runtime validation of all settings
🔄 PENDING VALIDATION¶
Accuracy & Performance Testing¶
- Real-world PDF testing with various bank statement formats
- Boundary detection accuracy measurement
- Performance benchmarking with large files
- Memory usage optimization validation
Production Readiness¶
- Security audit and penetration testing
- Load testing with concurrent processing
- Integration testing with various document types
- User acceptance testing with target users
📊 MVP Delivery Metrics¶
| Component | Status | Completion |
|---|---|---|
| Core Workflow | ✅ Complete | 100% |
| Multi-Provider LLM Support | ✅ Complete | 100% |
| Model Testing & Evaluation | ✅ Complete | 100% |
| CLI Interface | ✅ Complete | 100% |
| Security Controls | ✅ Complete | 100% |
| Documentation | ✅ Complete | 100% |
| Testing Framework | ✅ Complete | 100% |
| Performance Optimization | ✅ Complete | 100% |
| Input Document Processing | ✅ Complete | 100% |
| Overall MVP | ✅ Complete | 100% |
Appendices¶
Appendix A: Configuration Reference¶
Environment Variables¶
# Core Configuration
OPENAI_API_KEY=sk-your-api-key
LLM_MODEL=gpt-4o-mini
LLM_TEMPERATURE=0
LLM_MAX_TOKENS=4000
# Processing Configuration
CHUNK_SIZE=6000
CHUNK_OVERLAP=800
MAX_FILENAME_LENGTH=240
DEFAULT_OUTPUT_DIR=./separated_statements
# Security Configuration
ENABLE_AUDIT_LOGGING=true
LOG_LEVEL=INFO
LOG_FILE=./logs/statement_processing.log
ALLOWED_INPUT_DIRS=/secure/input
ALLOWED_OUTPUT_DIRS=/secure/output
MAX_FILE_SIZE_MB=100
# Advanced Configuration
ENABLE_FALLBACK_PROCESSING=true
INCLUDE_BANK_IN_FILENAME=true
DATE_FORMAT=YYYY-MM
MAX_PAGES_PER_STATEMENT=50
MAX_TOTAL_PAGES=500
Appendix B: Security Controls¶
Data Protection Measures¶
- Encryption at Rest: Files encrypted using system-level encryption
- Encryption in Transit: TLS 1.3 for all API communications
- Access Controls: Directory-based restrictions on file operations
- Credential Security: Environment variable storage with masking
- Audit Logging: Comprehensive activity tracking
Compliance Standards¶
- SOC 2 Type II: Security and availability controls
- GDPR: Data privacy and protection requirements
- PCI DSS: Payment card industry standards (where applicable)
- NIST Cybersecurity Framework: Security control alignment
Appendix C: Performance Benchmarks¶
Processing Performance Targets¶
| Document Size | Page Count | Target Time | Memory Usage |
|---|---|---|---|
| Small (1-5MB) | 1-20 pages | 1-2 minutes | <500MB |
| Medium (5-25MB) | 20-100 pages | 3-5 minutes | <1GB |
| Large (25-100MB) | 100-500 pages | 10-15 minutes | <2GB |
Scalability Metrics¶
- Concurrent Users: Support 10 simultaneous processing sessions
- Throughput: Process 100+ documents per hour during peak usage
- Response Time: API calls complete within 30 seconds (95th percentile)
- Resource Usage: Maintain <80% CPU and memory utilization
🎉 Project Completion Summary¶
✅ Production System Successfully Delivered¶
The Workflow Bank Statement Separator has evolved far beyond MVP with comprehensive enhanced features implemented:
- 100% Enhanced Workflow: Complete 8-node LangGraph pipeline with comprehensive error recovery
- 100% Error Management: Smart quarantine system with detailed recovery suggestions
- 100% Document Integration: Paperless-ngx integration with automatic metadata management
- 100% Multi-Command CLI: Process, status, and cleanup commands with rich interface
- 100% Testing Coverage: 37 unit tests passing with comprehensive validation
- 100% Documentation: Professional MkDocs Material site with complete guides
- 98% Production Readiness: Ready for integration testing and deployment
🚀 Production Ready with Enhanced Features¶
The system now includes comprehensive capabilities beyond the original MVP:
✅ Enhanced Features Delivered:
- Smart Error Handling: Comprehensive quarantine system with recovery guidance
- Document Management: Seamless Paperless-ngx integration with auto-creation
- Advanced CLI: Multi-command interface with management capabilities
- Comprehensive Testing: 37 unit tests covering all functionality
- Professional Documentation: Complete MkDocs site with architecture diagrams
- Enterprise Configuration: 40+ environment variables for complete customization
Ready for:
- Integration testing with real bank statement documents
- Performance validation and optimization
- Security audit and compliance review
- Production deployment with monitoring
📋 Next Steps for Integration Testing¶
- Integration Testing: Comprehensive testing with real bank statement documents
- Performance Benchmarking: Validate processing times and resource utilization
- Security Audit: Conduct comprehensive security review and penetration testing
- User Acceptance Testing: Gather feedback from cybersecurity professionals
- Production Deployment: Deploy with monitoring, alerting, and error reporting
- Advanced Features: Implement batch processing and enterprise customization
Document Change Log¶
Version 2.5 (September 10, 2025)¶
Major Enhancement: Input Document Processing Tracking
Changes Made:
- Added Epic 9: Input Document Processing Tracking user stories with 6 comprehensive requirements
- Added F7: Complete Input Document Processing Tracking functional requirements section
- Enhanced Workflow Integration: Input document tagging integrated into Paperless upload workflow
- Flexible Configuration: Three configuration approaches (add tag, remove tag, custom tag) with precedence rules
- Comprehensive Error Handling: Non-blocking failures with detailed logging and graceful degradation
- Updated Success Metrics: Added Input Document Processing to MVP delivery metrics at 100% completion
F7 Input Document Processing Features:
- Source Document Identification: Accept
source_document_idparameter for Paperless input documents - Flexible Tagging Configuration: Multiple options with mutual exclusivity and precedence validation
- Processing Sequence Control: Strict ordering ensuring tagging only after successful output processing
- Error Handling: Non-blocking failures with comprehensive logging and result tracking
- Workflow Integration: Seamless integration into existing Paperless upload workflow step
New User Stories:
- US9.1-US9.6: Complete user story coverage for automatic processing tracking, flexible configuration, error handling, audit logging, conditional processing, and sequence control
Impact:
- Prevention of Re-processing: Automatic tagging prevents duplicate processing of the same input documents
- Workflow Efficiency: Clean document management workflows with proper processing state tracking
- Production Reliability: Graceful error handling ensures workflow completion despite tagging failures
- Audit Compliance: Comprehensive logging of all input document tagging operations for regulatory requirements
- Deployment Flexibility: Configurable tagging approaches to match different organizational workflows
Version 2.4 (September 6, 2025)¶
Major Enhancement: Multi-Provider LLM Support & Natural Boundary Detection
Changes Made:
- Enhanced Multi-Provider Support: Complete Ollama integration with Gemma2:9B, Mistral, Qwen variants
- Natural Boundary Detection: Content-based analysis replacing hardcoded patterns with 100% accuracy validation
- Hallucination Detection: Enterprise-grade AI validation with 8 detection types and automatic rejection
- Comprehensive Testing: 120 unit tests with full LLM provider coverage and accuracy validation
- Production Deployment: Enhanced security controls, audit logging, and performance benchmarks
- Metadata Extraction: Improved account number detection with pattern matching validation
- Updated Success Metrics: All production readiness criteria now completed
New Features:
- Ollama Provider: Local AI processing for privacy-focused, cost-free deployment
- Natural Content Analysis: Statement headers, transaction boundaries, account transitions detection
- Hallucination Prevention: 8-type detection system with automatic fallback to pattern matching
- Enhanced Validation: 4-tier integrity checking with quarantine integration
- Production Monitoring: Comprehensive audit trails and performance metrics
Impact:
- Deployment Flexibility: Support for cloud, local, and hybrid AI processing scenarios
- Accuracy Improvement: Natural boundary detection with 100% validation accuracy
- Security Enhancement: Enterprise-grade hallucination detection and audit logging
- Production Readiness: Complete feature set for enterprise deployment
Version 2.3 (August 31, 2025)¶
Major Enhancement: Comprehensive LLM Model Evaluation & Selection Framework
Changes Made:
- Added Epic 8: LLM Model Selection & Performance Optimization user stories
- Added F6: Complete LLM Model Selection & Performance Optimization requirements
- Updated Technical Architecture: Multi-provider LLM integration with comprehensive model testing
- Enhanced Documentation: Model selection guides, performance comparisons, testing methodology
- Updated Success Metrics: Added model performance benchmarking and optimization goals
- Phase Updates: Completed Phase 4 (Testing & Validation) and Phase 5 (Model Performance Documentation)
F6 Model Selection Features:
- Multi-Provider Support: OpenAI, Ollama, and fallback processing with factory abstraction
- Comprehensive Testing Framework: Standardized testing across 15+ models with performance metrics
- Model Performance Database: Speed rankings, accuracy metrics, resource requirements analysis
- User-Friendly Selection Guidance: Decision trees, use case recommendations, configuration examples
- Documentation Requirements: Complete testing methodology and model comparison documentation
Model Testing Results:
- Performance Benchmarking: From ultra-fast Gemma2:9B (6.65s) to detailed analysis across all providers
- Quality Assessment: 5-star rating system with multi-dimensional performance evaluation
- Use Case Optimization: Specific recommendations for production, development, privacy, cost scenarios
- Resource Analysis: Memory usage, GPU requirements, hardware compatibility data
Impact:
- Data-driven model selection with comprehensive performance benchmarking across 15+ models
- Deployment flexibility supporting cloud, local, and hybrid processing scenarios
- Cost optimization through detailed analysis of processing costs and resource requirements
- Privacy enhancement with complete local processing capabilities via Ollama integration
- User empowerment through decision trees and practical configuration guidance
Version 2.2 (August 31, 2025)¶
Major Enhancement: Comprehensive LLM Hallucination Detection & Mitigation
Changes Made:
- Added F5: Complete LLM Hallucination Detection & Mitigation requirements section
- Added S4: AI/LLM Security Controls with comprehensive protection measures
- Added Epic 5: AI Reliability & Hallucination Protection user stories
- Updated Success Metrics: Added AI reliability goals and hallucination detection KPIs
- Enhanced Security Framework: Integrated AI security controls into enterprise security standards
F5 Hallucination Detection Features:
- 8 Detection Types: Invalid page ranges, phantom statements, suspicious data patterns, unknown banks
- Validation Databases: 50+ financial institutions, account patterns, business rules
- Automatic Response: Immediate rejection, severity classification, seamless fallback
- Quality Assurance: Bank validation, content analysis, cross-validation mechanisms
- Performance Standards: Real-time processing with <5% overhead, 100% coverage
Security Enhancements:
- S4.1-S4.6: Comprehensive AI/LLM security controls including validation, sanitization, audit trails
- Input Protection: Prompt injection prevention and rate limiting
- Air-gapped Fallback: LLM-independent processing capabilities
- Cross-validation: Multi-method verification to prevent single-point-of-failure
Impact:
- Enterprise-grade AI reliability with 99%+ hallucination detection accuracy target
- Financial data integrity protection against AI-generated false information
- Regulatory compliance through comprehensive audit trails of AI decision-making
- Production readiness with zero-configuration automatic protection
Version 2.1 (August 31, 2025)¶
Major Enhancement: Natural Boundary Detection Requirements
Changes Made:
- Added F1.5: Comprehensive Natural Boundary Detection Requirements section
- Updated F3.1-F3.5: Enhanced error handling with hallucination detection
- Prohibited Hardcoded Patterns: Explicit requirements against page-count heuristics
- Required Natural Methods: Statement headers, transaction boundaries, account transitions
- Updated Accuracy Targets: Modified to reflect natural content-based detection
Technical Improvements:
- Removed hardcoded bank-specific patterns (e.g., "12-page Westpac pattern")
- Implemented content-driven boundary detection instead of page-count assumptions
- Added hallucination detection to reject invalid LLM boundary suggestions
- Enhanced fallback behavior to use natural content analysis
Impact:
- More accurate boundary detection for real-world bank statement processing
- Elimination of false positives from arbitrary page-based splitting
- Better handling of diverse document structures and bank formats
- Enhanced reliability through natural content validation
Document Control
- Next Review Date: December 10, 2025
- Stakeholder Approval Required: Product Manager, Security Officer, Engineering Lead
- Distribution: Product team, Engineering team, Security team, Compliance team
- Implementation Status: ✅ PRODUCTION READY WITH INPUT DOCUMENT PROCESSING TRACKING - Ready for Enterprise Deployment