Boundary Detection Issue Report - FULLY RESOLVED¶
Issue Summary¶
Multiple boundary detection failures were identified affecting both LLM providers and fallback processing modes where separate statements were incorrectly merged into single output files.
Status: ✅ FULLY RESOLVED in v0.1.0 (September 2025)
🎯 Major Resolution Update (August 2025)¶
CRITICAL FIX: Resolved core LLM boundary detection accuracy issue affecting both OpenAI and Ollama providers.
✅ Root Cause Identified and Fixed¶
Primary Issue: Adjacent Boundary Consolidation Bug in _validate_and_consolidate_boundaries()
- Problem: Logic
boundary.start_page <= last_boundary.end_page + 1treated adjacent pages as overlapping - Impact: 3 separate statements (Westpac, CBA, NAB) merged into 1 statement
- Result: 33% accuracy → 100% accuracy after fix
Secondary Issue: LLM Text Preparation Without Page Boundaries
- Problem: Combined text
" ".join(text_chunks)provided no structural information - Impact: LLM couldn't identify page transitions between statements
- Result: Enhanced with
=== PAGE N ===markers for clear structure
Resolution Summary¶
The boundary detection issue has been successfully resolved through comprehensive improvements to the fallback processing system:
✅ Implemented Solutions¶
- Enhanced Fallback Detection (
llm_analyzer.py) - Added text-based analysis for stronger header detection
- Implemented fragment detection using multiple criteria
-
Enhanced confidence scoring based on critical elements
-
Fragment Filtering (
workflow.py) - Automatic filtering of low-confidence fragments (< 0.3)
- Tracking of skipped fragments and pages
-
Transparent logging of filtering decisions
-
Validation Improvements (
workflow.py) - Adjusted validation to account for intentionally skipped pages
- Dynamic file size tolerance based on skipped content
- Clear reporting of validation adjustments
🎯 Results¶
- Before: Fragment merged with valid statement, causing incorrect boundary
- After: Fragment automatically detected and filtered, clean statement separation
- Accuracy: Improved boundary detection even without OpenAI API
- Transparency: Clear logging of what content was filtered and why
Affected File¶
- Output:
test/output_batch_test/unknown-0267-unknown-date.pdf - Source: Generated from
triple_statements_mixed_banks_test_statements.pdf - Account: NAB account 084234560267
Issue Details¶
What Happened¶
The system incorrectly merged two distinct sections:
- Page 1: A fragment showing a single transaction (10/02/2023 ATM withdrawal)
- Pages 2-3: Complete NAB statement for period Jan 16 - Feb 15, 2023
Expected Behavior¶
These should have been detected as separate statements or the fragment should have been excluded.
Root Cause¶
The fallback boundary detection (pattern-based) failed to identify the boundary between:
- The transaction fragment on page 1
- The proper statement header on page 2
Technical Analysis¶
Current Fallback Logic Issues¶
- Weak Header Detection: The pattern matching doesn't strongly differentiate between:
- Statement fragments with minimal formatting
-
Actual statement headers with full bank/account details
-
Missing Boundary Indicators: The fallback mode doesn't detect:
- Sudden format changes between pages
- Incomplete transaction tables
-
Missing statement period indicators on fragments
-
Metadata Extraction Failure: The system couldn't extract proper metadata, resulting in:
- Filename:
unknown-0267-unknown-date.pdf - Missing bank identification (should be "nab")
- Missing date information
Recommended Improvements¶
Short-term Fixes¶
- Enhance Header Pattern Matching
- Require minimum header elements (bank name, account number, statement period)
-
Detect full statement headers vs. transaction fragments
-
Add Fragment Detection
- Identify incomplete pages (single transactions without context)
-
Flag pages with insufficient metadata
-
Improve Boundary Confidence Scoring
- Score potential boundaries based on multiple factors
- Require minimum confidence threshold
Long-term Solutions¶
- Multi-Pass Analysis
- First pass: Identify definite statement headers
- Second pass: Group pages between headers
-
Third pass: Handle orphaned fragments
-
Structure Analysis
- Detect consistent formatting within statements
-
Flag format changes as potential boundaries
-
Enhanced Fallback Models
- Train lightweight ML model for boundary detection
- Use document structure features without requiring LLM
Testing Requirements¶
- Test with various fragment types
- Test with statements having weak/minimal headers
- Test with mixed format documents
- Ensure improvements don't break existing working cases
Impact Assessment¶
- Severity: Medium (incorrect document separation but no data loss)
- Frequency: Occurs in fallback mode with certain document structures
- User Impact: Incorrectly merged documents uploaded to Paperless
Next Steps¶
- Implement enhanced header pattern matching
- Add fragment detection logic
- Improve metadata extraction fallback
- Add specific test cases for this scenario
- Consider adding validation warnings for low-confidence boundaries