Boundary Detection Issue Report - FULLY RESOLVED¶

Issue Summary¶

Multiple boundary detection failures were identified affecting both LLM providers and fallback processing modes where separate statements were incorrectly merged into single output files.

Status: ✅ FULLY RESOLVED in v0.1.0 (September 2025)

🎯 Major Resolution Update (August 2025)¶

CRITICAL FIX: Resolved core LLM boundary detection accuracy issue affecting both OpenAI and Ollama providers.

✅ Root Cause Identified and Fixed¶

Primary Issue: Adjacent Boundary Consolidation Bug in _validate_and_consolidate_boundaries()

Problem: Logic boundary.start_page <= last_boundary.end_page + 1 treated adjacent pages as overlapping
Impact: 3 separate statements (Westpac, CBA, NAB) merged into 1 statement
Result: 33% accuracy → 100% accuracy after fix

Secondary Issue: LLM Text Preparation Without Page Boundaries

Problem: Combined text " ".join(text_chunks) provided no structural information
Impact: LLM couldn't identify page transitions between statements
Result: Enhanced with === PAGE N === markers for clear structure

Resolution Summary¶

The boundary detection issue has been successfully resolved through comprehensive improvements to the fallback processing system:

✅ Implemented Solutions¶

Enhanced Fallback Detection (llm_analyzer.py)
Added text-based analysis for stronger header detection
Implemented fragment detection using multiple criteria
Enhanced confidence scoring based on critical elements
Fragment Filtering (workflow.py)
Automatic filtering of low-confidence fragments (< 0.3)
Tracking of skipped fragments and pages
Transparent logging of filtering decisions
Validation Improvements (workflow.py)
Adjusted validation to account for intentionally skipped pages
Dynamic file size tolerance based on skipped content
Clear reporting of validation adjustments

🎯 Results¶

Before: Fragment merged with valid statement, causing incorrect boundary
After: Fragment automatically detected and filtered, clean statement separation
Accuracy: Improved boundary detection even without OpenAI API
Transparency: Clear logging of what content was filtered and why

Affected File¶

Output: test/output_batch_test/unknown-0267-unknown-date.pdf
Source: Generated from triple_statements_mixed_banks_test_statements.pdf
Account: NAB account 084234560267

Issue Details¶

What Happened¶

The system incorrectly merged two distinct sections:

Page 1: A fragment showing a single transaction (10/02/2023 ATM withdrawal)
Pages 2-3: Complete NAB statement for period Jan 16 - Feb 15, 2023

Expected Behavior¶

These should have been detected as separate statements or the fragment should have been excluded.

Root Cause¶

The fallback boundary detection (pattern-based) failed to identify the boundary between:

The transaction fragment on page 1
The proper statement header on page 2

Technical Analysis¶

Current Fallback Logic Issues¶

Weak Header Detection: The pattern matching doesn't strongly differentiate between:
Statement fragments with minimal formatting
Actual statement headers with full bank/account details
Missing Boundary Indicators: The fallback mode doesn't detect:
Sudden format changes between pages
Incomplete transaction tables
Missing statement period indicators on fragments
Metadata Extraction Failure: The system couldn't extract proper metadata, resulting in:
Filename: unknown-0267-unknown-date.pdf
Missing bank identification (should be "nab")
Missing date information

Recommended Improvements¶

Short-term Fixes¶

Enhance Header Pattern Matching
Require minimum header elements (bank name, account number, statement period)
Detect full statement headers vs. transaction fragments
Add Fragment Detection
Identify incomplete pages (single transactions without context)
Flag pages with insufficient metadata
Improve Boundary Confidence Scoring
Score potential boundaries based on multiple factors
Require minimum confidence threshold

Long-term Solutions¶

Multi-Pass Analysis
First pass: Identify definite statement headers
Second pass: Group pages between headers
Third pass: Handle orphaned fragments
Structure Analysis
Detect consistent formatting within statements
Flag format changes as potential boundaries
Enhanced Fallback Models
Train lightweight ML model for boundary detection
Use document structure features without requiring LLM

Testing Requirements¶

Test with various fragment types
Test with statements having weak/minimal headers
Test with mixed format documents
Ensure improvements don't break existing working cases

Impact Assessment¶

Severity: Medium (incorrect document separation but no data loss)
Frequency: Occurs in fallback mode with certain document structures
User Impact: Incorrectly merged documents uploaded to Paperless

Next Steps¶

Implement enhanced header pattern matching
Add fragment detection logic
Improve metadata extraction fallback
Add specific test cases for this scenario
Consider adding validation warnings for low-confidence boundaries