Skip to content

Boundary Detection Issue Report - FULLY RESOLVED

Issue Summary

Multiple boundary detection failures were identified affecting both LLM providers and fallback processing modes where separate statements were incorrectly merged into single output files.

Status: ✅ FULLY RESOLVED in v0.1.0 (September 2025)

🎯 Major Resolution Update (August 2025)

CRITICAL FIX: Resolved core LLM boundary detection accuracy issue affecting both OpenAI and Ollama providers.

✅ Root Cause Identified and Fixed

Primary Issue: Adjacent Boundary Consolidation Bug in _validate_and_consolidate_boundaries()

  • Problem: Logic boundary.start_page <= last_boundary.end_page + 1 treated adjacent pages as overlapping
  • Impact: 3 separate statements (Westpac, CBA, NAB) merged into 1 statement
  • Result: 33% accuracy → 100% accuracy after fix

Secondary Issue: LLM Text Preparation Without Page Boundaries

  • Problem: Combined text " ".join(text_chunks) provided no structural information
  • Impact: LLM couldn't identify page transitions between statements
  • Result: Enhanced with === PAGE N === markers for clear structure

Resolution Summary

The boundary detection issue has been successfully resolved through comprehensive improvements to the fallback processing system:

✅ Implemented Solutions

  1. Enhanced Fallback Detection (llm_analyzer.py)
  2. Added text-based analysis for stronger header detection
  3. Implemented fragment detection using multiple criteria
  4. Enhanced confidence scoring based on critical elements

  5. Fragment Filtering (workflow.py)

  6. Automatic filtering of low-confidence fragments (< 0.3)
  7. Tracking of skipped fragments and pages
  8. Transparent logging of filtering decisions

  9. Validation Improvements (workflow.py)

  10. Adjusted validation to account for intentionally skipped pages
  11. Dynamic file size tolerance based on skipped content
  12. Clear reporting of validation adjustments

🎯 Results

  • Before: Fragment merged with valid statement, causing incorrect boundary
  • After: Fragment automatically detected and filtered, clean statement separation
  • Accuracy: Improved boundary detection even without OpenAI API
  • Transparency: Clear logging of what content was filtered and why

Affected File

  • Output: test/output_batch_test/unknown-0267-unknown-date.pdf
  • Source: Generated from triple_statements_mixed_banks_test_statements.pdf
  • Account: NAB account 084234560267

Issue Details

What Happened

The system incorrectly merged two distinct sections:

  1. Page 1: A fragment showing a single transaction (10/02/2023 ATM withdrawal)
  2. Pages 2-3: Complete NAB statement for period Jan 16 - Feb 15, 2023

Expected Behavior

These should have been detected as separate statements or the fragment should have been excluded.

Root Cause

The fallback boundary detection (pattern-based) failed to identify the boundary between:

  • The transaction fragment on page 1
  • The proper statement header on page 2

Technical Analysis

Current Fallback Logic Issues

  1. Weak Header Detection: The pattern matching doesn't strongly differentiate between:
  2. Statement fragments with minimal formatting
  3. Actual statement headers with full bank/account details

  4. Missing Boundary Indicators: The fallback mode doesn't detect:

  5. Sudden format changes between pages
  6. Incomplete transaction tables
  7. Missing statement period indicators on fragments

  8. Metadata Extraction Failure: The system couldn't extract proper metadata, resulting in:

  9. Filename: unknown-0267-unknown-date.pdf
  10. Missing bank identification (should be "nab")
  11. Missing date information

Short-term Fixes

  1. Enhance Header Pattern Matching
  2. Require minimum header elements (bank name, account number, statement period)
  3. Detect full statement headers vs. transaction fragments

  4. Add Fragment Detection

  5. Identify incomplete pages (single transactions without context)
  6. Flag pages with insufficient metadata

  7. Improve Boundary Confidence Scoring

  8. Score potential boundaries based on multiple factors
  9. Require minimum confidence threshold

Long-term Solutions

  1. Multi-Pass Analysis
  2. First pass: Identify definite statement headers
  3. Second pass: Group pages between headers
  4. Third pass: Handle orphaned fragments

  5. Structure Analysis

  6. Detect consistent formatting within statements
  7. Flag format changes as potential boundaries

  8. Enhanced Fallback Models

  9. Train lightweight ML model for boundary detection
  10. Use document structure features without requiring LLM

Testing Requirements

  • Test with various fragment types
  • Test with statements having weak/minimal headers
  • Test with mixed format documents
  • Ensure improvements don't break existing working cases

Impact Assessment

  • Severity: Medium (incorrect document separation but no data loss)
  • Frequency: Occurs in fallback mode with certain document structures
  • User Impact: Incorrectly merged documents uploaded to Paperless

Next Steps

  1. Implement enhanced header pattern matching
  2. Add fragment detection logic
  3. Improve metadata extraction fallback
  4. Add specific test cases for this scenario
  5. Consider adding validation warnings for low-confidence boundaries