Fragment Detection and Quality Control¶
The Workflow Bank Statement Separator includes advanced fragment detection capabilities to ensure high-quality document separation by automatically identifying and filtering out incomplete or low-quality document sections.
What Are Fragments?¶
Fragments are incomplete document sections that may appear in multi-statement PDF files, such as:
- Single transaction records without full statement context
- Test footers or headers without actual content
- Incomplete pages with minimal banking information
- Orphaned content between proper statements
Fragment Detection Features¶
Automatic Detection¶
The system automatically identifies fragments using multiple criteria:
Content Analysis¶
- Text Length: Very short text sections (< 200 characters)
- Missing Elements: Lack of critical statement components
- Pattern Matching: Recognition of fragment-specific patterns
Critical Element Validation¶
For each section, the system checks for:
- Bank Name: Major bank identifiers (Westpac, ANZ, NAB, etc.)
- Account Information: Account numbers, BSB codes, account types
- Period Information: Statement dates, period ranges
Confidence Scoring¶
- High Confidence (> 0.7): Complete statements with all elements
- Medium Confidence (0.3-0.7): Acceptable statements with some missing info
- Low Confidence (< 0.3): Fragments automatically filtered out
How Fragment Filtering Works¶
Detection Process¶
flowchart TD
A[Document Section] --> B{Text Length > 200?}
B -->|No| F[Fragment Detected]
B -->|Yes| C{Pattern Matching}
C -->|Fragment Patterns| F
C -->|Valid Patterns| D{Critical Elements}
D -->|< 2 Elements| F
D -->|≥ 2 Elements| E[Valid Statement]
F --> G[Confidence = 0.2]
E --> H[Calculate Confidence]
G --> I{Confidence < 0.3?}
H --> I
I -->|Yes| J[Skip in PDF Generation]
I -->|No| K[Include in Output]
classDef fragmentStyle fill:#ffebee,stroke:#c62828,stroke-width:2px
classDef validStyle fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px
classDef decisionStyle fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
class F,G,J fragmentStyle
class E,K validStyle
class B,C,D,I decisionStyle
Filtering Logic¶
The system applies the following filtering rules:
- Pre-Processing: Analyze each detected statement boundary
- Confidence Assessment: Calculate confidence based on multiple factors
- Quality Gate: Skip sections with confidence < 0.3
- Validation Adjustment: Account for skipped pages in validation
Fragment Pattern Examples¶
Common Fragment Patterns¶
10/02/2023 ATM WITHDRAWAL Livingston-Miller -$32.00 $25,397.00
This statement was generated for testing purposes on 31/08/2025
Detection: Single transaction without statement context
This statement was generated for testing purposes on 31/08/2025
Detection: Test text without banking content
Page 3
Detection: Minimal content, no banking information
Valid Statement Headers¶
NAB Banking Corporation
Statement Period: 16 January 2023 to 15 February 2023
Account Type: Classic Banking Account
Account Number: 084234560267
BSB: 084-419
Detection: Full statement header with all elements
Configuration Options¶
Confidence Threshold¶
You can adjust the fragment filtering threshold through environment variables:
# Default: 0.3 (30%)
FRAGMENT_CONFIDENCE_THRESHOLD=0.3
# More aggressive filtering (fewer fragments allowed)
FRAGMENT_CONFIDENCE_THRESHOLD=0.5
# Less aggressive filtering (more fragments allowed)
FRAGMENT_CONFIDENCE_THRESHOLD=0.1
Fragment Detection Sensitivity¶
# Enable/disable fragment detection
ENABLE_FRAGMENT_DETECTION=true
# Minimum text length for valid statements
MIN_STATEMENT_TEXT_LENGTH=200
# Required critical elements count
MIN_CRITICAL_ELEMENTS=2
Logging and Monitoring¶
Fragment Detection Logs¶
The system logs detailed information about fragment detection:
2025-08-31 19:42:12 - WARNING - Detected fragment page at 4-6
2025-08-31 19:42:12 - WARNING - Skipping fragment with confidence 0.2: pages 4-6 (3 pages)
Processing Summary¶
The output includes fragment information:
Validation Reporting¶
Validation accounts for skipped fragments:
Best Practices¶
When Using Fragment Detection¶
- Monitor Logs: Review fragment detection logs to ensure valid content isn't being filtered
- Adjust Thresholds: Fine-tune confidence thresholds for your document types
- Validate Results: Check that important content isn't incorrectly classified as fragments
Troubleshooting¶
Valid Content Being Filtered¶
If valid statements are being classified as fragments:
- Check Patterns: Ensure your statements contain standard banking headers
- Lower Threshold: Reduce the confidence threshold temporarily
- Review Logs: Check what elements are missing from filtered content
Fragments Not Being Filtered¶
If fragments are still appearing in output:
- Increase Threshold: Raise the confidence threshold
- Add Patterns: Contribute additional fragment patterns
- Manual Review: Use dry-run mode to preview detection results
API Integration¶
Confidence Scoring¶
When using the API, confidence scores are available in results:
{
"statements": [
{
"pages": "1-3",
"confidence": 0.85,
"status": "included"
},
{
"pages": "4-6",
"confidence": 0.15,
"status": "filtered_fragment"
}
]
}
Fragment Reporting¶
Detailed fragment information is available:
{
"processing_summary": {
"total_sections": 2,
"valid_statements": 1,
"filtered_fragments": 1,
"skipped_pages": 3
}
}
Impact on Accuracy¶
Fragment detection significantly improves output quality:
- Reduced Noise: Eliminates incomplete document sections
- Cleaner Separation: Prevents mixing of fragments with valid statements
- Better Metadata: Focuses extraction on complete statements only
- Improved Validation: Accounts for intentionally skipped content
The fragment detection feature ensures that only high-quality, complete bank statements are included in the final output, improving the overall reliability and usability of the separated documents.