LLM Model Testing Results¶
Overview¶
This document contains comprehensive testing results for various LLM providers and models used with the Workflow Bank Statement Separator. All tests were conducted using the same 12-page Westpac bank statement document containing multiple statements to ensure consistent comparison.
Test Methodology¶
- Test Document:
westpac_12_page_test.pdf(12 pages, 2,691 words) - Expected Output: 3 separate bank statements
- Metrics Measured: Processing time, statement detection accuracy, metadata extraction quality, filename generation compliance
- Test Environment: Ollama server at 10.0.0.150:11434, OpenAI GPT-4o-mini
- Validation: All outputs validated for page count, file integrity, and PRD compliance
OpenAI Results¶
GPT-4o-mini - Gold Standard ⭐⭐⭐⭐⭐¶
- Processing Time: 10.85 seconds
- Statements Detected: 3 (perfect segmentation)
- Status: ✅ Success
- Quality: Highest accuracy with complete metadata extraction
- Output Files:
westpac-2819-2015-05-21.pdf(1.3MB, pages 1-5)westpac-8782-2015-05-21.pdf(651KB, pages 6-7)westpac-5261-2023-05-06.pdf(1.9MB, pages 8-12)
Key Strengths:
- Perfect boundary detection
- Complete metadata extraction (bank, account, dates)
- PRD-compliant filename format
- Fast processing with high reliability
Ollama Model Results¶
Top Tier (6-9 seconds) ⭐⭐⭐⭐⭐¶
1. Gemma2:9B - Best Overall Ollama Model¶
- Processing Time: 6.65 seconds ⚡ (fastest)
- Statements: 2 (under-segmentation but high quality)
- Quality: Excellent JSON responses, accurate metadata
- Files:
westpac-2819-2015-05-21.pdf,westpac-5602-2015-05-21.pdf
2. Mistral:Instruct - Best Segmentation Match¶
- Processing Time: 7.63 seconds
- Statements: 3 (matches OpenAI exactly)
- Quality: Correct boundaries, good account extraction
- Files:
westpac-2819-unknown-date.pdf,westpac-5261-unknown-date.pdf,westpac-1039-unknown-date.pdf
3. Qwen2.5:latest - Most Granular Analysis¶
- Processing Time: 8.53 seconds
- Statements: 4 (most detailed segmentation)
- Quality: Multiple date extractions, clean responses
- Files: 4 separate statement files with varying metadata quality
4. Qwen2.5-Coder:latest - Code-Optimized Excellence¶
- Processing Time: 8.59 seconds
- Statements: 3 (perfect OpenAI match)
- Quality: Excellent segmentation and metadata
- Files:
westpac-2819-2015-05-21.pdf,westpac-8782-2015-05-21.pdf,businessch-0000-unknown-date-p9.pdf
5. OpenHermes:latest - Smart Quality Control¶
- Processing Time: 8.66 seconds
- Statements: 3 (4 detected, 1 filtered for low confidence)
- Quality: Intelligent confidence-based filtering
- Files: High-quality outputs with automatic quality control
6. DeepSeek-Coder-v2:latest - Major Improvement¶
- Processing Time: 9.33 seconds (retest - 16x faster than original!)
- Statements: 2
- Quality: Dramatic speed improvement, good metadata
- Files:
westpac-2819-unknown-date.pdf,unknown-8782-2015-05-21.pdf
Mid Tier (10-20 seconds) ⭐⭐⭐⭐¶
7. Llama3.1:latest - Speed Improvement¶
- Processing Time: 11.10 seconds
- Statements: 2
- Quality: Much faster than Llama3.2, some JSON issues
- Files:
westpac-2819-2015-05-21.pdf,unknown-0000-unknown-date-p9.pdf
8. DeepSeek-r1:latest - Solid Performer¶
- Processing Time: 16.50 seconds
- Statements: 2
- Quality: Good date extraction and metadata
- Files:
westpac-1831-2015-05-21.pdf,westpac-8782-2015-05-21.pdf
9. DeepSeek-r1:8b - Under-segmentation Issues¶
- Processing Time: 18.17 seconds
- Statements: 1 (treated entire document as single statement)
- Quality: Hallucination warnings, poor segmentation
- Files: Single 3.9MB file
westpac-2819-2015-05-21.pdf
10. Phi4:latest - Microsoft's Latest¶
- Processing Time: 20.08 seconds
- Statements: 3 (correct segmentation)
- Quality: Good metadata extraction, reliable
- Files:
westpac-2819-2015-05-21.pdf,westpac-8782-2015-05-21.pdf,westpac-0000-unknown-date-p8.pdf
Lower Tier (30+ seconds) ⭐⭐⭐¶
11. Qwen3:latest - Slower Generation¶
- Processing Time: 30.90 seconds
- Statements: 2
- Quality: JSON parsing issues but functional
- Files:
westpac-2819-unknown-date.pdf,unknown-0000-unknown-date-p5.pdf
Poor Performance ⭐⭐ / ❌¶
Llama3.2:latest - Significant Issues¶
- Processing Time: 205.42 seconds (very slow)
- Statements: 3 (with major JSON parsing failures)
- Quality: Extensive metadata extraction failures
- Issues: Hallucination warnings, response formatting problems
Phi3 Models - Critical Failures¶
- Phi3:medium: Complete LLM breakdown, garbled responses
- Phi3:14b: Validation failures, missing pages (9 vs 12 expected)
- Status: ❌ Unsuitable for production use
Fallback Pattern Matching Results¶
Pattern-Only Processing ⭐⭐¶
- Processing Time: ~1 second (fastest)
- Statements: 9 (over-segmentation)
- Status: ❌ Failed validation (14 output pages vs 12 expected)
- Quality: No metadata extraction, over-aggressive splitting
- Use Case: Emergency fallback only
Performance Summary¶
Speed Rankings¶
- Gemma2:9B - 6.65s
- Mistral:Instruct - 7.63s
- Qwen2.5:latest - 8.53s
- Qwen2.5-Coder - 8.59s
- OpenHermes - 8.66s
- DeepSeek-Coder-v2 - 9.33s
- OpenAI GPT-4o-mini - 10.85s
Accuracy Rankings (Statement Segmentation)¶
- OpenAI GPT-4o-mini - 3/3 perfect
- Mistral:Instruct - 3/3 perfect match
- Qwen2.5-Coder - 3/3 perfect match
- Phi4:latest - 3/3 correct
- OpenHermes - ¾ (smart filtering)
Metadata Quality Rankings¶
- OpenAI GPT-4o-mini - Complete extraction
- Gemma2:9B - Excellent dates/accounts
- Qwen2.5 variants - Very good extraction
- DeepSeek-r1:latest - Good extraction
- Mistral:Instruct - Good accounts, missing dates
Key Findings¶
OpenAI Dominance¶
- GPT-4o-mini remains the gold standard for accuracy and completeness
- Consistent performance with comprehensive metadata extraction
- Best choice for production deployments requiring maximum accuracy
Ollama Top Performers¶
- Gemma2:9B: Fastest Ollama model with excellent quality
- Mistral:Instruct: Best segmentation accuracy matching OpenAI
- Qwen2.5-Coder: Perfect for code-focused document processing
- OpenHermes: Best for quality control with confidence filtering
Significant Performance Variations¶
- 16x speed difference between fastest (Gemma2) and slowest (Llama3.2) Ollama models
- DeepSeek-Coder-v2 showed massive improvement on retest (151s → 9s)
- Model size doesn't guarantee performance (Phi3:14b worse than smaller models)
JSON Processing Issues¶
- Most Ollama models suffer from JSON parsing issues due to:
- Comments in JSON responses
- Verbose explanatory text
- Inconsistent response formatting
- Gemma2 and Qwen2.5 variants handle JSON responses cleanly
Filename Generation Consistency¶
- All successful models generate PRD-compliant filenames
- Format:
<bank>-<last4digits>-<statement_date>.pdf - Consistent behavior for paperless integration across all providers
Recommendations¶
Production Deployments¶
- Primary: OpenAI GPT-4o-mini for maximum accuracy
- Offline/Privacy: Gemma2:9B for best local performance
- Code Processing: Qwen2.5-Coder for structured document analysis
- Quality Control: OpenHermes for confidence-filtered outputs
Development/Testing¶
- Fast Iteration: Gemma2:9B for quick testing cycles
- Segmentation Testing: Mistral:Instruct for boundary validation
- Metadata Testing: Qwen2.5:latest for comprehensive extraction
Avoid in Production¶
- Llama3.2: Too slow with parsing issues
- Phi3 variants: Critical reliability failures
- Pattern-only fallback: Over-segmentation issues