Model Comparison Tables
| Rank |
Model |
Provider |
Time (s) |
Statements |
Quality |
Status |
| 1 |
Gemma2:9B |
Ollama |
6.65 |
2 |
⭐⭐⭐⭐⭐ |
✅ |
| 2 |
Mistral:Instruct |
Ollama |
7.63 |
3 |
⭐⭐⭐⭐⭐ |
✅ |
| 3 |
Qwen2.5:latest |
Ollama |
8.53 |
4 |
⭐⭐⭐⭐⭐ |
✅ |
| 4 |
Qwen2.5-Coder |
Ollama |
8.59 |
3 |
⭐⭐⭐⭐⭐ |
✅ |
| 5 |
OpenHermes |
Ollama |
8.66 |
3 |
⭐⭐⭐⭐ |
✅ |
| 6 |
DeepSeek-Coder-v2 |
Ollama |
9.33 |
2 |
⭐⭐⭐⭐⭐ |
✅ |
| 7 |
GPT-4o-mini |
OpenAI |
10.85 |
3 |
⭐⭐⭐⭐⭐ |
✅ |
| 8 |
Llama3.1 |
Ollama |
11.10 |
2 |
⭐⭐⭐ |
✅ |
| 9 |
DeepSeek-r1:latest |
Ollama |
16.50 |
2 |
⭐⭐⭐⭐ |
✅ |
| 10 |
DeepSeek-r1:8b |
Ollama |
18.17 |
1 |
⭐⭐ |
⚠️ |
| 11 |
Phi4:latest |
Ollama |
20.08 |
3 |
⭐⭐⭐⭐ |
✅ |
| 12 |
Qwen3:latest |
Ollama |
30.90 |
2 |
⭐⭐⭐ |
✅ |
| 13 |
Llama3.2 |
Ollama |
205.42 |
3 |
⭐⭐ |
⚠️ |
| - |
Phi3:medium |
Ollama |
- |
7 |
⭐ |
❌ |
| - |
Phi3:14b |
Ollama |
- |
3 |
⭐ |
❌ |
| - |
Pattern Fallback |
Local |
1.0 |
9 |
⭐⭐ |
❌ |
Detailed Comparison by Provider
OpenAI Models
| Model |
Time (s) |
Accuracy |
Metadata Quality |
Cost |
Recommendation |
| GPT-4o-mini |
10.85 |
Perfect (3/3) |
Complete |
Medium |
✅ Production |
Notes: Gold standard for accuracy and completeness. Best choice when maximum precision is required.
Ollama Models - Top Tier (< 10 seconds)
| Model |
Time (s) |
Statements |
Date Extract |
Account Extract |
JSON Quality |
Use Case |
| Gemma2:9B |
6.65 |
2 |
✅ Excellent |
✅ Complete |
✅ Clean |
Speed priority |
| Mistral:Instruct |
7.63 |
3 |
❌ Missing |
✅ Complete |
⚠️ Some issues |
Segmentation accuracy |
| Qwen2.5:latest |
8.53 |
4 |
✅ Multiple |
✅ Complete |
✅ Clean |
Granular analysis |
| Qwen2.5-Coder |
8.59 |
3 |
✅ Excellent |
✅ Complete |
✅ Clean |
Code processing |
| OpenHermes |
8.66 |
3 |
✅ Good |
✅ Complete |
✅ Clean |
Quality control |
| DeepSeek-Coder-v2 |
9.33 |
2 |
⚠️ Partial |
✅ Complete |
✅ Clean |
Development |
Ollama Models - Mid Tier (10-30 seconds)
| Model |
Time (s) |
Issues |
Strengths |
Recommendation |
| Llama3.1 |
11.10 |
JSON parsing |
Speed vs 3.2 |
⚠️ Limited use |
| DeepSeek-r1:latest |
16.50 |
None major |
Good metadata |
✅ Acceptable |
| DeepSeek-r1:8b |
18.17 |
Under-segmentation |
- |
❌ Avoid |
| Phi4:latest |
20.08 |
Slower |
Reliable |
⚠️ Limited use |
| Qwen3:latest |
30.90 |
JSON issues |
Functional |
❌ Avoid |
| Model |
Time (s) |
Primary Issues |
Status |
| Llama3.2 |
205.42 |
Very slow, JSON failures |
❌ Avoid |
| Phi3:medium |
- |
Garbled output, fallback |
❌ Broken |
| Phi3:14b |
- |
Missing pages, validation failure |
❌ Broken |
Feature Comparison Matrix
| Model |
Bank Name |
Account Number |
Statement Dates |
Customer Info |
Confidence |
| GPT-4o-mini |
✅ Complete |
✅ Full digits |
✅ Perfect dates |
⚠️ Limited |
High |
| Gemma2:9B |
✅ Complete |
✅ Last 4 digits |
✅ Perfect dates |
❌ None |
High |
| Mistral:Instruct |
✅ Complete |
✅ Full numbers |
❌ Missing dates |
❌ None |
Medium |
| Qwen2.5-Coder |
✅ Complete |
✅ Full numbers |
✅ Perfect dates |
❌ None |
High |
| OpenHermes |
✅ Complete |
✅ Full numbers |
✅ Good dates |
❌ None |
High |
| DeepSeek-Coder-v2 |
⚠️ Partial |
✅ Full numbers |
⚠️ Some dates |
❌ None |
Medium |
Document Segmentation Accuracy
| Model |
Expected (3) |
Detected |
Accuracy |
Notes |
| GPT-4o-mini |
3 |
3 |
100% |
Perfect boundaries |
| Mistral:Instruct |
3 |
3 |
100% |
Exact match |
| Qwen2.5-Coder |
3 |
3 |
100% |
Exact match |
| Phi4:latest |
3 |
3 |
100% |
Exact match |
| OpenHermes |
3 |
3 (4-1) |
100% |
Smart filtering |
| Qwen2.5:latest |
3 |
4 |
75% |
Over-segmentation |
| Gemma2:9B |
3 |
2 |
67% |
Under-segmentation |
| DeepSeek models |
3 |
1-2 |
33-67% |
Various issues |
| Llama models |
3 |
2-3 |
67-100% |
With JSON issues |
Processing Speed Comparison
Speed Categories
| Category |
Time Range |
Models |
Use Cases |
| Ultra Fast |
< 7s |
Gemma2:9B |
Real-time processing |
| Fast |
7-9s |
Mistral, Qwen2.5 variants, OpenHermes |
Production workflows |
| Moderate |
9-15s |
DeepSeek-Coder-v2, GPT-4o-mini, Llama3.1 |
Standard processing |
| Slow |
15-25s |
DeepSeek-r1 variants, Phi4 |
Batch processing |
| Very Slow |
> 30s |
Qwen3, Llama3.2 |
Background tasks only |
Resource Requirements
Model Sizes and Memory Usage
| Model |
Size (GB) |
Memory Req |
GPU Req |
CPU Performance |
| Gemma2:9B |
5.4 |
8GB+ |
Recommended |
Good |
| Mistral:Instruct |
4.1 |
6GB+ |
Recommended |
Good |
| Qwen2.5:latest |
4.7 |
6GB+ |
Recommended |
Good |
| Qwen2.5-Coder |
4.7 |
6GB+ |
Recommended |
Good |
| OpenHermes |
4.1 |
6GB+ |
Recommended |
Good |
| DeepSeek-Coder-v2 |
8.9 |
12GB+ |
Required |
Poor |
| Llama3.1 |
4.7 |
6GB+ |
Recommended |
Good |
| Phi4 |
9.1 |
12GB+ |
Required |
Moderate |
| Qwen3 |
5.2 |
8GB+ |
Recommended |
Poor |
Quality Scores Breakdown
Overall Quality Rating System
- ⭐⭐⭐⭐⭐ Excellent: Perfect/near-perfect accuracy, fast processing
- ⭐⭐⭐⭐ Very Good: Minor issues, reliable performance
- ⭐⭐⭐ Good: Some issues but usable
- ⭐⭐ Poor: Major issues, limited use
- ⭐ Broken: Unsuitable for production
Quality Factor Weights
| Factor |
Weight |
Description |
| Segmentation Accuracy |
40% |
Correct statement boundary detection |
| Metadata Extraction |
25% |
Bank name, account, date extraction |
| Processing Speed |
20% |
Time to complete processing |
| Response Quality |
10% |
JSON formatting, parsing success |
| Reliability |
5% |
Consistent performance, error rates |
Use Case Recommendations
High-Volume Production
- Gemma2:9B - Best speed/quality balance
- Mistral:Instruct - Best segmentation accuracy
- GPT-4o-mini - Maximum accuracy required
Development/Testing
- Qwen2.5-Coder - Code-focused processing
- OpenHermes - Quality control testing
- DeepSeek-Coder-v2 - Development iteration
Offline/Privacy-First
- Gemma2:9B - Best local performance
- Qwen2.5 variants - Feature-complete local
- Mistral:Instruct - Segmentation priority
Budget-Conscious
- OpenAI GPT-4o-mini - Best accuracy per dollar
- Self-hosted Gemma2:9B - Zero marginal cost
- Mistral:Instruct - Good local alternative
Experimental/Research
- Qwen2.5:latest - Most granular analysis
- OpenHermes - Confidence scoring research
- DeepSeek-r1:latest - Reasoning model testing