Model Comparison Tables¶

Performance Overview¶

Rank	Model	Provider	Time (s)	Statements	Quality	Status
1	Gemma2:9B	Ollama	6.65	2	⭐⭐⭐⭐⭐	✅
2	Mistral:Instruct	Ollama	7.63	3	⭐⭐⭐⭐⭐	✅
3	Qwen2.5:latest	Ollama	8.53	4	⭐⭐⭐⭐⭐	✅
4	Qwen2.5-Coder	Ollama	8.59	3	⭐⭐⭐⭐⭐	✅
5	OpenHermes	Ollama	8.66	3	⭐⭐⭐⭐	✅
6	DeepSeek-Coder-v2	Ollama	9.33	2	⭐⭐⭐⭐⭐	✅
7	GPT-4o-mini	OpenAI	10.85	3	⭐⭐⭐⭐⭐	✅
8	Llama3.1	Ollama	11.10	2	⭐⭐⭐	✅
9	DeepSeek-r1:latest	Ollama	16.50	2	⭐⭐⭐⭐	✅
10	DeepSeek-r1:8b	Ollama	18.17	1	⭐⭐	⚠️
11	Phi4:latest	Ollama	20.08	3	⭐⭐⭐⭐	✅
12	Qwen3:latest	Ollama	30.90	2	⭐⭐⭐	✅
13	Llama3.2	Ollama	205.42	3	⭐⭐	⚠️
-	Phi3:medium	Ollama	-	7	⭐	❌
-	Phi3:14b	Ollama	-	3	⭐	❌
-	Pattern Fallback	Local	1.0	9	⭐⭐	❌

Model	Time (s)	Accuracy	Metadata Quality	Cost	Recommendation
GPT-4o-mini	10.85	Perfect (3/3)	Complete	Medium	✅ Production

Notes: Gold standard for accuracy and completeness. Best choice when maximum precision is required.

Model	Time (s)	Statements	Date Extract	Account Extract	JSON Quality	Use Case
Gemma2:9B	6.65	2	✅ Excellent	✅ Complete	✅ Clean	Speed priority
Mistral:Instruct	7.63	3	❌ Missing	✅ Complete	⚠️ Some issues	Segmentation accuracy
Qwen2.5:latest	8.53	4	✅ Multiple	✅ Complete	✅ Clean	Granular analysis
Qwen2.5-Coder	8.59	3	✅ Excellent	✅ Complete	✅ Clean	Code processing
OpenHermes	8.66	3	✅ Good	✅ Complete	✅ Clean	Quality control
DeepSeek-Coder-v2	9.33	2	⚠️ Partial	✅ Complete	✅ Clean	Development

Model	Time (s)	Issues	Strengths	Recommendation
Llama3.1	11.10	JSON parsing	Speed vs 3.2	⚠️ Limited use
DeepSeek-r1:latest	16.50	None major	Good metadata	✅ Acceptable
DeepSeek-r1:8b	18.17	Under-segmentation	-	❌ Avoid
Phi4:latest	20.08	Slower	Reliable	⚠️ Limited use
Qwen3:latest	30.90	JSON issues	Functional	❌ Avoid

Model	Time (s)	Primary Issues	Status
Llama3.2	205.42	Very slow, JSON failures	❌ Avoid
Phi3:medium	-	Garbled output, fallback	❌ Broken
Phi3:14b	-	Missing pages, validation failure	❌ Broken

Model	Bank Name	Account Number	Statement Dates	Customer Info	Confidence
GPT-4o-mini	✅ Complete	✅ Full digits	✅ Perfect dates	⚠️ Limited	High
Gemma2:9B	✅ Complete	✅ Last 4 digits	✅ Perfect dates	❌ None	High
Mistral:Instruct	✅ Complete	✅ Full numbers	❌ Missing dates	❌ None	Medium
Qwen2.5-Coder	✅ Complete	✅ Full numbers	✅ Perfect dates	❌ None	High
OpenHermes	✅ Complete	✅ Full numbers	✅ Good dates	❌ None	High
DeepSeek-Coder-v2	⚠️ Partial	✅ Full numbers	⚠️ Some dates	❌ None	Medium

Model	Expected (3)	Detected	Accuracy	Notes
GPT-4o-mini	3	3	100%	Perfect boundaries
Mistral:Instruct	3	3	100%	Exact match
Qwen2.5-Coder	3	3	100%	Exact match
Phi4:latest	3	3	100%	Exact match
OpenHermes	3	3 (4-1)	100%	Smart filtering
Qwen2.5:latest	3	4	75%	Over-segmentation
Gemma2:9B	3	2	67%	Under-segmentation
DeepSeek models	3	1-2	33-67%	Various issues
Llama models	3	2-3	67-100%	With JSON issues

Category	Time Range	Models	Use Cases
Ultra Fast	< 7s	Gemma2:9B	Real-time processing
Fast	7-9s	Mistral, Qwen2.5 variants, OpenHermes	Production workflows
Moderate	9-15s	DeepSeek-Coder-v2, GPT-4o-mini, Llama3.1	Standard processing
Slow	15-25s	DeepSeek-r1 variants, Phi4	Batch processing
Very Slow	> 30s	Qwen3, Llama3.2	Background tasks only

Model	Size (GB)	Memory Req	GPU Req	CPU Performance
Gemma2:9B	5.4	8GB+	Recommended	Good
Mistral:Instruct	4.1	6GB+	Recommended	Good
Qwen2.5:latest	4.7	6GB+	Recommended	Good
Qwen2.5-Coder	4.7	6GB+	Recommended	Good
OpenHermes	4.1	6GB+	Recommended	Good
DeepSeek-Coder-v2	8.9	12GB+	Required	Poor
Llama3.1	4.7	6GB+	Recommended	Good
Phi4	9.1	12GB+	Required	Moderate
Qwen3	5.2	8GB+	Recommended	Poor

Factor	Weight	Description
Segmentation Accuracy	40%	Correct statement boundary detection
Metadata Extraction	25%	Bank name, account, date extraction
Processing Speed	20%	Time to complete processing
Response Quality	10%	JSON formatting, parsing success
Reliability	5%	Consistent performance, error rates