
A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response
Top Score
93.0%
Models Evaluated
6
Dataset Size
1,350 samples
Last Updated
May 26, 2025
Availability
Three components: 700 expert-reviewed MCQs from industry certifications, 150 CTF-style forensic tasks, and 500 NIST CFTT disk/memory forensic cases
Number of Tasks
4
| Rank | Model | mcq mean-accuracy | mcq confidence-index | Evaluated By | Date | Source |
|---|---|---|---|---|---|---|
| 1st | GPT-4o gpt-4o • OpenAI | 93.0% | 88.9% | DFIR-Metric authors | May 26, 2025 | Link |
| 2nd | GPT-4.1 gpt-4.1 • OpenAI | 92.8% | 89.3% | DFIR-Metric authors | May 26, 2025 | Link |
| 3rd | Claude 3.7 Sonnet claude-3-7-sonnet • Anthropic | 91.6% | 86.4% | DFIR-Metric authors | May 26, 2025 | Link |
| #4 | Gemini 2.5 Flash gemini-2.5-flash • Google | 90.4% | 85.4% | DFIR-Metric authors | May 26, 2025 | Link |
| #5 | DeepSeek V3 deepseek-v3-671b • DeepSeek | 89.3% | 81.8% | DFIR-Metric authors | May 26, 2025 | Link |
| #6 | Llama 3.3 70B Instruct llama-3.3-70b-instruct • Meta | 86.5% | 79.8% | DFIR-Metric authors | May 26, 2025 | Link |