Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Back to Benchmarks
Incident ResponseDigital ForensicsForensicsLog Analysis

DFIR-Metric

A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response

View Paper
Quick Stats

Top Score

93.0%

Models Evaluated

6

Dataset Size

1,350 samples

Last Updated

May 26, 2025

Availability

Dataset ✓Code ✓
Metrics Tracked
mcq mean-accuracymcq confidence-index
Sources
Project
Dataset Information

Three components: 700 expert-reviewed MCQs from industry certifications, 150 CTF-style forensic tasks, and 500 NIST CFTT disk/memory forensic cases

Number of Tasks

4

Knowledge AssessmentCTF Forensic ChallengesDisk ForensicsMemory Forensics
Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModelmcq mean-accuracymcq confidence-indexEvaluated ByDateSource
1st
GPT-4o
gpt-4o • OpenAI
93.0%88.9%DFIR-Metric authorsMay 26, 2025Link
2nd
GPT-4.1
gpt-4.1 • OpenAI
92.8%89.3%DFIR-Metric authorsMay 26, 2025Link
3rd
Claude 3.7 Sonnet
claude-3-7-sonnet • Anthropic
91.6%86.4%DFIR-Metric authorsMay 26, 2025Link
#4
Gemini 2.5 Flash
gemini-2.5-flash • Google
90.4%85.4%DFIR-Metric authorsMay 26, 2025Link
#5
DeepSeek V3
deepseek-v3-671b • DeepSeek
89.3%81.8%DFIR-Metric authorsMay 26, 2025Link
#6
Llama 3.3 70B Instruct
llama-3.3-70b-instruct • Meta
86.5%79.8%DFIR-Metric authorsMay 26, 2025Link
Cyber LLM Benchmark Hub

Cyber LLM Benchmark Hub

Benchmarking frontier models across cybersecurity tasks.

BenchmarksContact

© 2026 Cyber LLM Benchmark Hub