Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Back to Benchmarks
incident responsedigital-forensicsforensicslog-analysis

DFIR-Metric

A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response

View PaperCompare Models
Quick Stats

Top Score

0.0%

Models Evaluated

0

Dataset Size

1,350 samples

Last Updated

May 26, 2025

Paper Details

Title

DFIR-Metric: A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response

Authors

Bilel Cherif, Tamas Bisztray, Richard A. Dubniczky

+3 more

Published

May 26, 2025

arXiv ID

2505.19973
Metrics Tracked
accuracyconsistencytask understanding-score
Availability
Dataset AvailableYes
Code AvailableYes
Dataset Information

Three components: 700 expert-reviewed MCQs from industry certifications, 150 CTF-style forensic tasks, and 500 NIST CFTT disk/memory forensic cases

Number of Tasks

knowledge-assessmentctf-forensic-challengesdisk-forensicsmemory-forensics

Dataset Size

1,350 samples

Model Results
Detailed scores for each model evaluated on this benchmark

No results yet

Be the first to submit results for this benchmark!

Submit Results