incident responsedigital-forensicsforensicslog-analysis

DFIR-Metric

A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response

View Paper Compare Models

Quick Stats

Top Score

0.0%

Models Evaluated

Dataset Size

1,350 samples

Last Updated

May 26, 2025

Paper Details

Title

DFIR-Metric: A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response

Authors

Bilel Cherif, Tamas Bisztray, Richard A. Dubniczky

+3 more

Published

May 26, 2025

arXiv ID

2505.19973

Metrics Tracked

accuracyconsistencytask understanding-score

Availability

Dataset AvailableYes

Code AvailableYes

Dataset Information

Three components: 700 expert-reviewed MCQs from industry certifications, 150 CTF-style forensic tasks, and 500 NIST CFTT disk/memory forensic cases

Number of Tasks

knowledge-assessmentctf-forensic-challengesdisk-forensicsmemory-forensics

Dataset Size

1,350 samples

Model Results

Detailed scores for each model evaluated on this benchmark

No results yet

Be the first to submit results for this benchmark!

Submit Results

incident responsedigital-forensicsforensicslog-analysis

DFIR-Metric

A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response

View Paper Compare Models

Quick Stats

Top Score

0.0%

Models Evaluated

Dataset Size

1,350 samples

Last Updated

May 26, 2025

Paper Details

Title

DFIR-Metric: A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response

Authors

Bilel Cherif, Tamas Bisztray, Richard A. Dubniczky

+3 more

Published

May 26, 2025

arXiv ID

2505.19973

Metrics Tracked

accuracyconsistencytask understanding-score

Availability

Dataset AvailableYes

Code AvailableYes

Dataset Information

Three components: 700 expert-reviewed MCQs from industry certifications, 150 CTF-style forensic tasks, and 500 NIST CFTT disk/memory forensic cases

Number of Tasks

knowledge-assessmentctf-forensic-challengesdisk-forensicsmemory-forensics

Dataset Size

1,350 samples

Model Results

Detailed scores for each model evaluated on this benchmark

No results yet

Be the first to submit results for this benchmark!

Submit Results