malware analysisthreat-intelligencebehavior-analysismalware-classification

CyberSOCEval

Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning

View Paper Compare Models

Quick Stats

Top Score

84.7%

Models Evaluated

Dataset Size

2,275 samples

Last Updated

September 1, 2024

Paper Details

Title

CyberSOCEval: Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning

Authors

Lauren Deason, Adam Bali, Ciprian Bejean

+20 more

Published

September 24, 2025

arXiv ID

2509.20166

Metrics Tracked

accuracyf1 scoreprecisionrecall

Availability

Dataset AvailableNo

Code AvailableYes

Dataset Information

Comprehensive evaluation across malware analysis and threat intelligence tasks with real-world cybersecurity scenarios

Number of Tasks

malware-classificationthreat-detectionmalware-family-identificationthreat-intelligence-analysis

Dataset Size

2,275 samples

Performance Comparison

Visual comparison of model performance on this benchmark

Model Results

Detailed scores for each model evaluated on this benchmark

Rank	Model	accuracy	f1 score	precision	recall	Evaluated By	Date
1st	GPT-4 gpt-4-0613 • OpenAI	84.7%	82.3%	88.1%	79.2%	Meta AI Research	September 1, 2024
2nd	Claude 3.5 claude-3-5-sonnet-20241022 • Anthropic	83.5%	81.2%	86.9%	77.8%	Meta AI Research	September 1, 2024
3rd	Claude 3 claude-3-opus-20240229 • Anthropic	82.1%	79.8%	85.4%	75.8%	Meta AI Research	September 1, 2024
#4	Gemini Ultra gemini-ultra-1.0 • Google	81.2%	78.7%	84.5%	74.2%	Meta AI Research	September 1, 2024
#5	Llama 3.1 llama-3.1-70b-instruct • Meta	78.9%	76.5%	82.3%	71.8%	Meta AI Research	September 1, 2024
#6	Llama 2 llama-2-70b-chat • Meta	76.8%	74.5%	80.2%	69.8%	Meta AI Research	September 1, 2024
#7	GPT-3.5 gpt-3.5-turbo-0613 • OpenAI	73.4%	71.2%	77.8%	65.8%	Meta AI Research	September 1, 2024

malware analysisthreat-intelligencebehavior-analysismalware-classification

CyberSOCEval

Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning

View Paper Compare Models

Quick Stats

Top Score

84.7%

Models Evaluated

Dataset Size

2,275 samples

Last Updated

September 1, 2024

Paper Details

Title

CyberSOCEval: Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning

Authors

Lauren Deason, Adam Bali, Ciprian Bejean

+20 more

Published

September 24, 2025

arXiv ID

2509.20166

Metrics Tracked

accuracyf1 scoreprecisionrecall

Availability

Dataset AvailableNo

Code AvailableYes

Dataset Information

Comprehensive evaluation across malware analysis and threat intelligence tasks with real-world cybersecurity scenarios

Number of Tasks

malware-classificationthreat-detectionmalware-family-identificationthreat-intelligence-analysis

Dataset Size

2,275 samples

Performance Comparison

Visual comparison of model performance on this benchmark

Model Results

Detailed scores for each model evaluated on this benchmark

Rank	Model	accuracy	f1 score	precision	recall	Evaluated By	Date
1st	GPT-4 gpt-4-0613 • OpenAI	84.7%	82.3%	88.1%	79.2%	Meta AI Research	September 1, 2024
2nd	Claude 3.5 claude-3-5-sonnet-20241022 • Anthropic	83.5%	81.2%	86.9%	77.8%	Meta AI Research	September 1, 2024
3rd	Claude 3 claude-3-opus-20240229 • Anthropic	82.1%	79.8%	85.4%	75.8%	Meta AI Research	September 1, 2024
#4	Gemini Ultra gemini-ultra-1.0 • Google	81.2%	78.7%	84.5%	74.2%	Meta AI Research	September 1, 2024
#5	Llama 3.1 llama-3.1-70b-instruct • Meta	78.9%	76.5%	82.3%	71.8%	Meta AI Research	September 1, 2024
#6	Llama 2 llama-2-70b-chat • Meta	76.8%	74.5%	80.2%	69.8%	Meta AI Research	September 1, 2024
#7	GPT-3.5 gpt-3.5-turbo-0613 • OpenAI	73.4%	71.2%	77.8%	65.8%	Meta AI Research	September 1, 2024