Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Back to Benchmarks
Malware AnalysisThreat IntelligenceBehavior AnalysisMalware Classification

CyberSOCEval

Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning

View Paper
Quick Stats

Top Score

33.5%

Models Evaluated

6

Dataset Size

1,197 samples

Last Updated

September 24, 2025

Availability

Dataset ✓Code ✓
Metrics Tracked
malware analysis-accuracythreat intelligence-accuracy
Dataset Information

Multiple-choice benchmark with 609 Malware Analysis test cases derived from CrowdStrike Falcon Sandbox detonation reports across five malware families, plus 588 Threat Intelligence Reasoning test cases drawn from 45 CTI reports (sources include CrowdStrike, CISA, IC3, NSA).

Number of Tasks

3

Malware AnalysisThreat Intelligence ReasoningSOC Reasoning
Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModelmalware analysis-accuracythreat intelligence-accuracyEvaluated ByDateSource
1st
Claude 3.7 Sonnet
claude-3-7-sonnet • Anthropic
33.5%49.3%CyberSOCEval authors (Meta + CrowdStrike)September 24, 2025Link
2nd
OpenAI o3
o3 • OpenAI
29.4%52.9%CyberSOCEval authors (Meta + CrowdStrike)September 24, 2025Link
3rd
Llama 4 Maverick
llama-4-maverick • Meta
28.6%50.4%CyberSOCEval authors (Meta + CrowdStrike)September 24, 2025Link
#4
Gemini 2.5 Pro
gemini-2.5-pro • Google
27.4%45.1%CyberSOCEval authors (Meta + CrowdStrike)September 24, 2025Link
#5
GPT-4o
gpt-4o • OpenAI
24.6%43.1%CyberSOCEval authors (Meta + CrowdStrike)September 24, 2025Link
#6
Llama 4 Scout
llama-4-scout • Meta
23.0%43.8%CyberSOCEval authors (Meta + CrowdStrike)September 24, 2025Link
Cyber LLM Benchmark Hub

Cyber LLM Benchmark Hub

Benchmarking frontier models across cybersecurity tasks.

BenchmarksContact

© 2026 Cyber LLM Benchmark Hub