
Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning
Top Score
33.5%
Models Evaluated
6
Dataset Size
1,197 samples
Last Updated
September 24, 2025
Availability
Multiple-choice benchmark with 609 Malware Analysis test cases derived from CrowdStrike Falcon Sandbox detonation reports across five malware families, plus 588 Threat Intelligence Reasoning test cases drawn from 45 CTI reports (sources include CrowdStrike, CISA, IC3, NSA).
Number of Tasks
3
| Rank | Model | malware analysis-accuracy | threat intelligence-accuracy | Evaluated By | Date | Source |
|---|---|---|---|---|---|---|
| 1st | Claude 3.7 Sonnet claude-3-7-sonnet • Anthropic | 33.5% | 49.3% | CyberSOCEval authors (Meta + CrowdStrike) | September 24, 2025 | Link |
| 2nd | OpenAI o3 o3 • OpenAI | 29.4% | 52.9% | CyberSOCEval authors (Meta + CrowdStrike) | September 24, 2025 | Link |
| 3rd | Llama 4 Maverick llama-4-maverick • Meta | 28.6% | 50.4% | CyberSOCEval authors (Meta + CrowdStrike) | September 24, 2025 | Link |
| #4 | Gemini 2.5 Pro gemini-2.5-pro • Google | 27.4% | 45.1% | CyberSOCEval authors (Meta + CrowdStrike) | September 24, 2025 | Link |
| #5 | GPT-4o gpt-4o • OpenAI | 24.6% | 43.1% | CyberSOCEval authors (Meta + CrowdStrike) | September 24, 2025 | Link |
| #6 | Llama 4 Scout llama-4-scout • Meta | 23.0% | 43.8% | CyberSOCEval authors (Meta + CrowdStrike) | September 24, 2025 | Link |