
Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning
Top Score
84.7%
Models Evaluated
7
Dataset Size
2,275 samples
Last Updated
September 1, 2024
Title
CyberSOCEval: Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning
Authors
Lauren Deason, Adam Bali, Ciprian Bejean
+20 more
Published
September 24, 2025
arXiv ID
2509.20166Comprehensive evaluation across malware analysis and threat intelligence tasks with real-world cybersecurity scenarios
Number of Tasks
malware-classificationthreat-detectionmalware-family-identificationthreat-intelligence-analysis
Dataset Size
2,275 samples
| Rank | Model | accuracy | f1 score | precision | recall | Evaluated By | Date |
|---|---|---|---|---|---|---|---|
| 1st | GPT-4 gpt-4-0613 • OpenAI | 84.7% | 82.3% | 88.1% | 79.2% | Meta AI Research | September 1, 2024 |
| 2nd | Claude 3.5 claude-3-5-sonnet-20241022 • Anthropic | 83.5% | 81.2% | 86.9% | 77.8% | Meta AI Research | September 1, 2024 |
| 3rd | Claude 3 claude-3-opus-20240229 • Anthropic | 82.1% | 79.8% | 85.4% | 75.8% | Meta AI Research | September 1, 2024 |
| #4 | Gemini Ultra gemini-ultra-1.0 • Google | 81.2% | 78.7% | 84.5% | 74.2% | Meta AI Research | September 1, 2024 |
| #5 | Llama 3.1 llama-3.1-70b-instruct • Meta | 78.9% | 76.5% | 82.3% | 71.8% | Meta AI Research | September 1, 2024 |
| #6 | Llama 2 llama-2-70b-chat • Meta | 76.8% | 74.5% | 80.2% | 69.8% | Meta AI Research | September 1, 2024 |
| #7 | GPT-3.5 gpt-3.5-turbo-0613 • OpenAI | 73.4% | 71.2% | 77.8% | 65.8% | Meta AI Research | September 1, 2024 |