Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Back to Benchmarks
malware analysisthreat-intelligencebehavior-analysismalware-classification

CyberSOCEval

Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning

View PaperCompare Models
Quick Stats

Top Score

84.7%

Models Evaluated

7

Dataset Size

2,275 samples

Last Updated

September 1, 2024

Paper Details

Title

CyberSOCEval: Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning

Authors

Lauren Deason, Adam Bali, Ciprian Bejean

+20 more

Published

September 24, 2025

arXiv ID

2509.20166
Metrics Tracked
accuracyf1 scoreprecisionrecall
Availability
Dataset AvailableNo
Code AvailableYes
Dataset Information

Comprehensive evaluation across malware analysis and threat intelligence tasks with real-world cybersecurity scenarios

Number of Tasks

malware-classificationthreat-detectionmalware-family-identificationthreat-intelligence-analysis

Dataset Size

2,275 samples

Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModelaccuracyf1 scoreprecisionrecallEvaluated ByDate
1st
GPT-4
gpt-4-0613 • OpenAI
84.7%82.3%88.1%79.2%Meta AI ResearchSeptember 1, 2024
2nd
Claude 3.5
claude-3-5-sonnet-20241022 • Anthropic
83.5%81.2%86.9%77.8%Meta AI ResearchSeptember 1, 2024
3rd
Claude 3
claude-3-opus-20240229 • Anthropic
82.1%79.8%85.4%75.8%Meta AI ResearchSeptember 1, 2024
#4
Gemini Ultra
gemini-ultra-1.0 • Google
81.2%78.7%84.5%74.2%Meta AI ResearchSeptember 1, 2024
#5
Llama 3.1
llama-3.1-70b-instruct • Meta
78.9%76.5%82.3%71.8%Meta AI ResearchSeptember 1, 2024
#6
Llama 2
llama-2-70b-chat • Meta
76.8%74.5%80.2%69.8%Meta AI ResearchSeptember 1, 2024
#7
GPT-3.5
gpt-3.5-turbo-0613 • OpenAI
73.4%71.2%77.8%65.8%Meta AI ResearchSeptember 1, 2024