Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Back to Benchmarks
Threat IntelligenceCTI ExtractionThreat Actor AnalysisIOC Identification

AthenaBench

A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence with six specialized CTI tasks: Knowledge Testing (CKT), Technique Extraction (ATE), Report Matching (RCM), Report Summarization (RMS), Threat Attribution (TAA), and Vulnerability Prediction (VSP)

View Paper
Quick Stats

Top Score

92.0%

Models Evaluated

5

Dataset Size

3,000 samples

Last Updated

November 3, 2025

Availability

Dataset ✓Code ✓
Metrics Tracked
ckt accuracyrcm accuracycombined score
Sources
Code
Dataset Information

Comprehensive CTI benchmark with six specialized tasks covering knowledge evaluation, technique extraction, report analysis, threat attribution, and vulnerability assessment. Includes full and mini dataset variants for quick iteration.

Number of Tasks

6

CTI Knowledge TestAdversary Technique ExtractionReport Comprehension MatchingReport Mapping SummarizationThreat Actor AttributionVulnerability Severity Prediction
Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModelckt accuracyrcm accuracycombined scoreEvaluated ByDateSource
1st
GPT-5
gpt-5 • OpenAI
92.0%71.6%66.1%AthenaBench authorsNovember 3, 2025Link
2nd
Gemini 2.5 Pro
gemini-2.5-pro • Google
89.1%71.2%63.6%AthenaBench authorsNovember 3, 2025Link
3rd
GPT-4o
gpt-4o • OpenAI
85.2%71.3%58.0%AthenaBench authorsNovember 3, 2025Link
#4
Llama 3.3 70B Instruct
llama-3.3-70b-instruct • Meta
81.4%60.0%46.5%AthenaBench authorsNovember 3, 2025Link
#5
GPT-4
gpt-4-turbo • OpenAI
78.7%63.1%51.4%AthenaBench authorsNovember 3, 2025Link
Cyber LLM Benchmark Hub

Cyber LLM Benchmark Hub

Benchmarking frontier models across cybersecurity tasks.

BenchmarksContact

© 2026 Cyber LLM Benchmark Hub