Threat IntelligenceCTI ExtractionThreat Actor AnalysisIOC Identification

AthenaBench

A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence with six specialized CTI tasks: Knowledge Testing (CKT), Technique Extraction (ATE), Report Matching (RCM), Report Summarization (RMS), Threat Attribution (TAA), and Vulnerability Prediction (VSP)

View Paper

Quick Stats

Top Score

92.0%

Models Evaluated

Dataset Size

3,000 samples

Last Updated

November 3, 2025

Availability

Dataset ✓Code ✓

Metrics Tracked

ckt accuracyrcm accuracycombined score

Sources

Code

Dataset Information

Comprehensive CTI benchmark with six specialized tasks covering knowledge evaluation, technique extraction, report analysis, threat attribution, and vulnerability assessment. Includes full and mini dataset variants for quick iteration.

Number of Tasks

CTI Knowledge TestAdversary Technique ExtractionReport Comprehension MatchingReport Mapping SummarizationThreat Actor AttributionVulnerability Severity Prediction

Performance Comparison

Visual comparison of model performance on this benchmark

Model Results

Detailed scores for each model evaluated on this benchmark

Rank	Model	ckt accuracy	rcm accuracy	combined score	Evaluated By	Date	Source
1st	GPT-5 gpt-5 • OpenAI	92.0%	71.6%	66.1%	AthenaBench authors	November 3, 2025	Link
2nd	Gemini 2.5 Pro gemini-2.5-pro • Google	89.1%	71.2%	63.6%	AthenaBench authors	November 3, 2025	Link
3rd	GPT-4o gpt-4o • OpenAI	85.2%	71.3%	58.0%	AthenaBench authors	November 3, 2025	Link
#4	Llama 3.3 70B Instruct llama-3.3-70b-instruct • Meta	81.4%	60.0%	46.5%	AthenaBench authors	November 3, 2025	Link
#5	GPT-4 gpt-4-turbo • OpenAI	78.7%	63.1%	51.4%	AthenaBench authors	November 3, 2025	Link