
A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence with six specialized CTI tasks: Knowledge Testing (CKT), Technique Extraction (ATE), Report Matching (RCM), Report Summarization (RMS), Threat Attribution (TAA), and Vulnerability Prediction (VSP)
Top Score
92.0%
Models Evaluated
5
Dataset Size
3,000 samples
Last Updated
November 3, 2025
Availability
Comprehensive CTI benchmark with six specialized tasks covering knowledge evaluation, technique extraction, report analysis, threat attribution, and vulnerability assessment. Includes full and mini dataset variants for quick iteration.
Number of Tasks
6
| Rank | Model | ckt accuracy | rcm accuracy | combined score | Evaluated By | Date | Source |
|---|---|---|---|---|---|---|---|
| 1st | GPT-5 gpt-5 • OpenAI | 92.0% | 71.6% | 66.1% | AthenaBench authors | November 3, 2025 | Link |
| 2nd | Gemini 2.5 Pro gemini-2.5-pro • Google | 89.1% | 71.2% | 63.6% | AthenaBench authors | November 3, 2025 | Link |
| 3rd | GPT-4o gpt-4o • OpenAI | 85.2% | 71.3% | 58.0% | AthenaBench authors | November 3, 2025 | Link |
| #4 | Llama 3.3 70B Instruct llama-3.3-70b-instruct • Meta | 81.4% | 60.0% | 46.5% | AthenaBench authors | November 3, 2025 | Link |
| #5 | GPT-4 gpt-4-turbo • OpenAI | 78.7% | 63.1% | 51.4% | AthenaBench authors | November 3, 2025 | Link |