
Benchmarking LLM Knowledge and Reasoning Across Heterogeneous Cyber Threat Intelligence
Top Score
71.2%
Models Evaluated
5
Dataset Size
691 samples
Last Updated
October 13, 2025
Availability
691 QA pairs split across 9 CTI tasks in three categories: 371 structured (CTI-RCM, CTI-WIM, CTI-ATD, CTI-ESD), 150 unstructured (CTI-MLA, CTI-TAP, CTI-CSC), 170 hybrid (CTI-VCA, CTI-ATA). Built from CVE/CWE/CAPEC/ATT&CK and vendor reports with an LLM-as-judge + human cross-verification pipeline.
Number of Tasks
3
| Rank | Model | csc f1 | tap f1 | mla f1 | Evaluated By | Date | Source |
|---|---|---|---|---|---|---|---|
| 1st | Qwen 3 qwen3-235b • Alibaba | 71.2% | 70.9% | 40.7% | CTIArena authors | October 13, 2025 | Link |
| 2nd | Gemini 2.5 Pro gemini-2.5-pro • Google | 69.4% | 60.8% | 44.1% | CTIArena authors | October 13, 2025 | Link |
| 3rd | Gemini 2.5 Flash gemini-2.5-flash • Google | 69.4% | 69.4% | 32.8% | CTIArena authors | October 13, 2025 | Link |
| #4 | GPT-5 gpt-5 • OpenAI | 67.1% | 58.3% | 39.4% | CTIArena authors | October 13, 2025 | Link |
| #5 | GPT-4o gpt-4o • OpenAI | 66.0% | 67.1% | 39.3% | CTIArena authors | October 13, 2025 | Link |