
A Benchmark for Evaluating LLMs in Cyber Threat Intelligence
Top Score
71.0%
Models Evaluated
5
Dataset Size
2,500 samples
Last Updated
June 11, 2024
Availability
Five evaluation tasks: CTI-MCQ (2,500 multiple-choice questions from MITRE/CWE/standards), CTI-RCM (1,000 CVE→CWE root-cause mappings), CTI-VSP (CVSS v3 severity prediction), CTI-ATE (MITRE ATT&CK technique extraction), and CTI-TAA (threat actor attribution).
Number of Tasks
3
| Rank | Model | cti mcq-accuracy | cti rcm-accuracy | cti ate-f1 | Evaluated By | Date | Source |
|---|---|---|---|---|---|---|---|
| 1st | GPT-4 gpt-4-turbo • OpenAI | 71.0% | 72.0% | 63.9% | CTIBench authors | June 11, 2024 | Link |
| 2nd | Llama 3 70B Chat llama-3-70b-instruct • Meta | 65.7% | 65.9% | 47.2% | CTIBench authors | June 11, 2024 | Link |
| 3rd | Gemini 1.5 Pro gemini-1.5-pro • Google | 65.4% | 66.6% | 46.1% | CTIBench authors | June 11, 2024 | Link |
| #4 | Llama 3 8B Instruct llama-3-8b-instruct • Meta | 61.3% | 44.7% | 15.6% | CTIBench authors | June 11, 2024 | Link |
| #5 | GPT-3.5 gpt-3.5-turbo-0613 • OpenAI | 54.1% | 67.2% | 31.1% | CTIBench authors | June 11, 2024 | Link |