Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Back to Benchmarks
Threat IntelligenceCTI ExtractionThreat Actor AnalysisIOC Identification

CTIBench

A Benchmark for Evaluating LLMs in Cyber Threat Intelligence

View Paper
Quick Stats

Top Score

71.0%

Models Evaluated

5

Dataset Size

2,500 samples

Last Updated

June 11, 2024

Availability

Dataset ✓Code ✓
Metrics Tracked
cti mcq-accuracycti rcm-accuracycti ate-f1
Sources
Code
Dataset Information

Five evaluation tasks: CTI-MCQ (2,500 multiple-choice questions from MITRE/CWE/standards), CTI-RCM (1,000 CVE→CWE root-cause mappings), CTI-VSP (CVSS v3 severity prediction), CTI-ATE (MITRE ATT&CK technique extraction), and CTI-TAA (threat actor attribution).

Number of Tasks

3

CTI Knowledge EvaluationThreat AnalysisIOC Extraction
Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModelcti mcq-accuracycti rcm-accuracycti ate-f1Evaluated ByDateSource
1st
GPT-4
gpt-4-turbo • OpenAI
71.0%72.0%63.9%CTIBench authorsJune 11, 2024Link
2nd
Llama 3 70B Chat
llama-3-70b-instruct • Meta
65.7%65.9%47.2%CTIBench authorsJune 11, 2024Link
3rd
Gemini 1.5 Pro
gemini-1.5-pro • Google
65.4%66.6%46.1%CTIBench authorsJune 11, 2024Link
#4
Llama 3 8B Instruct
llama-3-8b-instruct • Meta
61.3%44.7%15.6%CTIBench authorsJune 11, 2024Link
#5
GPT-3.5
gpt-3.5-turbo-0613 • OpenAI
54.1%67.2%31.1%CTIBench authorsJune 11, 2024Link
Cyber LLM Benchmark Hub

Cyber LLM Benchmark Hub

Benchmarking frontier models across cybersecurity tasks.

BenchmarksContact

© 2026 Cyber LLM Benchmark Hub