Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Back to Benchmarks
Threat IntelligenceCTI ExtractionThreat Actor AnalysisIOC Identification

CTIArena

Benchmarking LLM Knowledge and Reasoning Across Heterogeneous Cyber Threat Intelligence

View Paper
Quick Stats

Top Score

71.2%

Models Evaluated

5

Dataset Size

691 samples

Last Updated

October 13, 2025

Availability

Dataset ✓Code ✓
Metrics Tracked
csc f1tap f1mla f1
Sources
Code
Dataset Information

691 QA pairs split across 9 CTI tasks in three categories: 371 structured (CTI-RCM, CTI-WIM, CTI-ATD, CTI-ESD), 150 unstructured (CTI-MLA, CTI-TAP, CTI-CSC), 170 hybrid (CTI-VCA, CTI-ATA). Built from CVE/CWE/CAPEC/ATT&CK and vendor reports with an LLM-as-judge + human cross-verification pipeline.

Number of Tasks

3

Structured CTI AnalysisUnstructured CTI AnalysisHybrid CTI Analysis
Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModelcsc f1tap f1mla f1Evaluated ByDateSource
1st
Qwen 3
qwen3-235b • Alibaba
71.2%70.9%40.7%CTIArena authorsOctober 13, 2025Link
2nd
Gemini 2.5 Pro
gemini-2.5-pro • Google
69.4%60.8%44.1%CTIArena authorsOctober 13, 2025Link
3rd
Gemini 2.5 Flash
gemini-2.5-flash • Google
69.4%69.4%32.8%CTIArena authorsOctober 13, 2025Link
#4
GPT-5
gpt-5 • OpenAI
67.1%58.3%39.4%CTIArena authorsOctober 13, 2025Link
#5
GPT-4o
gpt-4o • OpenAI
66.0%67.1%39.3%CTIArena authorsOctober 13, 2025Link
Cyber LLM Benchmark Hub

Cyber LLM Benchmark Hub

Benchmarking frontier models across cybersecurity tasks.

BenchmarksContact

© 2026 Cyber LLM Benchmark Hub