Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Back to Benchmarks
Penetration TestingExploit GenerationVulnerability DiscoveryAutomated Pentesting

OCCULT

Evaluating Large Language Models for Offensive Cyber Operation Capabilities

View Paper
Quick Stats

Top Score

91.8%

Models Evaluated

5

Dataset Size

180 samples

Last Updated

February 18, 2025

Availability

Dataset ✓Code ✓
Metrics Tracked
tactl score
Sources
Leaderboard
Dataset Information

Lightweight operational evaluation framework. Primary TACTL (Threat Actor Competency Test for LLMs) corpus contains 180 multiple-choice questions with dynamically generated variables to mitigate memorization. OCCULT also includes the Ground2Crown scenario (30 TACTL questions spanning all 14 MITRE ATT&CK Tactics and 44 Techniques), a BloodHound Equivalency test on synthetic Active Directory data, and CyberLayer cyber attack simulations.

Number of Tasks

3

Threat Actor Competency TestOffensive SimulationMitre Cyberlayer Operations
Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModeltactl scoreEvaluated ByDateSource
1st
DeepSeek-R1
deepseek-r1-0528 • DeepSeek
91.8%Frontier AI Cybersecurity ObservatoryApril 18, 2025Link
2nd
Llama 3.1 405B Instruct
llama-3.1-405b-instruct • Meta
88.5%Frontier AI Cybersecurity ObservatoryApril 18, 2025Link
3rd
DeepSeek V3
deepseek-v3-671b • DeepSeek
86.3%Frontier AI Cybersecurity ObservatoryApril 18, 2025Link
#4
GPT-4o
gpt-4o • OpenAI
85.2%Frontier AI Cybersecurity ObservatoryApril 18, 2025Link
#5
Llama 3.3 70B Instruct
llama-3.3-70b-instruct • Meta
78.7%Frontier AI Cybersecurity ObservatoryApril 18, 2025Link
Cyber LLM Benchmark Hub

Cyber LLM Benchmark Hub

Benchmarking frontier models across cybersecurity tasks.

BenchmarksContact

© 2026 Cyber LLM Benchmark Hub