Penetration TestingExploit GenerationVulnerability DiscoveryAutomated Pentesting

OCCULT

Evaluating Large Language Models for Offensive Cyber Operation Capabilities

View Paper

Quick Stats

Top Score

91.8%

Models Evaluated

Dataset Size

180 samples

Last Updated

February 18, 2025

Availability

Dataset ✓Code ✓

Metrics Tracked

tactl score

Sources

Leaderboard

Dataset Information

Lightweight operational evaluation framework. Primary TACTL (Threat Actor Competency Test for LLMs) corpus contains 180 multiple-choice questions with dynamically generated variables to mitigate memorization. OCCULT also includes the Ground2Crown scenario (30 TACTL questions spanning all 14 MITRE ATT&CK Tactics and 44 Techniques), a BloodHound Equivalency test on synthetic Active Directory data, and CyberLayer cyber attack simulations.

Number of Tasks

Threat Actor Competency TestOffensive SimulationMitre Cyberlayer Operations

Performance Comparison

Visual comparison of model performance on this benchmark

Model Results

Detailed scores for each model evaluated on this benchmark

Rank	Model	tactl score	Evaluated By	Date	Source
1st	DeepSeek-R1 deepseek-r1-0528 • DeepSeek	91.8%	Frontier AI Cybersecurity Observatory	April 18, 2025	Link
2nd	Llama 3.1 405B Instruct llama-3.1-405b-instruct • Meta	88.5%	Frontier AI Cybersecurity Observatory	April 18, 2025	Link
3rd	DeepSeek V3 deepseek-v3-671b • DeepSeek	86.3%	Frontier AI Cybersecurity Observatory	April 18, 2025	Link
#4	GPT-4o gpt-4o • OpenAI	85.2%	Frontier AI Cybersecurity Observatory	April 18, 2025	Link
#5	Llama 3.3 70B Instruct llama-3.3-70b-instruct • Meta	78.7%	Frontier AI Cybersecurity Observatory	April 18, 2025	Link