
Evaluating Large Language Models for Offensive Cyber Operation Capabilities
Top Score
91.8%
Models Evaluated
5
Dataset Size
180 samples
Last Updated
February 18, 2025
Availability
Lightweight operational evaluation framework. Primary TACTL (Threat Actor Competency Test for LLMs) corpus contains 180 multiple-choice questions with dynamically generated variables to mitigate memorization. OCCULT also includes the Ground2Crown scenario (30 TACTL questions spanning all 14 MITRE ATT&CK Tactics and 44 Techniques), a BloodHound Equivalency test on synthetic Active Directory data, and CyberLayer cyber attack simulations.
Number of Tasks
3
| Rank | Model | tactl score | Evaluated By | Date | Source |
|---|---|---|---|---|---|
| 1st | DeepSeek-R1 deepseek-r1-0528 • DeepSeek | 91.8% | Frontier AI Cybersecurity Observatory | April 18, 2025 | Link |
| 2nd | Llama 3.1 405B Instruct llama-3.1-405b-instruct • Meta | 88.5% | Frontier AI Cybersecurity Observatory | April 18, 2025 | Link |
| 3rd | DeepSeek V3 deepseek-v3-671b • DeepSeek | 86.3% | Frontier AI Cybersecurity Observatory | April 18, 2025 | Link |
| #4 | GPT-4o gpt-4o • OpenAI | 85.2% | Frontier AI Cybersecurity Observatory | April 18, 2025 | Link |
| #5 | Llama 3.3 70B Instruct llama-3.3-70b-instruct • Meta | 78.7% | Frontier AI Cybersecurity Observatory | April 18, 2025 | Link |