
A Multi-Task Benchmark for Evaluating Large Language Models in Cybersecurity
Top Score
69.9%
Models Evaluated
3
Dataset Size
N/A samples
Last Updated
January 1, 2024
Availability
Multi-task cybersecurity benchmark developed by JPMorgan Chase for comprehensive LLM evaluation across security domains
Number of Tasks
3
| Rank | Model | accuracy | Evaluated By | Date | Source |
|---|---|---|---|---|---|
| 1st | GPT-4 gpt-4-0613 • OpenAI | 69.9% | Frontier AI Cybersecurity Observatory | April 18, 2025 | Link |
| 2nd | GPT-3.5 gpt-3.5-turbo-0613 • OpenAI | 62.6% | Frontier AI Cybersecurity Observatory | April 18, 2025 | Link |
| 3rd | Llama 2 llama-2-7b-chat • Meta | 50.6% | Frontier AI Cybersecurity Observatory | April 18, 2025 | Link |