
A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge
Top Score
91.3%
Models Evaluated
6
Dataset Size
10,000 samples
Last Updated
February 12, 2024
Availability
Multiple MCQ versions (80, 500, 2000, 10000 questions) generated via RAG from NIST standards, research papers, books, and RFCs
Number of Tasks
4
| Rank | Model | accuracy | Evaluated By | Date | Source |
|---|---|---|---|---|---|
| 1st | GPT-4o gpt-4o • OpenAI | 91.3% | Frontier AI Cybersecurity Observatory | April 18, 2025 | Link |
| 2nd | GPT-4 gpt-4-turbo • OpenAI | 91.0% | Frontier AI Cybersecurity Observatory | April 18, 2025 | Link |
| 3rd | GPT-3.5 gpt-3.5-turbo-0613 • OpenAI | 88.1% | Frontier AI Cybersecurity Observatory | April 18, 2025 | Link |
| #4 | Gemini Pro gemini-pro-1.0 • Google | 84.0% | Frontier AI Cybersecurity Observatory | April 18, 2025 | Link |
| #5 | Llama 3 8B Instruct llama-3-8b-instruct • Meta | 73.0% | Frontier AI Cybersecurity Observatory | April 18, 2025 | Link |
| #6 | Llama 2 llama-2-70b-chat • Meta | 72.6% | Frontier AI Cybersecurity Observatory | April 18, 2025 | Link |