
A Concise Question-Answering Dataset for Evaluating Large Language Models in Computer Security
Top Score
99.1%
Models Evaluated
8
Dataset Size
242 samples
Last Updated
December 26, 2023
Availability
Multiple-choice Q&A dataset generated from 'Computer Systems Security: Planning for Success' textbook with two versions of increasing complexity. SecQA v1: 127 questions (dev 5 / val 12 / test 110). SecQA v2: 115 questions (dev 5 / val 10 / test 100). Total 242 questions; 210 in the combined test split.
Number of Tasks
2
| Rank | Model | secqa v1-0shot-accuracy | secqa v1-5shot-accuracy | secqa v2-0shot-accuracy | secqa v2-5shot-accuracy | Evaluated By | Date | Source |
|---|---|---|---|---|---|---|---|---|
| 1st | GPT-3.5 gpt-3.5-turbo-0613 • OpenAI | 99.1% | 99.1% | 98.0% | 98.0% | SecQA authors | December 26, 2023 | Link |
| 2nd | GPT-4 gpt-4-0613 • OpenAI | 99.1% | 100.0% | 98.0% | 98.0% | SecQA authors | December 26, 2023 | Link |
| 3rd | Mistral 7B Instruct v0.2 mistral-7b-instruct-v0.2 • Mistral AI | 90.9% | 90.9% | 89.0% | 87.0% | SecQA authors | December 26, 2023 | Link |
| #4 | Zephyr 7B Beta zephyr-7b-beta • Hugging Face | 84.6% | 92.7% | 81.0% | 86.0% | SecQA authors | December 26, 2023 | Link |
| #5 | Vicuna 13B v1.5 vicuna-13b-v1.5 • LMSYS | 76.4% | 40.0% | 74.0% | 42.0% | SecQA authors | December 26, 2023 | Link |
| #6 | Llama 2 llama-2-7b-chat • Meta | 72.7% | 61.8% | 79.0% | 50.0% | SecQA authors | December 26, 2023 | Link |
| #7 | Vicuna 7B v1.5 vicuna-7b-v1.5 • LMSYS | 65.5% | 30.9% | 66.0% | 22.0% | SecQA authors | December 26, 2023 | Link |
| #8 | Llama 2 llama-2-13b-chat • Meta | 49.1% | 89.1% | 51.0% | 89.0% | SecQA authors | December 26, 2023 | Link |