
A Comprehensive Large Language Model Benchmark for CyberSecurity
Top Score
87.6%
Models Evaluated
5
Dataset Size
4,369 samples
Last Updated
November 25, 2024
Availability
Bilingual (English/Chinese) benchmark with 4,369 expert-crafted questions across 11 major categories and 42 subcategories, organized into three cognitive levels: knowledge, ability, and application. Includes multiple-choice, multiple-answer, true/false, subjective, and experimental question formats.
Number of Tasks
3
| Rank | Model | average accuracy | Evaluated By | Date | Source |
|---|---|---|---|---|---|
| 1st | GPT-4 gpt-4-0613 • OpenAI | 87.6% | CS-Eval authors | November 25, 2024 | Link |
| 2nd | GPT-4o gpt-4o • OpenAI | 86.1% | CS-Eval authors | November 25, 2024 | Link |
| 3rd | Llama 3.1 llama-3.1-70b-instruct • Meta | 84.3% | CS-Eval authors | November 25, 2024 | Link |
| #4 | GPT-3.5 gpt-3.5-turbo-1106 • OpenAI | 80.6% | CS-Eval authors | November 25, 2024 | Link |
| #5 | Llama 3.1 llama-3.1-8b-instruct • Meta | 77.3% | CS-Eval authors | November 25, 2024 | Link |