Security KnowledgeMultiple ChoiceKnowledge RetentionLogical Reasoning

CS-Eval

A Comprehensive Large Language Model Benchmark for CyberSecurity

View Paper

Quick Stats

Top Score

87.6%

Models Evaluated

Dataset Size

4,369 samples

Last Updated

November 25, 2024

Availability

Dataset ✓Code ✓

Metrics Tracked

average accuracy

Sources

Project

Dataset Information

Bilingual (English/Chinese) benchmark with 4,369 expert-crafted questions across 11 major categories and 42 subcategories, organized into three cognitive levels: knowledge, ability, and application. Includes multiple-choice, multiple-answer, true/false, subjective, and experimental question formats.

Number of Tasks

Knowledge AssessmentAbility EvaluationApplication Testing

Performance Comparison

Visual comparison of model performance on this benchmark

Model Results

Detailed scores for each model evaluated on this benchmark

Rank	Model	average accuracy	Evaluated By	Date	Source
1st	GPT-4 gpt-4-0613 • OpenAI	87.6%	CS-Eval authors	November 25, 2024	Link
2nd	GPT-4o gpt-4o • OpenAI	86.1%	CS-Eval authors	November 25, 2024	Link
3rd	Llama 3.1 llama-3.1-70b-instruct • Meta	84.3%	CS-Eval authors	November 25, 2024	Link
#4	GPT-3.5 gpt-3.5-turbo-1106 • OpenAI	80.6%	CS-Eval authors	November 25, 2024	Link
#5	Llama 3.1 llama-3.1-8b-instruct • Meta	77.3%	CS-Eval authors	November 25, 2024	Link