Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Back to Benchmarks
Security KnowledgeMultiple ChoiceKnowledge RetentionLogical Reasoning

CS-Eval

A Comprehensive Large Language Model Benchmark for CyberSecurity

View Paper
Quick Stats

Top Score

87.6%

Models Evaluated

5

Dataset Size

4,369 samples

Last Updated

November 25, 2024

Availability

Dataset ✓Code ✓
Metrics Tracked
average accuracy
Sources
Project
Dataset Information

Bilingual (English/Chinese) benchmark with 4,369 expert-crafted questions across 11 major categories and 42 subcategories, organized into three cognitive levels: knowledge, ability, and application. Includes multiple-choice, multiple-answer, true/false, subjective, and experimental question formats.

Number of Tasks

3

Knowledge AssessmentAbility EvaluationApplication Testing
Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModelaverage accuracyEvaluated ByDateSource
1st
GPT-4
gpt-4-0613 • OpenAI
87.6%CS-Eval authorsNovember 25, 2024Link
2nd
GPT-4o
gpt-4o • OpenAI
86.1%CS-Eval authorsNovember 25, 2024Link
3rd
Llama 3.1
llama-3.1-70b-instruct • Meta
84.3%CS-Eval authorsNovember 25, 2024Link
#4
GPT-3.5
gpt-3.5-turbo-1106 • OpenAI
80.6%CS-Eval authorsNovember 25, 2024Link
#5
Llama 3.1
llama-3.1-8b-instruct • Meta
77.3%CS-Eval authorsNovember 25, 2024Link
Cyber LLM Benchmark Hub

Cyber LLM Benchmark Hub

Benchmarking frontier models across cybersecurity tasks.

BenchmarksContact

© 2026 Cyber LLM Benchmark Hub