Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Back to Benchmarks
Security KnowledgeMultiple ChoiceKnowledge Retention

CyberMetric

A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge

View Paper
Quick Stats

Top Score

91.3%

Models Evaluated

6

Dataset Size

10,000 samples

Last Updated

February 12, 2024

Availability

Dataset ✓Code ✓
Metrics Tracked
accuracy
Sources
Leaderboard
Dataset Information

Multiple MCQ versions (80, 500, 2000, 10000 questions) generated via RAG from NIST standards, research papers, books, and RFCs

Number of Tasks

4

Cybersecurity KnowledgeCryptographyReverse EngineeringRisk Assessment
Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModelaccuracyEvaluated ByDateSource
1st
GPT-4o
gpt-4o • OpenAI
91.3%Frontier AI Cybersecurity ObservatoryApril 18, 2025Link
2nd
GPT-4
gpt-4-turbo • OpenAI
91.0%Frontier AI Cybersecurity ObservatoryApril 18, 2025Link
3rd
GPT-3.5
gpt-3.5-turbo-0613 • OpenAI
88.1%Frontier AI Cybersecurity ObservatoryApril 18, 2025Link
#4
Gemini Pro
gemini-pro-1.0 • Google
84.0%Frontier AI Cybersecurity ObservatoryApril 18, 2025Link
#5
Llama 3 8B Instruct
llama-3-8b-instruct • Meta
73.0%Frontier AI Cybersecurity ObservatoryApril 18, 2025Link
#6
Llama 2
llama-2-70b-chat • Meta
72.6%Frontier AI Cybersecurity ObservatoryApril 18, 2025Link
Cyber LLM Benchmark Hub

Cyber LLM Benchmark Hub

Benchmarking frontier models across cybersecurity tasks.

BenchmarksContact

© 2026 Cyber LLM Benchmark Hub