Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Back to Benchmarks
Security KnowledgeKnowledge RetentionLogical Reasoning

SECURE

Security Extraction, Understanding & Reasoning Evaluation - Benchmarking LLMs for Cybersecurity

View Paper
Quick Stats

Top Score

88.6%

Models Evaluated

6

Dataset Size

6 samples

Last Updated

May 30, 2024

Availability

Dataset ✓Code ✓
Metrics Tracked
maet accuracycwet accuracykcv accuracyvood ood-accuracy
Dataset Information

Six datasets focused on Industrial Control System (ICS) sector evaluating knowledge extraction, understanding, and reasoning from industry-standard sources

Number of Tasks

3

Knowledge ExtractionUnderstandingReasoning
Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModelmaet accuracycwet accuracykcv accuracyvood ood-accuracyEvaluated ByDateSource
1st
GPT-4
gpt-4-turbo • OpenAI
88.6%89.6%87.6%87.9%SECURE authorsMay 30, 2024Link
2nd
Llama 3 70B Chat
llama-3-70b-instruct • Meta
86.3%90.4%85.2%27.1%SECURE authorsMay 30, 2024Link
3rd
Gemini Pro
gemini-pro-1.0 • Google
86.2%87.8%83.5%86.7%SECURE authorsMay 30, 2024Link
#4
GPT-3.5
gpt-3.5-turbo-0613 • OpenAI
82.8%84.2%78.3%8.4%SECURE authorsMay 30, 2024Link
#5
Llama 3 8B Instruct
llama-3-8b-instruct • Meta
82.1%83.9%82.8%56.4%SECURE authorsMay 30, 2024Link
#6
Mistral 7B Instruct v0.2
mistral-7b-instruct-v0.2 • Mistral AI
77.9%80.1%64.2%57.1%SECURE authorsMay 30, 2024Link
Cyber LLM Benchmark Hub

Cyber LLM Benchmark Hub

Benchmarking frontier models across cybersecurity tasks.

BenchmarksContact

© 2026 Cyber LLM Benchmark Hub