Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Back to Benchmarks
Comprehensive SecurityThreat DetectionVulnerability AssessmentSecurity Operations

CyberBench (JPMorgan)

A Multi-Task Benchmark for Evaluating Large Language Models in Cybersecurity

View Paper
Quick Stats

Top Score

69.9%

Models Evaluated

3

Dataset Size

N/A samples

Last Updated

January 1, 2024

Availability

Dataset ✓Code ✓
Metrics Tracked
accuracy
Sources
CodeLeaderboard
Dataset Information

Multi-task cybersecurity benchmark developed by JPMorgan Chase for comprehensive LLM evaluation across security domains

Number of Tasks

3

Multi Task SecurityThreat DetectionVulnerability Assessment
Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModelaccuracyEvaluated ByDateSource
1st
GPT-4
gpt-4-0613 • OpenAI
69.9%Frontier AI Cybersecurity ObservatoryApril 18, 2025Link
2nd
GPT-3.5
gpt-3.5-turbo-0613 • OpenAI
62.6%Frontier AI Cybersecurity ObservatoryApril 18, 2025Link
3rd
Llama 2
llama-2-7b-chat • Meta
50.6%Frontier AI Cybersecurity ObservatoryApril 18, 2025Link
Cyber LLM Benchmark Hub

Cyber LLM Benchmark Hub

Benchmarking frontier models across cybersecurity tasks.

BenchmarksContact

© 2026 Cyber LLM Benchmark Hub