Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Back to Benchmarks
comprehensive securitythreat-detectionvulnerability-assessmentincident-responsesecurity-operations

ExCyTIn-Bench

Microsoft's benchmark for measuring AI capabilities in cybersecurity contexts

View PaperCompare Models
Quick Stats

Top Score

89.2%

Models Evaluated

7

Dataset Size

5,000 samples

Last Updated

October 14, 2024

Paper Details

Title

Microsoft Raises the Bar: A Smarter Way to Measure AI for Cybersecurity

Authors

Yiran Wu, Mauricio Velazco, Andrew Zhao

+9 more

Published

September 1, 2025

arXiv ID

2507.14201v2
Metrics Tracked
accuracysecurity scorefalse positive-ratedetection rate
Availability
Dataset AvailableYes
Code AvailableYes
Dataset Information

Comprehensive evaluation of AI systems in realistic cybersecurity scenarios across multiple security domains

Number of Tasks

threat-classificationvulnerability-identificationincident-triagesecurity-recommendationattack-pattern-recognition

Dataset Size

5,000 samples

Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModelaccuracysecurity scorefalse positive-ratedetection rateEvaluated ByDate
1st
GPT-4
gpt-4-1106-preview • OpenAI
89.2%87.6%5.8%92.3%Microsoft Security ResearchOctober 14, 2024
2nd
Claude 3.5
claude-3-5-sonnet-20241022 • Anthropic
88.4%87.1%6.3%91.5%Microsoft Security ResearchOctober 14, 2024
3rd
Gemini Ultra
gemini-ultra-1.0 • Google
87.1%85.9%7.2%89.8%Microsoft Security ResearchOctober 14, 2024
#4
Claude 3
claude-3-opus-20240229 • Anthropic
86.3%84.7%6.8%88.9%Microsoft Security ResearchOctober 14, 2024
#5
Llama 3.1
llama-3.1-70b-instruct • Meta
84.5%82.1%8.9%86.7%Microsoft Security ResearchOctober 14, 2024
#6
GPT-3.5
gpt-3.5-turbo-1106 • OpenAI
79.8%77.2%12.5%81.4%Microsoft Security ResearchOctober 14, 2024
#7
Llama 2
llama-2-70b-chat • Meta
77.6%74.5%14.3%78.9%Microsoft Security ResearchOctober 14, 2024