comprehensive securitythreat-detectionvulnerability-assessmentincident-responsesecurity-operations

ExCyTIn-Bench

Microsoft's benchmark for measuring AI capabilities in cybersecurity contexts

View Paper Compare Models

Quick Stats

Top Score

89.2%

Models Evaluated

Dataset Size

5,000 samples

Last Updated

October 14, 2024

Paper Details

Title

Microsoft Raises the Bar: A Smarter Way to Measure AI for Cybersecurity

Authors

Yiran Wu, Mauricio Velazco, Andrew Zhao

+9 more

Published

September 1, 2025

arXiv ID

2507.14201v2

Metrics Tracked

accuracysecurity scorefalse positive-ratedetection rate

Availability

Dataset AvailableYes

Code AvailableYes

Dataset Information

Comprehensive evaluation of AI systems in realistic cybersecurity scenarios across multiple security domains

Number of Tasks

threat-classificationvulnerability-identificationincident-triagesecurity-recommendationattack-pattern-recognition

Dataset Size

5,000 samples

Performance Comparison

Visual comparison of model performance on this benchmark

Model Results

Detailed scores for each model evaluated on this benchmark

Rank	Model	accuracy	security score	false positive-rate	detection rate	Evaluated By	Date
1st	GPT-4 gpt-4-1106-preview • OpenAI	89.2%	87.6%	5.8%	92.3%	Microsoft Security Research	October 14, 2024
2nd	Claude 3.5 claude-3-5-sonnet-20241022 • Anthropic	88.4%	87.1%	6.3%	91.5%	Microsoft Security Research	October 14, 2024
3rd	Gemini Ultra gemini-ultra-1.0 • Google	87.1%	85.9%	7.2%	89.8%	Microsoft Security Research	October 14, 2024
#4	Claude 3 claude-3-opus-20240229 • Anthropic	86.3%	84.7%	6.8%	88.9%	Microsoft Security Research	October 14, 2024
#5	Llama 3.1 llama-3.1-70b-instruct • Meta	84.5%	82.1%	8.9%	86.7%	Microsoft Security Research	October 14, 2024
#6	GPT-3.5 gpt-3.5-turbo-1106 • OpenAI	79.8%	77.2%	12.5%	81.4%	Microsoft Security Research	October 14, 2024
#7	Llama 2 llama-2-70b-chat • Meta	77.6%	74.5%	14.3%	78.9%	Microsoft Security Research	October 14, 2024

comprehensive securitythreat-detectionvulnerability-assessmentincident-responsesecurity-operations

ExCyTIn-Bench

Microsoft's benchmark for measuring AI capabilities in cybersecurity contexts

View Paper Compare Models

Quick Stats

Top Score

89.2%

Models Evaluated

Dataset Size

5,000 samples

Last Updated

October 14, 2024

Paper Details

Title

Microsoft Raises the Bar: A Smarter Way to Measure AI for Cybersecurity

Authors

Yiran Wu, Mauricio Velazco, Andrew Zhao

+9 more

Published

September 1, 2025

arXiv ID

2507.14201v2

Metrics Tracked

accuracysecurity scorefalse positive-ratedetection rate

Availability

Dataset AvailableYes

Code AvailableYes

Dataset Information

Comprehensive evaluation of AI systems in realistic cybersecurity scenarios across multiple security domains

Number of Tasks

threat-classificationvulnerability-identificationincident-triagesecurity-recommendationattack-pattern-recognition

Dataset Size

5,000 samples

Performance Comparison

Visual comparison of model performance on this benchmark

Model Results

Detailed scores for each model evaluated on this benchmark

Rank	Model	accuracy	security score	false positive-rate	detection rate	Evaluated By	Date
1st	GPT-4 gpt-4-1106-preview • OpenAI	89.2%	87.6%	5.8%	92.3%	Microsoft Security Research	October 14, 2024
2nd	Claude 3.5 claude-3-5-sonnet-20241022 • Anthropic	88.4%	87.1%	6.3%	91.5%	Microsoft Security Research	October 14, 2024
3rd	Gemini Ultra gemini-ultra-1.0 • Google	87.1%	85.9%	7.2%	89.8%	Microsoft Security Research	October 14, 2024
#4	Claude 3 claude-3-opus-20240229 • Anthropic	86.3%	84.7%	6.8%	88.9%	Microsoft Security Research	October 14, 2024
#5	Llama 3.1 llama-3.1-70b-instruct • Meta	84.5%	82.1%	8.9%	86.7%	Microsoft Security Research	October 14, 2024
#6	GPT-3.5 gpt-3.5-turbo-1106 • OpenAI	79.8%	77.2%	12.5%	81.4%	Microsoft Security Research	October 14, 2024
#7	Llama 2 llama-2-70b-chat • Meta	77.6%	74.5%	14.3%	78.9%	Microsoft Security Research	October 14, 2024