
Microsoft's benchmark for measuring AI capabilities in cybersecurity contexts
Top Score
89.2%
Models Evaluated
7
Dataset Size
5,000 samples
Last Updated
October 14, 2024
Title
Microsoft Raises the Bar: A Smarter Way to Measure AI for Cybersecurity
Authors
Yiran Wu, Mauricio Velazco, Andrew Zhao
+9 more
Published
September 1, 2025
arXiv ID
2507.14201v2Comprehensive evaluation of AI systems in realistic cybersecurity scenarios across multiple security domains
Number of Tasks
threat-classificationvulnerability-identificationincident-triagesecurity-recommendationattack-pattern-recognition
Dataset Size
5,000 samples
| Rank | Model | accuracy | security score | false positive-rate | detection rate | Evaluated By | Date |
|---|---|---|---|---|---|---|---|
| 1st | GPT-4 gpt-4-1106-preview • OpenAI | 89.2% | 87.6% | 5.8% | 92.3% | Microsoft Security Research | October 14, 2024 |
| 2nd | Claude 3.5 claude-3-5-sonnet-20241022 • Anthropic | 88.4% | 87.1% | 6.3% | 91.5% | Microsoft Security Research | October 14, 2024 |
| 3rd | Gemini Ultra gemini-ultra-1.0 • Google | 87.1% | 85.9% | 7.2% | 89.8% | Microsoft Security Research | October 14, 2024 |
| #4 | Claude 3 claude-3-opus-20240229 • Anthropic | 86.3% | 84.7% | 6.8% | 88.9% | Microsoft Security Research | October 14, 2024 |
| #5 | Llama 3.1 llama-3.1-70b-instruct • Meta | 84.5% | 82.1% | 8.9% | 86.7% | Microsoft Security Research | October 14, 2024 |
| #6 | GPT-3.5 gpt-3.5-turbo-1106 • OpenAI | 79.8% | 77.2% | 12.5% | 81.4% | Microsoft Security Research | October 14, 2024 |
| #7 | Llama 2 llama-2-70b-chat • Meta | 77.6% | 74.5% | 14.3% | 78.9% | Microsoft Security Research | October 14, 2024 |