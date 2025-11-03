Key Takeaways:

• Azure’s ND GB300 v6 virtual machines, powered by NVIDIA GB300 NVL72 rack-scale systems, achieved 1.1 million tokens per second on Llama 2 70B Inference—a 27 percent improvement over the previous Azure ND GB200 v6 record.

• The performance milestone was observed by Signal65, which called it “a definitive proof point that the performance required for large-scale, transformative AI is now available as a reliable, efficient, and resilient utility.”

• The ND GB300 v6 VMs leverage NVIDIA’s Blackwell architecture, with 50 percent more GPU memory, 16 percent higher TDP, and significant gains in HBM bandwidth and NVLink connectivity.

• Benchmarks using NVIDIA TensorRT-LLM and MLPerf Inference v5.1 demonstrate five-times higher throughput per GPU compared with Azure’s previous H100 generation.

Azure has set a new bar for enterprise-scale AI inference. The company’s new ND GB300 v6 virtual machines reached 1,100,000 tokens per second running Llama 2 70B Inference, surpassing its own previous record of 865,000 tokens per second by 27 percent.

The test, conducted on eighteen ND GB300 v6 VMs within a single NVIDIA GB300 NVL72 rack, was observed by independent firm Signal65. The organization described the result as “more than a benchmark record,” adding that it represents a turning point in making transformative AI capabilities widely available and scalable.

Built on the NVIDIA Blackwell architecture introduced with the ND GB200 v6, the new system is designed to handle inference workloads with greater efficiency. Each VM includes four NVIDIA GB300 GPUs for a total of 72 GPUs per rack, providing 189,471 MiB of GPU memory and a power limit of 1,400 watts per unit. This configuration delivers 15,200 tokens per second per GPU (± 5 percent), representing a 27 percent speed gain over the Blackwell GPU baseline.

For context, the previous MLPerf Inference v4.1 benchmark using NVIDIA’s DGX H100 system processed 24,525 tokens per second across eight GPUs (about 3,066 tokens per GPU). In comparison, the Azure ND GB300 v6 setup achieved roughly five times higher throughput per GPU than the ND H100 v5 generation. According to Microsoft, these results illustrate how its latest infrastructure advancements bring cutting-edge AI performance to a broader set of enterprise users through the cloud.

The benchmarking used MLPerf Inference v5.1 and NVIDIA TensorRT-LLM, a production-ready software stack for large-language-model inference. The Llama 2 70B model was run using FP4 precision, a form of quantization that accelerates processing speed while maintaining high accuracy. In offline scenarios, the aggregate throughput of over 1.1 million tokens per second marks the first time an Azure rack has broken the million-token threshold.

The engineering team credited multiple hardware improvements for the results. The ND GB300 v6 delivers 2.5 times more GEMM TFLOPS per GPU than the ND H100 v5. It also achieved 7.37 TB/s of high-bandwidth memory throughput at 92 percent efficiency, and four-times faster CPU-to-GPU transfer speeds thanks to NVLink C2C interconnects. These enhancements combine to reduce latency and improve scalability for multi-node inference tasks.

“The performance we’re seeing on ND GB300 v6 confirms our investment in co-designing hardware and software for AI scale,” said Mark Gitau, Software Engineer at Microsoft Azure. “By optimizing for memory bandwidth and communication efficiency, we can support models at unprecedented levels of throughput and responsiveness.”

Co-author Hugo Affaticati, Senior Cloud Infrastructure Engineer, added that the advancement “gives enterprises confidence that their most demanding AI applications can run reliably on Azure, from training to real-time inference.” He pointed out that the record reflects Azure’s continued collaboration with NVIDIA to tune its cloud architecture for large-language-model deployment.

From a customer perspective, the performance boost translates into faster response times, more concurrent users per instance, and the ability to scale AI applications without re-architecting workflows. Industries relying on large language models for code generation, chatbots, and document processing may see notable gains in efficiency and cost per token processed.

Azure shared the configuration and step-by-step guide for replicating the experiment on a single VM, demonstrating transparency in its benchmarking process. Developers can clone the AI Benchmarking Guide repository on GitHub, download the Llama 2 model and datasets, build the TensorRT-LLM container, and run the offline benchmark to observe results. While the test submission is unverified by MLCommons, the use of MLPerf methodology and third-party observation adds credibility to the findings.

Microsoft noted that these results underscore Azure’s broader strategy to make AI infrastructure a scalable utility for business innovation. By offering state-of-the-art GPU clusters as part of its cloud platform, the company is positioning Azure as a foundation for AI advancement rather than just a hosting environment. This approach aligns with CEO Satya Nadella’s recent statement on X that breaking the million-token barrier represents “an incredible technical achievement and a major step forward for our customers building at AI scale.”

As organizations increasingly depend on large language models for competitive advantage, the ability to run them efficiently and cost-effectively in the cloud is becoming a defining factor in AI adoption. Azure’s ND GB300 v6 release illustrates how advances in hardware design, software optimization, and benchmarking transparency can translate into real-world AI acceleration.