Training – CUDA Moat Still Alive – SemiAnalysis

Intro

SemiAnalysis has been on a five-month long quest to settle the reality of MI300X. In theory, the MI300X should be at a huge advantage over Nvidia’s H100 and H200 in terms of specifications and Total Cost of Ownership (TCO). However, the reality is that the on paper specs as given below are not representative of performance that can be expected in a real-world environment. If AMD could deliver the below marketed performance with this memory, it would be a very strong competitor in the market.

Source: SemiAnalysis, Nvidia, AMD

Today we are going to talk through our five-month journey conducting independent analysis and training-focused benchmarking of the MI300X, the H100 and the H200, engaging with both NVIDIA and AMD. We will do a detailed overview of the numerous low-level benchmarks that we ran, see the table of contents for summary. Furthermore, we will compare the total cost of ownership of Nvidia and AMD GPUs and factor in performance. Ultimately much of what we are doing is openly giving a comprehensive public recommendation to AMD on what they need to do to be competitive and fix their software issues after five months of submitting and squashing bugs. It’s not just that it’s immature software, they need to change how they do development.

In short, when comparing Nvidia’s GPUs to AMD’s MI300X, we found that the potential on paper advantage of the MI300X was not realized due to a lack within AMD public release software stack and the lack of testing from AMD.

AMD’s software experience is riddled with bugs rendering out of the box training with AMD is impossible. We were hopeful that AMD could emerge as a strong competitor to NVIDIA in training workloads, but, as of today, this is unfortunately not the case. The CUDA moat has yet to be crossed by AMD due to AMD’s weaker-than-expected software Quality Assurance (QA) culture and its challenging out of the box experience. As fast as AMD tries to fill in the CUDA moat, NVIDIA engineers are working overtime to deepen said moat with new features, libraries, and performance updates.

We shared benchmark source code and intermediate test results for GEMM benchmark and Single Node Training with both Nvidia and AMD and held calls and discussions to solicit feedback and implement improvements to the benchmarks, and we worked with AMD to implement bug fixes for the software stacks.

Our goal with this highly iterative interaction was to ensure that our tests are an unbiased evaluation of what real-world users would experience.

We initially planned to publish this article a few months ago but wanted to take the extra time to engage with the AMD team and explore possible fixes or development work. We spent a considerable time identifying and fixing AMD software bugs so that we could give AMD every chance to show MI300X unhindered by AMD software stack bugs as opposed to only showing problematic performance out of the box. To give a fair impression, we also explain the considerable amount of work on tuning and bug-squashing that it took to get there. We think this approach provides users with the best possible level of transparency.

We wanted to contribute in any way we could to try to improve the AMD ecosystem. Though AMD software is much better now due to our bug reports and tire-kicking, its public software stack still falls short. We have open-sourced many of the benchmarks and created simple one-liner commands to reproduce them.

If Lisa Su and the AMD Leadership redouble their investment with a focus on their software and testing stack, they have a chance to be competitive with Nvidia on training. We think the engineers at AMD are extremely capable and are doing their best to advance the AMD ecosystem – and indeed support from these engineers in the form of bug fixes, configuration help and custom images improved the results we were able to get from the MI300X.

To bring our benchmarking process to a coda, on November 15^th, 2024 we sent Nvidia and AMD a draft of most of our major GEMM and single node benchmarking code and results for comments, verification, and fine-tuning. We asked that any final comments, fixes, feedback and any performance improvements be submitted by November 25^th. We set this time frame to crystallize test results to allow time to write an in-depth analysis and commentary and carry out multiple rounds of internal and external reviews, all steps that can take a variable and often unknowable amount of time, typically from 2-4 weeks.

A few days ago, after we informed both that we had confirmed an article publication date of December 20^th, AMD requested that we delay publication to include results based on a beta WIP development build on an AMD developer’s branch. All of our benchmarking on Nvidia was conducted on publicly available stable release builds. In the spirit of transparency and fairness, we include these results as well as updated testing harness results on as the original November 25^th deadline image and the latest publicly available software. However, we believe that the correct way to interpret the results is to look at the performance of the public stable release of AMD/Nvidia software.

Below are the list of software builds that we have used for benchmarking:

H100 Public Stable Release – Out of Box experience for Nvidia H100.
H200 Public Stable Release – Out of Box experience for Nvidia H200.
MI300X Nov 25^th Custom Build – This is a custom VIP docker image hand-crafted that builds all dependencies from source code written by AMD principal engineers.
MI300X Stable Public Release PyTorch 2.5.1 – Out of Box experience for AMD MI300X.
MI300X Public Nightly Dec 19^th– This can indicate where AMD performance can be by January 2025, when PyTorch 2.6 is released, over 1 year after launch.
MI300X Dec 21^st WIP dev build – This is the image that AMD submitted to us after we agreed to delay publication of the article. It is an experimental development build that has not yet been merged into AMD’s internal main branch, and it does not use the native PyTorch flash attention API. Performance with this image can indicate where AMD public stable release performance will be in 1-2 quarters from now.

We are very thankful for the technical support provided by AMD and Nvidia throughout this process, but we maintain our independence in the results we publish. We want to shout out to and thank our AMD counterparties, Anush Elangovan (AMD VP of AI), Hui Liu and many dozens of amazing AMD Principal/Senior engineers, AMD VPs of Engineering, AMD Engineering Fellows, AMD CVPs of Engineering and AMD Directors of Engineering, AMD Software Library Leads, for triaging and fixing our various bug reports. On the Nvidia side, we are grateful to Kedar Potdar, Ian Buck, Sylvain Jeaugey and the NCCL team from NVIDIA for their amazing support.

Thank you to Crusoe, TensorWave (AMD Ventures Portco), Nebius, Lambda, Hot Aisle and Sustainable Metal Cloud (SMC) / Firmus for the compute and for being supporters of open-source benchmarking. Crusoe, Nebius, SMC / Firmus and Lambda support managed SLURM and shared home directories out of the box. TensorWave currently has managed SLURM in beta and this feature will come to general availability (GA) at the start of next year. Sustainable Metal Cloud is one of the few neoclouds that has official MLPerf GPT-3 175B Training results.

We will be releasing a follow up article on inferencing for the H100, H200 and MI300X. We may also release a follow-up article in a few months to follow up on AMD training performance to see if out of box experience has improved and test other models such as LlaVa & Mamba.

Intro

Key Findings

Executive Recommendation to AMD

A Summary of the AMD vs Nvidia Narrative

General Matrix Multiply (GEMM) Performance

Stas’ GEMM Benchmark Is Wrong

HBM Memory Bandwidth Performance

AMD Hand-Crafted VIP Custom Builds and WIP Development Builds

Dec 21st AMD Development Builds

Training Testing Methodology (GPT1.5B, Llama 8B, Llama 70B, Mistral)

Single Node Training Performance

Multi-Node Training Performance

AMD PYTORCH_TUNABLE_OPS FLAG is a Bad User Experience

Scale Up NVLink/xGMI Topology

All Reduce/All to All/Reduce Scatter/All Gather Collectives Overview

Single Node NCCL Collective

Multi Node RCCL/NCCL Collectives and Scale Out Network Benchmarks

AMD’s User Experience is Suboptimal and the MI300X is Not Usable Out of the Box

Exploring Ideas for Better Performance on AMD

AMD’s Forked Libraries

Detailed Recommendations to AMD on How to Fix Their Software

H100/H200/MI300X Networking BoM Analysis and Performance per TCO

Subscribe for full access to this article

Must Read

Leave a Comment Cancel Reply