|

AMD’s Bid Against Nvidia Derailed by Incomplete Training Software, Claims TensorWave CEO

A bit of farm work was done to fit 8,192 water-cooled MI325X units into an AI training cluster. Interview After…

A bit of farm work was done to fit 8,192 water-cooled MI325X units into an AI training cluster.


Interview

After some teething pains, TensorWave CEO Darrick Horton is confident that AMD’s Instinct accelerators are ready to take on large-scale AI training.…

There were certainly some challenges with the initial version of the product,” Horton, who runs a GPU rental service and was one of the early supporters of the Nvidia rival, explained to The Register. “The underwhelming performance of AMD for training purposes in 2024 has been widely publicized.

In December 2023, AMD unveiled the MI300X, boasting a claimed 30% better performance compared to Nvidia’s H100 GPU. Additionally, it offers over double the memory capacity along with an impressive 60% increase in bandwidth. Theoretically, these advancements mean that the chip ought to significantly surpass Nvidia’s Hopper GPUs in terms of speed and efficiency.

The chip’s quicker and bigger storage capacity (192GB) using HBM3 offered clear benefits for inference tasks, allowing larger models to be executed on fewer GPUs compared to an equivalent H100 setup. Nonetheless, during initial trials, Horton mentioned that the performance was not up to par specifically for training purposes.

He states that the problem wasn’t primarily due to AMD’s hardware but rather the software.

Horton stated, “The challenge of building a training stack is significantly greater compared to developing an inference stack.” He added, “Their hardware performs exceptionally well right from the start… however, their software was somewhat lagging when they first began operations.”

As a result, TensorWave first concentrated on inference, starting with a bare-metal solution followed by a managed service. However, over the last 12 to 16 months, according to Horton, advancements in AMD’s software stack have reached a stage where the company is now planning to broaden its scope to include extensive AI training capabilities.

Although much of the attention regarding Generative Artificial Intelligence has moved from training to inference during the last year, training still holds significance, he stated. “Training activities will persist indefinitely, particularly as we discover novel techniques and fundamental changes in our approach to designing AI systems, which essentially restarts the process.”

“We expect several more of those foundational shifts to happen in the next few years,’ he added.

This week, the GPU storage facility manager secured a $100 million Series A financing round supported by Magnetar, AMD Ventures, Prosperity7, Maverick Silicon, and Nexus Venture Partners.

Part of that money will be allocated for expanding TensorWave’s workforce, however, most of it will fund an AI training setup featuring 8,192 of AMD’s top-tier MI300X GPUs, which collectively offer approximately 42 exaFLOPs of sparse FP8 computing power.

Unveiled last fall, AMD’s MI325X is essentially a bandwidth-boosted version of the original MI300X. The chip boasts the same floating point performance of its predecessor, but swaps the 192GB of HBM3 memory modules for 256GB of faster HBM3e memory, good for 6TB/s of bandwidth.

In order to extract the full performance of the 1,000-watt accelerators, TensorWave has opted for direct liquid cooling right out of the gate. If you’re not familiar, this technology replaces large, heavy heat sinks with small cold plates through which warm or chilled coolant is pumped.

“Technically speaking, this current generation allows for air cooling. Some other companies have opted for it, yet they might find themselves at a disadvantage as it’s not a prudent choice,” Horton stated. “For the next generation, using air cooling will lead to compromised performance. And within the following generation, it likely won’t even be available.”

This transition is already underway to some degree thanks to NVIDIA’s newest line of accelerators. Even when the leading graphics chip maker continues to provide air-cooled options, their top-tier NVL72 systems necessitate liquid cooling to manage their substantial 120-kilowatt power requirements.

Although AMD does not currently provide an equivalent rack-scale solution, their upcoming Instinct MI400-series accelerators, scheduled for release in 2026, are designed with this market segment in mind.

These rack-scale systems usually incorporate numerous GPUs interconnected via high-speed links to function as one massive accelerator, utilizing Nvidia’s exclusive NVLink technology or, in time, the open-standard UALink supported by companies such as AMD, Intel, and more. These configurations offer several benefits, especially for executing or training exceptionally large models, such as Meta’s forthcoming Llama 4 Behemoth, said to encompass almost 2 trillion parameters, with around 288 billion being actively used.

Some clients absolutely require NVL72, and it’s challenging to secure these clients at present,” he stated. “However, many customers do not have a particular demand for this product. When factoring in expenses, AMD remains more favorable in terms of total cost of ownership.

Horton anticipates evening out the competition once AMD’s rack-scale systems become available. TensorWave is planning multiple additional AI clusters in the near term. It’s likely that at least part of this lineup will include AMD’s upcoming MI355X accelerators, which are scheduled to launch next month.

TensorWave’s overall plan, however, continues to remain consistent with what was established last year. So far, a significant portion of their computing resources has been acquired via venture capital investments. Moving ahead, the company plans to follow a similar approach employed by CoreWeave and other firms, leveraging their collection of GPUs as security for securing substantial debt financing.

Similar Posts