Search
Close this search box.
Site maintenance is scheduled for Wednesday, September 11, 2024 at 6pm ET. Availability may be temporarily interrupted.

Amazon SageMaker HyperPod Claims 40% Reduction in AI Training Time

Amazon SageMaker HyperPod Claims 40% Reduction in AI Training Time

The News: At re:Invent, Amazon Web Services (AWS) announced the launch of Amazon SageMaker HyperPod. Now generally available, Amazon SageMaker HyperPod is a tool that aims to optimize AI infrastructure for training foundation models. AWS says the solution can reduce training time by up to 40%. Here are the key details:

  • Automatic cluster health check and repair. If an instance becomes defective during a training workload, SageMaker HyperPod automatically detects and swaps faulty nodes with healthy ones. To detect faulty hardware, SageMaker HyperPod regularly runs an array of health checks for graphics processing unit (GPU) and network integrity.
  • Streamlined distributed training for large training clusters. SageMaker HyperPod is preconfigured with Amazon SageMaker distributed training libraries, allowing users to automatically split their models and training datasets across AWS cluster instances to help them efficiently scale training workloads.
  • Optimized utilization of cluster’s compute, memory, and network resources. Amazon SageMaker distributed training libraries optimize training jobs for AWS network infrastructure and cluster topology through two techniques: data parallelism and model parallelism. Model parallelism splits models too large to fit on a single GPU into smaller parts before distributing across multiple GPUs to train, and data parallelism splits large datasets to train concurrently to improve training speed.

Read the announcement on the launch of Amazon SageMaker HyperPod on the AWS website.

Amazon SageMaker HyperPod Claims 40% Reduction in AI Training Time

Analyst Take: AI training and inference workloads are a significant cost issue, and if not addressed, it could stymie enterprise growth of generative AI use. AWS’ VP of AI Matt Wood says Amazon SageMaker HyperPod reduces time to train foundation models by up to 40% and reduces deployment costs associated with AI inference by an average of 50%. Has AWS found a unique solution? What will the impact of SageMaker HyperPod be? Here are my thoughts.

Some Elements Seem Replicable

AWS posted testimonials from Perplexity AI, Hugging Face, and Thomson Reuters, all of whom tested SageMaker HyperPod. Surprisingly, the key element to the efficiency gains, particularly for AI training, seems to be AWS’s automated monitoring of compute clusters. According to Hugging Face, “SageMaker HyperPod’s purpose-built resiliency and performance capabilities have enabled our open science team to focus on innovating and publishing important improvements to the ways foundation models are built, rather than managing infrastructure. We especially liked how SageMaker HyperPod is able to detect ML hardware failure and quickly replace the faulty hardware without disrupting ongoing model training… this automated job recovery feature helped us minimize disruption during the foundation model training process, helping us save hundreds of hours of training time in just a year.”

Perplexity said, “Because SageMaker HyperPod automatically monitors cluster health and remediates GPU failures, our developers are able to focus on model building instead of spending time on managing and optimizing the underlying infrastructure.”

Automating this type of functionality is something other providers can do, and perhaps might already be doing. The challenge to claims of efficiency gains are understanding the context of the metric so that comparisons across providers can be made.

Potential Differentiation

However, an area where AWS may have a differentiator is in its distributed training libraries. According to Microsoft, “In distributed training, the workload to train a model is split up and shared among multiple worker nodes. These work in parallel to speed up model training.”

Perplexity reported, “SageMaker HyperPod’s built-in data and model parallel libraries helped us optimize training time on GPUs and double the training throughput. As a result, our training experiments can now run twice as fast ….” Thomson Reuters said the distributed training libraries helped them “improve large scale model training performance.”

AWS’s distributed training libraries could be better than its competitors, or, more likely, the libraries are more easily accessed/integrated in the workload stack.

Impact, Conclusions

There is a lot of focus and effort on reducing AI workloads, ranging across hardware, middleware, and software. However, Amazon SageMaker HyperPod sends an important message to the marketplace that AWS, as a cloud compute provider, is interested in doing what it can to reduce AI workload costs. AWS appears to have a competitive advantage at this stage with SageMaker HyperPod, but the other cloud providers will likely develop similar ways to improve AI workload efficiencies or begin to talk about ways they are doing so already. For the time being, in this formative generative AI era, AWS could gain a competitive advantage and potentially sway some enterprises to switch or at very least test some AI workloads on AWS.

Disclosure: The Futurum Group is a research and advisory firm that engages or has engaged in research, analysis, and advisory services with many technology companies, including those mentioned in this article. The author does not hold any equity positions with any company mentioned in this article.

Analysis and opinions expressed herein are specific to the analyst individually and data and other information that might have been provided for validation, not those of The Futurum Group as a whole.

Other insights from The Futurum Group:

AWS re:Invent: AWS Unveils Next-Generation Graviton, Trainium Chips

Amazon CodeWhisperer and MongoDB Collaborate

AWS, Microsoft, and Google Cloud: Tying Up LLMs

Author Information

Mark comes to The Futurum Group from Omdia’s Artificial Intelligence practice, where his focus was on natural language and AI use cases.

Previously, Mark worked as a consultant and analyst providing custom and syndicated qualitative market analysis with an emphasis on mobile technology and identifying trends and opportunities for companies like Syniverse and ABI Research. He has been cited by international media outlets including CNBC, The Wall Street Journal, Bloomberg Businessweek, and CNET. Based in Tampa, Florida, Mark is a veteran market research analyst with 25 years of experience interpreting technology business and holds a Bachelor of Science from the University of Florida.

SHARE:

Latest Insights:

IBM’s DS8000 G10, Launching in Late 2024, Offers Groundbreaking Performance With PCIe gen4, NVMe Technology, and Enhanced Ransomware Protection for Mission-Critical Enterprise Storage
IBM's 10th generation DS8000, launching in late 2024, introduces significant advancements in performance and security for mission-critical storage systems. Camberley Bates and Steven Dickens share their insights on how upgrades such as PCIe gen4, NVMe technology, and enhanced ransomware protection position the DS8000 G10 to meet the evolving demands of heavily regulated industries, offering robust cyber resilience and faster data recovery.
Intel’s New Lunar Lake Core Ultra Processors for AI PCs Will Compete Against Qualcomm’s Snapdragon X and AMD’s Ryzen AI Processors, But Has Intel Fallen Behind?
Olivier Blanchard, Research Director at The Futurum Group shares his insights on whether or not Intel Lunar Lake Core Ultra processors entering the Windows 11 Copilot+ AI PC market a few months behind Qualcomm’s and AMD’s will cost the semiconductor giant some market share as the PC refresh supercycle begins to heat up.
Broadcom’s AI Revenue Growth, VMware Integration, and Steady Contributions from the Mainframe Software Division Drive Strong Financial Performance in Q3 2024
Steven Dickens, Chief Technology Advisor at The Futurum Group, shares insights on Broadcom’s Q3 2024 earnings, focusing on AI revenue growth, the impact of the VMware integration on profitability, and the contributions of the Mainframe Software Division.
Oracle’s Cloud Infrastructure and AI Advancements Drive Robust Revenue Growth Bolstered by Crucial Industry Verticals Such as Healthcare, Financial Services, and Retail
Steven Dickens, Chief Technology Advisor, and Keith Kirkpatrick, Research Director at The Futurum Group, share insights on Oracle’s Q1 FY2025 earnings, showcasing the rapid growth of Oracle Cloud Infrastructure (OCI) and AI’s impact on cloud revenue.