The News: At re:Invent, Amazon Web Services (AWS) announced the launch of Amazon SageMaker HyperPod. Now generally available, Amazon SageMaker HyperPod is a tool that aims to optimize AI infrastructure for training foundation models. AWS says the solution can reduce training time by up to 40%. Here are the key details:
- Automatic cluster health check and repair. If an instance becomes defective during a training workload, SageMaker HyperPod automatically detects and swaps faulty nodes with healthy ones. To detect faulty hardware, SageMaker HyperPod regularly runs an array of health checks for graphics processing unit (GPU) and network integrity.
- Streamlined distributed training for large training clusters. SageMaker HyperPod is preconfigured with Amazon SageMaker distributed training libraries, allowing users to automatically split their models and training datasets across AWS cluster instances to help them efficiently scale training workloads.
- Optimized utilization of cluster’s compute, memory, and network resources. Amazon SageMaker distributed training libraries optimize training jobs for AWS network infrastructure and cluster topology through two techniques: data parallelism and model parallelism. Model parallelism splits models too large to fit on a single GPU into smaller parts before distributing across multiple GPUs to train, and data parallelism splits large datasets to train concurrently to improve training speed.
Read the announcement on the launch of Amazon SageMaker HyperPod on the AWS website.
Amazon SageMaker HyperPod Claims 40% Reduction in AI Training Time
Analyst Take: AI training and inference workloads are a significant cost issue, and if not addressed, it could stymie enterprise growth of generative AI use. AWS’ VP of AI Matt Wood says Amazon SageMaker HyperPod reduces time to train foundation models by up to 40% and reduces deployment costs associated with AI inference by an average of 50%. Has AWS found a unique solution? What will the impact of SageMaker HyperPod be? Here are my thoughts.
Some Elements Seem Replicable
AWS posted testimonials from Perplexity AI, Hugging Face, and Thomson Reuters, all of whom tested SageMaker HyperPod. Surprisingly, the key element to the efficiency gains, particularly for AI training, seems to be AWS’s automated monitoring of compute clusters. According to Hugging Face, “SageMaker HyperPod’s purpose-built resiliency and performance capabilities have enabled our open science team to focus on innovating and publishing important improvements to the ways foundation models are built, rather than managing infrastructure. We especially liked how SageMaker HyperPod is able to detect ML hardware failure and quickly replace the faulty hardware without disrupting ongoing model training… this automated job recovery feature helped us minimize disruption during the foundation model training process, helping us save hundreds of hours of training time in just a year.”
Perplexity said, “Because SageMaker HyperPod automatically monitors cluster health and remediates GPU failures, our developers are able to focus on model building instead of spending time on managing and optimizing the underlying infrastructure.”
Automating this type of functionality is something other providers can do, and perhaps might already be doing. The challenge to claims of efficiency gains are understanding the context of the metric so that comparisons across providers can be made.
Potential Differentiation
However, an area where AWS may have a differentiator is in its distributed training libraries. According to Microsoft, “In distributed training, the workload to train a model is split up and shared among multiple worker nodes. These work in parallel to speed up model training.”
Perplexity reported, “SageMaker HyperPod’s built-in data and model parallel libraries helped us optimize training time on GPUs and double the training throughput. As a result, our training experiments can now run twice as fast ….” Thomson Reuters said the distributed training libraries helped them “improve large scale model training performance.”
AWS’s distributed training libraries could be better than its competitors, or, more likely, the libraries are more easily accessed/integrated in the workload stack.
Impact, Conclusions
There is a lot of focus and effort on reducing AI workloads, ranging across hardware, middleware, and software. However, Amazon SageMaker HyperPod sends an important message to the marketplace that AWS, as a cloud compute provider, is interested in doing what it can to reduce AI workload costs. AWS appears to have a competitive advantage at this stage with SageMaker HyperPod, but the other cloud providers will likely develop similar ways to improve AI workload efficiencies or begin to talk about ways they are doing so already. For the time being, in this formative generative AI era, AWS could gain a competitive advantage and potentially sway some enterprises to switch or at very least test some AI workloads on AWS.
Disclosure: The Futurum Group is a research and advisory firm that engages or has engaged in research, analysis, and advisory services with many technology companies, including those mentioned in this article. The author does not hold any equity positions with any company mentioned in this article.
Analysis and opinions expressed herein are specific to the analyst individually and data and other information that might have been provided for validation, not those of The Futurum Group as a whole.
Other insights from The Futurum Group:
AWS re:Invent: AWS Unveils Next-Generation Graviton, Trainium Chips
Amazon CodeWhisperer and MongoDB Collaborate
AWS, Microsoft, and Google Cloud: Tying Up LLMs
Author Information
Mark comes to The Futurum Group from Omdia’s Artificial Intelligence practice, where his focus was on natural language and AI use cases.
Previously, Mark worked as a consultant and analyst providing custom and syndicated qualitative market analysis with an emphasis on mobile technology and identifying trends and opportunities for companies like Syniverse and ABI Research. He has been cited by international media outlets including CNBC, The Wall Street Journal, Bloomberg Businessweek, and CNET. Based in Tampa, Florida, Mark is a veteran market research analyst with 25 years of experience interpreting technology business and holds a Bachelor of Science from the University of Florida.