The News: Google has released a series of storage enhancements targeted at supporting AI and machine learning (ML) workloads. The announcement came at Google Next ‘24 as part of a larger announcement around Google’s AI Hypercomputer. The storage updates include enhancements to Cloud Storage FUSE, Parallelstore, and Hyperdisk ML. More information about the AI Hypercomputer announcements can be found here.
Google Enhances Storage for AI
Analyst Take: As part of a series of AI Hypercomputer enhancements, Google announced multiple storage updates targeted at enhancing AI and ML. The new storage updates focus on maximizing GPU and TPU utilization to accelerate model training. The announcement includes the following product updates:
- Cloud Storage FUSE: Google announced new caching capabilities to Cloud Storage FUSE. Cloud Storage FUSE lets you mount and access Cloud Storage buckets as local file systems so that you can read and write objects using standard file system protocols. Caching will improve access time to the buckets, though customers may look to other file systems to provide the very high-speed needed for training. Google claims that the new caching functionality improves training by 2.9x and improves serving performance of Google foundational models by 2.2x.
- Parallelstore: Google added caching to its parallel file system, which is targeted for scratch storage use. Parallelstore is a file system based on DAOS, which is a key data store architecture written for NVMe technology. Originally designed for storage class memory (SCM, i.e. Optane), the caching will provide faster access. The Parallelstore and the caching capability is still in preview. Google claims it can provide up to 3.9x faster training times and up to 3.7x higher training throughput compared with native ML framework data loaders.
- Hyperdisk ML: Google is introducing a Hyperdisk ML block storage solution targeted at supporting AI inference workloads. This would be the fourth offering for Hyperdisk, thought the details on what is in the ML version are not available. Still, for this offering, Google claims Hyperdisk ML can provide up to 12x faster model load times and expects to trump the performance and throughput from with Azure UltraDisk SSD and Amazon EBS io2 BlockExpress.
The overall announcement from Google is optimizing for AI and ML across several layers of hardware and software. When considering the storage announcements specifically, the emphasis was placed on caching and keeping data close to compute resources to maximize training performance. Caching certainly is not a new concept, but it may hold extra significance when considering AI training. Training models is a time-consuming process that relies on expensive, compute-intensive resources. Keeping data near these resources and maximizing their utilization becomes a key priority when considering storage requirements for AI.
The bottleneck of the data to feed GPUs and train has always been an ongoing issue with traditional HPC/AI practitioners. There have been lots of gyrations, including processes, code adjustments, xPUs, etc., to address the IO problems. We expect the problem to ramp up as organizations take on more data and train for their specific use cases. Google is actively addressing these needs and looks to help customers further optimize their AI training.
Disclosure: The Futurum Group is a research and advisory firm that engages or has engaged in research, analysis, and advisory services with many technology companies, including those mentioned in this article. The author does not hold any equity positions with any company mentioned in this article.
Analysis and opinions expressed herein are specific to the analyst individually and data and other information that might have been provided for validation, not those of The Futurum Group as a whole.
Other Insights from The Futurum Group:
2023 Cloud Downtime Incident Report
Google Cloud Launches Axion and Enhances AI Hypercomputer
Public Cloud Storage Catered to AI Data in 2023