The News: Groq recently held its GroqDay where the company gave updates on product areas such as its Groq Compiler, software-hardware co-design, and its progress on graph neural networks. The company’s goal is working to reduce both the complexity and cost, while speeding up the artificial intelligence process for its users. Learn more about GroqDay via its on-demand virtual event.
GroqDay: Groq Sets its Sights on Accelerating the Speed of Artificial Intelligence
Analyst Take: Groq recently held its GroqDay event, where the organization gave updates on its offerings. It was immediately clear that the artificial intelligence and machine learning systems innovator is laser-focused on speeding up the artificial intelligence process with its kernel-free Groq Compiler and its next-generation GroqChips.
Groq set out to create a chip that performs more efficiently than conventional chip architectures, which is much more efficient when performing AI inferencing. This is significant, as AI inferencing can be both resource hungry and latency-sensitive. According to Jonathan Ross, Groq Founder and CEO, although the market is moving at a fast clip, inefficiencies exist and many organizations currently leading the way are losing money — the challenge is that there is barely enough compute power to do this affordably.
Andrew Ling, Ph.D., and Senior Director of ML Compilers and Software, ran through the advantages of building the software flow and compiler before the hardware. His viewpoint is that in conventional approaches, people often start with silicon versus the software which often yields complexity at the software and hardware level. Ultimately, with this approach, users end up with unpredictable hardware, unpredictable data movement, lack of visibility of the hardware to the software, and difficulty in paralyzation within the software. Fundamentally, this yields complexity and difficulty in the software environment. Groq ultimately solved this by creating a kernel-less approach to compilation.
Andrew Bitar, Technical Lead & Manager, Compilers, presented on software-defined hardware for data flow compute. His viewpoint is predictable: compute needs predictable hardware and prior approaches have created a plateau effect on performance. Bitar asserts there are algorithms that are very dataflow heavy, highly parallel, and predictable, with a new class of hardware tailored for this new class of applications. A key barrier to making things run smoothly is the development of compilers for the domain specific architectures (DSAs). In essence, in prior approaches, there has been a heavy reliance on hand-tuned kernel libraries as part of the compilation process. Bitar shared that as much as the development of kernel libraries can work, organizations need larger teams of engineering staff constantly updating and maintaining since applications change with new sets of operations that need to be supported as the hyper parameters of these applications change. In essence, this process does not scale well since the process requires the library to be constantly updated.
According to Groq, key impediments to building a parallelizing compiler include non-uniform memory or unpredictable memory access latency, asynchronous threads, complex on-chip interconnect or unpredictable data movement between cores, and dynamic chip-to-chip communication. Groq’s notion is there are algorithms that present an opportunity for predictable compute, but the hardware maintains unpredictable characteristics. Fundamentally, the company challenged itself to overcome this by making hardware defined data movement to software defined data movement.
The company mentioned that a key building block for the Groq architecture is its SIMD functional unit, which supports a set of vector operations and is supported by a very lightweight instruction dispatch unit. Groq characterizes this as a base class of architecture that is specialized into myriad types each supporting its own set of operations for the architecture. Those areas include the MXM for matrix vector-vector operations, VXM for vector operations, the SXM for data reshape operations, MEM (memory unit) for very high bandwidth on-chip memory. The company mentioned that each of these functional units has its own instruction dispatch unit, but they all execute in lockstep, which provides a common time domain across different functional units. To make communication more efficient across the functional units, Groq introduced high bandwidth stream registers for data passing.
The way this empowers their compiler is the Groq tensor steaming processor (TSP) architecture exhibits many characteristics that provide predictable compute. This includes software-controlled memory with no dynamic level caching. Ultimately, this empowers the compiler to be fully aware of the physical location of all pieces of data throughout the program, or no memory hierarchy, providing very high memory bandwidth. It includes lockstep execution of the functional units allowing the compiler to perform cycle-accurate instruction scheduling and simple on-chip interconnect that provides more efficient communication between the different functional units on the chip. To achieve this, the architecture employs a simple and one-dimensional interconnect consisting of an eastward and westward path that speeds data movement. Another area that speeds up the process and predictability of the architecture is the synchronous chip-to-chip communication protocol (C2C), which means the TSP acts as both a processor and router. Fundamentally, the Groq compiler is given the power of data orchestration since it knows the exact cycle the data needs to be sent from a source TSP to a destination TSP.
During his session at GroqDay, Bill Xing, Tech Lead Manager from the Machine Learning compiler team at Groq, discussed that the company got Meta’s large language model, LLaMa, running on their Groq systems within days when it was made available. According to Groq, after the company downloaded the model, they experimented with a smaller parameter (7 billion) and have already advanced to the 13 billion variant. Groq says that it expects to enable the largest variant (65 billion) at the maximum sequence very soon. To get it working quickly, the team said they had to “De-NVIDIA” the code to make it accelerator agnostic. Originally, Meta researchers developed the model on NVIDIA, and it contained low-level optimizations for GPUs.
The company mentioned that graph neural networks (GNNs) are everywhere. They are used when people are shopping online, driving, etc. Use cases cited during the session included Google Maps, Waze, Amazon, and eBay, apps that many of us use on a regular basis. As an example, Google and Waze use them for traffic prediction and optimal routes, while Amazon and eBay utilize them for product recommendations and product relationships with one another. GNNs differ from other neural networks in that they are designed to operate on graph-structured data. Fundamentally, in graph-structured data, the data is represented as a set of nodes and edges where the nodes represent entities and edges represent relationships between those entities. In essence, GNNs can handle a variety of tasks such as graph classification where entire graphs can be classified into specific categories.
A key challenge when implementing a GNN is the message-passing operation. The message-passing function of GNNs requires communication between nodes of the graph which creates a bottleneck in traditional AI computing hardware. The function requires accessing and updating information for each node in the graph, which can result in irregular memory access patterns. The company mentioned traditional CPUs and GPUs are optimized for locality-based regular memory access patterns and the irregularity in the GNN access patterns can lead to sub-optimal performance. The GroqChip provides a unique solution that addresses the challenges for GNN applications. The globally shared SRAM on the GroqChip facilitates massive concurrency and the data parallelism needed for bandwidth-sensitive applications. It provides an effective bandwidth of 80 terabytes per second compared to roughly 2 terabytes per second on the latest GPU. This high bandwidth helps the scatter-gather operations in the message passing framework.
Wrapping it up, Groq’s GroqDay offered an excellent view of how artificial intelligence is here and that it will continue to play an outsized role in business operations.
I see Groq as an extremely innovative organization and its disruptive kernel-free Groq Compiler and next-generation GroqChips are doing exactly what customers need: taking the cost and complexity out of the AI process.
Groq’s target focus includes key vertical markets such as finance, science & government, industrial, and oil & gas, which makes sense. I also see opportunities beyond these key vertical markets, as AI is a quickly evolving space and more organizations across a variety of verticals are beginning to see the power of AI and want to take steps to harness that power. I see this as extremely attractive for channel partners focused on AI and HPC, and would not be surprised to see higher education and retail as Groq targets moving forward. The future looks bright for Groq and I look forward to continuing to monitor the company’s progress.
Disclosure: The Futurum Group is a research and advisory firm that engages or has engaged in research, analysis, and advisory services with many technology companies, including those mentioned in this article. The author does not hold any equity positions with any company mentioned in this article.
Analysis and opinions expressed herein are specific to the analyst individually and data and other information that might have been provided for validation, not those of The Futurum Group as a whole.
Other insights from The Futurum Group:
The Six Five Podcast: Groq Day
The Six Five Podcast: Groq Goes LLaMa
Groq: A Single Core Design Enabling Higher Compute Performance