A new type of memory, High Bandwidth Flash (HBF), is set to become the next critical component in AI infrastructure, following the much-discussed HBM.
Today's advanced AI is increasingly focused on 'inference'—using trained models to generate answers or predictions. This requires instant access to huge amounts of data, or "context." Imagine a chatbot that needs to remember your entire conversation history to give a relevant answer. The current top-tier memory, HBM (High Bandwidth Memory), is incredibly fast but has limited capacity, like a small, high-speed workbench that fills up quickly. On the other hand, traditional storage like SSDs is like a vast library basement: plenty of space, but too slow for real-time AI tasks. This gap creates a significant performance bottleneck.
So, why is this becoming a major issue now? First, the problem has become more urgent. NVIDIA's latest B200 GPUs, for example, boast immense processing power but have a relatively small HBM capacity of around 192 GB. This highlights the need for a 'middle-ground' memory solution. NVIDIA itself validated this problem by announcing its BlueField-4 CMX platform, which is specifically designed to manage this 'context memory' outside of the primary HBM.
Second, the industry is already actively developing a solution. Tech giants SK hynix and SanDisk are leading a collaborative effort to create a global standard for HBF through the Open Compute Project (OCP). Standardization is crucial because it ensures that HBF modules from different manufacturers will work together seamlessly, which will accelerate adoption across the industry. Early prototypes from companies like Kioxia already demonstrate what's possible, showcasing a single module with about 26 times the capacity of a GPU's HBM.
Finally, the supply of HBM itself is extremely tight. Major producers have already sold out their entire 2026 capacity, and some experts warn that the underlying wafer shortage could persist until 2030. When the fastest and most premium option (HBM) is scarce and expensive, it naturally pressures the industry to develop a more abundant and cost-effective 'capacity' tier to sit right alongside it.
In conclusion, HBF isn't intended to replace HBM. Instead, it helps create a more balanced and efficient three-tier memory system: HBM for the most immediate, high-speed calculations, HBF for the massive context data that needs to be close by, and bulk storage for everything else. This evolution is a logical and necessary response to the real-world demands of large-scale AI inference.
- HBM (High Bandwidth Memory): A type of high-performance memory stacked vertically and placed very close to the GPU. It offers extremely fast data transfer speeds but has limited capacity and is expensive to produce.
- Inference: In AI, this is the process of using a trained model to make predictions or generate new content based on new input data. It's the 'live' operational phase after the initial 'training' phase.
- KV Cache: A specific memory optimization technique used in Transformer models (the basis for models like GPT). It stores intermediate calculations (keys and values) to speed up the process of generating sequential data, like text.
