Press release
Why "Token = IOPS": The New Golden Rule for AI Inference Infrastructure
For years, the AI industry has measured progress in compute. More GPUs, more FLOPS, larger clusters, and increasingly sophisticated accelerators have become the defining metrics of infrastructure investment. During the training era, this made perfect sense.Today, however, AI is entering a different phase. As enterprises move beyond model development and into large-scale deployment, the bottleneck is no longer training - it is inference.
Across customer service platforms, enterprise copilots, coding assistants, AI search engines, and agentic applications, users are not asking how many GPUs power a service. They care about how quickly the first response appears, how smoothly tokens continue to stream, and whether performance remains consistent when thousands of users arrive simultaneously.
This shift is forcing infrastructure architects to rethink a long-held assumption. While GPUs remain the engines of AI, compute alone no longer determines user experience. Increasingly, the decisive factor is how efficiently data can be delivered to those GPUs.
This is why a new rule is emerging for AI infrastructure:
Token = IOPS.
Every Token Begins with Data
In real-world inference deployments, the seamless stream of text visible to users masks a highly complex orchestration of memory, storage, and compute resources. This infrastructure is now being pushed to its limits by three defining trends in modern AI systems: Mixture-of-Experts (MoE) architectures, Retrieval-Augmented Generation (RAG), and ultra-long context windows.
1. MoE architectures, exemplified by models such as DeepSeek-V3, contain hundreds of billions of parameters while activating only a subset of experts for each token. While this approach significantly reduces compute requirements per generation, it increases the complexity of expert routing, weight scheduling, and inference orchestration. As models become larger and more modular, rapid access to model data becomes increasingly important to maintaining inference efficiency.
2. At the same time, RAG has become a foundational component of enterprise AI. Organizations are increasingly grounding models with proprietary knowledge stored in vector databases containing billions of embeddings. Before the first token is generated, a single user query may trigger thousands of retrieval operations. In these environments, storage latency becomes a critical factor influencing Time to First Token (TTFT) and overall responsiveness.
3. Long-context AI introduces another layer of complexity. Context windows have expanded from 8K tokens to 128K, 256K, and even one million tokens. A 70B-class model may consume several hundred kilobytes of KV Cache per token, meaning a single 128K-token session can require tens of gigabytes of cache data. When multiplied across thousands of concurrent sessions, the aggregate cache footprint can rapidly grow into tens of terabytes.
As concurrency increases, infrastructure requirements shift from peak bandwidth to sustained low-latency random access at scale. The challenge is no longer simply moving large amounts of data - it is delivering the right data, at the right time, with consistently low latency.
The NVMe SSD: From Storage to a Core Performance Layer
No practical deployment can keep this expanding volume of model data, retrieval data, and KV Cache resident entirely within GPU High Bandwidth Memory (HBM). As a result, modern inference platforms have evolved into hierarchical memory architectures, where frequently accessed data remains in HBM, additional capacity resides in system DRAM, and overflow data is served by high-performance NVMe SSDs.
A modern inference server equipped with high-end accelerators may represent hundreds of thousands of dollars in infrastructure investment. Yet when storage cannot respond quickly enough, GPUs spend valuable cycles waiting for data rather than generating tokens. TTFT increases, throughput declines, and overall infrastructure efficiency suffers.
In AI inference, keeping GPUs busy is no longer solely a compute challenge - it is increasingly a storage challenge.
Empowering the Next Generation of AI: PBlaze7 7A40
As AI deployments continue to scale and the reality of "Token = IOPS" becomes increasingly apparent, infrastructure platforms require storage specifically engineered for high-concurrency inference workloads.
To address these challenges, the Memblaze PBlaze7 7A40 series PCIe 5.0 SSD delivers up to 3.4 million random read IOPS, enabling efficient KV Cache access, vector database retrieval, and highly concurrent inference operations.
With up to 14.2 GB/s sequential read throughput, the PBlaze7 7A40 accelerates model loading, checkpoint recovery, and large-scale data movement throughout the inference lifecycle. Its ultra-low 50-microsecond random read latency helps reduce storage-induced delays that impact TTFT and token delivery consistency, ensuring responsive AI services even under demanding workloads.
For capacity-intensive AI environments, the Memblaze PBlaze7 7A40 Ocean extends the platform with capacities of up to 122.88 TB per drive while maintaining up to 14.2 GB/s sequential read throughput and 3.35 million random read IOPS.
The combination of extreme density and PCIe 5.0 performance makes the PBlaze7 7A40 Ocean particularly well suited for large-scale RAG repositories, long-context inference environments, AI data lakes, and other deployments where both performance and capacity are critical requirements.
Conclusion
As AI infrastructure matures, storage metrics are no longer merely storage metrics - they are AI performance metrics.
Random read latency influences TTFT.
IOPS determine how efficiently inference systems access KV Cache and vector databases.
Throughput impacts how quickly models can be loaded, updated, and deployed.
The first generation of AI infrastructure was defined by FLOPS.
The next generation will be defined by the efficiency of the entire inference pipeline.
Every token begins with data.
Every data access becomes an I/O operation.
And ultimately,
Token = IOPS.
B2-A302, Dongsheng Technology Park, No.66 Xixiaokou Road, Haidian District, 100192, Beijing China
Memblaze is the world's leading supplier of enterprise-level SSD (Solid State Drive) products and solutions. The PBlaze series SSD launched by Memblaze has been widely used in database, virtualization, cloud computing, big data, artificial intelligence and other fields, providing stable and reliable high-speed storage solutions for many customers in Internet, cloud service, finance, telecommunications and other industries.
This release was published on openPR.
Permanent link to this press release:
Copy
Please set a link in the press area of your homepage to this press release on openPR. openPR disclaims liability for any content contained in this release.
You can edit or delete your press release Why "Token = IOPS": The New Golden Rule for AI Inference Infrastructure here
News-ID: 4535999 • Views: …
More Releases from Beijing Memblaze Technology Co. Ltd.
Memblaze Expands PBlaze7 7A40 Family with New Members Offering Up to 122.88TB
At the recently held CFMS | MemoryS 2026 Flash Memory Summit, Memblaze, a global leader in high-performance enterprise SSDs, officially introduced the new members of PBlaze7 7A40 series PCIe 5.0 enterprise NVMe SSD family. Built on the latest-generation NAND technology, this addition supports capacities of up to 122.88TB, while delivering outstanding read performance - providing stronger support for rapidly evolving data center infrastructure as well as efficient AI training and…
How Many GPUs Can a Single SSD Feed? Memblaze PBlaze7 7A40 Breaks Records in MLP …
Beijing, February 2026 - In the AI era, where computing power equates to productivity, the response speed of storage systems has become a critical variable in determining large model training efficiency. In the recently released MLPerfTM Storage v2.0 benchmarks, Memblaze, in collaboration with industry partners, achieved multiple top rankings with a massive aggregate data bandwidth of 513GB/s. This milestone stems not only from breakthroughs in underlying hardware but also from…
Why Enterprise NVMe SSDs Are Critical to Modern AI Infrastructure
Over the past two years, AI has shifted from a "race of model capabilities" to a competition centered on compute and data infrastructure. As vector databases, Retrieval-Augmented Generation (RAG), model training, fine-tuning, and large-scale inference continue to expand, the importance of the storage has been amplified to an unprecedented degree.
Unlike traditional OLTP/OLAP workloads, AI workloads exhibit a "hydraulic-press-like" pressure pattern on storage: intensive random reads, massive sequential reads, continuous sequential…
Memblaze Showcases New PBlaze7 7A40 SSDs to Power the Future of Cloud and AI at …
October 8-9, 2025 - Memblaze, a global leader in enterprise PCIe SSDs and solutions, showcased new additions to its PBlaze7 7A40 series at Tech Week Singapore, one of the most influential technology events in Asia. Featuring higher performance, ultra-high capacity, and exceptional energy efficiency, the new SSDs are designed to meet the rapidly growing demands of cloud computing and artificial intelligence (AI).
With more than 14 years of expertise in enterprise…
More Releases for IOPS
Advanced Satellite Imagery Analysis and Data Fusion Services | IOPS
IOPS provides a comprehensive satellite imagery analysis package combined with an intelligent satellite imagery data fusion service to support high-value satellite applications. Their solutions transform raw satellite data into actionable insights through advanced processing, analysis, and reliable ground infrastructure.
Image Processing Subsystem (IPS)
The Image Processing Subsystem (IPS) is designed for the post-processing of satellite imagery to support various downstream applications. IPS enhances image quality, extracts useful information, and enables advanced…
Satellite Downstream Technology for Data Reception | IOPS
IOPS delivers advanced satellite downstream technology designed to ensure reliable and efficient communication between satellites and ground systems. Their satellite data reception solutions support high-performance ground station operations, enabling accurate signal acquisition and stable data flow for various satellite missions.
Antenna Control Subsystem (ACS)
The Antenna Control Subsystem (ACS) is a critical component of satellite downstream technology. It manages precise antenna operations and ensures optimal signal tracking during satellite passes. ACS integrates…
AI-Based Satellite Operation Control System | IOPS
IOPS is a leading provider of AI-based satellite operation control system software and satellite application services. The company delivers advanced solutions that enable efficient, reliable, and intelligent satellite operations through automation and data-driven decision-making.
Mission Control as a Service
IOPS offers Mission Control as a Service (MCaaS), allowing satellite operators to manage missions without maintaining complex ground systems. This service supports end-to-end satellite operations, reducing operational cost while improving mission efficiency and…
SSSTC Launches 16TB Enterprise SATA SSD with Breakthrough IOPS Performance
Responding to the rapidly growing demand for high-density, low-latency storage in AI servers and data centers, SSSTC has introduced its next-generation enterprise solid-state drive (SSD), the ER4 Series SATA SSD. The new series offers capacities of up to 16TB, making it one of the few SATA SSDs on the market to deliver such high density. It also features impressive random read/write performance of 98K / 30K IOPS, while the 8TB…
Europe Iron Oxide Pigments Market to Achieve US$ 1187 Mn by 2031 - Persistence M …
Iron oxide pigments (IOPs) are widely recognized for their durability, color stability, and non-toxicity, making them essential components across various industries, including construction, paints and coatings, plastics, and cosmetics. In Europe, the IOP market has been experiencing significant growth, driven by robust construction activities and a shift towards eco-friendly products. According to Persistence Market Research, the European iron oxide pigments market is projected to expand from US$ 838 million in…
Netmagic Launches India’s First Software Defined Storage Service
Mumbai, Feb 27, 2014: Netmagic, an NTT Communications company and India’s only Datacenter Infrastructure Lifecycle Management (DILM) service provider, today announced the launch of Netmagic Tiered Secure Storage (NTSS), India’s first Storage-as-a-Service that provides metered IOPS for customers' mission-critical applications. NTSS is a customized multi-platform software-defined storage offering available to public cloud customers as well as customers with single-tenant environments such as dedicated servers or private clouds. Software-defined storage allows…
