NVIDIA says continued software tuning has reduced the cost of serving DeepSeek V4 models on Blackwell hardware by up to five times in about a month. The company attributes the improvement to changes across its inference software stack rather than a new GPU release, showing how much performance can still be gained after hardware reaches data centers.
The announcement focuses on cost per token, a metric that measures how much it costs to generate AI output. For companies running large language models at scale, lower token costs can directly affect pricing, margins, and how widely they can deploy AI features.
NVIDIA’s figures show the estimated cost per million DeepSeek V4 tokens on a GB300 NVL72 system falling from around $0.30 to $0.06 through software improvements. The company says these gains came from better scheduling, memory handling, model execution, and communication between GPUs.
Software Is Becoming More Important for AI Inference
AI hardware performance is no longer determined only by the number of GPUs in a server. The software layer has become equally important because it decides how efficiently those processors, memory pools, and networking links are used.
NVIDIA says its inference stack combines three main areas:
| Layer | Main role |
|---|---|
| Production operations | Handles scheduling, autoscaling, orchestration and memory management |
| Application acceleration | Improves model execution through runtime tuning and optimized kernels |
| Infrastructure access | Connects software to GPU, networking and memory hardware capabilities |
Together, these layers help AI providers reduce wasted compute time and keep models running efficiently across large server clusters.
For DeepSeek V4, NVIDIA says the result is a major reduction in token cost without requiring customers to replace existing Blackwell systems.
TensorRT-LLM and Dynamo Are Central to the Gains
NVIDIA highlighted several tools used to improve DeepSeek V4 inference performance. TensorRT-LLM is designed to optimize large language models for NVIDIA GPUs, while Dynamo is aimed at managing GPU resources and scaling AI workloads across larger deployments.
The company says these tools allow AI providers to tune their systems for tasks such as coding assistants, reasoning models, long context requests, and real-time AI applications.
Several AI infrastructure companies are already using the Blackwell software stack for DeepSeek V4 deployments.
| Company | Reported use |
|---|---|
| Baseten | Uses TensorRT-LLM and custom runtime tuning |
| Cognition | Uses Dynamo for GPU management and reinforcement learning workloads |
| Deep Infra | Uses NVIDIA’s inference stack for open-source model deployment |
| Together AI | Uses TensorRT-LLM to support real-time coding services |
Baseten reportedly achieved up to 50 percent more tokens per second through its own optimizations on Blackwell powered systems.
NVIDIA Claims Up to 20 Times Higher Throughput in Full Configurations
NVIDIA also says its wider Blackwell platform can provide up to 20 times more throughput than a basic FP8 inference configuration when several technologies are combined.

These include NVLink interconnects, NVFP4 precision, Multi Token Prediction, larger expert parallel configurations, and disaggregated inference designs. Each feature contributes a different improvement, but the largest gains come from combining them across the entire system.
This approach matters because AI inference workloads are becoming more complex. Models need more memory, faster communication between accelerators, lower latency, and better power efficiency. A faster GPU alone cannot solve every bottleneck.
Lower Token Costs Could Shape AI Pricing
The most important part of NVIDIA’s claim is not the benchmark number itself, but what lower token costs could mean for AI products. If providers can serve models more cheaply, they may be able to offer lower prices, longer context limits, faster responses, or more capable free tiers.
However, the real benefit will depend on whether cloud providers and AI companies pass those savings on to customers. Lower operating costs do not automatically mean cheaper subscriptions or API access.
For now, NVIDIA’s update shows that Blackwell performance is still changing through software. The hardware may be fixed, but the cost and speed of AI inference can continue improving long after the servers are installed.



Discussion (0)
Be the first to comment.