NVIDIA Says Blackwell Software Optimizations Cut DeepSeek V4 Token Costs by Up to Five Times

Justin Nelon
Published on July 1, 2026

news

NVIDIA Says Blackwell Software Optimizations Cut DeepSeek V4 Token Costs by Up to Five Times

NVIDIA says continued software tuning has reduced the cost of serving DeepSeek V4 models on Blackwell hardware by up to five times in about a month. The company attributes the improvement to changes across its inference software stack rather than a new GPU release, showing how much performance can still be gained after hardware reaches data centers.

The announcement focuses on cost per token, a metric that measures how much it costs to generate AI output. For companies running large language models at scale, lower token costs can directly affect pricing, margins, and how widely they can deploy AI features.

NVIDIA’s figures show the estimated cost per million DeepSeek V4 tokens on a GB300 NVL72 system falling from around $0.30 to $0.06 through software improvements. The company says these gains came from better scheduling, memory handling, model execution, and communication between GPUs.

Software Is Becoming More Important for AI Inference

AI hardware performance is no longer determined only by the number of GPUs in a server. The software layer has become equally important because it decides how efficiently those processors, memory pools, and networking links are used.

NVIDIA says its inference stack combines three main areas:

Layer	Main role
Production operations	Handles scheduling, autoscaling, orchestration and memory management
Application acceleration	Improves model execution through runtime tuning and optimized kernels
Infrastructure access	Connects software to GPU, networking and memory hardware capabilities

Together, these layers help AI providers reduce wasted compute time and keep models running efficiently across large server clusters.

For DeepSeek V4, NVIDIA says the result is a major reduction in token cost without requiring customers to replace existing Blackwell systems.

TensorRT-LLM and Dynamo Are Central to the Gains

NVIDIA highlighted several tools used to improve DeepSeek V4 inference performance. TensorRT-LLM is designed to optimize large language models for NVIDIA GPUs, while Dynamo is aimed at managing GPU resources and scaling AI workloads across larger deployments.

The company says these tools allow AI providers to tune their systems for tasks such as coding assistants, reasoning models, long context requests, and real-time AI applications.

Several AI infrastructure companies are already using the Blackwell software stack for DeepSeek V4 deployments.

Company	Reported use
Baseten	Uses TensorRT-LLM and custom runtime tuning
Cognition	Uses Dynamo for GPU management and reinforcement learning workloads
Deep Infra	Uses NVIDIA’s inference stack for open-source model deployment
Together AI	Uses TensorRT-LLM to support real-time coding services

Baseten reportedly achieved up to 50 percent more tokens per second through its own optimizations on Blackwell powered systems.

NVIDIA Claims Up to 20 Times Higher Throughput in Full Configurations

NVIDIA also says its wider Blackwell platform can provide up to 20 times more throughput than a basic FP8 inference configuration when several technologies are combined.

These include NVLink interconnects, NVFP4 precision, Multi Token Prediction, larger expert parallel configurations, and disaggregated inference designs. Each feature contributes a different improvement, but the largest gains come from combining them across the entire system.

This approach matters because AI inference workloads are becoming more complex. Models need more memory, faster communication between accelerators, lower latency, and better power efficiency. A faster GPU alone cannot solve every bottleneck.

Lower Token Costs Could Shape AI Pricing

The most important part of NVIDIA’s claim is not the benchmark number itself, but what lower token costs could mean for AI products. If providers can serve models more cheaply, they may be able to offer lower prices, longer context limits, faster responses, or more capable free tiers.

However, the real benefit will depend on whether cloud providers and AI companies pass those savings on to customers. Lower operating costs do not automatically mean cheaper subscriptions or API access.

For now, NVIDIA’s update shows that Blackwell performance is still changing through software. The hardware may be fixed, but the cost and speed of AI inference can continue improving long after the servers are installed.

Discover: News

Software Is Becoming More Important for AI Inference

TensorRT-LLM and Dynamo Are Central to the Gains

NVIDIA Claims Up to 20 Times Higher Throughput in Full Configurations

Lower Token Costs Could Shape AI Pricing

Thank you!

Thank you!

Related articles

Is Cloud Gaming Worth It in 2026? It Depends on How You Play

Intel Begins Santa Clara Expansion to Support Future Foundry Technologies

Tesla Terafab Hires Former Intel Factory Manager for Major Chip Manufacturing Project