Nvidia has added day one support for Google DeepMind’s DiffusionGemma open model across its RTX and DGX platforms, giving developers a fast local option for text generation on consumer GPUs, professional workstations, and deskside AI systems. The model is built for faster output than traditional autoregressive models, and Nvidia says its own hardware and software stack can push performance even further.
DiffusionGemma is an open weight model designed to generate text using a diffusion based approach. Instead of predicting one token at a time, it can denoise up to 256 tokens per step. That parallel generation method is the main reason the model can offer faster output, especially in single user local workloads where traditional token by token generation can slow down.
The model is based on Gemma 4 and uses a mixture of experts design. It has 25.2 billion total parameters, but only 3.8 billion active parameters per step. That helps keep performance practical while still giving the model enough capacity for text generation tasks.
DiffusionGemma focuses on faster local AI generation
DiffusionGemma is designed to run locally, which means developers and creators can use it without relying on cloud inference or paying per token. It is available under an Apache 2.0 license and supports tools such as Hugging Face Transformers, vLLM, and Unsloth at launch.
| Feature | DiffusionGemma |
|---|---|
| Model type | Open weight diffusion model |
| Base architecture | Gemma 4 |
| Total parameters | 25.2 billion |
| Active parameters | 3.8 billion per step |
| Context length | Up to 256K tokens |
| Precision formats | BF16 and NVFP4 |
| Main use | Fast local text generation |
| Supported platforms | Nvidia RTX, RTX PRO, DGX Spark, DGX Station |
The model also supports text and image modalities, giving it a broader role than a basic text only model. Its main appeal, however, is fast local generation.
Nvidia says its platforms can run the model without extra tuning
Nvidia is supporting DiffusionGemma across GeForce RTX GPUs, RTX PRO workstations, DGX Spark systems, and DGX Station. The company says the model can run with its CUDA software stack and Tensor Core hardware without requiring extra user tuning.

That matters because open models often need careful setup to perform well. If Nvidia’s stack makes DiffusionGemma easier to deploy, more developers may be able to test it quickly on local systems.
DGX Spark appears to be one of the more interesting targets. It uses Nvidia’s GB10 Grace Blackwell Superchip, includes 128GB of unified memory, and is meant for local AI development, agents, research, prototyping, and fine tuning.
DGX Spark reaches 150 tokens per second
Nvidia says DGX Spark can run DiffusionGemma at around 150 tokens per second. DGX Station goes much higher, with claims of up to 800 tokens per second, while H100 Tensor Core GPUs in DGX systems can reportedly reach around 1,000 tokens per second on a single GPU.
| Nvidia platform | Claimed DiffusionGemma performance or role |
|---|---|
| DGX Spark | Around 150 tokens per second |
| DGX Station | Up to 800 tokens per second |
| H100 Tensor Core GPU | Around 1,000 tokens per second |
| RTX PRO 6000 workstations | Local professional inference and agent workflows |
| GeForce RTX GPUs | Local desktop AI support, with llama.cpp support coming |
The company says DiffusionGemma can be roughly four times faster than an equivalent autoregressive model in certain local generation scenarios. That could make it useful for workflows where speed and responsiveness matter more than cloud scale.
Local AI is becoming a bigger part of Nvidia’s RTX strategy
Nvidia has been pushing RTX hardware beyond gaming for several years, especially as local AI tools have become more common. DiffusionGemma fits neatly into that strategy because it gives RTX and DGX owners another model that can run on their own hardware.
For developers, local models are useful because they reduce cloud dependency and allow faster testing. For researchers, they make it easier to prototype agent workflows, experiment with model behavior, and fine tune systems without sending everything to remote servers.
For creators and professionals, local inference can also help with privacy, latency, and cost control.
DiffusionGemma gives Nvidia another way to show its full AI stack
The bigger story is not only that Nvidia supports one new model. It is that Nvidia wants every major open model to work well on its hardware from day one.
That kind of support strengthens the value of RTX and DGX systems. If developers know new models will run quickly on Nvidia hardware, they have less reason to look elsewhere for local AI work.
DiffusionGemma is also a good example of where AI model design is heading. Faster generation is becoming a priority, and diffusion based text models are one attempt to move beyond the slower one token at a time approach used by many existing systems.
For now, Nvidia’s day one support gives DiffusionGemma a strong launch platform. RTX users, workstation owners, and DGX Spark buyers can try the model locally, while Nvidia gets another showcase for CUDA, Tensor Cores, and its wider AI software stack.



Discussion (0)
Be the first to comment.