Google updates Gemma models to make AI generation significantly faster.
Gemma 4 is a popular tool among software engineers for tasks ranging from code generation and debugging to building complex agentic systems. According to Google, the Gemma 4 models were downloaded more than 60 million times since April 2, 2026.
Recently, Google has added Multi-Token Prediction to its Gemma 4 family of open models to make text generation faster, achieving up to 3 times the speed without reducing quality.
Previously, generating text involved waiting for the AI to produce each token sequentially. With multi-token prediction, a lightweight drafter model proposes several tokens at once, which the main model verifies in parallel. This leads to faster response times. Since the drafters are open source under Apache 2.0, companies can run them locally.
A Different Approach Than Gemini
Gemma 4 and Gemini are built for different audiences.
Gemini operates in Google's cloud infrastructure with access to large pools of server-class TPUs. When users ask Gemini questions, the system allocates resources to run the calculations across multiple TPUs.
With Gemma 4, Google has refocused on the needs of developers who want AI that run on their own hardware. Gemma 4 (in its smaller variants) works for consumer grade GPUs SMBs have on their desk: dedicated gaming graphics cards or purpose-built AI accelerators. MultiToken Prediction speeds things up for locally hosted AI. Google also released Gemma 4 under the Apache 2.0 license.
Previously the company used more restrictive licenses that limited how developers could use the models. Now it is easier for businesses to build their own systems: fact-checkers, document summarizers, chatbots, code generators. This opens the door for a new wave of custom applications that business is built on-premise or in private clouds.
The Bottleneck MTP Targets
The main bottleneck for local large language models was moving weights throughout memory during generation of each word. Since every word requires running the model again to predict the next token, most time was spent waiting on data transfer.
When the model runs at home on a gaming PC or a server with limited video memory, it is often waiting for weights to get fetched from VRAM. You end up with high costs: either you wait long periods for AI to generate text or you need expensive workstation hardware with video cards that have thousands of dollars worth of VRAM to get any real work done. Google's MultiToken Prediction extends the inference speed of Gemma 4. Previously slow models can now handle real-time applications with multi-phrase conversations or text generation.
How the Drafters Work
The new drafting system has two text generation models: a main model (the target model) and a smaller model (the drafter).
- Quick token generation: When you ask it to write something, the drafter will quickly come up with a few tokens.
- Shared memory cache: Meanwhile the main model stores its previous attention computations in another 'bank' of memory called the key value cache. The drafter shares this cache, so it can generate draft text very quickly, without rebuilding from the ground up.
- Intelligent mimicking: The drafter is less correct than the main model, so it guesses intelligently based on its training to mimic the main model.
- Simultaneous testing: The main model compares the drafter guesses against its own predictions and approves or not. It tests all candidates simultaneously.
- Parallel verification: These guesses are verified in parallel by the larger target model.
- Keeping the best guesses: When a draft token matches what the target would have generated, the system keeps the text. If every guess in a batch is right, all of them get accepted at once.
- Dropping and proposing again: Otherwise it drops that token and everything after it, then the drafter proposes again from that point.
The system performs better on speed while preserving output quality.
Hardware Performance Gains
- On Google's Pixel phones, a 2.8 times speed increase on E2B models with the new drafting system.
- The same Pixel phones perform 3.1 times faster in the E4B configuration with the latest draft system.
- The new drafting approach is also available on Apple's M4 Silicon chips, where the latest 31B Gemma models run 2.5 times faster with the new speedup.
Availability and Integration
Open source licensed (Apache 2.0) new drafting Multi-Token Prediction (MTP) Gemma implementations are now available.
This means anyone can use them and change them for free. These reference implementations are already integrated with popular local inference frameworks: MLX, vLLM, SGLang and Ollama.
Today, developers can take the latest release of Gemma and plug drafting in without rewriting existing code or infrastructure.
- If you just want to evaluate the model, then log in at Google's AI Studio at aistudio.google.com and select Gemma 4 from the list of available models.
- If you want to run Gemma 4 on your laptop without any server setup, run it locally using Ollama.
- Gemma 4 can be fine-tuned to specialize for your domain. The recommended way is to use QLoRA with the TRL library.
Businesses are already using Gemma 4
Businesses are already using Gemma 4 This is the first time a major AI lab has put frontier-capable AI in the hands of businesses without the need for cloud subscriptions or for data to leave the premises, which is changing AI affordability for SMEs. Running the model locally keeps the client data under the company’s control - an important requirement for GDPR and other regulations.
Gemma 4' function calls and JSON output allow firms to build AI agents that extract data from invoices, route support tickets or generate code without prompt engineering.
Different model variants suit different workloads:
- E4B for low‑latency customer service and chat bots,
- 26B MoE for contract review and financial report analysis,
- 31B Dense for code generation and complex decision support,
- E2B for factory‑floor monitoring and retail assistants.
For those finance leaders who cannot send general ledgers to public cloud APIs, Gemma 4's local deployment and native JSON functions offer automating invoice processing, revenue recognition and accounts payable while keeping payroll and vendor data air‑gapped.
Companies like WaveFlow use Gemma 4 to perform OCR on invoices, match them with local purchase orders and automate accounts payable entirely on local hardware.
Rate this article
Recommended posts
Our Clients' Feedback
We have been working for over 10 years and they have become our long-term technology partner. Any software development, programming, or design needs we have had, Belitsoft company has always been able to handle this for us.
Founder from ZensAI (Microsoft)/ formerly Elearningforce