Sunday, June 7, 2026

Today’s Edition

AI Intel Report

AI Index 1,284 ▲ 1.8%

Frontier Models

Google DeepMind Releases Gemma 4 12B On-Device Multimodal Model

Google DeepMind released Gemma 4 12B on June 3, 2026. The encoder-free multimodal model runs on standard laptops with 16GB RAM, delivers performance nearing the 26B MoE variant, and supports up to 256K tokens across 140 languages.

2 MIN READ
A person sits at a desk in a home office using a standard laptop to interact with an on-device multimodal AI model, with the laptop screen visible alongside a smartphone for cross-device input.
Illustration: AI Intel Report

Gemma 4 12B is a unified, encoder-free multimodal model in the Gemma 4 family developed by Google DeepMind for on-device deployment.

Gemma 4 12B is a unified, encoder-free multimodal model in the Gemma 4 family developed by Google DeepMind for on-device deployment.

Google released the model on June 3, 2026. It uses a single decoder-only transformer that projects raw image patches and audio waveforms directly into the embedding space via lightweight linear layers.

Details are available from the Google blog post announcing the model and the official Gemma model card.

What benchmarks has Gemma 4 12B achieved?

The model reports strong results across multiple evaluations.

How does the encoder-free design benefit on-device AI?

By eliminating dedicated encoders, the architecture reduces complexity and memory requirements, allowing the 12B model to run efficiently on standard consumer laptops equipped with 16GB of RAM or unified memory.

Gemma 4 12B delivers performance nearing our larger 26B MoE model on standard benchmarks, but at less than half the total memory footprint.Olivier Lacombe and Gus Martins, Director of Product Management and Product Manager, Google DeepMind

What are the main specifications of the Gemma 4 12B model?

Gemma 4 12B Key Specifications
SpecificationDetails
Parameters12 billion
Context WindowUp to 256K tokens
Languages SupportedOver 140
LicenseApache 2.0
AvailabilityHugging Face as google/gemma-4-12B-it

What input modalities does Gemma 4 12B support?

  1. Text inputs and generation
  2. Image understanding and analysis
  3. Audio waveform processing
  4. Video input handling

The model is part of a broader family including E2B, E4B, 26B A4B MoE, and 31B variants. It is optimized for Google AI Edge and LiteRT-LM.

Weights are released under Apache 2.0 and accessible via Hugging Face.