After showing impressive efficiency with Gemma 3, running powerful AI on a single GPU, Google has pushed the boundaries even further with Gemma 3n. This new release brings state-of-the-art AI to mobile and edge devices, using minimal memory while delivering fast, multimodal performance. In this article, we’ll explore what makes Gemma 3n so powerful, how it works under the hood with innovations like Per-Layer Embeddings (PLE) and MatFormer architecture, and how to access Gemma 3n easily using Google AI Studio. If you’re a developer looking to build fast, smart, and lightweight AI apps, this is your starting point.
Gemma 3 showed us that powerful AI models can run efficiently, even on a single GPU, while outperforming larger models like DeepSeek V3 in chatbot Elo scores with significantly less compute. Now, Google has taken things further with Gemma 3n, designed to bring state-of-the-art performance to even smaller, on-device environments like mobile phones and edge devices.
To make this possible, Google partnered with hardware leaders like Qualcomm, MediaTek, and Samsung System LSI, introducing a new on-device AI architecture that powers fast, private, and multimodal AI experiences. The “n” in Gemma 3n stands for nano, reflecting its small size yet powerful capabilities.
This new architecture is built on two key innovations:
Together, these innovations make Gemma 3n efficient for running high-performance, multimodal AI on low-resource devices.
When Gemma 3n models are executed, Per-Layer Embedding (PLE) settings are employed to generate data that improves each model layer’s performance. As each layer executes, the PLE data can be created independently, outside the model’s working memory, cached to quick storage, and then incorporated to the model inference process. By preventing PLE parameters from entering the model memory space, this method lowers resource usage without sacrificing the quality of the model’s response.
Gemma 3n models are labeled with parameter counts like E2B and E4B, which refer to their Effective parameter usage, a value lower than their total number of parameters. The “E” prefix signifies that these models can operate using a reduced set of parameters, thanks to the flexible parameter technology embedded in Gemma 3n, allowing them to run more efficiently on lower-resource devices.
These models organize their parameters into four key categories: text, visual, audio, and per-layer embedding (PLE) parameters. For instance, while the E2B model normally loads over 5 billion parameters during standard execution, it can reduce its active memory footprint to just 1.91 billion parameters by using parameter skipping and PLE caching, as shown in the following image:
Gemma 3n is finetuned for device tasks:
This allows the model to interact with the environment and allows users to naturally interact with applications. Gemma 3n is 1.5 times faster than Gemma 3 4B on mobile. This increases the fluidity in the user experience (Overcomes the generation latency in LLMs).
Gemma 3n has a smaller submodel as a unique 2 in 1 matformer architecture. This lets users dynamically choose performance and speed as necessary. And to do this we do not have to manage a separate model. All this happens in the same memory footprint.
A Matryoshka Transformer or MatFormer model architecture, which consists of nested smaller models inside a bigger model, is used by Gemma 3n models. It is possible to make inferences using the layered sub-models without triggering the enclosing models’ parameters while reacting to queries. Running only the smaller, core models inside a MatFormer model helps lower the model’s energy footprint, response time, and compute cost. The E2B model’s parameters are included in the E4B model for Gemma 3n. You can also choose settings and put together models in sizes that fall between 2B and 4B with this architecture.
Gemma 3n preview is available in Google AI Studio, Google GenAI SDK and MediaPipe (Huggingface and Kaggle). We will access Gemma 3n using Google AI Studio.
!pip install google-genai
Step 8: Use secret keys in colab to store GEMINI_API_KEY, enable the notebook access as well.
from google.colab import userdata
import os
os.environ["GEMINI_API_KEY"] = userdata.get('GEMINI_API_KEY')
import base64
import os
from google import genai
from google.genai import types
def generate():
client = genai.Client(
api_key=os.environ.get("GEMINI_API_KEY"),
)
model = "gemma-3n-e4b-it"
contents = [
types.Content(
role="user",
parts=[
types.Part.from_text(text="""Anu is a girl. She has three brothers. Each of her brothers has the same two sisters. How many sisters does Anu have?"""),
],
),
]
generate_content_config = types.GenerateContentConfig(
response_mime_type="text/plain",
)
for chunk in client.models.generate_content_stream(
model=model,
contents=contents,
config=generate_content_config,
):
print(chunk.text, end="")
if __name__ == "__main__":
generate()
Output:
Also Read: Top 13 Small Language Models (SLMs)
Gemma 3n is a big leap for AI on small devices. It runs powerful models with less memory and faster speed. Thanks to PLE and MatFormer, it’s efficient and smart. It works with text, images, audio, and even video all on-device. Google has made it easy for developers to test and use Gemma 3n through Google AI Studio. If you’re building mobile or edge AI apps, Gemma 3n is definitely worth exploring. Checkout Google AI Edge to run the Gemma 3n Locally.