North Mini Code vs. Gemma 4: Why Sparse MoE is Winning Local AI Coding

Q: Can I use North Mini Code for commercial projects?

Yes. It is released under the Apache 2.0 license, which allows for commercial use, modification, and distribution without royalties.

Q: How does it handle large files?

It features a 256K context window, allowing it to ingest and reason across entire modules or small repositories in a single pass.

Verdict: North Mini Code is the most efficient local coding model for 2026, delivering frontier-class reasoning (33.4 Coding Index) at speeds exceeding 90 tokens per second on consumer hardware. While Gemma 4 remains a strong generalist, North Mini Code’s 128-expert architecture makes it the superior choice for high-frequency, autonomous agent workflows.

Last verified: 2026-06-20
TL;DR:
• Performance: 33.4 on the Coding Index, outperforming Mistral 4 and Llama 4 Scout.
• Efficiency: 30B total parameters, but only 3B active per token (Sparse MoE).
• Speed: ~92 words/sec on M4 Max; roughly 2.8x faster than Devstral Small 2.
• Open: Apache 2.0 license with full support for local tool use and 256K context.

Is North Mini Code better than Gemma 4 for local coding?

Yes. While Gemma 4 (released April 2026) is a formidable open-weights model, North Mini Code specifically targets the "agentic" bottleneck: the need for deep reasoning without the high latency of dense models. In head-to-head benchmarks, North Mini Code scores a 33.4 on the Coding Index compared to Gemma 4’s 31.2.

The difference lies in training. While Gemma 4 is a generalist powerhouse, North Mini Code was trained using Reinforcement Learning with Verifiable Rewards (RLVR) against real-world software environments (SWE-agent and OpenCode). This means it is optimized not just to predict the next word of code, but to successfully complete terminal tasks and pass test suites.

How does the 128-expert MoE architecture work?

North Mini Code utilizes a "Sparse Mixture-of-Experts" (MoE) design with 128 individual expert networks. Conventional models "wake up" their entire parameter count for every single word they generate. North Mini Code only activates 8 out of its 128 experts (roughly 3B parameters) per token.

This provides two major advantages for small business and local developers:

Intelligence of a 30B model: It has the broad knowledge base of a much larger system.
Speed of a 3B model: Because only 3B parameters are compute-active, it runs at blistering speeds (up to 180 t/s via API and 90+ t/s locally), eliminating the "lag" that kills the flow of an autonomous coding agent.

What are the hardware requirements for North Mini Code?

Despite its 30B size, North Mini Code is surprisingly accessible thanks to its sparse activation. However, you still need enough VRAM to hold the weights in memory.

Precision	VRAM Required	Recommended Hardware
Q4 (4-bit)	~19 GB	RTX 4090 (24GB) or Mac Studio (32GB+)
FP8 (8-bit)	~30 GB	Mac Studio (64GB) or 1x A100 (40GB)
BF16 (Full)	~60 GB	1x H100 (80GB) or 2x RTX 3090/4090

For most local builders, the Q4 GGUF version is the sweet spot, fitting comfortably on a modern Mac or a high-end consumer GPU while retaining 98% of the model's base intelligence.

How to run North Mini Code locally with Ollama?

As of June 2026, support for North Mini Code’s custom MoE architecture has been integrated into the major local serving stacks. You can run it in minutes:

Update Ollama: Ensure you are on the latest version (v0.6.4 or higher).
Pull the model: Run ollama run north-mini-code.
Integration: Point your Hermes Agent or IDE extension to the local Ollama endpoint (localhost:11434).

For those building complex autonomous agent stations, we recommend keeping the model "warm" in memory to ensure near-instant response times for multi-step tasks.

What this means for you

The arrival of North Mini Code marks the end of the "Frontier Tax" for software development. Small businesses can now deploy private, offline agents that actually understand their codebase without sending proprietary data to a cloud provider.

By routing repetitive "grind" tasks—like unit testing, documentation, and refactoring—to a local North Mini instance, you can save your premium Claude or GPT-4o tokens for high-level architecture and strategic decisions.

FAQ

Q: Can I use North Mini Code for commercial projects?
A: Yes. It is released under the Apache 2.0 license, which allows for commercial use, modification, and distribution without royalties.

Q: How does it handle large files?
A: It features a 256K context window, allowing it to ingest and reason across entire modules or small repositories in a single pass.

Q: Does it support voice-to-code building?
A: Yes, when paired with a voice-capable harness like Hermes Agent, North Mini Code’s low latency makes real-time voice-controlled building possible.

Q: Is it better than Code Llama 3.6?
A: Benchmarks suggest North Mini Code is superior for "agentic" tasks (multi-step terminal use), whereas Code Llama 3.6 remains slightly more consistent at raw Python generation.

Sources

Updates & Corrections

2026-06-20: Original article published. Verified 33.4 Coding Index score and 128-expert MoE architecture.

Last verified: 2026-06-20
TL;DR:
• Performance: 33.4 on the Coding Index, outperforming Mistral 4 and Llama 4 Scout.
• Efficiency: 30B total parameters, but only 3B active per token (Sparse MoE).
• Speed: ~92 words/sec on M4 Max; roughly 2.8x faster than Devstral Small 2.
• Open: Apache 2.0 license with full support for local tool use and 256K context.

Is North Mini Code better than Gemma 4 for local coding?

How does the 128-expert MoE architecture work?

This provides two major advantages for small business and local developers:

Intelligence of a 30B model: It has the broad knowledge base of a much larger system.
Speed of a 3B model: Because only 3B parameters are compute-active, it runs at blistering speeds (up to 180 t/s via API and 90+ t/s locally), eliminating the "lag" that kills the flow of an autonomous coding agent.

What are the hardware requirements for North Mini Code?

Despite its 30B size, North Mini Code is surprisingly accessible thanks to its sparse activation. However, you still need enough VRAM to hold the weights in memory.

Precision	VRAM Required	Recommended Hardware
Q4 (4-bit)	~19 GB	RTX 4090 (24GB) or Mac Studio (32GB+)
FP8 (8-bit)	~30 GB	Mac Studio (64GB) or 1x A100 (40GB)
BF16 (Full)	~60 GB	1x H100 (80GB) or 2x RTX 3090/4090

For most local builders, the Q4 GGUF version is the sweet spot, fitting comfortably on a modern Mac or a high-end consumer GPU while retaining 98% of the model's base intelligence.

How to run North Mini Code locally with Ollama?

As of June 2026, support for North Mini Code’s custom MoE architecture has been integrated into the major local serving stacks. You can run it in minutes:

Update Ollama: Ensure you are on the latest version (v0.6.4 or higher).
Pull the model: Run ollama run north-mini-code.
Integration: Point your Hermes Agent or IDE extension to the local Ollama endpoint (localhost:11434).

For those building complex autonomous agent stations, we recommend keeping the model "warm" in memory to ensure near-instant response times for multi-step tasks.

What this means for you

FAQ

Q: Can I use North Mini Code for commercial projects?
A: Yes. It is released under the Apache 2.0 license, which allows for commercial use, modification, and distribution without royalties.

Q: How does it handle large files?
A: It features a 256K context window, allowing it to ingest and reason across entire modules or small repositories in a single pass.

Q: Does it support voice-to-code building?
A: Yes, when paired with a voice-capable harness like Hermes Agent, North Mini Code’s low latency makes real-time voice-controlled building possible.

Sources

Updates & Corrections

2026-06-20: Original article published. Verified 33.4 Coding Index score and 128-expert MoE architecture.

North Mini Code vs. Gemma 4: Why Sparse MoE is Winning Local AI Coding

Is North Mini Code better than Gemma 4 for local coding?

How does the 128-expert MoE architecture work?

What are the hardware requirements for North Mini Code?

How to run North Mini Code locally with Ollama?

What this means for you

FAQ

Get the practical AI brief

Tags

Discussion

North Mini Code vs. Gemma 4: Why Sparse MoE is Winning Local AI Coding

Is North Mini Code better than Gemma 4 for local coding?

How does the 128-expert MoE architecture work?

What are the hardware requirements for North Mini Code?

How to run North Mini Code locally with Ollama?

What this means for you

FAQ

Get the practical AI brief

Tags

Discussion