Hugging Face

Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains

June 1, 2026 · Commentary on a Hugging Face Blog announcement

Summary

JetBrains released Mellum2, a 12B-parameter MoE model that activates 2.5B parameters per token, aimed at latency-sensitive code and text tasks. It's Apache 2.0 licensed and available on Hugging Face.

JetBrains announced Mellum2, a Mixture-of-Experts model with 12B total parameters designed for fast inference on text and code workloads. The release was published on the Hugging Face Blog with the model weights available under the Apache 2.0 license.

What’s actually new

Mellum2 is a from-scratch MoE architecture — not a fine-tune of an existing base. It activates only 2.5B of its 12B parameters per token, which is where the inference speed claim comes from: JetBrains reports more than 2x faster inference compared to similarly sized open models. The model is deliberately scoped to text and code, skipping multimodal capabilities entirely. JetBrains positions it not as a standalone general-purpose model but as a “focal” component — something you’d slot into a larger system for routing, RAG post-processing, sub-agent tasks, or context compression. The original Mellum was a code completion model; Mellum2 broadens that to natural language while keeping the same efficiency-first design philosophy. A full technical report with architecture details, training setup, and benchmark methodology is linked from the announcement on arXiv.

What it means for your config

Mellum2 is a model release, not a library or framework update, so there’s no direct config migration path to worry about. That said, if you’re running inference pipelines that reference Hugging Face model IDs — in transformers configs, vLLM serving configs, or orchestration YAML — you’ll want to note the new collection path (JetBrains/mellum-2) for when you swap it in. MoE models can behave differently from dense models in serving setups: tensor parallelism strategies, memory allocation, and expert routing overhead may require adjustments to your inference server configuration. The announcement doesn’t detail specific serving recommendations (quantization support, recommended frameworks, or minimum hardware), so check the model card and technical report before committing to a deployment config. If you’re using model routing logic that dispatches to different models based on task type — the kind of setup Mellum2 is explicitly designed for — this is a new candidate to add to your routing table alongside larger reasoning models.

Recommended next step

Pull up the model collection on Hugging Face and the linked arXiv technical report. Before integrating, verify the benchmark results are relevant to your specific workload — “competitive with similarly sized models” covers a wide range of tasks, and performance on code generation versus general reasoning can vary significantly. If you’re running multi-model agent systems where intermediate calls (routing, summarization, validation) are bottlenecked on latency or cost, Mellum2’s 2.5B active parameter footprint makes it worth benchmarking against whatever you’re currently using for those steps.

Read the full announcement on Hugging Face Blog → Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains

More Hugging Face Updates

July 28, 2026

The OlmoEarth Platform: Geospatial inference at planetary scale

Allen Institute for AI details the infrastructure behind OlmoEarth Platform, which runs geospatial foundation model inference at continent scale by splitting work across CPU and GPU stages with heavy parallelism.

July 27, 2026

NVIDIA Cosmos-H-Dreams: Bringing Real-Time Generative Simulation to Surgical Robotics

NVIDIA releases Cosmos-H-Dreams, a distilled world model that runs action-conditioned surgical simulation at ~160 fps on a single GPU, along with FlashDreams inference engine and a recipe for adapting to custom embodiments.

July 23, 2026

Bringing Nunchaku 4-bit Diffusion Inference to Diffusers

Hugging Face integrates Nunchaku Lite into Diffusers, enabling native loading of SVDQuant W4A4 checkpoints via from_pretrained() with no local CUDA compilation. This cuts diffusion model VRAM usage roughly in half while also delivering inference speedups.