ConfigDeck

Hugging Face Updates

Stay updated with the latest Hugging Face releases, security patches, and feature updates.

Latest Hugging Face Updates

Hugging Face

olmo-eval: An evaluation workbench for the model development loop

Allen AI released olmo-eval, an open-source evaluation workbench that extends OLMES to cover the iterative model development loop — not just final-model benchmarking. It emphasizes modularity, pairwise checkpoint comparison, and flexible sandboxing.

Hugging Face

Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech

ServiceNow AI published a benchmark and dataset evaluating seven ASR models on code-switched (bilingual) speech across four language pairs, released through their AU-Harness evaluation tool. ElevenLabs Scribe V2, Gemini 3 Flash, and AssemblyAI Universal 3-Pro came out on top, while Whisper Large V3 Turbo performed poorly due to defaulting to translation mode on mixed-language audio.

Hugging Face

Amazing Digital Dentures (a failed project)

A Hugging Face hackathon participant documents their failed attempt to build an LLM-powered game generator using Nemotron 30b, detailing the prompt engineering and RAG strategies that didn't work and the scaled-back HTML toy maker that did.

Hugging Face

Five labs, five minds: building a multi-model finance drama on small models

A Hugging Face hackathon project runs four different labs' small models as separate agents in an emergent economy simulation, surfacing practical lessons about serving heterogeneous models, information isolation, and bounded memory in multi-agent setups.

Hugging Face

Thousand Token Wood: shipping a multi-agent economy on a 3B model

A Build Small Hackathon project runs five autonomous trading agents on Qwen2.5-3B via vLLM, demonstrating that a small model can reliably produce structured output for a multi-agent simulation while requiring heavy prompt engineering to compensate for weak reasoning.

Hugging Face

Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI

NVIDIA releases Nemotron 3.5 Content Safety, a 4B-parameter model built on Gemma 3 that adds custom policy enforcement, auditable reasoning traces, and a public safety dataset to its existing multimodal and multilingual classification capabilities.

Hugging Face

Holo3.1: Fast & Local Computer Use Agents

H Company releases the Holo3.1 family of computer-use models in four sizes (0.8B to 35B-A3B) with quantized checkpoints for local inference, expanded mobile support, and native function-calling protocols.

Hugging Face

Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains

JetBrains released Mellum2, a 12B-parameter MoE model that activates 2.5B parameters per token, aimed at latency-sensitive code and text tasks. It's Apache 2.0 licensed and available on Hugging Face.

Hugging Face

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

Artificial Analysis and IBM launch ITBench-AA, a benchmark testing frontier AI models on agentic SRE tasks like Kubernetes incident diagnosis. No model breaks 50%, making it one of the least saturated agentic benchmarks available.