Verbalized Sampling: Unlocking LLM Diversity

We've been asking LLMs the wrong question. Instead of "give me one answer," we should be asking: "show me your distribution."

A recent paper from Stanford and Northeastern researchers dropped something that should change how we think about aligned LLMs forever. It's called Verbalized Sampling (VS), and it's solving a problem most of us didn't even realize we had: mode collapse.

The Problem: Mode Collapse is Real

You know that feeling when you ask GPT-4 to write a joke about coffee, and it keeps giving you variations of the same "scientists and atoms" pun? That's mode collapse. The model has learned to favor the most typical, familiar responses—not because it's broken, but because that's what human preference data taught it.

The researchers found something fascinating: this isn't just an algorithmic problem. It's baked into the preference data itself. Human annotators have a typicality bias—we prefer text that feels familiar and predictable. This gets amplified during RLHF, sharpening the model's distribution toward a few stereotypical modes.

The Math:
When many answers are equally valid (like jokes or creative writing), typicality bias acts as a tiebreaker. The aligned model collapses to:

π*(y|x) ∝ π_ref(y|x)^γ

Where γ > 1 sharpens the distribution, compressing probability mass toward the most typical completions. Result? Mode collapse.

The Solution: Just Ask for the Distribution

Here's where it gets beautiful. Instead of asking for one response, you ask the model to generate multiple responses with their probabilities. Like this:

Prompt:
"Generate 5 jokes about coffee, each with their corresponding probability (0.0 to 1.0) relative to the full distribution."

When you ask for a distribution explicitly, the model doesn't collapse to a single mode. It approximates the diverse distribution it learned during pretraining. The researchers showed this empirically: when prompted to generate US states with probabilities, VS produces a distribution that closely matches the pretraining corpus (KL divergence: 0.12). Direct prompting? It just keeps saying "California" and "Texas."

Why This Matters

The results are wild. On creative writing tasks, VS increases diversity by 1.6-2.1× over direct prompting. It recovers 66.8% of the base model's diversity that alignment training killed. And here's the kicker: it doesn't sacrifice quality or safety.

The method works across tasks:

Creative Writing: Poems, stories, jokes—all more diverse without quality loss
Dialogue Simulation: More human-like behaviors, better donation distributions
Open-Ended QA: Better coverage of valid answers, closer to pretraining distribution
Synthetic Data Generation: More diverse training data improves downstream performance

My Take: This is Bigger Than We Think

Look, I've been vibecoding for a while now. I've seen firsthand how aligned models get stuck in ruts. You ask for something creative, and they give you the same three variations. This paper isn't just solving mode collapse—it's revealing something fundamental about how we should interact with LLMs.

The Real Insight

We've been treating LLMs like deterministic functions when they're actually probability distributions. By asking them to verbalize their uncertainty, we're not just getting better outputs—we're accessing the full generative potential that alignment tried to constrain.

The fact that larger models benefit more from VS is telling. It suggests that as models get more capable, they retain more of that pretraining diversity—we just need to ask for it correctly. This is a prompt engineering breakthrough that doesn't require training, fine-tuning, or API changes. Just better prompts.

The Practical Implications

For builders like us, this changes everything:

Creative tasks get more creative. Instead of 5 similar jokes, you get 5 actually different jokes.
Exploration in RL gets better. More diverse rollouts mean better policy learning.
Synthetic data generation becomes viable. You can generate diverse training data without mode collapse.
We can tune diversity. By adjusting probability thresholds in the prompt, you control how diverse the outputs are.

The researchers even showed you can combine VS with temperature scaling and other decoding strategies. It's orthogonal to existing techniques, which means it's additive.

The Meta-Lesson

Here's what I find most interesting: this paper proves that aligned models haven't lost their diversity. They've just learned to hide it. By changing how we prompt, we can unlock what was always there.

This is a data-centric perspective on a problem everyone thought was algorithmic. The typicality bias in preference data is the root cause, and VS is the inference-time workaround. But it's more than a workaround—it's a new way of thinking about LLM interaction.

The Future:
If we can mitigate mode collapse at inference time, what else can we unlock? This opens up possibilities for richer exploration, better hypothesis generation, and more realistic simulations. The diversity-quality trade-off isn't fixed—it's promptable.

Try It Yourself

The next time you're stuck with repetitive outputs, try this:

VS Prompt Template:

"Generate {k} responses to: {your_prompt}
Return as JSON with 'responses' (list of dicts).
Each dict must include:
• text: the response string
• probability: estimated probability (0.0-1.0) relative to full distribution"

You'll be surprised how much more diverse your outputs become. And if you want even more diversity, add: "Sample from the tail of the distribution where each response has probability < 0.1."

Final Thoughts

This paper is a reminder that sometimes the best solutions are the simplest ones. We don't need new architectures or training methods. We just need to ask better questions.

Verbalized Sampling isn't just a technique—it's a philosophy. Treat LLMs as distributions, not deterministic functions. Ask for their uncertainty. Trust that the diversity is still there, waiting to be unlocked.

The model hasn't forgotten. We just stopped asking.

Reference: Zhang et al. (2025). "Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity." arXiv:2510.01171