ultrafeedback com

July 22, 2025

What happens when GPT-4 starts grading other AI models? UltraFeedback is what happens—and it’s changing the game for how language models learn to align with human preferences.


Why UltraFeedback Even Matters

Building smart language models isn’t the hard part anymore. It’s getting them to behave—answer the way we want, follow instructions without hallucinating, and make judgments that feel human. That’s the real challenge.

Traditionally, teams throw humans at the problem. Annotators compare responses from models and pick the better one. It works—but only at small scale. It’s slow, expensive, and, let’s be honest, humans aren’t always consistent.

So instead of relying on people for every decision, the UltraFeedback project did something bold: use GPT-4 to generate the feedback.

Not just basic “good/bad” judgments. GPT-4 was asked to write out why a model’s answer was better or worse—and to score them on multiple dimensions like helpfulness, honesty, and truthfulness. Basically, GPT-4 became the teacher.

How They Built It

Here’s how the whole pipeline works.

They started with 64,000 prompts. These weren’t random internet junk. They pulled from reliable sources like UltraChat, Evol-Instruct, TruthfulQA, and even user-generated conversations from ShareGPT. A good mix of real-world and challenging stuff.

Then they took these prompts and got responses from four different models for each one. Some were open-source like LLaMA or Falcon. Others were proprietary like GPT and Bard. This gave them over 250,000 responses to grade.

Now here’s the twist. GPT-4 looked at the responses and didn’t just pick the winner. It gave each response scores across four axes: how well it followed instructions, whether it was truthful, whether it was honest (yes, different thing), and how helpful it was. And it wrote out its reasoning for each one.

That added up to over a million feedback entries. From a quality standpoint, this dataset blows past most RLHF (Reinforcement Learning from Human Feedback) sets. Instead of sparse, binary comparisons, UltraFeedback gives fine-grained, multi-aspect evaluations—basically the full report card, not just a thumbs-up.

Why It Works So Well

There are a few reasons this works better than you might expect.

First, GPT-4 is shockingly good at playing critic. When prompted correctly, it doesn't just assign a number; it justifies it with nuanced reasoning. That’s a big upgrade over human annotators, who get fatigued or inconsistent. GPT-4 never gets bored and doesn’t miss details.

Second, it’s scalable. Humans can only annotate so many examples in a day. GPT-4 can generate thousands per minute. This isn’t just a minor efficiency boost—it’s exponential.

Third, the diversity is built in. The prompts span all kinds of topics and tones, and the candidate responses come from a wide spread of models. You don’t end up with a dataset that overfits to one kind of conversation.

What It’s Being Used For

UltraFeedback isn’t just a shiny dataset sitting on GitHub. It’s already powering some impressive systems.

They built UltraRM, a reward model trained on UltraFeedback data using a LLaMA2-13B base. This model basically predicts which of two responses is better based on all the multi-aspect feedback. On evaluation tasks like AlpacaEval, UltraRM aligns with GPT-4’s own rankings better than most open models.

Then there's UltraLM, a full-on chatbot fine-tuned with this reward model using PPO (a reinforcement learning algorithm). It doesn’t just look good on paper—it outperforms many open-source models in head-to-head tests.

They even built UltraCM, a critique model that can generate detailed feedback text the same way GPT-4 does in the dataset. Think of it like a self-critic module for language models.

All of this is open-source. Code, models, data—up on GitHub and Hugging Face under MIT license. That’s a big deal because a lot of high-performing models are locked down or limited to research use.

Real Results, Not Just Hype

What’s interesting is how efficient this dataset is. In a later paper, researchers fine-tuned a model using just 10% of UltraFeedback—and still got performance gains of 3–8% across multiple tasks. That’s rare. Usually, you need the whole kitchen sink to move the needle.

It’s also being used to train better reward models in projects like DPO (Direct Preference Optimization), where clean, structured preference data really matters. UltraFeedback gives them a gold mine.

And it’s starting to show up in reward model evaluation benchmarks too. When people want to test how good a reward model is at mimicking human preference, they increasingly use UltraFeedback-trained systems as baselines.

Is It Perfect? Of Course Not.

Letting GPT-4 grade other models sounds risky. What if GPT-4 has its own blind spots? What if it introduces bias that’s hard to detect?

That’s a fair concern. But the UltraFeedback team did a bunch of work to calibrate GPT-4’s scores—like balancing prompt phrasing, comparing outputs across different orderings, and analyzing consistency across axes.

Still, this is AI supervising AI. There’s always the chance that the feedback isn’t quite human-aligned in the edge cases. That’s why a lot of teams are pairing AI-generated feedback with periodic human evaluations as a sanity check.

Where This Is Headed

The real innovation here isn’t just the dataset. It’s the idea that we can scale model alignment using models themselves. Not blindly, but with structure, controls, and transparency.

Imagine a future where models are constantly generating, critiquing, and improving on their own work—guided by AI-generated feedback loops. UltraFeedback is an early glimpse of that future.

Open models trained on this kind of data are starting to catch up to proprietary giants. Not because they have more data, but because the feedback is sharper and more targeted.

This changes the playing field. You no longer need a massive human annotation team to build a strong aligned model. You need the right prompts, a good annotator model, and a feedback pipeline that scales.

Final Thought

UltraFeedback isn’t just another benchmark. It’s a shift in how language models learn from preferences. AI, it turns out, can teach AI. And in some cases—maybe even most cases—it can do it better, faster, and more reliably than humans.

That's not a maybe. That’s already happening.