LukeW | Improving AI Models Through Inference Scaling

In her Inference Scaling: A New Frontier for AI Capabilities presentation at Sutter Hill Ventures, Azalia Mirohosfini shared her team's research showing that giving AI models multiple attempts at tasks and carefully selecting the best results can improve performance. Here's my notes from her talk:

Improving Model Performance

Pre-training and fine-tuning have been key focus areas for scaling language models.
Traditional fine-tuning starts with next-token prediction on high-quality specialized data
Reinforcement Learning from Human Feedback (RLHF) introduced human preferences into the process where people rate/rank outputs for steering model behavior.
Constitutional AI moves beyond collecting thousands of human labels to using ~10 human principles in a two-stage approach: models generate and critique outputs based on these principles then RLAIF (Reinforcement Learning from AI Feedback) adds model-generated labels.
This improves harmlessness and helpfulness and reduces dependency on human data collection

Inference Time Scaling

The "Large Language Monkeys" project showed that repeated sampling (trying multiple times) during inference can significantly improve performance on complex tasks like math and coding
Even smaller models showed major gains from increased sampling
Performance improvements follow an exponential power law relationship
Some correct solutions only appeared in <10 out of 10,000 attempts
Key inference time techniques that can be combined: repeated sampling (generating multiple attempts), fusion (synthesizing multiple responses), criticism and ranking of responses, verification of outputs.
Verification falls into two categories of problems: automated (coding, formal math proofs) and manual(needs human judgment).
Basic approaches like majority voting don't work well, we need better verifiers.

Future Directions

Need deeper investigation into whether parallel or serial inference approaches are more effective
As inference becomes a larger part of both training and deployment, high-throughput model serving infrastructure becomes increasingly critical.
The line between inference and training is blurring, with inference results being fed back into training processes to improve model capabilities.
Future models will need seamless self-improvement cycles that continuously enhance their capabilities.
More similar to how humans learn through constant interaction and feedback rather than discrete training periods.