Replies: 1 comment
-
Yes, we're looking at using a multi-agent framework approach to serve as a "user proxy" |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Gist:
Imagine you have a large, powerful model capable of solving complex puzzles. Normally, a human might guide this model when it encounters challenges, offering subtle hints or guiding the model’s reasoning. Now, what if we shift that interaction "down the ranks"? Instead of a human guiding the large model, we have the large model guiding a smaller, weaker model through a problem it can't solve on its own. The large model offers guidance at key moments, and we use this interaction to create a dataset that captures the process of problem-solving. Later, we replace the guidance (hints) with inner reflections, creating a dataset that can teach models not just answers, but how to approach problems.
Overview
This method was inspired by watching David Shapiro's video, where some interesting ideas about solving complex tasks were discussed. The idea stuck with us: how could we automate this process to guide models through challenges, without relying on humans?
Initially, in many tasks, a human might guide a frontier model, offering hints or feedback when it stumbles on difficult parts of a problem. But we wondered: What if we replaced the human with another model? Instead of a person guiding the model, we shift down a level, letting a large "frontier" model take over the role of guiding a weaker model. This interaction, where the frontier model helps the weaker model solve the puzzle, becomes the foundation for a dataset designed to teach problem-solving skills.
Here’s the process, step by step, as i envision it.
Step 1: Generating Complex Puzzles and Solutions
The first step involves using a frontier model (a large, powerful model) to generate complex puzzles that are difficult but solvable. These puzzles should be hard enough that a smaller model won’t be able to solve them in one shot. The frontier model also generates the solution to the puzzle.
For example:
And the frontier model generates this solution:
The goal is to create puzzles that are easy enough for the frontier model to generate and solve, but hard enough that a smaller model won’t solve them on its first try.
Step 2: Shifting Down the Guidance
Now comes the shift in roles. Originally, we might think of the interaction as a human guiding a frontier model through a complex task. The human provides hints, offers corrections, and ultimately helps the model reach the solution. But instead of this human-guided process, we "shift down" the interaction.
Here’s what this looks like: the frontier model now becomes the guide, and the lesser model (e.g., a 7B model) becomes the subject that needs help solving the puzzle. The lesser model will attempt to solve the puzzle on its own, but it won’t get it right at first. Here’s an example:
The frontier model then steps in to guide the weaker model with a hint:
This guidance helps the lesser model improve its answer:
The process of guidance and iteration continues until the lesser model eventually arrives at the correct solution. What we’re capturing here is the interaction between the models—one guiding the other—replacing the role that a human would traditionally have.
Step 3: Replacing Guidance with Inner Reflections
Now that the puzzle has been solved using hints, we take the process one step further. We go back through the interaction and replace the hints with inner reflections. These reflections simulate the thought process of the model as it moves from one step to the next, guiding itself through the problem.
For each hint, we generate a reflection that represents the model’s internal reasoning, as though it were pausing to think before the next attempt. Here’s how this might look:
So, the full conversation now becomes:
These inner reflections are generated by the frontier model and replace the explicit hints that were guiding the lesser model, creating a more introspective narrative of problem-solving.
Step 4: Building a Dataset of Problem-Solving Conversations
Once we have the full conversation—starting with the puzzle, then the initial attempts, the reflections, and finally the correct answer—we package this into a dataset. The dataset captures the entire process of problem-solving rather than just the answer itself.
Here’s what a full entry in the dataset might look like:
The key idea here is that we’re not just teaching models what the answer is—we’re teaching them how to approach and refine their problem-solving process. The inner reflections offer a window into how a model might guide itself to better solutions.
As demonstrated in Davids video.
Closing Thoughts
This approach came about after watching David Shapiro's video, and it’s built on the idea of shifting the interaction down from human-guided to model-guided. By having a frontier model guide a weaker model, we can fully automate the process of problem-solving. The dataset that emerges from this process isn’t just about the solutions—it’s about the journey of reflecting, iterating, and improving.
I'm still experimenting with this idea, and I’d love to hear from others who might have thoughts or ideas. Have you tried something similar? Do you think this approach could be useful in other contexts? Feel free to comment below or submit a pull request to join the conversation.
Beta Was this translation helpful? Give feedback.
All reactions