-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancing Reward Function for MCTS in Marco-o1 #13
Comments
Thank you for your attention. As you mentioned, we also noted in our README that we have identified limitations in our current reward function, which is a significant constraint on our model's capabilities. From the perspective of test@k, this has had a considerable impact on the final performance. Additionally, we are currently training our reward model. We believe that as the precision of the reward improves, the performance of our model will improve as well. |
Thank you for your detailed response and for acknowledging the limitations of the current reward function. It's great to know that you are already working on training a reward model to address this issue. Given the impact of the reward function on the test@k performance, I believe that incorporating task-specific knowledge or logical rules into the reward evaluation could provide a significant boost. For example: Introducing global consistency checks to ensure reasoning paths align with task goals. Looking forward to your insights and progress updates! |
Initially, we still plan to use ORM + MCTS, as this type of data is relatively easy to obtain and we already have a considerable amount of it. |
Thank you for the clarification! Your plan of starting with ORM + MCTS and using tree search results as unsupervised labels for PRM training sounds solid. Excited to see how this develops! |
I'm curious how you use ORM for tasks that don't have a standard answer (such as the translation task mentioned in your technical report) |
How to combine ORM and MCTS? Generally, it requires process rewards and judgment for each step. |
Maybe simply calculate the scores of nodes on the Monte Carlo Tree using UCB? As in existing work. |
Why choose the base model Qwen2-7B-Instruct instead of Qwen2.5-7B-Instruct? I suspect it might be because Qwen2-7B-Instruct showed improvements in experiments, while Qwen2.5 did not show significant gains. I have also compared the capabilities of these two base models before, and the base capabilities of Qwen2 are significantly better than those of Qwen2.5 |
The current reward function in Marco-o1's MCTS implementation relies solely on token-level confidence scores derived from the model's output probabilities. While this method provides a straightforward way to evaluate reasoning paths, it has notable limitations:
Local Optimality: Token-level probabilities may lead to paths that seem promising locally but fail to achieve global correctness.
Model Bias: The model's inherent biases might result in overconfidence in certain common patterns, misguiding the search process.
Context Insensitivity: The reward function does not evaluate the logical consistency of the tokens in the broader context of the reasoning path.
Lack of Task-Specificity: The reward function is generic and does not incorporate domain-specific knowledge or logical rules pertinent to the task.
The text was updated successfully, but these errors were encountered: