Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarks #4

Open
slavakurilyak opened this issue Jan 3, 2025 · 1 comment
Open

Benchmarks #4

slavakurilyak opened this issue Jan 3, 2025 · 1 comment

Comments

@slavakurilyak
Copy link

It would be great to see some benchmarks for MCP Reasoner

I'm interested to compare this to the Sequential Thinking MCP Server by @Skirano

@frgmt0
Copy link
Contributor

frgmt0 commented Jan 6, 2025

I opened a PR to update the readme, where I did some loose benchmarking. the API is a bit expensive for me right now so maybe thats something I do in the future but you can read up on my testing there, but ill paste a snippet of it below:

===

Beam Search

so beam search is pretty straightforward - it keeps track of the most promising solution paths as it goes. works really well when you've got problems with clear right answers, like math stuff or certain types of puzzles.

interesting thing i found while testing; when i threw 50 puzzles from the Arc AGI benchmark at it, it only scored 24%. like, it wasn't completely lost, but... not great. here's how i tested it:

  • first, i'd check if claude actually got the pattern from the examples. if it seemed confused, i'd try to nudge it in the right direction (but dock points cause that's not ideal)
  • then for the actual test cases, i had this whole scoring system:
    • 5 points - nailed it
    • 4 points - reasoning was solid but maybe i fucked up following the instructions
    • 3 points - kinda got the pattern but didn't quite nail it
    • 2 points - straight up failed
    • 1 point - at least the initial reasoning wasn't completely off

Monte Carlo Tree Search

now THIS is where it gets interesting. MCTS absolutely crushed it compared to beam search - we're talking 48% on a different set of 50 Arc puzzles. yeah yeah, maybe they were easier puzzles (this isn't an official benchmark or anything), but doubling the performance? that's not just luck.

the cool thing about MCTS is how it explores different possibilities. instead of just following what seems best right away, it tries out different paths to see what might work better in the long run. claude spent way more time understanding the examples before diving in, which probably helped a lot.

Why This Matters

adding structured reasoning to claude makes it way better... no der, right? but what's really interesting is how different methods work for different types of problems.

why'd i test on puzzles instead of coding problems? honestly, claude's already proven itself on stuff like polyglot and codeforces. i wanted to see how it handled more abstract reasoning - the kind of stuff that's harder to measure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants