Benchmarks #4

slavakurilyak · 2025-01-03T18:11:36Z

It would be great to see some benchmarks for MCP Reasoner

I'm interested to compare this to the Sequential Thinking MCP Server by @Skirano

frgmt0 · 2025-01-06T18:38:54Z

I opened a PR to update the readme, where I did some loose benchmarking. the API is a bit expensive for me right now so maybe thats something I do in the future but you can read up on my testing there, but ill paste a snippet of it below:

===

Beam Search

so beam search is pretty straightforward - it keeps track of the most promising solution paths as it goes. works really well when you've got problems with clear right answers, like math stuff or certain types of puzzles.

interesting thing i found while testing; when i threw 50 puzzles from the Arc AGI benchmark at it, it only scored 24%. like, it wasn't completely lost, but... not great. here's how i tested it:

first, i'd check if claude actually got the pattern from the examples. if it seemed confused, i'd try to nudge it in the right direction (but dock points cause that's not ideal)
then for the actual test cases, i had this whole scoring system:
- 5 points - nailed it
- 4 points - reasoning was solid but maybe i fucked up following the instructions
- 3 points - kinda got the pattern but didn't quite nail it
- 2 points - straight up failed
- 1 point - at least the initial reasoning wasn't completely off

Monte Carlo Tree Search

now THIS is where it gets interesting. MCTS absolutely crushed it compared to beam search - we're talking 48% on a different set of 50 Arc puzzles. yeah yeah, maybe they were easier puzzles (this isn't an official benchmark or anything), but doubling the performance? that's not just luck.

the cool thing about MCTS is how it explores different possibilities. instead of just following what seems best right away, it tries out different paths to see what might work better in the long run. claude spent way more time understanding the examples before diving in, which probably helped a lot.

Why This Matters

adding structured reasoning to claude makes it way better... no der, right? but what's really interesting is how different methods work for different types of problems.

why'd i test on puzzles instead of coding problems? honestly, claude's already proven itself on stuff like polyglot and codeforces. i wanted to see how it handled more abstract reasoning - the kind of stuff that's harder to measure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks #4

Benchmarks #4

slavakurilyak commented Jan 3, 2025

frgmt0 commented Jan 6, 2025

Benchmarks #4

Benchmarks #4

Comments

slavakurilyak commented Jan 3, 2025

frgmt0 commented Jan 6, 2025

Beam Search

Monte Carlo Tree Search

Why This Matters