Complete MVP #48

daveshap · 2024-09-21T18:38:37Z

MVP Deliverables

One finetuning file JSONL with at least 500 samples of CoT reasoning
Preferably completed by Oct 1, 2024

Milestones

Synthesize data (various approaches)
Validate data (TBD)
Clean up/format data (final format to be JSONL finetuning format)
Test against SIMPLE bench

Data Synthesis

We've discussed several potential approaches, generally revolving around either an agentic workflow and/or multi-model approach

Agentic workflow with multi-step reasoning
"Peer review" agents for critique/reflection
Reverse-engineering (start with answer first and work backwards)

Data Validation

Numerous potential approaches discussed. Generally, we agreed that solving validation may be just as hard (if not harder) than data synthesis. We did agree that this is critical as it could be infinitely scalable (e.g. we will only use manual validation to prove out the process, but all final data must be 100% automated)

Model Choice(s)

Various approaches recommended/discussed. We can use free/open source/small models to prove out the process. The only hard requirement is that we will not use o1-mini or o1-preview for any part of data synthesis. We agreed that we must validate the approach using previous generation models. This is for our own rigor as well as to prove the methodology (e.g. using last-gen models to approximate next-gen performance)

daveshap · 2024-09-28T12:31:22Z

Approach updates

After initial brainstorming and work by the team, I believe we need to focus on the methodology/approach before focusing on data. For instance, duplicating the success here should be a top priority: #10

Recommended methodology

Start with a battery of tough logic and reasoning puzzles that are self contained to prove out the reasoning methodology. So far, I've found that the following three Latent Space Activations tend to work best:

Chain of Thought Reasoning
Systems 2 Thinking
Let's Verify Step by Step

In other words telling the model to use these methods tends to work best, right out of the gate. It still takes some turns.

User proxy agent

We need a "user proxy" agent that provides critical feedback and has a fundamentally different role from the expert. This is the EXPERT/CRITIQUE model of agents. We may need an even more sophisticated agent framework with 3 agents, or an out of band process (perhaps a verification loop?)

Criteria

I believe that starting with self-contained logic puzzles is the best approach because they are

Provable
Require logic, math, reasoning, and planning
Also require lateral thinking
Require no external resources

Brainstorming with Claude

User

I am working on a project where I need a bunch of brain teasers that are difficult to solve, requiring reasoning, math, logic, etc, but require no outside resources. Example:

write a 10 word coherent sentence where each word has exactly 1 letter more than its previous word starting with first word having one letter and so on till tenth word having 10 letters

Can you help me brainstorm more? It may be helpful to talk through this problem, as lateral thinking brain teasers are very old. Let's use a few techniques:

Chain of Thought Reasoning
Let's Verify Step By Step

Claude

Certainly! I'd be happy to help you brainstorm more brain teasers using the techniques you mentioned. Let's go through this process step by step.

Chain of Thought Reasoning:

1. Let's consider different types of puzzles that require reasoning, math, and logic.
5. We want puzzles that can be solved without external resources.
6. The puzzles should be challenging but not impossible.
7. We can draw inspiration from classic puzzles and modify them.
8. We can also create new puzzles by combining different concepts.

Let's Verify Step By Step:

1. Number sequences: Create a sequence with a hidden pattern.
Example: What's the next number in the sequence: 1, 11, 21, 1211, 111221, ...?
Verification: This is the "look and say" sequence. Each number describes the previous number.

2. Word puzzles: Create puzzles involving letter manipulation or word patterns.
Example: What 8-letter word can have a letter taken away and still form a word, take another letter away and still form a word, and take a third letter away and still form a word?
Verification: The word is "starting". starting → staring → string → sting

3. Logic puzzles: Create scenarios that require deductive reasoning.
Example: In a room are two men. One always lies, and one always tells the truth. You don't know which is which. You can ask one question to one of them to determine which door leads to freedom. What question do you ask?
Verification: Ask either man: "If I asked the other man which door leads to freedom, what would he say?" Then choose the opposite door.

4. Mathematical puzzles: Create problems that require creative thinking with numbers.
Example: Using only addition, how can you add eight 8's and get the number 1000?
Verification: 888 + 88 + 8 + 8 + 8 = 1000

5. Lateral thinking puzzles: Create scenarios that require "thinking outside the box".
Example: A man pushes his car to a hotel and tells the owner he's bankrupt. Why?
Verification: He's playing Monopoly.

These examples demonstrate different types of brain teasers that meet your criteria. We can continue to generate more by applying similar principles to new concepts or by combining different puzzle types.

User

Please brainstorm 10 puzzles, 2 per category

Claude

Certainly! Here are 10 puzzles, 2 for each category:

Number Sequences:

1. What's the next number in this sequence: 2, 3, 5, 9, 17, 33, ...?
Verification: The next number is 65. Each number is doubled and then decreased by 1.

5. Find the missing number: 1, 4, 9, 16, 25, 36, ?, 64, 81, 100
Verification: The missing number is 49. These are perfect squares (1², 2², 3², ..., 10²).

Word Puzzles:

9. What common English word becomes shorter when you add two letters to it?
Verification: The word is "short". Adding "er" makes it "shorter".

10. I am a word of letters three. Add two and fewer there will be. What am I?
Verification: The word is "few". Adding "er" makes it "fewer".

Logic Puzzles:

11. You have 12 balls, all identical except one which is either heavier or lighter. Using a balance scale only 3 times, how can you identify the odd ball and determine if it's heavier or lighter?
Verification: This requires a multi-step process of dividing the balls and comparing groups.

12. A man is looking at a photograph of someone. His friend asks who it is. The man replies, "Brothers and sisters, I have none. But that man's father is my father's son." Who was in the photograph?
Verification: The man is looking at a photograph of his own son.

Mathematical Puzzles:

13. How can you add 8 eights to get the number 1000?
Verification: 888 + 88 + 8 + 8 + 8 = 1000

14. Using only the number 4 (four times) and basic math operations, how can you make 17?
Verification: (4 + 4) + (4 + 4/4) = 17

Lateral Thinking Puzzles:

15. A woman shoots her husband, then holds him underwater for five minutes. Next, she hangs him. But five minutes later they both go out and enjoy a wonderful dinner together. How can this be?
Verification: She took a picture of him, developed it in a darkroom, and hung it up to dry.

16. Two boxers are in a match scheduled for 12 rounds. One knocks out the other after only six rounds, yet no man throws a punch. How is this possible?
Verification: They're female boxers.

These puzzles cover a range of difficulty levels and require different types of thinking to solve.

Repository owner deleted a comment from ori1337 Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete MVP #48

Complete MVP #48

daveshap commented Sep 21, 2024 •

edited

Loading

daveshap commented Sep 28, 2024

Complete MVP #48

Complete MVP #48

Comments

daveshap commented Sep 21, 2024 • edited Loading

MVP Deliverables

Milestones

Data Synthesis

Data Validation

Model Choice(s)

daveshap commented Sep 28, 2024

Approach updates

Recommended methodology

User proxy agent

Criteria

Brainstorming with Claude

User

Claude

User

Claude

daveshap commented Sep 21, 2024 •

edited

Loading