Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Level 1 Assertions Into Evaluator #261

Closed
jlewi opened this issue Sep 26, 2024 · 0 comments · Fixed by #300
Closed

Integrate Level 1 Assertions Into Evaluator #261

jlewi opened this issue Sep 26, 2024 · 0 comments · Fixed by #300

Comments

@jlewi
Copy link
Owner

jlewi commented Sep 26, 2024

#253 disabled/removed the code for running level 1 assetions.
We should reintegrate running level 1 assertions into the evaluator and storing the results in our sqlite database

jlewi added a commit that referenced this issue Sep 27, 2024
# Use Simulation For Evaluation

This PR completely overhauls how we do evaluation as outlined in
[TN011EVALDATA](https://foyle.io/docs/tech-notes/tn011_eval_data/)

One of the major pain points in our approach to evaluation has been
building up a sufficiently large dataset for evaluation. This PR solves
this problem by using examples generated from sessions produced by
actual usage. This ensures that the more we use Foyle the more data we
have available for evaluation.

Another challenge for evaluation has been what do we use for our set of
learned examples during evaluation? Using actual sessions solves this
problem because sessions are ordered in time. During evaluation we start
out with no learned examples. We then replay the sessions in the same
order the occurred. Foyle can then learn from those sessions using its
learning process to improve accuracy on subsequent examples.

# Making the Evaluator a Simulator

In order to achieve this we redo our Evaluator to act more like a
simulator that simulates what a user would do by using the sessions as
examples of intent and actions.

We refactor the Evaluator to follow the pattern we first used in the
AssertJob of having the experiment driver (the evaluator) interact with
the Agent via RPC. This makes it easy to setup and configure an
independent instance of the Agent with the suitable parameters for the
experiment.

# Use sqlite for storing the results

We rewrite the evaluator to use sqlite to story the evaluation results
rather than using pebble. This gives much better querying capabilities
for exploring the evaluation results.

We store the EvalResult proto in JSON not binary format so that we can
use sqlite's capabilities to query the data.

# Level 1 Evals

This PR deletes the Assertor code because it is rendered out of data by
all the changes. In a subsequent PR we should integration the level 1
assertions into the evaluator.

Tracked in #261

# Code Cleanup

Delete code for computing the distance between expected and actual
programs. We have switched to LLM as judge. That metric is likely not
useful anymore because generated code are often multi-line mini programs
that the metric couldn't handle.

Delete the data/eval directory. These were handcrafted evaluation
examples expressed as markdown files. With this PR we are making two
changes
1. Store EvalExamples as protos to allow richer data representations
2. Produce evaluation datasets from logs and actual usage

Fix #140
@jlewi jlewi changed the title Integrate Level Assertions Into Evaluator Integrate Level 1 Assertions Into Evaluator Sep 28, 2024
jlewi added a commit that referenced this issue Oct 15, 2024
# Experiment Report

After running an evaluation experiment, we compute a report that
contains the key metrics we want to track. To start with this is

* Number of cell match results
* Number of errors and examples
* Generate latency measured as percentiles
* Level1 assertion stats

# Level 1 Assertion stats

* Add a level 1 assertion to test whether the document is zero.
* I believe I observed this started happening when we included a fix to
outputs not being included (#286) in #285.
* I think the problem is the cell outputs could be very long and this
could end up eating all the available context buffer

# Reintegrate Level 1 Assertions Into Evaluation
* Fix #261 
* We start computing level 1 assertions at RunTime so that they are
available in production and evaluation
* Level1 assertions are computed and then logged
* Our Analyzer pipeline reads the assertions from the logs and adds them
to the trace
* Our evaluation report accumulates assertion statistics and reports
them
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant