Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation and Context
Evaluating the quality of LLM-generated feedback for modeling exercises is challenging due to the variability and subjectivity of natural language. Traditional evaluation methods struggle with the nuances of feedback expression. Human evaluation, while possible, is time-consuming, expensive, and inconsistent. This PR introduces the Eunomia Integration test address these challenges by providing an automated and consistent framework for evaluating LLM feedback generation. The integration test leverages predefined Structured Grading Instructions to transform the evaluation into a comparison of instruction IDs, allowing for scalable and objective assessments.
Description
The implementation is outlined in detail in the README of the integration test.
Steps for Testing
Testserver States
Note
These badges show the state of the test servers.
Green = Currently available, Red = Currently locked
Click on the badges to get to the test servers.
Screenshots