Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically generate draft answers for student questions #5331

Open
bmesuere opened this issue Jan 30, 2024 · 3 comments
Open

Automatically generate draft answers for student questions #5331

bmesuere opened this issue Jan 30, 2024 · 3 comments
Labels
low priority Thing we want to see implemented at some point

Comments

@bmesuere
Copy link
Member

With the increasing capabilities of LLMs, it is only a matter of time before they become powerful/cheap enough to use them inside Dodona. A first step might be to generate draft answers for questions from students. Here's how it might function:

  • A student asks a question about a line of code
  • This triggers a job to generate a draft answer based on the student's code and their question as input (do we also need to add the problem description?)
  • When a TA is ready to answer the question, the draft is pre-loaded and clearly labeled as an AI-generated draft
  • The TA then has the option to either approve, modify, or discard the draft response.

This approach minimizes risk since each AI-generated answer undergoes human review and editing. Moreover, it's not time-sensitive. If the AI draft is inadequate or fails, the situation remains as it is currently. However, the potential time savings could be substantial.


Since this would be our first LLM integration, this will involve some research aspects.

  • Effectiveness Assessment: We must evaluate the quality of the drafts. This involves preserving every AI-generated draft alongside the TA's final response for future analysis and potentially soliciting feedback from TAs.
  • Model Selection: Deciding which model to deploy – be it a local instance of Code Llama, GPT-3, GPT-4, or another – requires careful consideration. We could conduct experiments using a selection of existing questions from our database to compare and assess the responses generated by different models.
  • Prompt Optimization: Determining the most effective system prompts for query generation.
  • Cost Analysis: What is the cost of using gpt3 and 4? Is code llama on a mac studio fast enough?
  • Continuous Evaluation: Developing a method for assessing new models or prompts as they emerge, potentially through explicit A/B testing where TAs are asked to judge the quality of responses.
@bmesuere bmesuere added the feature New feature or request label Jan 30, 2024
@github-project-automation github-project-automation bot moved this to Unplanned in Roadmap Jan 30, 2024
@bmesuere bmesuere moved this from Unplanned to Todo in Roadmap Jan 30, 2024
@bmesuere bmesuere added the medium priority Things we want to see implemented relatively soon label Jan 30, 2024
@bmesuere
Copy link
Member Author

Some old code I wrote to generate answers based on questions as a stand-alone script:

import OpenAI from "openai";

import { JSDOM } from 'jsdom';

const dodonaHeaders = new Headers({
  "Authorization": ""
});

const openai = new OpenAI({
  apiKey: ""
});

const systemPrompt = "Your goal is to help a teaching assistant answer student questions for a university-level programming course. You will be provided with the problem description, the code of the student, and the question of the student. Your answer should consist of 2 parts. First, very briefly summarize what the student did wrong to the teaching assistant. Second, provide a short response to the question aimed at the student in the same language as the student's question.";


const questionId = 148513;

async function fetchData(questionId) {
  // fetch question data from https://dodona.be/nl/annotations/<ID>.json
  let r = await fetch(`https://dodona.be/nl/annotations/${questionId}.json`, {headers: dodonaHeaders});
  const questionData = await r.json();
  const lineNr = questionData.line_nr;
  const question = questionData.annotation_text;
  const submissionUrl = questionData.submission_url;

  // fetch submission data
  r = await fetch(submissionUrl, { headers: dodonaHeaders });
  const submissionData = await r.json();
  const code = submissionData.code;
  const exerciseUrl = submissionData.exercise;

  // fetch exercise data
  r = await fetch(exerciseUrl, { headers: dodonaHeaders });
  const exerciseData = await r.json();
  const descriptionUrl = exerciseData.description_url;

  // fetch description
  r = await fetch(descriptionUrl, { headers: dodonaHeaders });
  const descriptionHtml = await r.text();
  const description = htmlToText(descriptionHtml);

  return {description, code, question, lineNr};
}

async function generateAnswer({description, code, question, lineNr}) {
  const response = await openai.chat.completions.create({
    model: "gpt-4",
    messages: [
      {"role": "system", "content": systemPrompt},
      {"role": "user", "content": `Description: ${description}\nCode: ${code}\nQuestion on line ${lineNr}: ${question}`}
    ]
  });
  console.log(response);
  console.log(response.choices[0].message);
  //return gptResponse.data.choices[0].text;
}

function htmlToText(html) {
  const dom = new JSDOM(html);
  const text = dom.window.document.body.textContent
    .split("\n")
    .map(l => l.trim())
    .filter(line => !line.includes("I18n"))
    .filter(line => !line.includes("dodona.ready"))
    .join("\n");
  return removeTextAfterSubstring(text, "Links").trim();
}

function removeTextAfterSubstring(str, substring) {
  const index = str.indexOf(substring);

  if (index === -1) {
    return str;  // substring not found
  }

  return str.substring(0, index);
}

const data = await fetchData(questionId);
console.log(data);
await generateAnswer(data)

@bmesuere
Copy link
Member Author

bmesuere commented Feb 2, 2024

I tested the runtime performance of a few models on my mac studio (64GB memory):

Model Quantization Memory usage Inference
codellama-34b-instruct Q5_K_M 22.13 GB 9.87 tok/s
codellama-34b-instruct Q6_K 25.63 GB 9.58 tok/s
codellama-34b-instruct Q8_0 33.06 GB 9.32 tok/s
codellama-70b-instruct Q4_K_M 38.37 GB 7.00 tok/s
codellama-70b-instruct Q6_0 49.39 GB crashed
mixtral-8x7b-instruct Q5_K_M 29.64 GB 21.5 tok/s

I could not validate the output of codellama-70b since it seems to use a different prompt format.

@bmesuere
Copy link
Member Author

bmesuere commented Feb 3, 2024

I played around with the various models this afternoon. Some early observations:

  • I tweaked the system prompt from above and left out the "2 part answer" and only focused on letting it generate a draft.
  • I couldn't get codellama to answer in Dutch, mixtral did fine
  • I did a quick search for existing questions which we could use to evaluate the models, but was a bit disappointed. The data quality is very low.
    • many questions have no answer (often given in person or because the student solved the exercise already)
    • questions are sometimes asked in multiple messages (a follow up message is added before the question is answered)
    • questions are added to a line, but the question has nothing to do with that line
    • often, the question assumes external knowledge which isn't explicitly mentioned. This makes it hard to answer. For example, a question like "I don't know why this fails" where "this" is actually one of the tests.
    • most importantly: it is often not clear what the question is
  • We'll probably have to include the problem description in the prompt to add extra content. Unfortunately, this takes up a huge amount of tokens for many exercises. There are multiple reasons for this: a lot of exercises are in html which is much more verbose than markdown. A lot of exercises contain a whole lot of irrelevant information. The examples can be very long, token wise. The problem with these huge prompts is they are too large for self-hosted models and are expensive for openAI models. Even with some strategies to reduce the problem description token count, I encountered descriptions of 5000+ tokens.
  • Problem descriptions in English work better than those in Dutch, but we can't use them because they sometimes require different function names.
  • It often takes 40-50 seconds to get a response, both for gpt4 as well as mixtral

@bmesuere bmesuere added low priority Thing we want to see implemented at some point and removed medium priority Things we want to see implemented relatively soon labels Feb 18, 2024
@bmesuere bmesuere removed the feature New feature or request label Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
low priority Thing we want to see implemented at some point
Projects
Status: Todo
Development

No branches or pull requests

1 participant