Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: add after_llm_cb to modify llm generated text #1044

Open
achapla opened this issue Nov 5, 2024 · 1 comment
Open

Feature: add after_llm_cb to modify llm generated text #1044

achapla opened this issue Nov 5, 2024 · 1 comment
Labels
question Further information is requested

Comments

@achapla
Copy link

achapla commented Nov 5, 2024

Question regarding handling special tokens in conversation transcription

First of all, thanks for making this wonderful SDK to easily create voice-enabled applications!

I'm currently building a quiz agent that asks questions to users. The user's response is evaluated by an LLM, and if it's correct, the agent congratulates the user and adds a special 'QUESTION_END' token in the response. This token is used to identify that the conversation related to the current question is finished. I then create a new chat context with the next question.

Issue

The issue I'm facing is that while I have removed the special 'QUESTION_END' token in the before_tts_cb function, so my agent does not speak 'QUESTION_END', it still appears in the conversation text. It seems that the 'QUESTION_END' text was not removed from the LLM's response before transcription.

Upon further investigation, I discovered the following code is executed after calling the LLM, which utilizes two different variables, tts_source and transcription_source, for different purposes:

def _synthesize_agent_speech(
    self,
    speech_id: str,
    source: str | LLMStream | AsyncIterable[str],
) -> SynthesisHandle:
    assert (
        self._agent_output is not None
    ), "agent output should be initialized when ready"
 
    if isinstance(source, LLMStream):
        source = _llm_stream_to_str_iterable(speech_id, source)
 
    og_source = source
    transcript_source = source
    if isinstance(og_source, AsyncIterable):
        og_source, transcript_source = utils.aio.itertools.tee(og_source, 2)
 
    tts_source = self._opts.before_tts_cb(self, og_source)
    if tts_source is None:
        raise ValueError("before_tts_cb must return str or AsyncIterable[str]")
 
    return self._agent_output.synthesize(
        speech_id=speech_id,
        tts_source=tts_source,
        transcript_source=transcript_source,
        transcription=self._opts.transcription.agent_transcription,
        transcription_speed=self._opts.transcription.agent_transcription_speed,
        sentence_tokenizer=self._opts.transcription.sentence_tokenizer,
        word_tokenizer=self._opts.transcription.word_tokenizer,
        hyphenate_word=self._opts.transcription.hyphenate_word,
    )

The tts_source does not contain the 'QUESTION_END' token, so it won't be played out, but the transcription_source contains 'QUESTION_END', causing it to be included in the transcription when committed.

Request for Enhancement

Currently, I've been unable to find a way to remove the 'QUESTION_END' text from the transcription, forcing me to use a crude hack to remove it from my frontend.

I am looking for an after_llm_cb function or similar hook that would allow for observation and modification of the LLM-generated text before it's committed to transcription.

Thank you!

@achapla achapla added the question Further information is requested label Nov 5, 2024
@davidzhao
Copy link
Member

this is a good point, we should offer a way to override before it's committed to the context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants