diff --git a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/CONTRIBUTING.md b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/CONTRIBUTING.md index 5c0d54b9..be3e7d9f 100644 --- a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/CONTRIBUTING.md +++ b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/CONTRIBUTING.md @@ -39,4 +39,4 @@ In the future, we look forward to your patches and contributions to this project All submissions, including submissions by project members, require review. We use GitHub pull requests for this purpose. Consult [GitHub Help](https://help.github.com/articles/about-pull-requests/) for more -information on using pull requests. \ No newline at end of file +information on using pull requests. diff --git a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/README.md b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/README.md index 41e558a0..4b92730e 100644 --- a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/README.md +++ b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/README.md @@ -7,6 +7,7 @@ This repository serves as a comprehensive developer guide for [Google's Gemini M ## What You'll Learn By following this guide, you'll be able to: + - Build real-time audio chat applications with Gemini - Implement live video interactions through webcam and screen sharing - Create multimodal experiences combining audio and video @@ -18,18 +19,21 @@ The guide progresses from basic concepts to advanced implementations, culminatin ## Key Concepts Covered - **Real-time Communication:** + - WebSocket-based streaming - Bidirectional audio chat - Live video processing - Turn-taking and interruption handling - **Audio Processing:** + - Microphone input capture - Audio chunking and streaming - Voice Activity Detection (VAD) - Real-time audio playback - **Video Integration:** + - Webcam and screen capture - Frame processing and encoding - Simultaneous audio-video streaming @@ -45,20 +49,26 @@ The guide progresses from basic concepts to advanced implementations, culminatin ## Guide Structure ### [Part 1](part_1_intro): Introduction to Gemini's Multimodal Live API + Basic concepts and SDK usage: + - SDK setup and authentication - Text and audio interactions - Real-time audio chat implementation ### [Part 2](part_2_dev_api): WebSocket Development with [Gemini Developer API](https://ai.google.dev/api/multimodal-live) + Direct WebSocket implementation, building towards Project Pastra - a production-ready multimodal AI assistant inspired by Project Astra: + - Low-level WebSocket communication - Audio and video streaming - Function calling and system instructions - Mobile-first deployment ### [Part 3](part_3_vertex_api): WebSocket Development with [Vertex AI API](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-live) + Enterprise-grade implementation using Vertex AI, mirroring Part 2's journey with production-focused architecture: + - Proxy-based authentication - Service account integration - Cloud deployment architecture @@ -68,14 +78,14 @@ Enterprise-grade implementation using Vertex AI, mirroring Part 2's journey with Below is a comprehensive overview of where each feature is implemented across the Development API and Vertex AI versions: -| Feature | Part 1 - Intro Chapter | Part 2 - Dev API Chapter | Part 3 - Vertex AI Chapter | -|---------|----------------|----------------|-------------------| -| SDK setup and authentication | [Chapter 1](part_1_intro/chapter_01) | - | - | -| Text and audio interactions | [Chapter 1](part_1_intro/chapter_01) | - | - | -| Real-time Audio Chat | [Chapter 2](part_1_intro/chapter_02) | [Chapter 5](part_2_dev_api/chapter_05) | [Chapter 9](part_3_vertex_api/chapter_09) | -| Multimodal (Audio + Video) | - | [Chapter 6](part_2_dev_api/chapter_06) | [Chapter 10](part_3_vertex_api/chapter_10) | -| Function Calling & Instructions | - | [Chapter 7](part_2_dev_api/chapter_07) | [Chapter 11](part_3_vertex_api/chapter_11) | -| Production Deployment | - | [Chapter 8](part_2_dev_api/chapter_08) | [Chapter 12](part_3_vertex_api/chapter_12) | +| Feature | Part 1 - Intro Chapter | Part 2 - Dev API Chapter | Part 3 - Vertex AI Chapter | +| ------------------------------- | ------------------------------------ | -------------------------------------- | ------------------------------------------ | +| SDK setup and authentication | [Chapter 1](part_1_intro/chapter_01) | - | - | +| Text and audio interactions | [Chapter 1](part_1_intro/chapter_01) | - | - | +| Real-time Audio Chat | [Chapter 2](part_1_intro/chapter_02) | [Chapter 5](part_2_dev_api/chapter_05) | [Chapter 9](part_3_vertex_api/chapter_09) | +| Multimodal (Audio + Video) | - | [Chapter 6](part_2_dev_api/chapter_06) | [Chapter 10](part_3_vertex_api/chapter_10) | +| Function Calling & Instructions | - | [Chapter 7](part_2_dev_api/chapter_07) | [Chapter 11](part_3_vertex_api/chapter_11) | +| Production Deployment | - | [Chapter 8](part_2_dev_api/chapter_08) | [Chapter 12](part_3_vertex_api/chapter_12) | Note: Vertex AI implementation starts directly with advanced features, skipping basic WebSocket and text-to-speech examples. @@ -94,6 +104,7 @@ Note: Vertex AI implementation starts directly with advanced features, skipping ## Key Differences Between Dev API and Vertex AI ### Development API (Part 2) + - Simple API key authentication - Direct WebSocket connection - All tools available simultaneously @@ -101,6 +112,7 @@ Note: Vertex AI implementation starts directly with advanced features, skipping - Ideal for prototyping and learning ### Vertex AI (Part 3) + - Service account authentication - Proxy-based architecture - Single tool limitation @@ -117,4 +129,3 @@ Note: Vertex AI implementation starts directly with advanced features, skipping ## License This project is licensed under the Apache License. - diff --git a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_1_intro/README.md b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_1_intro/README.md index 5a6bfabe..2f9cb0b4 100644 --- a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_1_intro/README.md +++ b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_1_intro/README.md @@ -5,6 +5,7 @@ This section provides a foundational introduction to working with Google's Gemin ## Contents ### Chapter 1: SDK Basics + - Introduction to the Google Gemini AI SDK - Setting up the development environment - Basic text interactions with Gemini @@ -12,12 +13,14 @@ This section provides a foundational introduction to working with Google's Gemin - Examples using both direct API key authentication and Vertex AI authentication ### Chapter 2: Multimodal Interactions + - Real-time audio conversations with Gemini - Streaming audio input and output - Voice activity detection and turn-taking - Example implementation of an interactive voice chat ## Key Features Covered + - Text generation and conversations - Audio output generation - Real-time streaming interactions @@ -25,6 +28,7 @@ This section provides a foundational introduction to working with Google's Gemin - Multimodal capabilities (text-to-audio, audio-to-audio) ## Prerequisites + - Python environment - Google Gemini API access - Required packages: @@ -32,4 +36,5 @@ This section provides a foundational introduction to working with Google's Gemin - `pyaudio` (for audio examples) ## Getting Started -Each chapter contains Jupyter notebooks and Python scripts that demonstrate different aspects of the Gemini AI capabilities. Start with Chapter 1's notebooks for basic SDK usage, and then move on to the more advanced multimodal examples in Chapter 2. \ No newline at end of file + +Each chapter contains Jupyter notebooks and Python scripts that demonstrate different aspects of the Gemini AI capabilities. Start with Chapter 1's notebooks for basic SDK usage, and then move on to the more advanced multimodal examples in Chapter 2. diff --git a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_1_intro/chapter_02/README.md b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_1_intro/chapter_02/README.md index 944a57f1..ac961444 100644 --- a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_1_intro/chapter_02/README.md +++ b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_1_intro/chapter_02/README.md @@ -16,39 +16,40 @@ The application's functionality can be broken down into several key components: The `pyaudio` library is used to create input and output streams that interface with the user's audio hardware. -* **Input Stream:** An input stream is initialized to capture audio data from the user's microphone. The stream is configured with parameters such as format, channels, sample rate, and chunk size. The `SEND_SAMPLE_RATE` is set to 16000 Hz, which is a common sample rate for speech recognition. The `CHUNK_SIZE` determines the number of audio frames read from the microphone at a time. The `exception_on_overflow` parameter is set to `False` to prevent the stream from raising an exception if the buffer overflows. -* **Output Stream:** An output stream is initialized to play audio data through the user's speakers. Similar to the input stream, it is configured with appropriate parameters. The `RECEIVE_SAMPLE_RATE` is set to 24000 Hz, which is suitable for high-quality audio playback. +- **Input Stream:** An input stream is initialized to capture audio data from the user's microphone. The stream is configured with parameters such as format, channels, sample rate, and chunk size. The `SEND_SAMPLE_RATE` is set to 16000 Hz, which is a common sample rate for speech recognition. The `CHUNK_SIZE` determines the number of audio frames read from the microphone at a time. The `exception_on_overflow` parameter is set to `False` to prevent the stream from raising an exception if the buffer overflows. +- **Output Stream:** An output stream is initialized to play audio data through the user's speakers. Similar to the input stream, it is configured with appropriate parameters. The `RECEIVE_SAMPLE_RATE` is set to 24000 Hz, which is suitable for high-quality audio playback. ### Communication with Gemini API The `google-genai` library provides the necessary tools to connect to the Gemini API and establish a communication session. -* **Client Initialization:** A `genai.Client` is created to interact with the API. The `http_options` parameter is used to specify the API version, which is set to `'v1alpha'` in this case. -* **Session Configuration:** A configuration object `CONFIG` is defined to customize the interaction with the model. This includes: - * `generation_config`: Specifies the response modality as "AUDIO" and configures the "speech_config" to "Puck". - * `system_instruction`: Sets a system instruction to always start the model's sentences with "mate". -* **Live Connection:** The `client.aio.live.connect` method establishes a live connection to the Gemini model specified by `MODEL`, which is set to `"models/gemini-2.0-flash-exp"`. +- **Client Initialization:** A `genai.Client` is created to interact with the API. The `http_options` parameter is used to specify the API version, which is set to `'v1alpha'` in this case. +- **Session Configuration:** A configuration object `CONFIG` is defined to customize the interaction with the model. This includes: + - `generation_config`: Specifies the response modality as "AUDIO" and configures the "speech_config" to "Puck". + - `system_instruction`: Sets a system instruction to always start the model's sentences with "mate". +- **Live Connection:** The `client.aio.live.connect` method establishes a live connection to the Gemini model specified by `MODEL`, which is set to `"models/gemini-2.0-flash-exp"`. ### Asynchronous Audio Handling The `asyncio` library is used to manage the asynchronous operations involved in audio processing and communication. -* **Audio Queue:** An `asyncio.Queue` is created to store audio data temporarily. This queue is not used in the current implementation but is defined for potential future use. -* **Task Group:** An `asyncio.TaskGroup` is used to manage two concurrent tasks: `listen_and_send` and `receive_and_play`. -* **`listen_and_send` Task:** This task continuously reads audio data from the input stream in chunks and sends it to the Gemini API. It checks if the model is currently speaking (`model_speaking` flag) and only sends data if the model is not speaking. The chunking is performed using the `pyaudio` library's `read()` method, which is called with a specific `CHUNK_SIZE` (number of audio frames per chunk). Here's how it's done in the code: +- **Audio Queue:** An `asyncio.Queue` is created to store audio data temporarily. This queue is not used in the current implementation but is defined for potential future use. +- **Task Group:** An `asyncio.TaskGroup` is used to manage two concurrent tasks: `listen_and_send` and `receive_and_play`. +- **`listen_and_send` Task:** This task continuously reads audio data from the input stream in chunks and sends it to the Gemini API. It checks if the model is currently speaking (`model_speaking` flag) and only sends data if the model is not speaking. The chunking is performed using the `pyaudio` library's `read()` method, which is called with a specific `CHUNK_SIZE` (number of audio frames per chunk). Here's how it's done in the code: - ```python - while True: - if not model_speaking: - try: - data = await asyncio.to_thread(input_stream.read, CHUNK_SIZE, exception_on_overflow=False) - # ... send data to API ... - except OSError as e: - # ... handle error ... - ``` + ```python + while True: + if not model_speaking: + try: + data = await asyncio.to_thread(input_stream.read, CHUNK_SIZE, exception_on_overflow=False) + # ... send data to API ... + except OSError as e: + # ... handle error ... + ``` - In this code, `input_stream.read(CHUNK_SIZE)` reads a chunk of audio frames from the microphone's input buffer. Each chunk is then sent to the API along with the `end_of_turn=True` flag. -* **`receive_and_play` Task:** This task continuously receives responses from the Gemini API and plays the audio data through the output stream. It sets the `model_speaking` flag to `True` when the model starts speaking and to `False` when the turn is complete. It then iterates through the parts of the response and writes the audio data to the output stream. + In this code, `input_stream.read(CHUNK_SIZE)` reads a chunk of audio frames from the microphone's input buffer. Each chunk is then sent to the API along with the `end_of_turn=True` flag. + +- **`receive_and_play` Task:** This task continuously receives responses from the Gemini API and plays the audio data through the output stream. It sets the `model_speaking` flag to `True` when the model starts speaking and to `False` when the turn is complete. It then iterates through the parts of the response and writes the audio data to the output stream. ### Audio Chunking and Real-time Interaction @@ -68,7 +69,6 @@ In this case, with a `CHUNK_SIZE` of 512 frames and a `SEND_SAMPLE_RATE` of 1600 `Chunk Duration = 512 frames / 16000 Hz = 0.032 seconds = 32 milliseconds` - Therefore, each chunk represents 32 milliseconds of audio. **Real-time Interaction Flow:** @@ -97,21 +97,21 @@ The application distinguishes between user input and model output through a comb **Distinguishing Input from Output:** -* **`model_speaking` Flag:** This boolean flag serves as a primary mechanism to differentiate between when the user is providing input and when the model is generating output. - * When `model_speaking` is `False`, the application assumes it's the user's turn to speak. The `listen_and_send` task reads audio data from the microphone and sends it to the API. - * When `model_speaking` is `True`, the application understands that the model is currently generating an audio response. The `listen_and_send` task pauses, preventing user input from being sent to the API while the model is "speaking." The `receive_and_play` task is active during this time, receiving and playing the model's audio output. +- **`model_speaking` Flag:** This boolean flag serves as a primary mechanism to differentiate between when the user is providing input and when the model is generating output. + - When `model_speaking` is `False`, the application assumes it's the user's turn to speak. The `listen_and_send` task reads audio data from the microphone and sends it to the API. + - When `model_speaking` is `True`, the application understands that the model is currently generating an audio response. The `listen_and_send` task pauses, preventing user input from being sent to the API while the model is "speaking." The `receive_and_play` task is active during this time, receiving and playing the model's audio output. **How Audio Chunks are Sent:** -* **`end_of_turn=True` with Each Chunk:** The `listen_and_send` task sends each chunk of audio data (determined by `CHUNK_SIZE`) with `end_of_turn=True` in the message payload: `await session.send({"data": data, "mime_type": "audio/pcm"}, end_of_turn=True)`. This might seem like it would constantly interrupt the conversation flow. However, the API handles this gracefully. -* **API-Side Buffering and VAD:** The Gemini API likely buffers the incoming audio chunks on its end. Even though each chunk is marked as the end of a turn with `end_of_turn=True`, the API's Voice Activity Detection (VAD) analyzes the buffered audio to identify longer pauses or periods of silence that more accurately represent the actual end of the user's speech. The API can group several chunks into what it considers a single user turn based on its VAD analysis, rather than strictly treating each chunk as a separate turn. -* **Low-Latency Processing:** The API is designed for low-latency interaction. It starts processing the received audio chunks as soon as possible. Even if `end_of_turn=True` is sent with each chunk, the API can begin generating a response while still receiving more audio from the user, as long as it hasn't detected a significant enough pause to finalize the user's turn based on its VAD. +- **`end_of_turn=True` with Each Chunk:** The `listen_and_send` task sends each chunk of audio data (determined by `CHUNK_SIZE`) with `end_of_turn=True` in the message payload: `await session.send({"data": data, "mime_type": "audio/pcm"}, end_of_turn=True)`. This might seem like it would constantly interrupt the conversation flow. However, the API handles this gracefully. +- **API-Side Buffering and VAD:** The Gemini API likely buffers the incoming audio chunks on its end. Even though each chunk is marked as the end of a turn with `end_of_turn=True`, the API's Voice Activity Detection (VAD) analyzes the buffered audio to identify longer pauses or periods of silence that more accurately represent the actual end of the user's speech. The API can group several chunks into what it considers a single user turn based on its VAD analysis, rather than strictly treating each chunk as a separate turn. +- **Low-Latency Processing:** The API is designed for low-latency interaction. It starts processing the received audio chunks as soon as possible. Even if `end_of_turn=True` is sent with each chunk, the API can begin generating a response while still receiving more audio from the user, as long as it hasn't detected a significant enough pause to finalize the user's turn based on its VAD. **Determining End of Model Turn:** -* **`turn_complete` Field:** The `receive_and_play` task continuously listens for responses from the API. Each response includes a `server_content` object, which contains a `turn_complete` field. - * When `turn_complete` is `True`, it signifies that the model has finished generating its response for the current turn. - * Upon receiving a `turn_complete: True` signal, the `receive_and_play` task sets the `model_speaking` flag to `False`. This signals that the model's turn is over, and the application is ready to accept new user input. +- **`turn_complete` Field:** The `receive_and_play` task continuously listens for responses from the API. Each response includes a `server_content` object, which contains a `turn_complete` field. + - When `turn_complete` is `True`, it signifies that the model has finished generating its response for the current turn. + - Upon receiving a `turn_complete: True` signal, the `receive_and_play` task sets the `model_speaking` flag to `False`. This signals that the model's turn is over, and the application is ready to accept new user input. **Turn-Taking Flow:** @@ -127,7 +127,7 @@ In essence, although `end_of_turn=True` is sent with each audio chunk, the API's ### Why Always Set `end_of_turn=True`? -Setting `end_of_turn=True` with each audio chunk, even when the user hasn't finished speaking, might seem counterintuitive. Here are some reasons for this design choice: +Setting `end_of_turn=True` with each audio chunk, even when the user hasn't finished speaking, might seem counterintuitive. Here are some reasons for this design choice: 1. **Simplicity and Reduced Client-Side Complexity:** Implementing robust Voice Activity Detection (VAD) on the client-side can be complex. By always setting `end_of_turn=True`, the developers might have opted for a simpler client-side implementation that offloads the more complex VAD task to the Gemini API. 2. **Lower Latency:** Sending smaller chunks with `end_of_turn=True` might allow the API to start processing the audio sooner. However, this potential latency benefit depends heavily on how the API is designed. @@ -149,5 +149,4 @@ The `if __name__ == "__main__":` block ensures that the `audio_loop` function is ## Limitations -The current implementation does not support user interruption of the model's speech. Future implementations could support interruption by sending a specific interrupt signal to the API or by modifying the current `end_of_turn` logic to be more responsive to shorter pauses in user speech. - +The current implementation does not support user interruption of the model's speech. Future implementations could support interruption by sending a specific interrupt signal to the API or by modifying the current `end_of_turn` logic to be more responsive to shorter pauses in user speech. diff --git a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/README.md b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/README.md index 57b8152a..cd39b9ff 100644 --- a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/README.md +++ b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/README.md @@ -3,47 +3,55 @@ This section demonstrates how to work directly with the Gemini API using WebSockets, progressively building towards Project Pastra - a production-ready multimodal AI assistant inspired by Google DeepMind's Project Astra. Through a series of chapters, we evolve from basic implementations to a sophisticated, mobile-first application that showcases the full potential of the Gemini API. ## Journey to Project Pastra + Starting with fundamental WebSocket concepts, each chapter adds new capabilities, ultimately culminating in Project Pastra - our implementation of a universal AI assistant that can see, hear, and interact in real-time. Like Project Astra (Google DeepMind's research prototype), our application demonstrates how to create an AI assistant that can engage in natural, multimodal interactions while maintaining production-grade reliability. ## Contents ### Chapter 3: Basic WebSocket Communication + - Single exchange example with the Gemini API - Core WebSocket setup and communication - Understanding the API's message formats - Handling the mandatory setup phase ### Chapter 4: Text-to-Speech Implementation + - Converting text input to audio responses - Real-time audio playback in the browser - Audio chunk management and streaming - WebSocket and AudioContext integration ### Chapter 5: Real-time Audio Chat + - Bidirectional audio communication - Live microphone input processing - Voice activity detection and turn management - Advanced audio streaming techniques ### Chapter 6: Multimodal Interactions + - Adding video capabilities (webcam and screen sharing) - Frame capture and processing - Simultaneous audio and video streaming - Enhanced user interface controls ### Chapter 7: Advanced Features + - Function calling capabilities - System instructions integration - External API integrations (weather, search) - Code execution functionality ### Chapter 8: Project Pastra + - Mobile-first UI design inspired by Project Astra - Cloud Run deployment setup - Production-grade error handling - Scalable architecture implementation ## Key Features + - Direct WebSocket communication with Gemini API - Real-time audio and video processing - Browser-based implementation @@ -51,6 +59,7 @@ Starting with fundamental WebSocket concepts, each chapter adds new capabilities - Production deployment guidance ## Prerequisites + - Basic understanding of WebSockets - Familiarity with JavaScript and HTML5 - Google Gemini API access @@ -59,20 +68,24 @@ Starting with fundamental WebSocket concepts, each chapter adds new capabilities ## Getting Started This guide uses a simple development server to: + - Serve the HTML/JavaScript files for each chapter - Provide access to shared components (audio processing, media handling, etc.) used across chapters - Enable proper loading of JavaScript modules and assets - Avoid CORS issues when accessing local files 1. Start the development server: + ```bash python server.py ``` + This will serve both the chapter files and shared components at http://localhost:8000 2. Navigate to the specific chapter you want to work with: + - Chapter 3: http://localhost:8000/chapter_03/ - Chapter 4: http://localhost:8000/chapter_04/ - And so on... + And so on... -3. Begin with Chapter 3 to understand the fundamentals of WebSocket communication with Gemini. Each subsequent chapter builds upon previous concepts, gradually introducing more complex features and capabilities. By Chapter 8, you'll have transformed the development prototype into Project Pastra - a production-ready AI assistant that demonstrates the future of human-AI interaction. \ No newline at end of file +3. Begin with Chapter 3 to understand the fundamentals of WebSocket communication with Gemini. Each subsequent chapter builds upon previous concepts, gradually introducing more complex features and capabilities. By Chapter 8, you'll have transformed the development prototype into Project Pastra - a production-ready AI assistant that demonstrates the future of human-AI interaction. diff --git a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_03/README.md b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_03/README.md index 3105385d..88454102 100644 --- a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_03/README.md +++ b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_03/README.md @@ -10,97 +10,99 @@ The application's functionality can be broken down into several key components: ### 1. Establishing a WebSocket Connection -* **API Endpoint:** The application connects to the Gemini API using a specific WebSocket endpoint URL: - ``` - wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContent?key=${apiKey} - ``` - This URL includes the API host, the service path, and an API key for authentication. Replace `${apiKey}` with your actual API key. -* **WebSocket Object:** A new `WebSocket` object is created in JavaScript, initiating the connection: - ```javascript - const ws = new WebSocket(endpoint); - ``` -* **Event Handlers:** Event handlers are defined to manage the connection's lifecycle and handle incoming messages: - * `onopen`: Triggered when the connection is successfully opened. - * `onmessage`: Triggered when a message is received from the server. - * `onerror`: Triggered if an error occurs during the connection. - * `onclose`: Triggered when the connection is closed. +- **API Endpoint:** The application connects to the Gemini API using a specific WebSocket endpoint URL: + ``` + wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContent?key=${apiKey} + ``` + This URL includes the API host, the service path, and an API key for authentication. Replace `${apiKey}` with your actual API key. +- **WebSocket Object:** A new `WebSocket` object is created in JavaScript, initiating the connection: + ```javascript + const ws = new WebSocket(endpoint); + ``` +- **Event Handlers:** Event handlers are defined to manage the connection's lifecycle and handle incoming messages: + - `onopen`: Triggered when the connection is successfully opened. + - `onmessage`: Triggered when a message is received from the server. + - `onerror`: Triggered if an error occurs during the connection. + - `onclose`: Triggered when the connection is closed. ### 2. Sending a Setup Message (Mandatory First Step) -* **API Requirement:** The Gemini API requires a setup message to be sent as the **very first message** after the WebSocket connection is established. This is crucial for configuring the session. -* **`onopen` Handler:** The `onopen` event handler, which is triggered when the connection is open, is responsible for sending this setup message. -* **Setup Message Structure:** The setup message is a JSON object that conforms to the `BidiGenerateContentSetup` format as defined in the API documentation: - ```javascript - const setupMessage = { - setup: { - model: "models/gemini-2.0-flash-exp", - generation_config: { - response_modalities: ["text"] - } - } - }; - ``` - * `model`: Specifies the Gemini model to use (`"models/gemini-2.0-flash-exp"` in this case). - * `generation_config`: Configures the generation parameters, such as the `response_modalities` (set to `"text"` for text-based output). You can also specify other parameters like `temperature`, `top_p`, `top_k`, etc., within `generation_config` as needed. -* **Sending the Message:** The setup message is stringified and sent to the server using `ws.send()`: - ```javascript - ws.send(JSON.stringify(setupMessage)); - ``` +- **API Requirement:** The Gemini API requires a setup message to be sent as the **very first message** after the WebSocket connection is established. This is crucial for configuring the session. +- **`onopen` Handler:** The `onopen` event handler, which is triggered when the connection is open, is responsible for sending this setup message. +- **Setup Message Structure:** The setup message is a JSON object that conforms to the `BidiGenerateContentSetup` format as defined in the API documentation: + ```javascript + const setupMessage = { + setup: { + model: "models/gemini-2.0-flash-exp", + generation_config: { + response_modalities: ["text"], + }, + }, + }; + ``` + - `model`: Specifies the Gemini model to use (`"models/gemini-2.0-flash-exp"` in this case). + - `generation_config`: Configures the generation parameters, such as the `response_modalities` (set to `"text"` for text-based output). You can also specify other parameters like `temperature`, `top_p`, `top_k`, etc., within `generation_config` as needed. +- **Sending the Message:** The setup message is stringified and sent to the server using `ws.send()`: + ```javascript + ws.send(JSON.stringify(setupMessage)); + ``` ### 3. Receiving and Processing Messages -* **`onmessage` Handler:** The `onmessage` event handler receives messages from the server. -* **Data Handling:** The code handles potential `Blob` data using `new Response(event.data).text()`, but in this text-only example, it directly parses the message as JSON. -* **Response Parsing:** The received message is parsed as a JSON object using `JSON.parse()`. -* **Message Types:** The code specifically checks for a `BidiGenerateContentSetupComplete` message type, indicated by the `setupComplete` field in the response. +- **`onmessage` Handler:** The `onmessage` event handler receives messages from the server. +- **Data Handling:** The code handles potential `Blob` data using `new Response(event.data).text()`, but in this text-only example, it directly parses the message as JSON. +- **Response Parsing:** The received message is parsed as a JSON object using `JSON.parse()`. +- **Message Types:** The code specifically checks for a `BidiGenerateContentSetupComplete` message type, indicated by the `setupComplete` field in the response. ### 4. Confirming Setup Completion Before Proceeding -* **`setupComplete` Check:** The code includes a conditional check to ensure that a `setupComplete` message is received before sending any user content: - ```javascript - if (response.setupComplete) { - // ... Send user message ... - } - ``` -* **Why This Is Important:** This check is essential because the API will not process user content messages until the setup is complete. Sending content before receiving confirmation that the setup is complete will likely result in an error or unexpected behavior. The API might close the connection if messages other than the initial setup message are sent before the setup is completed. +- **`setupComplete` Check:** The code includes a conditional check to ensure that a `setupComplete` message is received before sending any user content: + ```javascript + if (response.setupComplete) { + // ... Send user message ... + } + ``` +- **Why This Is Important:** This check is essential because the API will not process user content messages until the setup is complete. Sending content before receiving confirmation that the setup is complete will likely result in an error or unexpected behavior. The API might close the connection if messages other than the initial setup message are sent before the setup is completed. ### 5. Sending a Hardcoded User Message -* **Triggered by `setupComplete`:** Only after the `setupComplete` message is received and processed does the application send a user message to the model. -* **User Message Structure:** The user message is a JSON object conforming to the `BidiGenerateContentClientContent` format: - ```javascript - const contentMessage = { - client_content: { - turns: [{ +- **Triggered by `setupComplete`:** Only after the `setupComplete` message is received and processed does the application send a user message to the model. +- **User Message Structure:** The user message is a JSON object conforming to the `BidiGenerateContentClientContent` format: + ```javascript + const contentMessage = { + client_content: { + turns: [ + { role: "user", - parts: [{ text: "Hello! Are you there?" }] - }], - turn_complete: true - } - }; - ``` - * `client_content`: Contains the conversation content. - * `turns`: An array representing the conversation turns. - * `role`: Indicates the role of the speaker ("user" in this case). - * `parts`: An array of content parts (in this case, a single text part). - * `text`: The actual user message (hardcoded to "Hello! Are you there?"). - * `turn_complete`: Set to `true` to signal the end of the user's turn. -* **Sending the Message:** The content message is stringified and sent to the server using `ws.send()`. + parts: [{ text: "Hello! Are you there?" }], + }, + ], + turn_complete: true, + }, + }; + ``` + - `client_content`: Contains the conversation content. + - `turns`: An array representing the conversation turns. + - `role`: Indicates the role of the speaker ("user" in this case). + - `parts`: An array of content parts (in this case, a single text part). + - `text`: The actual user message (hardcoded to "Hello! Are you there?"). + - `turn_complete`: Set to `true` to signal the end of the user's turn. +- **Sending the Message:** The content message is stringified and sent to the server using `ws.send()`. ### 6. Displaying the Model's Response -* **`serverContent` Handling:** When a `serverContent` message is received (which contains the model's response), the application extracts the response text. -* **Response Extraction:** The model's response is accessed using `response.serverContent.modelTurn.parts[0]?.text`. -* **Displaying the Response:** The `logMessage()` function displays the model's response in the `output` div on the HTML page. +- **`serverContent` Handling:** When a `serverContent` message is received (which contains the model's response), the application extracts the response text. +- **Response Extraction:** The model's response is accessed using `response.serverContent.modelTurn.parts[0]?.text`. +- **Displaying the Response:** The `logMessage()` function displays the model's response in the `output` div on the HTML page. ### 7. Error Handling and Connection Closure -* **`onerror` Handler:** The `onerror` event handler logs any WebSocket errors to the console and displays an error message on the page. -* **`onclose` Handler:** The `onclose` event handler logs information about the connection closure, including the reason and status code. +- **`onerror` Handler:** The `onerror` event handler logs any WebSocket errors to the console and displays an error message on the page. +- **`onclose` Handler:** The `onclose` event handler logs information about the connection closure, including the reason and status code. ### 8. Logging Messages -* **`logMessage()` Function:** This utility function creates a new paragraph element (`

`) and appends it to the `output` div, displaying the provided message on the page. +- **`logMessage()` Function:** This utility function creates a new paragraph element (`

`) and appends it to the `output` div, displaying the provided message on the page. ## Educational Purpose @@ -117,20 +119,20 @@ By examining this code, you can gain a deeper understanding of the underlying co **Note:** This is a simplified example for educational purposes. A real-world chat application would involve more complex features like: -* Dynamic user input. -* Handling multiple conversation turns. -* Maintaining conversation history. -* Potentially integrating audio or video. +- Dynamic user input. +- Handling multiple conversation turns. +- Maintaining conversation history. +- Potentially integrating audio or video. This example provides a solid foundation for understanding the basic principles involved in interacting with the Gemini API at a low level using WebSockets, especially the crucial setup process. **Note:** This is a simplified example for educational purposes. A real-world chat application would involve more complex features like: -* Dynamic user input (see Chapter 4). -* Handling multiple conversation turns. -* Maintaining conversation history. -* Potentially integrating audio or video (see Chapter 5 & 6). +- Dynamic user input (see Chapter 4). +- Handling multiple conversation turns. +- Maintaining conversation history. +- Potentially integrating audio or video (see Chapter 5 & 6). **Security Best Practices:** -For production applications, **never** expose your API key directly in client-side code. Instead, use a secure backend server to handle authentication and proxy requests to the API. This protects your API key from unauthorized access. \ No newline at end of file +For production applications, **never** expose your API key directly in client-side code. Instead, use a secure backend server to handle authentication and proxy requests to the API. This protects your API key from unauthorized access. diff --git a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_03/index.html b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_03/index.html index fda55344..16c19623 100644 --- a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_03/index.html +++ b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_03/index.html @@ -1,4 +1,4 @@ - + - - Gemini WebSocket Test - - - -

-

Gemini WebSocket Test

-

This is a simple demonstration of WebSocket communication with the Gemini API, showing a single exchange between user and model. It illustrates the fundamental principles of interacting with the API at a low level, without using an SDK.

-
- -
- - - - \ No newline at end of file + + + diff --git a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_04/README.md b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_04/README.md index fd4c7787..f06188a3 100644 --- a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_04/README.md +++ b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_04/README.md @@ -15,121 +15,123 @@ The application's functionality can be broken down into several key components: ### 1. Establishing a WebSocket Connection -* **API Endpoint:** The application connects to the Gemini API using a specific WebSocket endpoint URL: - ``` - wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContent?key=${apiKey} - ``` - This URL includes the API host, the service path, and an API key for authentication. Remember to replace `${apiKey}` with your actual API key. -* **WebSocket Object:** A new `WebSocket` object is created in JavaScript, initiating the connection: - ```javascript - const ws = new WebSocket(endpoint); - ``` -* **Event Handlers:** Event handlers are defined to manage the connection's lifecycle and handle incoming messages: - * `onopen`: Triggered when the connection is successfully opened. - * `onmessage`: Triggered when a message is received from the server. - * `onerror`: Triggered if an error occurs during the connection. - * `onclose`: Triggered when the connection is closed. +- **API Endpoint:** The application connects to the Gemini API using a specific WebSocket endpoint URL: + ``` + wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContent?key=${apiKey} + ``` + This URL includes the API host, the service path, and an API key for authentication. Remember to replace `${apiKey}` with your actual API key. +- **WebSocket Object:** A new `WebSocket` object is created in JavaScript, initiating the connection: + ```javascript + const ws = new WebSocket(endpoint); + ``` +- **Event Handlers:** Event handlers are defined to manage the connection's lifecycle and handle incoming messages: + - `onopen`: Triggered when the connection is successfully opened. + - `onmessage`: Triggered when a message is received from the server. + - `onerror`: Triggered if an error occurs during the connection. + - `onclose`: Triggered when the connection is closed. ### 2. Sending a Setup Message -* **`onopen` Handler:** When the `onopen` event is triggered, the application sends a setup message to the API. -* **Setup Message Structure:** The setup message is a JSON object that configures the interaction: - ```javascript - const setupMessage = { - setup: { - model: "models/gemini-2.0-flash-exp", - generation_config: { - response_modalities: ["AUDIO"] - } - } - }; - ``` - * `model`: Specifies the Gemini model to use (`"models/gemini-2.0-flash-exp"` in this case). - * `generation_config`: Configures the generation parameters. Here, `response_modalities` is set to `["AUDIO"]` to request audio output. -* **Sending the Message:** The setup message is stringified and sent to the server using `ws.send()`. -* **Input Disabled:** Initially, the user input field and send button are disabled. They are only enabled after the setup is complete. +- **`onopen` Handler:** When the `onopen` event is triggered, the application sends a setup message to the API. +- **Setup Message Structure:** The setup message is a JSON object that configures the interaction: + ```javascript + const setupMessage = { + setup: { + model: "models/gemini-2.0-flash-exp", + generation_config: { + response_modalities: ["AUDIO"], + }, + }, + }; + ``` + - `model`: Specifies the Gemini model to use (`"models/gemini-2.0-flash-exp"` in this case). + - `generation_config`: Configures the generation parameters. Here, `response_modalities` is set to `["AUDIO"]` to request audio output. +- **Sending the Message:** The setup message is stringified and sent to the server using `ws.send()`. +- **Input Disabled:** Initially, the user input field and send button are disabled. They are only enabled after the setup is complete. ### 3. Receiving and Processing Messages -* **`onmessage` Handler:** The `onmessage` event handler receives messages from the server. -* **Handling Different Response Types:** The code handles either `Blob` or `JSON` data. It converts `Blob` data to text and parses the text as JSON. -* **Response Parsing:** The received message is parsed as a JSON object using `JSON.parse()`. -* **Message Types:** The code checks for two types of messages: - * **`setupComplete`:** Indicates that the setup process is finished. - * **`serverContent`:** Contains the model's response, which in this case will be audio data. +- **`onmessage` Handler:** The `onmessage` event handler receives messages from the server. +- **Handling Different Response Types:** The code handles either `Blob` or `JSON` data. It converts `Blob` data to text and parses the text as JSON. +- **Response Parsing:** The received message is parsed as a JSON object using `JSON.parse()`. +- **Message Types:** The code checks for two types of messages: + - **`setupComplete`:** Indicates that the setup process is finished. + - **`serverContent`:** Contains the model's response, which in this case will be audio data. ### 4. Sending User Messages -* **Enabling Input:** When a `setupComplete` message is received, the application enables the user input field and the send button. -* **`sendUserMessage()` Function:** This function is called when the user clicks the "Send" button or presses Enter in the input field. -* **User Message Structure:** The user message is a JSON object: - ```javascript - const contentMessage = { - client_content: { - turns: [{ - role: "user", - parts: [{ text: message }] - }], - turn_complete: true - } - }; - ``` - * `client_content`: Contains the conversation content. - * `turns`: An array representing the conversation turns. - * `role`: Indicates the role of the speaker ("user" in this case). - * `parts`: An array of content parts (in this case, a single text part containing the user's message). - * `turn_complete`: Set to `true` to signal the end of the user's turn. -* **Sending the Message:** The content message is stringified and sent to the server using `ws.send()`. -* **Clearing Input:** The input field is cleared after the message is sent. +- **Enabling Input:** When a `setupComplete` message is received, the application enables the user input field and the send button. +- **`sendUserMessage()` Function:** This function is called when the user clicks the "Send" button or presses Enter in the input field. +- **User Message Structure:** The user message is a JSON object: + ```javascript + const contentMessage = { + client_content: { + turns: [ + { + role: "user", + parts: [{ text: message }], + }, + ], + turn_complete: true, + }, + }; + ``` + - `client_content`: Contains the conversation content. + - `turns`: An array representing the conversation turns. + - `role`: Indicates the role of the speaker ("user" in this case). + - `parts`: An array of content parts (in this case, a single text part containing the user's message). + - `turn_complete`: Set to `true` to signal the end of the user's turn. +- **Sending the Message:** The content message is stringified and sent to the server using `ws.send()`. +- **Clearing Input:** The input field is cleared after the message is sent. ### 5. Handling Audio Responses -* **`serverContent` with Audio:** When a `serverContent` message containing audio data is received, the application extracts the base64-encoded audio data. -* **`inlineData`:** The audio data is found in `response.serverContent.modelTurn.parts[0].inlineData.data`. -* **`playAudioChunk()`:** This function is called to handle the audio chunk. -* **Audio Queue:** Audio is pushed into an `audioQueue` array for processing. -* **Audio Playback Management:** `isPlayingAudio` flag ensures that chunks are played sequentially, one after the other. +- **`serverContent` with Audio:** When a `serverContent` message containing audio data is received, the application extracts the base64-encoded audio data. +- **`inlineData`:** The audio data is found in `response.serverContent.modelTurn.parts[0].inlineData.data`. +- **`playAudioChunk()`:** This function is called to handle the audio chunk. +- **Audio Queue:** Audio is pushed into an `audioQueue` array for processing. +- **Audio Playback Management:** `isPlayingAudio` flag ensures that chunks are played sequentially, one after the other. ### 6. Audio Playback with `AudioContext` -* **`ensureAudioInitialized()`:** This function initializes the `AudioContext` when the first audio chunk is received. This is done lazily to comply with browser autoplay policies. It sets a sample rate of 24000. - * **Lazy Initialization:** The `AudioContext` is only created when the first audio chunk is received. This is because some browsers restrict audio playback unless it's initiated by a user action. - * **Sample Rate:** The sample rate is set to 24000 Hz, which is a common sample rate for speech audio. -* **`playAudioChunk()`:** This function adds an audio chunk to a queue (`audioQueue`) and initiates audio playback if it's not already playing. -* **`processAudioQueue()`:** This function is responsible for processing and playing audio chunks from the queue. - * **Chunk Handling:** It retrieves an audio chunk from the queue. - * **Base64 Decoding:** The base64-encoded audio chunk is decoded to an `ArrayBuffer` using `base64ToArrayBuffer()`. - * **PCM to Float32 Conversion:** The raw PCM16LE (16-bit little-endian Pulse Code Modulation) audio data is converted to Float32 format using `convertPCM16LEToFloat32()`. This is necessary because `AudioContext` works with floating-point audio data. - * **Creating an `AudioBuffer`:** An `AudioBuffer` is created with a single channel, the appropriate length, and a sample rate of 24000 Hz. The Float32 audio data is then copied into the `AudioBuffer`. - * **Creating an `AudioBufferSourceNode`:** An `AudioBufferSourceNode` is created, which acts as a source for the audio data. The `AudioBuffer` is assigned to the source node. - * **Connecting to Destination:** The source node is connected to the `AudioContext`'s destination (the speakers). - * **Starting Playback:** `source.start(0)` starts the playback of the audio chunk immediately. - * **`onended` Event:** A promise is used with the `onended` event of the source node to ensure that the next chunk in the queue is only played after the current chunk has finished playing. This is crucial for maintaining the correct order and avoiding overlapping audio. +- **`ensureAudioInitialized()`:** This function initializes the `AudioContext` when the first audio chunk is received. This is done lazily to comply with browser autoplay policies. It sets a sample rate of 24000. + - **Lazy Initialization:** The `AudioContext` is only created when the first audio chunk is received. This is because some browsers restrict audio playback unless it's initiated by a user action. + - **Sample Rate:** The sample rate is set to 24000 Hz, which is a common sample rate for speech audio. +- **`playAudioChunk()`:** This function adds an audio chunk to a queue (`audioQueue`) and initiates audio playback if it's not already playing. +- **`processAudioQueue()`:** This function is responsible for processing and playing audio chunks from the queue. + - **Chunk Handling:** It retrieves an audio chunk from the queue. + - **Base64 Decoding:** The base64-encoded audio chunk is decoded to an `ArrayBuffer` using `base64ToArrayBuffer()`. + - **PCM to Float32 Conversion:** The raw PCM16LE (16-bit little-endian Pulse Code Modulation) audio data is converted to Float32 format using `convertPCM16LEToFloat32()`. This is necessary because `AudioContext` works with floating-point audio data. + - **Creating an `AudioBuffer`:** An `AudioBuffer` is created with a single channel, the appropriate length, and a sample rate of 24000 Hz. The Float32 audio data is then copied into the `AudioBuffer`. + - **Creating an `AudioBufferSourceNode`:** An `AudioBufferSourceNode` is created, which acts as a source for the audio data. The `AudioBuffer` is assigned to the source node. + - **Connecting to Destination:** The source node is connected to the `AudioContext`'s destination (the speakers). + - **Starting Playback:** `source.start(0)` starts the playback of the audio chunk immediately. + - **`onended` Event:** A promise is used with the `onended` event of the source node to ensure that the next chunk in the queue is only played after the current chunk has finished playing. This is crucial for maintaining the correct order and avoiding overlapping audio. ### 7. Helper Functions -* **`base64ToArrayBuffer()`:** Converts a base64-encoded string to an `ArrayBuffer`. -* **`convertPCM16LEToFloat32()`:** Converts PCM16LE audio data to Float32 format. -* **`logMessage()`:** Appends a message to the `output` div on the HTML page. +- **`base64ToArrayBuffer()`:** Converts a base64-encoded string to an `ArrayBuffer`. +- **`convertPCM16LEToFloat32()`:** Converts PCM16LE audio data to Float32 format. +- **`logMessage()`:** Appends a message to the `output` div on the HTML page. ### 8. Error Handling and Connection Closure -* **`onerror` Handler:** Logs WebSocket errors to the console and displays an error message on the page. -* **`onclose` Handler:** Logs information about the connection closure. +- **`onerror` Handler:** Logs WebSocket errors to the console and displays an error message on the page. +- **`onclose` Handler:** Logs information about the connection closure. ## Summary This example demonstrates a basic text-to-speech application using the Gemini API with WebSockets. It showcases: -* Establishing a WebSocket connection and sending a setup message. -* Handling user input and sending text messages to the API. -* Receiving audio responses in base64-encoded chunks. -* Decoding and converting audio data to a format suitable for playback. -* Using `AudioContext` to play the audio in the browser sequentially, one chunk after the other. -* Implementing basic error handling and connection closure. +- Establishing a WebSocket connection and sending a setup message. +- Handling user input and sending text messages to the API. +- Receiving audio responses in base64-encoded chunks. +- Decoding and converting audio data to a format suitable for playback. +- Using `AudioContext` to play the audio in the browser sequentially, one chunk after the other. +- Implementing basic error handling and connection closure. This example provides a starting point for building more sophisticated applications that can generate audio responses from the Gemini model and play them back in real time, all within the browser environment using low-level WebSockets and `AudioContext` for audio management. The sample rate is set to 24000 Hz to match the API's output sample rate, ensuring correct playback speed and pitch. **Security Best Practices:** -For production applications, **never** expose your API key directly in client-side code. Instead, use a secure backend server to handle authentication and proxy requests to the API. This protects your API key from unauthorized access. \ No newline at end of file +For production applications, **never** expose your API key directly in client-side code. Instead, use a secure backend server to handle authentication and proxy requests to the API. This protects your API key from unauthorized access. diff --git a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_04/index.html b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_04/index.html index 3789347a..cfafeaf2 100644 --- a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_04/index.html +++ b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_04/index.html @@ -1,4 +1,4 @@ - + - - Gemini Text-to-Speech WebSocket Test - - - -
-

Gemini Text-to-Speech with WebSockets

-

This application demonstrates real-time text-to-speech using the Gemini API. Type a message and receive an audio response that plays automatically in your browser. The app uses WebSockets for communication and AudioContext for handling audio playback.

-
- -
- - -
-
- - - - \ No newline at end of file + + + diff --git a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_05/README.md b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_05/README.md index a3b9da60..927b2e0d 100644 --- a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_05/README.md +++ b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_05/README.md @@ -6,37 +6,37 @@ This chapter presents a real-time audio-to-audio chat application that interacts **How This Chapter Differs from Previous Chapters:** -* **Chapter 2 (Live Audio Chat with Gemini):** Utilized the Python SDK for simplifying the audio streaming, but didn't run in the browser. It handled audio-to-audio but with the assistance of the SDK's higher-level abstractions. **Importantly, Chapter 2 used a `model_speaking` flag on the client-side to prevent the model's output from being treated as input.** This chapter achieves a similar outcome through a different mechanism, relying on the API's turn management. -* **Chapter 3 (Low-Level WebSocket Interaction - Single Exchange Example):** Introduced low level WebSocket interaction but only for sending a single text query to the model. -* **Chapter 4 (Text-to-Speech with WebSockets):** Focused on text-to-speech, sending text to the API and playing back the received audio. It introduced basic audio handling but did not involve live microphone input or complex audio stream management. +- **Chapter 2 (Live Audio Chat with Gemini):** Utilized the Python SDK for simplifying the audio streaming, but didn't run in the browser. It handled audio-to-audio but with the assistance of the SDK's higher-level abstractions. **Importantly, Chapter 2 used a `model_speaking` flag on the client-side to prevent the model's output from being treated as input.** This chapter achieves a similar outcome through a different mechanism, relying on the API's turn management. +- **Chapter 3 (Low-Level WebSocket Interaction - Single Exchange Example):** Introduced low level WebSocket interaction but only for sending a single text query to the model. +- **Chapter 4 (Text-to-Speech with WebSockets):** Focused on text-to-speech, sending text to the API and playing back the received audio. It introduced basic audio handling but did not involve live microphone input or complex audio stream management. **Chapter 5, in contrast, combines the real-time nature of Chapter 2 with the low-level WebSocket approach of Chapters 3 and 4 but implements a full audio-to-audio chat entirely within the browser.** This requires handling: -* **Live Microphone Input:** Capturing and processing a continuous stream of audio data from the user's microphone. -* **Bidirectional Audio Streaming:** Sending audio chunks to the API while simultaneously receiving and playing back audio responses in real time. -* **Advanced Audio Processing:** Converting between audio formats, managing audio buffers, and ensuring smooth playback using the Web Audio API. -* **Complex State Management:** Handling interruptions, turn-taking, and potential errors in a real-time audio stream. +- **Live Microphone Input:** Capturing and processing a continuous stream of audio data from the user's microphone. +- **Bidirectional Audio Streaming:** Sending audio chunks to the API while simultaneously receiving and playing back audio responses in real time. +- **Advanced Audio Processing:** Converting between audio formats, managing audio buffers, and ensuring smooth playback using the Web Audio API. +- **Complex State Management:** Handling interruptions, turn-taking, and potential errors in a real-time audio stream. **Why the Increased Complexity?** The jump in complexity comes from the need to manage real-time, bidirectional audio streams directly within the browser using low-level APIs. This involves: -* **No SDK Abstraction:** We're working directly with WebSockets and handling the raw message formats defined by the Gemini API, including setup and control messages. -* **Manual Audio Handling:** We must manually capture, chunk, encode, decode, process, and play audio data, without the convenience of an SDK's built-in methods. -* **Real-time Constraints:** We need to ensure that audio is processed and played back with minimal latency to maintain a natural conversational flow. -* **Asynchronous Operations:** We rely heavily on asynchronous JavaScript and Promises to manage the non-blocking nature of WebSockets and audio processing. +- **No SDK Abstraction:** We're working directly with WebSockets and handling the raw message formats defined by the Gemini API, including setup and control messages. +- **Manual Audio Handling:** We must manually capture, chunk, encode, decode, process, and play audio data, without the convenience of an SDK's built-in methods. +- **Real-time Constraints:** We need to ensure that audio is processed and played back with minimal latency to maintain a natural conversational flow. +- **Asynchronous Operations:** We rely heavily on asynchronous JavaScript and Promises to manage the non-blocking nature of WebSockets and audio processing. ## Project Structure This chapter's application consists of the following files: -* **`index.html`:** The main HTML file that sets up the user interface (a microphone button and an output area for messages) and includes the core JavaScript logic for WebSocket communication and overall application flow. -* **`audio-recorder.js`:** Contains the `AudioRecorder` class, which handles capturing audio from the microphone, converting it to the required format, and emitting chunks of audio data using an `EventEmitter3` interface. -* **`audio-streamer.js`:** Contains the `AudioStreamer` class, which manages audio playback using the Web Audio API. It handles queuing, buffering, and playing audio chunks received from the API, ensuring smooth and continuous playback. -* **`audio-recording-worklet.js`:** Defines an `AudioWorkletProcessor` that runs in a separate thread and performs the low-level audio processing, including float32 to int16 conversion and chunking. -* **`audioworklet-registry.js`:** A utility to help register and manage `AudioWorklet`s, preventing duplicate registration. -* **`utils.js`:** Provides utility functions like `audioContext` (for creating an `AudioContext`) and `base64ToArrayBuffer` (for decoding base64 audio data). -* **`style.css`:** Contains basic CSS styles for the user interface. +- **`index.html`:** The main HTML file that sets up the user interface (a microphone button and an output area for messages) and includes the core JavaScript logic for WebSocket communication and overall application flow. +- **`audio-recorder.js`:** Contains the `AudioRecorder` class, which handles capturing audio from the microphone, converting it to the required format, and emitting chunks of audio data using an `EventEmitter3` interface. +- **`audio-streamer.js`:** Contains the `AudioStreamer` class, which manages audio playback using the Web Audio API. It handles queuing, buffering, and playing audio chunks received from the API, ensuring smooth and continuous playback. +- **`audio-recording-worklet.js`:** Defines an `AudioWorkletProcessor` that runs in a separate thread and performs the low-level audio processing, including float32 to int16 conversion and chunking. +- **`audioworklet-registry.js`:** A utility to help register and manage `AudioWorklet`s, preventing duplicate registration. +- **`utils.js`:** Provides utility functions like `audioContext` (for creating an `AudioContext`) and `base64ToArrayBuffer` (for decoding base64 audio data). +- **`style.css`:** Contains basic CSS styles for the user interface. ## System Architecture @@ -48,181 +48,190 @@ The audio processing pipeline in this application is crucial for real-time perfo **1. Microphone Input and `AudioRecorder`:** -* **`AudioRecorder` Class:** This class encapsulates the logic for capturing audio from the user's microphone using the browser's `MediaDevices` API (`navigator.mediaDevices.getUserMedia`). -* **`AudioWorklet`:** It utilises an `AudioWorklet` to perform audio processing in a separate thread, preventing the main thread from being blocked by computationally intensive audio operations, which is essential for maintaining a smooth user experience. -* **`audio-recording-worklet.js`:** This file defines the `AudioProcessingWorklet` class, which extends `AudioWorkletProcessor`. It performs the following: - * **Float32 to Int16 Conversion:** Converts the raw audio data from Float32 format (used by the Web Audio API) to Int16 format (required by the Gemini API for PCM audio). The conversion involves scaling the Float32 values (ranging from -1.0 to 1.0) to the Int16 range (-32768 to 32767). - ```javascript - // convert float32 -1 to 1 to int16 -32768 to 32767 - const int16Value = float32Array[i] * 32768; - ``` - * **Chunking:** Buffers audio samples and sends them in chunks. This is where the frequency of audio transmission is determined. The `buffer` has a fixed length of **2048 samples**. When the `bufferWriteIndex` reaches the end of the buffer, the `sendAndClearBuffer` function is called. The buffer is sent via `postMessage` and then cleared, ready for new data. - ```javascript - // send and clear buffer every 2048 samples, - buffer = new Int16Array(2048); - - // ... - - if(this.bufferWriteIndex >= this.buffer.length) { - this.sendAndClearBuffer(); - } +- **`AudioRecorder` Class:** This class encapsulates the logic for capturing audio from the user's microphone using the browser's `MediaDevices` API (`navigator.mediaDevices.getUserMedia`). +- **`AudioWorklet`:** It utilises an `AudioWorklet` to perform audio processing in a separate thread, preventing the main thread from being blocked by computationally intensive audio operations, which is essential for maintaining a smooth user experience. +- **`audio-recording-worklet.js`:** This file defines the `AudioProcessingWorklet` class, which extends `AudioWorkletProcessor`. It performs the following: - // ... + - **Float32 to Int16 Conversion:** Converts the raw audio data from Float32 format (used by the Web Audio API) to Int16 format (required by the Gemini API for PCM audio). The conversion involves scaling the Float32 values (ranging from -1.0 to 1.0) to the Int16 range (-32768 to 32767). + ```javascript + // convert float32 -1 to 1 to int16 -32768 to 32767 + const int16Value = float32Array[i] * 32768; + ``` + - **Chunking:** Buffers audio samples and sends them in chunks. This is where the frequency of audio transmission is determined. The `buffer` has a fixed length of **2048 samples**. When the `bufferWriteIndex` reaches the end of the buffer, the `sendAndClearBuffer` function is called. The buffer is sent via `postMessage` and then cleared, ready for new data. - sendAndClearBuffer() { - this.port.postMessage({ - event: "chunk", - data: { - int16arrayBuffer: this.buffer.slice(0, this.bufferWriteIndex).buffer, - }, - }); - this.bufferWriteIndex = 0; - } - ``` - **At the input sample rate of 16000 Hz, a chunk of 2048 samples is created and sent approximately every 128 milliseconds (2048 / 16000 = 0.128 seconds).** -* **EventEmitter3:** The `AudioRecorder` class extends `EventEmitter3`, allowing it to emit events. Specifically, it emits a `data` event whenever a chunk of audio data is ready to be sent. Other parts of the application can listen for this event to receive the audio data. -* **`start()` and `stop()` Methods:** These methods control the recording process, starting and stopping the microphone capture and managing the associated resources. + ```javascript + // send and clear buffer every 2048 samples, + buffer = new Int16Array(2048); + + // ... + + if(this.bufferWriteIndex >= this.buffer.length) { + this.sendAndClearBuffer(); + } + + // ... + + sendAndClearBuffer() { + this.port.postMessage({ + event: "chunk", + data: { + int16arrayBuffer: this.buffer.slice(0, this.bufferWriteIndex).buffer, + }, + }); + this.bufferWriteIndex = 0; + } + ``` + + **At the input sample rate of 16000 Hz, a chunk of 2048 samples is created and sent approximately every 128 milliseconds (2048 / 16000 = 0.128 seconds).** + +- **EventEmitter3:** The `AudioRecorder` class extends `EventEmitter3`, allowing it to emit events. Specifically, it emits a `data` event whenever a chunk of audio data is ready to be sent. Other parts of the application can listen for this event to receive the audio data. +- **`start()` and `stop()` Methods:** These methods control the recording process, starting and stopping the microphone capture and managing the associated resources. **2. WebSocket Communication (`index.html`)** -* **`ws.onopen`:** Sends the initial `setup` message to the Gemini API, specifying the model, audio output as the response modality, and the desired voice. -* **`ws.onmessage`:** Handles incoming messages from the API: - * **`setupComplete`:** Enables the microphone button, indicating that the connection is ready. - * **`serverContent`:** Processes audio data, handles interruptions, and sends continuation signals as needed. -* **`sendAudioChunk()`:** This function is triggered by the `data` event emitted by the `AudioRecorder`. It takes a chunk of audio data (which has already been converted to Int16 and then to base64 in the `AudioRecorder`), constructs a `realtime_input` message, and sends it to the API via `ws.send()`. The message format adheres to the `BidiGenerateContentRealtimeInput` structure defined in the API documentation. -* **`sendEndMessage()` and `sendContinueSignal()`:** These are crucial for managing the conversation flow. - * **`sendEndMessage()`:** Sends a message with `turn_complete: true` when the user stops recording (by clicking the "Stop Mic" button). This signals to the API that the user's turn is finished. - ```javascript - const message = { - client_content: { - turns: [{ - role: "user", - parts: [] // no more audio for this turn - }], - turn_complete: true // end of turn - } - }; - ``` - * **`sendContinueSignal()`:** Sends a message with `turn_complete: false` immediately after receiving an audio chunk from the model, *unless* the model indicates `turnComplete: true`. This serves as a keep-alive, letting the API know that the client is still listening and ready for more audio data. This is important for the low-latency, real-time nature of the interaction. +- **`ws.onopen`:** Sends the initial `setup` message to the Gemini API, specifying the model, audio output as the response modality, and the desired voice. +- **`ws.onmessage`:** Handles incoming messages from the API: + - **`setupComplete`:** Enables the microphone button, indicating that the connection is ready. + - **`serverContent`:** Processes audio data, handles interruptions, and sends continuation signals as needed. +- **`sendAudioChunk()`:** This function is triggered by the `data` event emitted by the `AudioRecorder`. It takes a chunk of audio data (which has already been converted to Int16 and then to base64 in the `AudioRecorder`), constructs a `realtime_input` message, and sends it to the API via `ws.send()`. The message format adheres to the `BidiGenerateContentRealtimeInput` structure defined in the API documentation. +- **`sendEndMessage()` and `sendContinueSignal()`:** These are crucial for managing the conversation flow. + - **`sendEndMessage()`:** Sends a message with `turn_complete: true` when the user stops recording (by clicking the "Stop Mic" button). This signals to the API that the user's turn is finished. ```javascript - const message = { - client_content: { - turns: [{ - role: "user", - parts: [] // no more audio for this turn - }], - turn_complete: false // not the end of turn, keep going - } - }; + const message = { + client_content: { + turns: [ + { + role: "user", + parts: [], // no more audio for this turn + }, + ], + turn_complete: true, // end of turn + }, + }; ``` -* **`toggleMicrophone()`:** Starts and stops the recording process, calling the appropriate methods in `AudioRecorder`. + - **`sendContinueSignal()`:** Sends a message with `turn_complete: false` immediately after receiving an audio chunk from the model, _unless_ the model indicates `turnComplete: true`. This serves as a keep-alive, letting the API know that the client is still listening and ready for more audio data. This is important for the low-latency, real-time nature of the interaction. + ```javascript + const message = { + client_content: { + turns: [ + { + role: "user", + parts: [], // no more audio for this turn + }, + ], + turn_complete: false, // not the end of turn, keep going + }, + }; + ``` +- **`toggleMicrophone()`:** Starts and stops the recording process, calling the appropriate methods in `AudioRecorder`. **3. Audio Playback and `AudioStreamer`:** -* **`AudioStreamer` Class:** This class manages the playback of audio chunks received from the Gemini API. -* **`AudioContext`:** It utilizes the Web Audio API's `AudioContext` for handling audio playback. The `AudioContext` is initialized only when the first audio chunk is received to comply with browser autoplay policies. It sets a sample rate of 24000 Hz. - * **Lazy Initialization:** The `AudioContext` is only created when the first audio chunk is received. This is because some browsers restrict audio playback unless it's initiated by a user action. - * **Sample Rate:** The sample rate is set to 24000 Hz, which is a common sample rate for speech audio. -* **`addPCM16()`:** This method receives PCM16 audio chunks, converts them back to Float32, creates `AudioBuffer` objects, and adds them to an internal queue (`audioQueue`). -* **`playNextBuffer()`:** This method retrieves audio buffers from the queue and plays them using an `AudioBufferSourceNode`. It ensures that chunks are played sequentially, one after the other, using the `onended` event of the source node and a small delay. -* **`isPlaying` Flag:** This flag tracks whether audio is currently being played, preventing overlapping playback. -* **`stop()` and `resume()`:** These methods provide control over stopping and resuming audio playback. -* **`complete()`:** This method is called to signal the end of an audio stream, allowing any remaining buffers in the queue to be played out. -* **Stall Detection:** Implements a mechanism to detect and recover from playback stalls, ensuring continuous audio flow. The `checkPlaybackStatus()` function periodically checks if audio playback has stalled (by comparing the current time with the last playback time). If a stall is detected and there are still buffers in the queue, it attempts to restart playback by calling `playNextBuffer()`. This is a safety net to handle situations where the `onended` event might not fire reliably or if there are unexpected delays in audio processing. - ```javascript - checkPlaybackStatus() { - // Clear any existing timeout - if (this.playbackTimeout) { - clearTimeout(this.playbackTimeout); +- **`AudioStreamer` Class:** This class manages the playback of audio chunks received from the Gemini API. +- **`AudioContext`:** It utilizes the Web Audio API's `AudioContext` for handling audio playback. The `AudioContext` is initialized only when the first audio chunk is received to comply with browser autoplay policies. It sets a sample rate of 24000 Hz. + - **Lazy Initialization:** The `AudioContext` is only created when the first audio chunk is received. This is because some browsers restrict audio playback unless it's initiated by a user action. + - **Sample Rate:** The sample rate is set to 24000 Hz, which is a common sample rate for speech audio. +- **`addPCM16()`:** This method receives PCM16 audio chunks, converts them back to Float32, creates `AudioBuffer` objects, and adds them to an internal queue (`audioQueue`). +- **`playNextBuffer()`:** This method retrieves audio buffers from the queue and plays them using an `AudioBufferSourceNode`. It ensures that chunks are played sequentially, one after the other, using the `onended` event of the source node and a small delay. +- **`isPlaying` Flag:** This flag tracks whether audio is currently being played, preventing overlapping playback. +- **`stop()` and `resume()`:** These methods provide control over stopping and resuming audio playback. +- **`complete()`:** This method is called to signal the end of an audio stream, allowing any remaining buffers in the queue to be played out. +- **Stall Detection:** Implements a mechanism to detect and recover from playback stalls, ensuring continuous audio flow. The `checkPlaybackStatus()` function periodically checks if audio playback has stalled (by comparing the current time with the last playback time). If a stall is detected and there are still buffers in the queue, it attempts to restart playback by calling `playNextBuffer()`. This is a safety net to handle situations where the `onended` event might not fire reliably or if there are unexpected delays in audio processing. + + ```javascript + checkPlaybackStatus() { + // Clear any existing timeout + if (this.playbackTimeout) { + clearTimeout(this.playbackTimeout); + } + + // Set a new timeout to check playback status + this.playbackTimeout = setTimeout(() => { + const now = this.context.currentTime; + const timeSinceLastPlayback = now - this.lastPlaybackTime; + + // If more than 1 second has passed since last playback and we have buffers to play + if (timeSinceLastPlayback > 1 && this.audioQueue.length > 0 && this.isPlaying) { + console.log('Playback appears to have stalled, restarting...'); + this.playNextBuffer(); } - // Set a new timeout to check playback status - this.playbackTimeout = setTimeout(() => { - const now = this.context.currentTime; - const timeSinceLastPlayback = now - this.lastPlaybackTime; - - // If more than 1 second has passed since last playback and we have buffers to play - if (timeSinceLastPlayback > 1 && this.audioQueue.length > 0 && this.isPlaying) { - console.log('Playback appears to have stalled, restarting...'); - this.playNextBuffer(); - } - - // Continue checking if we're still playing - if (this.isPlaying) { - this.checkPlaybackStatus(); - } - }, 1000); - } - ``` + // Continue checking if we're still playing + if (this.isPlaying) { + this.checkPlaybackStatus(); + } + }, 1000); + } + ``` **4. Interruption Handling:** -* **Detection:** The API signals an interruption by sending a `serverContent` message with the `interrupted` flag set to `true`. This typically happens when the API's VAD detects speech from the user while the model is still speaking. - ```javascript - if (wsResponse.serverContent.interrupted) { - logMessage('Gemini: Interrupted'); - isInterrupted = true; - audioStreamer.stop(); - return; - } - ``` -* **Client-Side Handling:** When the `interrupted` flag is received: - 1. The `isInterrupted` flag is set to `true`. - 2. The `AudioStreamer`'s `stop()` method is called to immediately halt any ongoing audio playback. This ensures that the interrupted audio is not played. -* **Latency:** The latency for interruption detection is primarily determined by the API's VAD and the network latency. The client-side processing adds minimal delay. On a fast connection, the interruption should feel near-instantaneous. -* **No Specific Parameter:** There is no specific parameter in this code to tune the interruption sensitivity, as that is primarily controlled by the API's VAD. -* **Effects of Changing VAD (if possible):** If the API provided a way to adjust VAD sensitivity (which it currently doesn't for the Multimodal Live API), the effects would be: - * **More Sensitive VAD:** Interruptions would be triggered more easily, potentially leading to a more responsive but also more "jumpy" conversation. - * **Less Sensitive VAD:** The model would be more likely to finish its turn, but it might feel less responsive to user interruptions. +- **Detection:** The API signals an interruption by sending a `serverContent` message with the `interrupted` flag set to `true`. This typically happens when the API's VAD detects speech from the user while the model is still speaking. + ```javascript + if (wsResponse.serverContent.interrupted) { + logMessage("Gemini: Interrupted"); + isInterrupted = true; + audioStreamer.stop(); + return; + } + ``` +- **Client-Side Handling:** When the `interrupted` flag is received: + 1. The `isInterrupted` flag is set to `true`. + 2. The `AudioStreamer`'s `stop()` method is called to immediately halt any ongoing audio playback. This ensures that the interrupted audio is not played. +- **Latency:** The latency for interruption detection is primarily determined by the API's VAD and the network latency. The client-side processing adds minimal delay. On a fast connection, the interruption should feel near-instantaneous. +- **No Specific Parameter:** There is no specific parameter in this code to tune the interruption sensitivity, as that is primarily controlled by the API's VAD. +- **Effects of Changing VAD (if possible):** If the API provided a way to adjust VAD sensitivity (which it currently doesn't for the Multimodal Live API), the effects would be: + - **More Sensitive VAD:** Interruptions would be triggered more easily, potentially leading to a more responsive but also more "jumpy" conversation. + - **Less Sensitive VAD:** The model would be more likely to finish its turn, but it might feel less responsive to user interruptions. **5. Preventing Feedback Loop (No Echo):** In Chapter 2 with the Python SDK we introduced a `model_speaking` flag to prevent to model from listening to itself. In this chapter, we achieve this without an explicit flag on the client-side, **relying on the API's built-in turn management capabilities.** Here's how it works: -* **Turn Detection:** The Gemini API uses its Voice Activity Detection (VAD) to determine when a user's turn begins and ends. When the user starts speaking, the VAD detects this as the start of a turn. When the user stops speaking for a certain duration (a pause), the VAD determines that the user's turn has ended. +- **Turn Detection:** The Gemini API uses its Voice Activity Detection (VAD) to determine when a user's turn begins and ends. When the user starts speaking, the VAD detects this as the start of a turn. When the user stops speaking for a certain duration (a pause), the VAD determines that the user's turn has ended. -* **`turn_complete` Signal:** The `turn_complete: true` signal sent in the `sendEndMessage()` function after the user stops speaking explicitly tells the API that the user's turn is over. This is important for the API to properly segment the conversation. The sending of this signal is directly tied to the user clicking the "Stop Mic" button, which in turn is only clickable when the user is speaking. This means the user has control when a turn ends. +- **`turn_complete` Signal:** The `turn_complete: true` signal sent in the `sendEndMessage()` function after the user stops speaking explicitly tells the API that the user's turn is over. This is important for the API to properly segment the conversation. The sending of this signal is directly tied to the user clicking the "Stop Mic" button, which in turn is only clickable when the user is speaking. This means the user has control when a turn ends. -* **API-Side Management:** The API manages the conversation flow internally, ensuring that the model only processes audio input that is considered part of the user's turn. The model does not start generating its response until the user's turn is deemed complete (either by `turn_complete: true` or by the VAD detecting a sufficiently long pause). +- **API-Side Management:** The API manages the conversation flow internally, ensuring that the model only processes audio input that is considered part of the user's turn. The model does not start generating its response until the user's turn is deemed complete (either by `turn_complete: true` or by the VAD detecting a sufficiently long pause). -* **`sendContinueSignal()`:** The `sendContinueSignal()` function sends `turn_complete: false` after model audio is received unless the model indicated `turn_complete: true`. This is important. Without that the model would not continue to speak if the generated audio takes longer than the VAD's pause detection. +- **`sendContinueSignal()`:** The `sendContinueSignal()` function sends `turn_complete: false` after model audio is received unless the model indicated `turn_complete: true`. This is important. Without that the model would not continue to speak if the generated audio takes longer than the VAD's pause detection. Essentially, the API is designed to handle the "listen while speaking" scenario gracefully. It's not simply feeding the output audio back into the input. The VAD and turn management logic ensure that the model only processes audio it considers as user input. **6. Audio Streaming and Context Window:** -* **Continuous Streaming:** As long as the microphone is active and the user is speaking, audio data is continuously sent to the Gemini API in chunks. This is necessary for real-time interaction. -* **Chunk Size and Data Rate:** - * Each chunk contains 2048 samples of 16-bit PCM audio. - * Each sample is 2 bytes (16 bits = 2 bytes). - * Therefore, each chunk is 2048 samples * 2 bytes/sample = 4096 bytes. - * Chunks are sent roughly every 128 milliseconds. - * This translates to a data rate of approximately 4096 bytes / 0.128 seconds = 32 KB/s (kilobytes per second). - * **VAD and Turn Boundaries:** The API's VAD plays a crucial role in determining the boundaries of a turn. When VAD detects a significant enough pause in the user's speech, it considers the turn to be over, and the model generates a response based on that segment of audio. - * **Practical Implications:** For a natural conversational flow, it's generally a good practice to keep your utterances relatively concise and allow for turn-taking. This helps the API process the audio effectively and generate relevant responses. +- **Continuous Streaming:** As long as the microphone is active and the user is speaking, audio data is continuously sent to the Gemini API in chunks. This is necessary for real-time interaction. +- **Chunk Size and Data Rate:** + - Each chunk contains 2048 samples of 16-bit PCM audio. + - Each sample is 2 bytes (16 bits = 2 bytes). + - Therefore, each chunk is 2048 samples \* 2 bytes/sample = 4096 bytes. + - Chunks are sent roughly every 128 milliseconds. + - This translates to a data rate of approximately 4096 bytes / 0.128 seconds = 32 KB/s (kilobytes per second). + - **VAD and Turn Boundaries:** The API's VAD plays a crucial role in determining the boundaries of a turn. When VAD detects a significant enough pause in the user's speech, it considers the turn to be over, and the model generates a response based on that segment of audio. + - **Practical Implications:** For a natural conversational flow, it's generally a good practice to keep your utterances relatively concise and allow for turn-taking. This helps the API process the audio effectively and generate relevant responses. **7. User Interface (`index.html`)** -* **"Start Mic"/"Stop Mic" Button:** This button controls the microphone recording. Its text toggles between "Start Mic" and "Stop Mic" depending on the recording state. -* **Output Area:** The `div` with the ID `output` is used to display messages to the user, such as "Recording started...", "Recording stopped...", "Gemini: Speaking...", and "Gemini: Finished speaking". -* **Visual Feedback:** The UI provides basic visual feedback about the state of the application (recording, playing audio, etc.). -* **Initial State:** When the page loads, the microphone button is disabled. It is only enabled after the WebSocket connection is successfully established and the setup message exchange is complete. +- **"Start Mic"/"Stop Mic" Button:** This button controls the microphone recording. Its text toggles between "Start Mic" and "Stop Mic" depending on the recording state. +- **Output Area:** The `div` with the ID `output` is used to display messages to the user, such as "Recording started...", "Recording stopped...", "Gemini: Speaking...", and "Gemini: Finished speaking". +- **Visual Feedback:** The UI provides basic visual feedback about the state of the application (recording, playing audio, etc.). +- **Initial State:** When the page loads, the microphone button is disabled. It is only enabled after the WebSocket connection is successfully established and the setup message exchange is complete. **8. Debugging** -* **Browser Developer Tools:** The primary tool for debugging this application is your browser's developer tools (usually accessed by pressing F12). - * **Console:** Use the console to view `console.log` messages, errors, and warnings. The code includes numerous `console.log` statements to help you track the flow of execution and the data being processed. - * **Network Tab:** Use the Network tab to monitor WebSocket traffic. You can inspect the individual messages being sent and received, including their contents and timing. This is invaluable for understanding the communication with the API. - * **Debugger:** Use the JavaScript debugger to set breakpoints, step through the code, inspect variables, and analyze the call stack. -* **`logMessage()` Function:** This function provides a simple way to display messages in the `output` div on the page, providing visual feedback within the application itself. +- **Browser Developer Tools:** The primary tool for debugging this application is your browser's developer tools (usually accessed by pressing F12). + - **Console:** Use the console to view `console.log` messages, errors, and warnings. The code includes numerous `console.log` statements to help you track the flow of execution and the data being processed. + - **Network Tab:** Use the Network tab to monitor WebSocket traffic. You can inspect the individual messages being sent and received, including their contents and timing. This is invaluable for understanding the communication with the API. + - **Debugger:** Use the JavaScript debugger to set breakpoints, step through the code, inspect variables, and analyze the call stack. +- **`logMessage()` Function:** This function provides a simple way to display messages in the `output` div on the page, providing visual feedback within the application itself. **9. Further Considerations** -* **Error Handling:** The code includes basic error handling, but it could be made more robust by handling specific error codes or messages from the API and providing more informative feedback to the user. -* **Security:** The API key is currently hardcoded in the HTML file. For production, you should **never** expose your API key directly in client-side code. Instead, use a secure backend server to handle authentication and proxy requests to the API. -* **Scalability:** This example is designed for a single user. For a multi-user scenario, you would need to manage multiple WebSocket connections and potentially use a server-side component to handle user sessions and routing. -* **Audio Quality:** The audio quality depends on the microphone, network conditions, and the API's processing. You can experiment with different sample rates and chunk sizes, but these values are often constrained by the API's requirements and the need to balance latency and bandwidth. -* **Network Latency:** Network latency can significantly impact the real-time performance of the application. There's no single solution to mitigate network latency, but using a server closer to the user's location and optimizing the audio processing pipeline can help. -* **Audio Level:** There is a `gainNode` to allow for controlling the volume of the output audio in the `AudioStreamer`. This is not used yet but could be exposed to the user through the UI if needed. +- **Error Handling:** The code includes basic error handling, but it could be made more robust by handling specific error codes or messages from the API and providing more informative feedback to the user. +- **Security:** The API key is currently hardcoded in the HTML file. For production, you should **never** expose your API key directly in client-side code. Instead, use a secure backend server to handle authentication and proxy requests to the API. +- **Scalability:** This example is designed for a single user. For a multi-user scenario, you would need to manage multiple WebSocket connections and potentially use a server-side component to handle user sessions and routing. +- **Audio Quality:** The audio quality depends on the microphone, network conditions, and the API's processing. You can experiment with different sample rates and chunk sizes, but these values are often constrained by the API's requirements and the need to balance latency and bandwidth. +- **Network Latency:** Network latency can significantly impact the real-time performance of the application. There's no single solution to mitigate network latency, but using a server closer to the user's location and optimizing the audio processing pipeline can help. +- **Audio Level:** There is a `gainNode` to allow for controlling the volume of the output audio in the `AudioStreamer`. This is not used yet but could be exposed to the user through the UI if needed. ## Web Audio API @@ -230,67 +239,67 @@ The Web Audio API is a high-level JavaScript API for processing and synthesizing **Key Concepts:** -* **`AudioContext`:** The primary interface for working with the Web Audio API. It represents an audio-processing graph built from audio nodes. You can only have one `AudioContext` per document. Think of it as the container or the manager for all audio operations. -* **Audio Nodes:** Building blocks of the audio graph. They perform specific audio processing tasks. Examples include: - * **`AudioBufferSourceNode`:** Represents an audio source consisting of in-memory audio data stored in an `AudioBuffer`. Used here to play the audio chunks received from the API. - * **`MediaStreamAudioSourceNode`:** Represents an audio source consisting of a `MediaStream` (e.g., from a microphone). Used here to capture audio from the microphone. - * **`GainNode`:** Controls the volume (gain) of the audio signal. Used here for potential volume adjustments. - * **`AudioWorkletNode`:** A special type of node that allows you to run custom audio processing JavaScript code in a separate thread (the audio rendering thread). This is essential for real-time audio processing as it prevents blocking the main thread and causing glitches. Used here (`audio-recording-worklet.js`) to handle audio chunking and format conversion in a separate thread. -* **`AudioBuffer`:** Represents a short audio asset residing in memory. Used to hold the audio data of each chunk. -* **`AudioParam`:** Represents a parameter of an audio node (e.g., the gain of a `GainNode`). Can be automated over time. -* **`AudioWorklet`:** Enables developers to write custom audio processing scripts that run in a separate thread. This is crucial for performance-sensitive audio applications, as it ensures that audio processing doesn't block the main thread and cause glitches or delays. `AudioWorklet`s are defined in separate JavaScript files (like `audio-recording-worklet.js`) and are added to the `AudioContext` using `audioContext.audioWorklet.addModule()`. +- **`AudioContext`:** The primary interface for working with the Web Audio API. It represents an audio-processing graph built from audio nodes. You can only have one `AudioContext` per document. Think of it as the container or the manager for all audio operations. +- **Audio Nodes:** Building blocks of the audio graph. They perform specific audio processing tasks. Examples include: + - **`AudioBufferSourceNode`:** Represents an audio source consisting of in-memory audio data stored in an `AudioBuffer`. Used here to play the audio chunks received from the API. + - **`MediaStreamAudioSourceNode`:** Represents an audio source consisting of a `MediaStream` (e.g., from a microphone). Used here to capture audio from the microphone. + - **`GainNode`:** Controls the volume (gain) of the audio signal. Used here for potential volume adjustments. + - **`AudioWorkletNode`:** A special type of node that allows you to run custom audio processing JavaScript code in a separate thread (the audio rendering thread). This is essential for real-time audio processing as it prevents blocking the main thread and causing glitches. Used here (`audio-recording-worklet.js`) to handle audio chunking and format conversion in a separate thread. +- **`AudioBuffer`:** Represents a short audio asset residing in memory. Used to hold the audio data of each chunk. +- **`AudioParam`:** Represents a parameter of an audio node (e.g., the gain of a `GainNode`). Can be automated over time. +- **`AudioWorklet`:** Enables developers to write custom audio processing scripts that run in a separate thread. This is crucial for performance-sensitive audio applications, as it ensures that audio processing doesn't block the main thread and cause glitches or delays. `AudioWorklet`s are defined in separate JavaScript files (like `audio-recording-worklet.js`) and are added to the `AudioContext` using `audioContext.audioWorklet.addModule()`. **How This Application Uses the Web Audio API:** -* **`AudioContext`:** An `AudioContext` is created to manage the entire audio graph. It's initialized with a sample rate of 24000 Hz, matching the API's output sample rate. -* **`AudioWorkletNode`:** An `AudioWorkletNode` is used to run the `AudioProcessingWorklet` defined in `audio-recording-worklet.js`. This handles the real-time processing of microphone input, converting it to Int16 format and dividing it into chunks. -* **`AudioBufferSourceNode`:** An `AudioBufferSourceNode` is created for each audio chunk received from the API. The audio data is decoded, converted to Float32, and then used to create an `AudioBuffer` that is assigned to the source node. -* **`MediaStreamAudioSourceNode`:** A `MediaStreamAudioSourceNode` is created to capture the audio stream from the user's microphone. -* **`GainNode`:** A `GainNode` is connected to the output for potential volume control. -* **Connections:** The nodes are connected: `MediaStreamAudioSourceNode` -> `AudioWorkletNode` (for input processing), and `AudioBufferSourceNode` -> `GainNode` -> `AudioContext.destination` (for output). +- **`AudioContext`:** An `AudioContext` is created to manage the entire audio graph. It's initialized with a sample rate of 24000 Hz, matching the API's output sample rate. +- **`AudioWorkletNode`:** An `AudioWorkletNode` is used to run the `AudioProcessingWorklet` defined in `audio-recording-worklet.js`. This handles the real-time processing of microphone input, converting it to Int16 format and dividing it into chunks. +- **`AudioBufferSourceNode`:** An `AudioBufferSourceNode` is created for each audio chunk received from the API. The audio data is decoded, converted to Float32, and then used to create an `AudioBuffer` that is assigned to the source node. +- **`MediaStreamAudioSourceNode`:** A `MediaStreamAudioSourceNode` is created to capture the audio stream from the user's microphone. +- **`GainNode`:** A `GainNode` is connected to the output for potential volume control. +- **Connections:** The nodes are connected: `MediaStreamAudioSourceNode` -> `AudioWorkletNode` (for input processing), and `AudioBufferSourceNode` -> `GainNode` -> `AudioContext.destination` (for output). **Audio Queueing and Buffering:** -* **`audioQueue`:** This array in `AudioStreamer` acts as a queue for incoming audio chunks. Chunks are added to the queue as they are received from the API. -* **`playNextBuffer()`:** This function retrieves and plays buffers from the queue sequentially. It uses the `onended` event of the `AudioBufferSourceNode` to trigger the playback of the next chunk, ensuring a continuous stream. -* **Buffering:** The Web Audio API internally handles some buffering, but the `audioQueue` provides an additional layer of buffering to smooth out any irregularities in the arrival of audio chunks. +- **`audioQueue`:** This array in `AudioStreamer` acts as a queue for incoming audio chunks. Chunks are added to the queue as they are received from the API. +- **`playNextBuffer()`:** This function retrieves and plays buffers from the queue sequentially. It uses the `onended` event of the `AudioBufferSourceNode` to trigger the playback of the next chunk, ensuring a continuous stream. +- **Buffering:** The Web Audio API internally handles some buffering, but the `audioQueue` provides an additional layer of buffering to smooth out any irregularities in the arrival of audio chunks. **Batched Sending:** -* The term "batching" isn't explicitly used in the code, but the concept is present in how audio chunks are created and sent. The `AudioWorklet` buffers 2048 samples before sending a chunk. This can be considered a form of batching, as it sends data in discrete units rather than a continuous stream of individual samples. This approach balances the need for real-time responsiveness with the efficiency of sending data in larger packets. +- The term "batching" isn't explicitly used in the code, but the concept is present in how audio chunks are created and sent. The `AudioWorklet` buffers 2048 samples before sending a chunk. This can be considered a form of batching, as it sends data in discrete units rather than a continuous stream of individual samples. This approach balances the need for real-time responsiveness with the efficiency of sending data in larger packets. ## Configuration and Parameters The following parameters and values are used in this application and can be customized: -* **`model`:** `"models/gemini-2.0-flash-exp"` (specifies the Gemini model). -* **`response_modalities`:** `["audio"]` (requests audio output from the API). -* **`speech_config`:** - * **`voice_config`**: - * **`prebuilt_voice_config`**: - * **`voice_name`**: `Aoede` (specifies which voice to use). - Possible values: `Aoede`, `Charon`, `Fenrir`, `Kore`, `Puck` -* **`sampleRate`:** - The sample rate is set to 16000 Hz for the input and 24000 Hz for the output. This is dictated by the API's requirements. - * **Input (Microphone):** 16000 Hz (set in `audio-recorder.js`). This is a common sample rate for speech recognition. - * **Why 16000 Hz for input?** 16000 Hz is a standard sample rate for speech processing and is often used in speech recognition systems because it captures most of the relevant frequency information in human speech while keeping computational costs manageable. Using a higher sample rate for input might not provide significant improvements in speech recognition accuracy for this application. - * **Output (API):** 24000 Hz (specified in the API documentation and when creating the `AudioContext`). This is a higher sample rate, providing better audio quality for playback. - * **Why 24000 Hz for output?** 24000 Hz is chosen because it's the sample rate at which the API provides audio output. Using this rate ensures that the audio is played back at the correct speed and pitch. -* **`CHUNK_SIZE` (in `audio-recording-worklet.js`):** 2048 samples. This determines the size of the audio chunks sent to the API. It represents a good balance between latency and processing overhead. - * **Calculation:** With a sample rate of 16000 Hz, a 2048-sample chunk corresponds to 2048 / 16000 = 0.128 seconds, or 128 milliseconds. - * **Why 2048 samples per chunk?** This value is chosen to balance the need for low latency with the overhead of sending frequent messages. Smaller chunks would result in lower latency but would increase the number of messages sent to the API, potentially leading to higher processing overhead and network congestion. Larger chunks would reduce the frequency of messages but increase latency. - * **Effects of Changing `CHUNK_SIZE`:** - * **Smaller `CHUNK_SIZE` (e.g., 1024 samples):** - * **Pros:** Lower latency (around 64 milliseconds per chunk). The application would feel more responsive. - * **Cons:** Increased processing overhead on both the client and server sides due to more frequent message sending and handling. Increased network traffic. The audio might also start to sound choppy and distorted due to potential buffer underruns. - * **Larger `CHUNK_SIZE` (e.g., 4096 samples):** - * **Pros:** Reduced processing overhead and network traffic. - * **Cons:** Higher latency (around 256 milliseconds per chunk). The application would feel less responsive, and the conversation might feel sluggish. -* **Audio Format:** - * **Input:** The microphone provides audio data in Float32 format. - * **API Input:** The API expects audio data in 16-bit linear PCM (Int16) format, little-endian. - * **API Output:** The API provides audio data in base64-encoded 16-bit linear PCM (Int16) format, little-endian. - * **Output:** The `AudioContext` works with Float32 audio data. +- **`model`:** `"models/gemini-2.0-flash-exp"` (specifies the Gemini model). +- **`response_modalities`:** `["audio"]` (requests audio output from the API). +- **`speech_config`:** + - **`voice_config`**: + - **`prebuilt_voice_config`**: + - **`voice_name`**: `Aoede` (specifies which voice to use). + Possible values: `Aoede`, `Charon`, `Fenrir`, `Kore`, `Puck` +- **`sampleRate`:** + The sample rate is set to 16000 Hz for the input and 24000 Hz for the output. This is dictated by the API's requirements. + - **Input (Microphone):** 16000 Hz (set in `audio-recorder.js`). This is a common sample rate for speech recognition. + - **Why 16000 Hz for input?** 16000 Hz is a standard sample rate for speech processing and is often used in speech recognition systems because it captures most of the relevant frequency information in human speech while keeping computational costs manageable. Using a higher sample rate for input might not provide significant improvements in speech recognition accuracy for this application. + - **Output (API):** 24000 Hz (specified in the API documentation and when creating the `AudioContext`). This is a higher sample rate, providing better audio quality for playback. + - **Why 24000 Hz for output?** 24000 Hz is chosen because it's the sample rate at which the API provides audio output. Using this rate ensures that the audio is played back at the correct speed and pitch. +- **`CHUNK_SIZE` (in `audio-recording-worklet.js`):** 2048 samples. This determines the size of the audio chunks sent to the API. It represents a good balance between latency and processing overhead. + - **Calculation:** With a sample rate of 16000 Hz, a 2048-sample chunk corresponds to 2048 / 16000 = 0.128 seconds, or 128 milliseconds. + - **Why 2048 samples per chunk?** This value is chosen to balance the need for low latency with the overhead of sending frequent messages. Smaller chunks would result in lower latency but would increase the number of messages sent to the API, potentially leading to higher processing overhead and network congestion. Larger chunks would reduce the frequency of messages but increase latency. + - **Effects of Changing `CHUNK_SIZE`:** + - **Smaller `CHUNK_SIZE` (e.g., 1024 samples):** + - **Pros:** Lower latency (around 64 milliseconds per chunk). The application would feel more responsive. + - **Cons:** Increased processing overhead on both the client and server sides due to more frequent message sending and handling. Increased network traffic. The audio might also start to sound choppy and distorted due to potential buffer underruns. + - **Larger `CHUNK_SIZE` (e.g., 4096 samples):** + - **Pros:** Reduced processing overhead and network traffic. + - **Cons:** Higher latency (around 256 milliseconds per chunk). The application would feel less responsive, and the conversation might feel sluggish. +- **Audio Format:** + - **Input:** The microphone provides audio data in Float32 format. + - **API Input:** The API expects audio data in 16-bit linear PCM (Int16) format, little-endian. + - **API Output:** The API provides audio data in base64-encoded 16-bit linear PCM (Int16) format, little-endian. + - **Output:** The `AudioContext` works with Float32 audio data. ## Lessons Learned and Best Practices @@ -298,55 +307,55 @@ Through the development of this real-time audio streaming application, several i ### Audio Context Setup -* **Lazy Initialization:** Initialize the `AudioContext` only when needed, typically in response to a user interaction, to comply with browser autoplay policies. +- **Lazy Initialization:** Initialize the `AudioContext` only when needed, typically in response to a user interaction, to comply with browser autoplay policies. ### Audio Buffer Management -* **Avoid Fixed Buffer Sizes:** Instead of using fixed buffer sizes and trying to manage partial buffers, adapt to the natural chunk sizes provided by the API. Process each chunk as it arrives. This simplifies buffer management and improves playback smoothness. -* **Don't Overcomplicate:** Simple sequential playback using the `onended` event is often more effective and easier to manage than complex scheduling logic. +- **Avoid Fixed Buffer Sizes:** Instead of using fixed buffer sizes and trying to manage partial buffers, adapt to the natural chunk sizes provided by the API. Process each chunk as it arrives. This simplifies buffer management and improves playback smoothness. +- **Don't Overcomplicate:** Simple sequential playback using the `onended` event is often more effective and easier to manage than complex scheduling logic. ### PCM16 Data Handling -* **Correct Conversion:** Ensure that PCM16 data is correctly interpreted and converted to Float32 format for the Web Audio API. The conversion involves normalizing the 16-bit integer values to the range [-1, 1]. +- **Correct Conversion:** Ensure that PCM16 data is correctly interpreted and converted to Float32 format for the Web Audio API. The conversion involves normalizing the 16-bit integer values to the range [-1, 1]. ### Playback Timing and Scheduling -* **Sequential Playback:** Use the `onended` event of `AudioBufferSourceNode` to trigger the playback of the next audio chunk. This ensures that chunks are played sequentially without overlap. -* **Avoid Aggressive Scheduling:** Do not schedule buffers too far in advance. This can lead to memory issues and make it difficult to handle interruptions. +- **Sequential Playback:** Use the `onended` event of `AudioBufferSourceNode` to trigger the playback of the next audio chunk. This ensures that chunks are played sequentially without overlap. +- **Avoid Aggressive Scheduling:** Do not schedule buffers too far in advance. This can lead to memory issues and make it difficult to handle interruptions. ### Interruption Handling -* **Immediate Stop:** When an interruption is detected (using the `interrupted` flag from the API), stop the current audio playback immediately using `audioStreamer.stop()`. -* **State Reset:** Reset the `isInterrupted` flag and any other relevant state variables to prepare for new audio input. -* **Clear Buffers:** Ensure that any pending audio buffers are cleared to prevent stale audio from playing. +- **Immediate Stop:** When an interruption is detected (using the `interrupted` flag from the API), stop the current audio playback immediately using `audioStreamer.stop()`. +- **State Reset:** Reset the `isInterrupted` flag and any other relevant state variables to prepare for new audio input. +- **Clear Buffers:** Ensure that any pending audio buffers are cleared to prevent stale audio from playing. ### Protocol Management -* **Setup Message:** Send the `setup` message as the very first message after establishing the WebSocket connection. This configures the session with the API. -* **Voice Selection:** In the setup message, select a voice in the speech config, which determines the voice of the audio response. -* **Continue Signals:** Send `client_content` messages with `turn_complete: false` to maintain the streaming connection and signal that the client is ready for more audio data. Send these signals immediately after receiving and processing an audio chunk from the model. -* **Turn Completion:** Send a `client_content` message with `turn_complete: true` to indicate the end of the user's turn. +- **Setup Message:** Send the `setup` message as the very first message after establishing the WebSocket connection. This configures the session with the API. +- **Voice Selection:** In the setup message, select a voice in the speech config, which determines the voice of the audio response. +- **Continue Signals:** Send `client_content` messages with `turn_complete: false` to maintain the streaming connection and signal that the client is ready for more audio data. Send these signals immediately after receiving and processing an audio chunk from the model. +- **Turn Completion:** Send a `client_content` message with `turn_complete: true` to indicate the end of the user's turn. ### State Management -* **Track Essential States:** Keep track of states like `isRecording`, `initialized`, and `isInterrupted` to manage the application flow correctly. -* **Reset States Appropriately:** Reset these states at the appropriate times, such as when starting a new recording or after an interruption. +- **Track Essential States:** Keep track of states like `isRecording`, `initialized`, and `isInterrupted` to manage the application flow correctly. +- **Reset States Appropriately:** Reset these states at the appropriate times, such as when starting a new recording or after an interruption. ### Technical Requirements and Best Practices -* **`AudioContext` Sample Rate:** Always initialize the `AudioContext` with a sample rate of 24000 Hz for compatibility with the Gemini API. -* **WebSocket Configuration:** Ensure the WebSocket connection is properly configured with the correct API endpoint and API key. -* **Event Handling:** Implement proper event handling for all relevant audio and WebSocket events, including `onopen`, `onmessage`, `onerror`, `onclose`, `onended`, and custom events like the `data` event from `AudioRecorder`. -* **State Management:** Implement robust state management to track the recording state, initialization state, interruption state, and other relevant flags. +- **`AudioContext` Sample Rate:** Always initialize the `AudioContext` with a sample rate of 24000 Hz for compatibility with the Gemini API. +- **WebSocket Configuration:** Ensure the WebSocket connection is properly configured with the correct API endpoint and API key. +- **Event Handling:** Implement proper event handling for all relevant audio and WebSocket events, including `onopen`, `onmessage`, `onerror`, `onclose`, `onended`, and custom events like the `data` event from `AudioRecorder`. +- **State Management:** Implement robust state management to track the recording state, initialization state, interruption state, and other relevant flags. ### Common Pitfalls to Avoid -* **Overly Complex Buffer Management:** Avoid using fixed buffer sizes or complex buffering logic when a simpler sequential approach is sufficient. -* **Aggressive Buffer Scheduling:** Don't schedule audio buffers too far in advance, as this can lead to memory issues and complicate interruption handling. -* **Incorrect PCM16 Handling:** Ensure that PCM16 data is correctly converted to Float32 format, and that the sample rate is properly considered. -* **Ignoring `turn_complete`:** Always handle the `turn_complete` signal from the API to properly manage turn-taking. -* **Neglecting State Management:** Failing to properly manage and reset state variables can lead to unexpected behavior and bugs. -* **Forgetting Continue Signals:** Remember to send continue signals to maintain the streaming connection, especially during long audio generation. +- **Overly Complex Buffer Management:** Avoid using fixed buffer sizes or complex buffering logic when a simpler sequential approach is sufficient. +- **Aggressive Buffer Scheduling:** Don't schedule audio buffers too far in advance, as this can lead to memory issues and complicate interruption handling. +- **Incorrect PCM16 Handling:** Ensure that PCM16 data is correctly converted to Float32 format, and that the sample rate is properly considered. +- **Ignoring `turn_complete`:** Always handle the `turn_complete` signal from the API to properly manage turn-taking. +- **Neglecting State Management:** Failing to properly manage and reset state variables can lead to unexpected behavior and bugs. +- **Forgetting Continue Signals:** Remember to send continue signals to maintain the streaming connection, especially during long audio generation. ## Summary diff --git a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_05/index.html b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_05/index.html index 81d94098..7f7d3176 100644 --- a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_05/index.html +++ b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_05/index.html @@ -1,4 +1,4 @@ - + - - Gemini Audio-to-Audio WebSocket Demo (Dev API) - - - - -
-

Gemini Live Audio Chat (Dev API)

-

This application demonstrates real-time audio-to-audio chat using the Gemini API and WebSockets. Speak into your microphone and receive audio responses in real time. The app uses the Web Audio API for capturing microphone input and playing back responses, with support for natural conversation flow and interruptions.

-
- -
- -
-
- - - - - - - - - + + + + + + - - \ No newline at end of file + + + diff --git a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_06/README.md b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_06/README.md index 7017a463..f0177615 100644 --- a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_06/README.md +++ b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_06/README.md @@ -6,86 +6,92 @@ This chapter takes the real-time audio chat application from **Chapter 5** and s Chapter 6 leverages the foundational concepts and components established in earlier chapters: -* **Chapter 2 (Live Audio Chat with Gemini):** Provided the basis for real-time audio interaction, which we extend here. -* **Chapter 3 (Low-Level WebSocket Interaction):** Introduced the core WebSocket communication principles that are essential for this chapter. -* **Chapter 4 (Text-to-Speech with WebSockets):** Demonstrated basic audio handling with WebSockets, which we build upon for live audio streaming. -* **Chapter 5 (Real-time Audio-to-Audio):** Established the foundation for real-time audio streaming using WebSockets and the Web Audio API. Chapter 6 extends this by adding video capabilities. We'll reuse the `AudioRecorder`, `AudioStreamer`, and WebSocket communication logic from this chapter. +- **Chapter 2 (Live Audio Chat with Gemini):** Provided the basis for real-time audio interaction, which we extend here. +- **Chapter 3 (Low-Level WebSocket Interaction):** Introduced the core WebSocket communication principles that are essential for this chapter. +- **Chapter 4 (Text-to-Speech with WebSockets):** Demonstrated basic audio handling with WebSockets, which we build upon for live audio streaming. +- **Chapter 5 (Real-time Audio-to-Audio):** Established the foundation for real-time audio streaming using WebSockets and the Web Audio API. Chapter 6 extends this by adding video capabilities. We'll reuse the `AudioRecorder`, `AudioStreamer`, and WebSocket communication logic from this chapter. **New Functionalities in Chapter 6:** This chapter introduces the following key additions: 1. **Video Capture and Management:** - * **`MediaHandler` Class:** A new `MediaHandler` class is introduced to manage user media, specifically for webcam and screen capture. It's responsible for: - * Requesting access to the user's webcam or screen using `navigator.mediaDevices.getUserMedia()` and `navigator.mediaDevices.getDisplayMedia()`. - * Starting and stopping video streams. - * Capturing individual frames from the video stream. - * Managing the active state of the webcam and screen sharing (using `isWebcamActive` and `isScreenActive` flags). - * **Webcam and Screen Sharing Toggle:** The UI now includes two new buttons with material symbol icons: - * **Webcam Button:** Toggles the webcam on and off. - * **Screen Sharing Button:** Toggles screen sharing on and off. - * **Video Preview:** A `