diff --git a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/CONTRIBUTING.md b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/CONTRIBUTING.md index 5c0d54b9..be3e7d9f 100644 --- a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/CONTRIBUTING.md +++ b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/CONTRIBUTING.md @@ -39,4 +39,4 @@ In the future, we look forward to your patches and contributions to this project All submissions, including submissions by project members, require review. We use GitHub pull requests for this purpose. Consult [GitHub Help](https://help.github.com/articles/about-pull-requests/) for more -information on using pull requests. \ No newline at end of file +information on using pull requests. diff --git a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/README.md b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/README.md index 41e558a0..4b92730e 100644 --- a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/README.md +++ b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/README.md @@ -7,6 +7,7 @@ This repository serves as a comprehensive developer guide for [Google's Gemini M ## What You'll Learn By following this guide, you'll be able to: + - Build real-time audio chat applications with Gemini - Implement live video interactions through webcam and screen sharing - Create multimodal experiences combining audio and video @@ -18,18 +19,21 @@ The guide progresses from basic concepts to advanced implementations, culminatin ## Key Concepts Covered - **Real-time Communication:** + - WebSocket-based streaming - Bidirectional audio chat - Live video processing - Turn-taking and interruption handling - **Audio Processing:** + - Microphone input capture - Audio chunking and streaming - Voice Activity Detection (VAD) - Real-time audio playback - **Video Integration:** + - Webcam and screen capture - Frame processing and encoding - Simultaneous audio-video streaming @@ -45,20 +49,26 @@ The guide progresses from basic concepts to advanced implementations, culminatin ## Guide Structure ### [Part 1](part_1_intro): Introduction to Gemini's Multimodal Live API + Basic concepts and SDK usage: + - SDK setup and authentication - Text and audio interactions - Real-time audio chat implementation ### [Part 2](part_2_dev_api): WebSocket Development with [Gemini Developer API](https://ai.google.dev/api/multimodal-live) + Direct WebSocket implementation, building towards Project Pastra - a production-ready multimodal AI assistant inspired by Project Astra: + - Low-level WebSocket communication - Audio and video streaming - Function calling and system instructions - Mobile-first deployment ### [Part 3](part_3_vertex_api): WebSocket Development with [Vertex AI API](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-live) + Enterprise-grade implementation using Vertex AI, mirroring Part 2's journey with production-focused architecture: + - Proxy-based authentication - Service account integration - Cloud deployment architecture @@ -68,14 +78,14 @@ Enterprise-grade implementation using Vertex AI, mirroring Part 2's journey with Below is a comprehensive overview of where each feature is implemented across the Development API and Vertex AI versions: -| Feature | Part 1 - Intro Chapter | Part 2 - Dev API Chapter | Part 3 - Vertex AI Chapter | -|---------|----------------|----------------|-------------------| -| SDK setup and authentication | [Chapter 1](part_1_intro/chapter_01) | - | - | -| Text and audio interactions | [Chapter 1](part_1_intro/chapter_01) | - | - | -| Real-time Audio Chat | [Chapter 2](part_1_intro/chapter_02) | [Chapter 5](part_2_dev_api/chapter_05) | [Chapter 9](part_3_vertex_api/chapter_09) | -| Multimodal (Audio + Video) | - | [Chapter 6](part_2_dev_api/chapter_06) | [Chapter 10](part_3_vertex_api/chapter_10) | -| Function Calling & Instructions | - | [Chapter 7](part_2_dev_api/chapter_07) | [Chapter 11](part_3_vertex_api/chapter_11) | -| Production Deployment | - | [Chapter 8](part_2_dev_api/chapter_08) | [Chapter 12](part_3_vertex_api/chapter_12) | +| Feature | Part 1 - Intro Chapter | Part 2 - Dev API Chapter | Part 3 - Vertex AI Chapter | +| ------------------------------- | ------------------------------------ | -------------------------------------- | ------------------------------------------ | +| SDK setup and authentication | [Chapter 1](part_1_intro/chapter_01) | - | - | +| Text and audio interactions | [Chapter 1](part_1_intro/chapter_01) | - | - | +| Real-time Audio Chat | [Chapter 2](part_1_intro/chapter_02) | [Chapter 5](part_2_dev_api/chapter_05) | [Chapter 9](part_3_vertex_api/chapter_09) | +| Multimodal (Audio + Video) | - | [Chapter 6](part_2_dev_api/chapter_06) | [Chapter 10](part_3_vertex_api/chapter_10) | +| Function Calling & Instructions | - | [Chapter 7](part_2_dev_api/chapter_07) | [Chapter 11](part_3_vertex_api/chapter_11) | +| Production Deployment | - | [Chapter 8](part_2_dev_api/chapter_08) | [Chapter 12](part_3_vertex_api/chapter_12) | Note: Vertex AI implementation starts directly with advanced features, skipping basic WebSocket and text-to-speech examples. @@ -94,6 +104,7 @@ Note: Vertex AI implementation starts directly with advanced features, skipping ## Key Differences Between Dev API and Vertex AI ### Development API (Part 2) + - Simple API key authentication - Direct WebSocket connection - All tools available simultaneously @@ -101,6 +112,7 @@ Note: Vertex AI implementation starts directly with advanced features, skipping - Ideal for prototyping and learning ### Vertex AI (Part 3) + - Service account authentication - Proxy-based architecture - Single tool limitation @@ -117,4 +129,3 @@ Note: Vertex AI implementation starts directly with advanced features, skipping ## License This project is licensed under the Apache License. - diff --git a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_1_intro/README.md b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_1_intro/README.md index 5a6bfabe..2f9cb0b4 100644 --- a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_1_intro/README.md +++ b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_1_intro/README.md @@ -5,6 +5,7 @@ This section provides a foundational introduction to working with Google's Gemin ## Contents ### Chapter 1: SDK Basics + - Introduction to the Google Gemini AI SDK - Setting up the development environment - Basic text interactions with Gemini @@ -12,12 +13,14 @@ This section provides a foundational introduction to working with Google's Gemin - Examples using both direct API key authentication and Vertex AI authentication ### Chapter 2: Multimodal Interactions + - Real-time audio conversations with Gemini - Streaming audio input and output - Voice activity detection and turn-taking - Example implementation of an interactive voice chat ## Key Features Covered + - Text generation and conversations - Audio output generation - Real-time streaming interactions @@ -25,6 +28,7 @@ This section provides a foundational introduction to working with Google's Gemin - Multimodal capabilities (text-to-audio, audio-to-audio) ## Prerequisites + - Python environment - Google Gemini API access - Required packages: @@ -32,4 +36,5 @@ This section provides a foundational introduction to working with Google's Gemin - `pyaudio` (for audio examples) ## Getting Started -Each chapter contains Jupyter notebooks and Python scripts that demonstrate different aspects of the Gemini AI capabilities. Start with Chapter 1's notebooks for basic SDK usage, and then move on to the more advanced multimodal examples in Chapter 2. \ No newline at end of file + +Each chapter contains Jupyter notebooks and Python scripts that demonstrate different aspects of the Gemini AI capabilities. Start with Chapter 1's notebooks for basic SDK usage, and then move on to the more advanced multimodal examples in Chapter 2. diff --git a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_1_intro/chapter_02/README.md b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_1_intro/chapter_02/README.md index 944a57f1..ac961444 100644 --- a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_1_intro/chapter_02/README.md +++ b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_1_intro/chapter_02/README.md @@ -16,39 +16,40 @@ The application's functionality can be broken down into several key components: The `pyaudio` library is used to create input and output streams that interface with the user's audio hardware. -* **Input Stream:** An input stream is initialized to capture audio data from the user's microphone. The stream is configured with parameters such as format, channels, sample rate, and chunk size. The `SEND_SAMPLE_RATE` is set to 16000 Hz, which is a common sample rate for speech recognition. The `CHUNK_SIZE` determines the number of audio frames read from the microphone at a time. The `exception_on_overflow` parameter is set to `False` to prevent the stream from raising an exception if the buffer overflows. -* **Output Stream:** An output stream is initialized to play audio data through the user's speakers. Similar to the input stream, it is configured with appropriate parameters. The `RECEIVE_SAMPLE_RATE` is set to 24000 Hz, which is suitable for high-quality audio playback. +- **Input Stream:** An input stream is initialized to capture audio data from the user's microphone. The stream is configured with parameters such as format, channels, sample rate, and chunk size. The `SEND_SAMPLE_RATE` is set to 16000 Hz, which is a common sample rate for speech recognition. The `CHUNK_SIZE` determines the number of audio frames read from the microphone at a time. The `exception_on_overflow` parameter is set to `False` to prevent the stream from raising an exception if the buffer overflows. +- **Output Stream:** An output stream is initialized to play audio data through the user's speakers. Similar to the input stream, it is configured with appropriate parameters. The `RECEIVE_SAMPLE_RATE` is set to 24000 Hz, which is suitable for high-quality audio playback. ### Communication with Gemini API The `google-genai` library provides the necessary tools to connect to the Gemini API and establish a communication session. -* **Client Initialization:** A `genai.Client` is created to interact with the API. The `http_options` parameter is used to specify the API version, which is set to `'v1alpha'` in this case. -* **Session Configuration:** A configuration object `CONFIG` is defined to customize the interaction with the model. This includes: - * `generation_config`: Specifies the response modality as "AUDIO" and configures the "speech_config" to "Puck". - * `system_instruction`: Sets a system instruction to always start the model's sentences with "mate". -* **Live Connection:** The `client.aio.live.connect` method establishes a live connection to the Gemini model specified by `MODEL`, which is set to `"models/gemini-2.0-flash-exp"`. +- **Client Initialization:** A `genai.Client` is created to interact with the API. The `http_options` parameter is used to specify the API version, which is set to `'v1alpha'` in this case. +- **Session Configuration:** A configuration object `CONFIG` is defined to customize the interaction with the model. This includes: + - `generation_config`: Specifies the response modality as "AUDIO" and configures the "speech_config" to "Puck". + - `system_instruction`: Sets a system instruction to always start the model's sentences with "mate". +- **Live Connection:** The `client.aio.live.connect` method establishes a live connection to the Gemini model specified by `MODEL`, which is set to `"models/gemini-2.0-flash-exp"`. ### Asynchronous Audio Handling The `asyncio` library is used to manage the asynchronous operations involved in audio processing and communication. -* **Audio Queue:** An `asyncio.Queue` is created to store audio data temporarily. This queue is not used in the current implementation but is defined for potential future use. -* **Task Group:** An `asyncio.TaskGroup` is used to manage two concurrent tasks: `listen_and_send` and `receive_and_play`. -* **`listen_and_send` Task:** This task continuously reads audio data from the input stream in chunks and sends it to the Gemini API. It checks if the model is currently speaking (`model_speaking` flag) and only sends data if the model is not speaking. The chunking is performed using the `pyaudio` library's `read()` method, which is called with a specific `CHUNK_SIZE` (number of audio frames per chunk). Here's how it's done in the code: +- **Audio Queue:** An `asyncio.Queue` is created to store audio data temporarily. This queue is not used in the current implementation but is defined for potential future use. +- **Task Group:** An `asyncio.TaskGroup` is used to manage two concurrent tasks: `listen_and_send` and `receive_and_play`. +- **`listen_and_send` Task:** This task continuously reads audio data from the input stream in chunks and sends it to the Gemini API. It checks if the model is currently speaking (`model_speaking` flag) and only sends data if the model is not speaking. The chunking is performed using the `pyaudio` library's `read()` method, which is called with a specific `CHUNK_SIZE` (number of audio frames per chunk). Here's how it's done in the code: - ```python - while True: - if not model_speaking: - try: - data = await asyncio.to_thread(input_stream.read, CHUNK_SIZE, exception_on_overflow=False) - # ... send data to API ... - except OSError as e: - # ... handle error ... - ``` + ```python + while True: + if not model_speaking: + try: + data = await asyncio.to_thread(input_stream.read, CHUNK_SIZE, exception_on_overflow=False) + # ... send data to API ... + except OSError as e: + # ... handle error ... + ``` - In this code, `input_stream.read(CHUNK_SIZE)` reads a chunk of audio frames from the microphone's input buffer. Each chunk is then sent to the API along with the `end_of_turn=True` flag. -* **`receive_and_play` Task:** This task continuously receives responses from the Gemini API and plays the audio data through the output stream. It sets the `model_speaking` flag to `True` when the model starts speaking and to `False` when the turn is complete. It then iterates through the parts of the response and writes the audio data to the output stream. + In this code, `input_stream.read(CHUNK_SIZE)` reads a chunk of audio frames from the microphone's input buffer. Each chunk is then sent to the API along with the `end_of_turn=True` flag. + +- **`receive_and_play` Task:** This task continuously receives responses from the Gemini API and plays the audio data through the output stream. It sets the `model_speaking` flag to `True` when the model starts speaking and to `False` when the turn is complete. It then iterates through the parts of the response and writes the audio data to the output stream. ### Audio Chunking and Real-time Interaction @@ -68,7 +69,6 @@ In this case, with a `CHUNK_SIZE` of 512 frames and a `SEND_SAMPLE_RATE` of 1600 `Chunk Duration = 512 frames / 16000 Hz = 0.032 seconds = 32 milliseconds` - Therefore, each chunk represents 32 milliseconds of audio. **Real-time Interaction Flow:** @@ -97,21 +97,21 @@ The application distinguishes between user input and model output through a comb **Distinguishing Input from Output:** -* **`model_speaking` Flag:** This boolean flag serves as a primary mechanism to differentiate between when the user is providing input and when the model is generating output. - * When `model_speaking` is `False`, the application assumes it's the user's turn to speak. The `listen_and_send` task reads audio data from the microphone and sends it to the API. - * When `model_speaking` is `True`, the application understands that the model is currently generating an audio response. The `listen_and_send` task pauses, preventing user input from being sent to the API while the model is "speaking." The `receive_and_play` task is active during this time, receiving and playing the model's audio output. +- **`model_speaking` Flag:** This boolean flag serves as a primary mechanism to differentiate between when the user is providing input and when the model is generating output. + - When `model_speaking` is `False`, the application assumes it's the user's turn to speak. The `listen_and_send` task reads audio data from the microphone and sends it to the API. + - When `model_speaking` is `True`, the application understands that the model is currently generating an audio response. The `listen_and_send` task pauses, preventing user input from being sent to the API while the model is "speaking." The `receive_and_play` task is active during this time, receiving and playing the model's audio output. **How Audio Chunks are Sent:** -* **`end_of_turn=True` with Each Chunk:** The `listen_and_send` task sends each chunk of audio data (determined by `CHUNK_SIZE`) with `end_of_turn=True` in the message payload: `await session.send({"data": data, "mime_type": "audio/pcm"}, end_of_turn=True)`. This might seem like it would constantly interrupt the conversation flow. However, the API handles this gracefully. -* **API-Side Buffering and VAD:** The Gemini API likely buffers the incoming audio chunks on its end. Even though each chunk is marked as the end of a turn with `end_of_turn=True`, the API's Voice Activity Detection (VAD) analyzes the buffered audio to identify longer pauses or periods of silence that more accurately represent the actual end of the user's speech. The API can group several chunks into what it considers a single user turn based on its VAD analysis, rather than strictly treating each chunk as a separate turn. -* **Low-Latency Processing:** The API is designed for low-latency interaction. It starts processing the received audio chunks as soon as possible. Even if `end_of_turn=True` is sent with each chunk, the API can begin generating a response while still receiving more audio from the user, as long as it hasn't detected a significant enough pause to finalize the user's turn based on its VAD. +- **`end_of_turn=True` with Each Chunk:** The `listen_and_send` task sends each chunk of audio data (determined by `CHUNK_SIZE`) with `end_of_turn=True` in the message payload: `await session.send({"data": data, "mime_type": "audio/pcm"}, end_of_turn=True)`. This might seem like it would constantly interrupt the conversation flow. However, the API handles this gracefully. +- **API-Side Buffering and VAD:** The Gemini API likely buffers the incoming audio chunks on its end. Even though each chunk is marked as the end of a turn with `end_of_turn=True`, the API's Voice Activity Detection (VAD) analyzes the buffered audio to identify longer pauses or periods of silence that more accurately represent the actual end of the user's speech. The API can group several chunks into what it considers a single user turn based on its VAD analysis, rather than strictly treating each chunk as a separate turn. +- **Low-Latency Processing:** The API is designed for low-latency interaction. It starts processing the received audio chunks as soon as possible. Even if `end_of_turn=True` is sent with each chunk, the API can begin generating a response while still receiving more audio from the user, as long as it hasn't detected a significant enough pause to finalize the user's turn based on its VAD. **Determining End of Model Turn:** -* **`turn_complete` Field:** The `receive_and_play` task continuously listens for responses from the API. Each response includes a `server_content` object, which contains a `turn_complete` field. - * When `turn_complete` is `True`, it signifies that the model has finished generating its response for the current turn. - * Upon receiving a `turn_complete: True` signal, the `receive_and_play` task sets the `model_speaking` flag to `False`. This signals that the model's turn is over, and the application is ready to accept new user input. +- **`turn_complete` Field:** The `receive_and_play` task continuously listens for responses from the API. Each response includes a `server_content` object, which contains a `turn_complete` field. + - When `turn_complete` is `True`, it signifies that the model has finished generating its response for the current turn. + - Upon receiving a `turn_complete: True` signal, the `receive_and_play` task sets the `model_speaking` flag to `False`. This signals that the model's turn is over, and the application is ready to accept new user input. **Turn-Taking Flow:** @@ -127,7 +127,7 @@ In essence, although `end_of_turn=True` is sent with each audio chunk, the API's ### Why Always Set `end_of_turn=True`? -Setting `end_of_turn=True` with each audio chunk, even when the user hasn't finished speaking, might seem counterintuitive. Here are some reasons for this design choice: +Setting `end_of_turn=True` with each audio chunk, even when the user hasn't finished speaking, might seem counterintuitive. Here are some reasons for this design choice: 1. **Simplicity and Reduced Client-Side Complexity:** Implementing robust Voice Activity Detection (VAD) on the client-side can be complex. By always setting `end_of_turn=True`, the developers might have opted for a simpler client-side implementation that offloads the more complex VAD task to the Gemini API. 2. **Lower Latency:** Sending smaller chunks with `end_of_turn=True` might allow the API to start processing the audio sooner. However, this potential latency benefit depends heavily on how the API is designed. @@ -149,5 +149,4 @@ The `if __name__ == "__main__":` block ensures that the `audio_loop` function is ## Limitations -The current implementation does not support user interruption of the model's speech. Future implementations could support interruption by sending a specific interrupt signal to the API or by modifying the current `end_of_turn` logic to be more responsive to shorter pauses in user speech. - +The current implementation does not support user interruption of the model's speech. Future implementations could support interruption by sending a specific interrupt signal to the API or by modifying the current `end_of_turn` logic to be more responsive to shorter pauses in user speech. diff --git a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/README.md b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/README.md index 57b8152a..cd39b9ff 100644 --- a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/README.md +++ b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/README.md @@ -3,47 +3,55 @@ This section demonstrates how to work directly with the Gemini API using WebSockets, progressively building towards Project Pastra - a production-ready multimodal AI assistant inspired by Google DeepMind's Project Astra. Through a series of chapters, we evolve from basic implementations to a sophisticated, mobile-first application that showcases the full potential of the Gemini API. ## Journey to Project Pastra + Starting with fundamental WebSocket concepts, each chapter adds new capabilities, ultimately culminating in Project Pastra - our implementation of a universal AI assistant that can see, hear, and interact in real-time. Like Project Astra (Google DeepMind's research prototype), our application demonstrates how to create an AI assistant that can engage in natural, multimodal interactions while maintaining production-grade reliability. ## Contents ### Chapter 3: Basic WebSocket Communication + - Single exchange example with the Gemini API - Core WebSocket setup and communication - Understanding the API's message formats - Handling the mandatory setup phase ### Chapter 4: Text-to-Speech Implementation + - Converting text input to audio responses - Real-time audio playback in the browser - Audio chunk management and streaming - WebSocket and AudioContext integration ### Chapter 5: Real-time Audio Chat + - Bidirectional audio communication - Live microphone input processing - Voice activity detection and turn management - Advanced audio streaming techniques ### Chapter 6: Multimodal Interactions + - Adding video capabilities (webcam and screen sharing) - Frame capture and processing - Simultaneous audio and video streaming - Enhanced user interface controls ### Chapter 7: Advanced Features + - Function calling capabilities - System instructions integration - External API integrations (weather, search) - Code execution functionality ### Chapter 8: Project Pastra + - Mobile-first UI design inspired by Project Astra - Cloud Run deployment setup - Production-grade error handling - Scalable architecture implementation ## Key Features + - Direct WebSocket communication with Gemini API - Real-time audio and video processing - Browser-based implementation @@ -51,6 +59,7 @@ Starting with fundamental WebSocket concepts, each chapter adds new capabilities - Production deployment guidance ## Prerequisites + - Basic understanding of WebSockets - Familiarity with JavaScript and HTML5 - Google Gemini API access @@ -59,20 +68,24 @@ Starting with fundamental WebSocket concepts, each chapter adds new capabilities ## Getting Started This guide uses a simple development server to: + - Serve the HTML/JavaScript files for each chapter - Provide access to shared components (audio processing, media handling, etc.) used across chapters - Enable proper loading of JavaScript modules and assets - Avoid CORS issues when accessing local files 1. Start the development server: + ```bash python server.py ``` + This will serve both the chapter files and shared components at http://localhost:8000 2. Navigate to the specific chapter you want to work with: + - Chapter 3: http://localhost:8000/chapter_03/ - Chapter 4: http://localhost:8000/chapter_04/ - And so on... + And so on... -3. Begin with Chapter 3 to understand the fundamentals of WebSocket communication with Gemini. Each subsequent chapter builds upon previous concepts, gradually introducing more complex features and capabilities. By Chapter 8, you'll have transformed the development prototype into Project Pastra - a production-ready AI assistant that demonstrates the future of human-AI interaction. \ No newline at end of file +3. Begin with Chapter 3 to understand the fundamentals of WebSocket communication with Gemini. Each subsequent chapter builds upon previous concepts, gradually introducing more complex features and capabilities. By Chapter 8, you'll have transformed the development prototype into Project Pastra - a production-ready AI assistant that demonstrates the future of human-AI interaction. diff --git a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_03/README.md b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_03/README.md index 3105385d..88454102 100644 --- a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_03/README.md +++ b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_03/README.md @@ -10,97 +10,99 @@ The application's functionality can be broken down into several key components: ### 1. Establishing a WebSocket Connection -* **API Endpoint:** The application connects to the Gemini API using a specific WebSocket endpoint URL: - ``` - wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContent?key=${apiKey} - ``` - This URL includes the API host, the service path, and an API key for authentication. Replace `${apiKey}` with your actual API key. -* **WebSocket Object:** A new `WebSocket` object is created in JavaScript, initiating the connection: - ```javascript - const ws = new WebSocket(endpoint); - ``` -* **Event Handlers:** Event handlers are defined to manage the connection's lifecycle and handle incoming messages: - * `onopen`: Triggered when the connection is successfully opened. - * `onmessage`: Triggered when a message is received from the server. - * `onerror`: Triggered if an error occurs during the connection. - * `onclose`: Triggered when the connection is closed. +- **API Endpoint:** The application connects to the Gemini API using a specific WebSocket endpoint URL: + ``` + wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContent?key=${apiKey} + ``` + This URL includes the API host, the service path, and an API key for authentication. Replace `${apiKey}` with your actual API key. +- **WebSocket Object:** A new `WebSocket` object is created in JavaScript, initiating the connection: + ```javascript + const ws = new WebSocket(endpoint); + ``` +- **Event Handlers:** Event handlers are defined to manage the connection's lifecycle and handle incoming messages: + - `onopen`: Triggered when the connection is successfully opened. + - `onmessage`: Triggered when a message is received from the server. + - `onerror`: Triggered if an error occurs during the connection. + - `onclose`: Triggered when the connection is closed. ### 2. Sending a Setup Message (Mandatory First Step) -* **API Requirement:** The Gemini API requires a setup message to be sent as the **very first message** after the WebSocket connection is established. This is crucial for configuring the session. -* **`onopen` Handler:** The `onopen` event handler, which is triggered when the connection is open, is responsible for sending this setup message. -* **Setup Message Structure:** The setup message is a JSON object that conforms to the `BidiGenerateContentSetup` format as defined in the API documentation: - ```javascript - const setupMessage = { - setup: { - model: "models/gemini-2.0-flash-exp", - generation_config: { - response_modalities: ["text"] - } - } - }; - ``` - * `model`: Specifies the Gemini model to use (`"models/gemini-2.0-flash-exp"` in this case). - * `generation_config`: Configures the generation parameters, such as the `response_modalities` (set to `"text"` for text-based output). You can also specify other parameters like `temperature`, `top_p`, `top_k`, etc., within `generation_config` as needed. -* **Sending the Message:** The setup message is stringified and sent to the server using `ws.send()`: - ```javascript - ws.send(JSON.stringify(setupMessage)); - ``` +- **API Requirement:** The Gemini API requires a setup message to be sent as the **very first message** after the WebSocket connection is established. This is crucial for configuring the session. +- **`onopen` Handler:** The `onopen` event handler, which is triggered when the connection is open, is responsible for sending this setup message. +- **Setup Message Structure:** The setup message is a JSON object that conforms to the `BidiGenerateContentSetup` format as defined in the API documentation: + ```javascript + const setupMessage = { + setup: { + model: "models/gemini-2.0-flash-exp", + generation_config: { + response_modalities: ["text"], + }, + }, + }; + ``` + - `model`: Specifies the Gemini model to use (`"models/gemini-2.0-flash-exp"` in this case). + - `generation_config`: Configures the generation parameters, such as the `response_modalities` (set to `"text"` for text-based output). You can also specify other parameters like `temperature`, `top_p`, `top_k`, etc., within `generation_config` as needed. +- **Sending the Message:** The setup message is stringified and sent to the server using `ws.send()`: + ```javascript + ws.send(JSON.stringify(setupMessage)); + ``` ### 3. Receiving and Processing Messages -* **`onmessage` Handler:** The `onmessage` event handler receives messages from the server. -* **Data Handling:** The code handles potential `Blob` data using `new Response(event.data).text()`, but in this text-only example, it directly parses the message as JSON. -* **Response Parsing:** The received message is parsed as a JSON object using `JSON.parse()`. -* **Message Types:** The code specifically checks for a `BidiGenerateContentSetupComplete` message type, indicated by the `setupComplete` field in the response. +- **`onmessage` Handler:** The `onmessage` event handler receives messages from the server. +- **Data Handling:** The code handles potential `Blob` data using `new Response(event.data).text()`, but in this text-only example, it directly parses the message as JSON. +- **Response Parsing:** The received message is parsed as a JSON object using `JSON.parse()`. +- **Message Types:** The code specifically checks for a `BidiGenerateContentSetupComplete` message type, indicated by the `setupComplete` field in the response. ### 4. Confirming Setup Completion Before Proceeding -* **`setupComplete` Check:** The code includes a conditional check to ensure that a `setupComplete` message is received before sending any user content: - ```javascript - if (response.setupComplete) { - // ... Send user message ... - } - ``` -* **Why This Is Important:** This check is essential because the API will not process user content messages until the setup is complete. Sending content before receiving confirmation that the setup is complete will likely result in an error or unexpected behavior. The API might close the connection if messages other than the initial setup message are sent before the setup is completed. +- **`setupComplete` Check:** The code includes a conditional check to ensure that a `setupComplete` message is received before sending any user content: + ```javascript + if (response.setupComplete) { + // ... Send user message ... + } + ``` +- **Why This Is Important:** This check is essential because the API will not process user content messages until the setup is complete. Sending content before receiving confirmation that the setup is complete will likely result in an error or unexpected behavior. The API might close the connection if messages other than the initial setup message are sent before the setup is completed. ### 5. Sending a Hardcoded User Message -* **Triggered by `setupComplete`:** Only after the `setupComplete` message is received and processed does the application send a user message to the model. -* **User Message Structure:** The user message is a JSON object conforming to the `BidiGenerateContentClientContent` format: - ```javascript - const contentMessage = { - client_content: { - turns: [{ +- **Triggered by `setupComplete`:** Only after the `setupComplete` message is received and processed does the application send a user message to the model. +- **User Message Structure:** The user message is a JSON object conforming to the `BidiGenerateContentClientContent` format: + ```javascript + const contentMessage = { + client_content: { + turns: [ + { role: "user", - parts: [{ text: "Hello! Are you there?" }] - }], - turn_complete: true - } - }; - ``` - * `client_content`: Contains the conversation content. - * `turns`: An array representing the conversation turns. - * `role`: Indicates the role of the speaker ("user" in this case). - * `parts`: An array of content parts (in this case, a single text part). - * `text`: The actual user message (hardcoded to "Hello! Are you there?"). - * `turn_complete`: Set to `true` to signal the end of the user's turn. -* **Sending the Message:** The content message is stringified and sent to the server using `ws.send()`. + parts: [{ text: "Hello! Are you there?" }], + }, + ], + turn_complete: true, + }, + }; + ``` + - `client_content`: Contains the conversation content. + - `turns`: An array representing the conversation turns. + - `role`: Indicates the role of the speaker ("user" in this case). + - `parts`: An array of content parts (in this case, a single text part). + - `text`: The actual user message (hardcoded to "Hello! Are you there?"). + - `turn_complete`: Set to `true` to signal the end of the user's turn. +- **Sending the Message:** The content message is stringified and sent to the server using `ws.send()`. ### 6. Displaying the Model's Response -* **`serverContent` Handling:** When a `serverContent` message is received (which contains the model's response), the application extracts the response text. -* **Response Extraction:** The model's response is accessed using `response.serverContent.modelTurn.parts[0]?.text`. -* **Displaying the Response:** The `logMessage()` function displays the model's response in the `output` div on the HTML page. +- **`serverContent` Handling:** When a `serverContent` message is received (which contains the model's response), the application extracts the response text. +- **Response Extraction:** The model's response is accessed using `response.serverContent.modelTurn.parts[0]?.text`. +- **Displaying the Response:** The `logMessage()` function displays the model's response in the `output` div on the HTML page. ### 7. Error Handling and Connection Closure -* **`onerror` Handler:** The `onerror` event handler logs any WebSocket errors to the console and displays an error message on the page. -* **`onclose` Handler:** The `onclose` event handler logs information about the connection closure, including the reason and status code. +- **`onerror` Handler:** The `onerror` event handler logs any WebSocket errors to the console and displays an error message on the page. +- **`onclose` Handler:** The `onclose` event handler logs information about the connection closure, including the reason and status code. ### 8. Logging Messages -* **`logMessage()` Function:** This utility function creates a new paragraph element (`
`) and appends it to the `output` div, displaying the provided message on the page. +- **`logMessage()` Function:** This utility function creates a new paragraph element (`
`) and appends it to the `output` div, displaying the provided message on the page. ## Educational Purpose @@ -117,20 +119,20 @@ By examining this code, you can gain a deeper understanding of the underlying co **Note:** This is a simplified example for educational purposes. A real-world chat application would involve more complex features like: -* Dynamic user input. -* Handling multiple conversation turns. -* Maintaining conversation history. -* Potentially integrating audio or video. +- Dynamic user input. +- Handling multiple conversation turns. +- Maintaining conversation history. +- Potentially integrating audio or video. This example provides a solid foundation for understanding the basic principles involved in interacting with the Gemini API at a low level using WebSockets, especially the crucial setup process. **Note:** This is a simplified example for educational purposes. A real-world chat application would involve more complex features like: -* Dynamic user input (see Chapter 4). -* Handling multiple conversation turns. -* Maintaining conversation history. -* Potentially integrating audio or video (see Chapter 5 & 6). +- Dynamic user input (see Chapter 4). +- Handling multiple conversation turns. +- Maintaining conversation history. +- Potentially integrating audio or video (see Chapter 5 & 6). **Security Best Practices:** -For production applications, **never** expose your API key directly in client-side code. Instead, use a secure backend server to handle authentication and proxy requests to the API. This protects your API key from unauthorized access. \ No newline at end of file +For production applications, **never** expose your API key directly in client-side code. Instead, use a secure backend server to handle authentication and proxy requests to the API. This protects your API key from unauthorized access. diff --git a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_03/index.html b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_03/index.html index fda55344..16c19623 100644 --- a/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_03/index.html +++ b/genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide/part_2_dev_api/chapter_03/index.html @@ -1,4 +1,4 @@ - + -
-This is a simple demonstration of WebSocket communication with the Gemini API, showing a single exchange between user and model. It illustrates the fundamental principles of interacting with the API at a low level, without using an SDK.
-