diff --git a/README.md b/README.md index d642e92..3f6ba77 100644 --- a/README.md +++ b/README.md @@ -11,16 +11,9 @@ The Swiss Army Llama is designed to facilitate and optimize the process of worki Some additional useful endpoints are provided, such as computing semantic similarity between submitted text strings. The service leverages a high-performance Rust-based library, `fast_vector_similarity`, to offer a range of similarity measures including `spearman_rho`, `kendall_tau`, `approximate_distance_correlation`, `jensen_shannon_similarity`, and [`hoeffding_d`](https://blogs.sas.com/content/iml/2021/05/03/examples-hoeffding-d.html). Additionally, semantic search across all your cached embeddings is supported using FAISS vector searching. You can either use the built in cosine similarity from FAISS, or supplement this with a second pass that computes the more sophisticated similarity measures for the most relevant subset of the stored vectors found using cosine similarity (see the advanced semantic search endpoint for this functionality). Also, we now support multiple embedding pooling methods for combining token-level embedding vectors into a single fixed-length embedding vector for any length of input text, including the following: - - `means`: Element-wise average of the token embeddings. - - `means_mins_maxes`: Concatenation of element-wise mean, min, and max of the token embeddings. - - `means_mins_maxes_stds_kurtoses`: Concatenation of element-wise mean, min, max, standard deviation, and kurtosis of the token embeddings. - `svd`: Concatenation of the first two singular vectors obtained from the Singular Value Decomposition (SVD) of the token embeddings matrix. - `svd_first_four`: Concatenation of the first four singular vectors obtained from the Singular Value Decomposition (SVD) of the token embeddings matrix. - - `gram_matrix`: Flattened Gram matrix (dot product of the token embeddings matrix with its transpose). - - `qr_decomposition`: Concatenation of the flattened Q and R matrices from QR decomposition of the token embeddings. - - `cholesky_decomposition`: Flattened lower triangular matrix from Cholesky decomposition of the covariance matrix of the token embeddings. - `ica`: Flattened independent components obtained from Independent Component Analysis (ICA) of the token embeddings. - - `nmf`: Flattened components obtained from Non-Negative Matrix Factorization (NMF) of the token embeddings. - `factor_analysis`: Flattened factors obtained from Factor Analysis of the token embeddings. - `gaussian_random_projection`: Flattened embeddings obtained from Gaussian Random Projection of the token embeddings. @@ -694,50 +687,22 @@ The primary goal of these pooling methods is to retain as much useful informatio #### Explanation of Pooling Methods -1. **Means**: - - **How it works**: Computes the element-wise average of the token embeddings. - - **Rationale**: The mean pooling method provides a simple yet effective way to summarize the central tendency of the token embeddings, capturing the overall semantic content of the text. - -2. **Means_Mins_Maxes**: - - **How it works**: Concatenates the element-wise mean, min, and max of the token embeddings. - - **Rationale**: This method captures the central tendency (mean) as well as the range (min and max) of the embeddings, providing a richer representation by considering the distribution of values. - -3. **Means_Mins_Maxes_Stds_Kurtoses**: - - **How it works**: Concatenates the element-wise mean, min, max, standard deviation, and kurtosis of the token embeddings. - - **Rationale**: This method captures various statistical properties of the embeddings, including their central tendency, variability, and distribution shape, offering a comprehensive summary of the token embeddings. - -4. **SVD (Singular Value Decomposition)**: +1. **SVD (Singular Value Decomposition)**: - **How it works**: Concatenates the first two singular vectors obtained from the SVD of the token embeddings matrix. - **Rationale**: SVD is a dimensionality reduction technique that captures the most important features of the data. Using the first two singular vectors provides a compact representation that retains significant information. -5. **SVD_First_Four**: +2. **SVD_First_Four**: - **How it works**: Uses the first four singular vectors obtained from the SVD of the token embeddings matrix. - **Rationale**: By using more singular vectors, this method captures more of the variance in the data, providing a richer representation while still reducing dimensionality. -6. **Gram_Matrix**: - - **How it works**: Computes the Gram matrix (dot product of the embeddings matrix with its transpose) and flattens it. - - **Rationale**: The Gram matrix captures the pairwise similarities between token embeddings, providing a summary of their relationships. - -7. **QR_Decomposition**: - - **How it works**: Performs QR decomposition on the embeddings matrix and concatenates the flattened Q and R matrices. - - **Rationale**: QR decomposition provides an orthogonal basis (Q) and upper triangular matrix (R), summarizing the embeddings in terms of these basis vectors and their coefficients. - -8. **Cholesky_Decomposition**: - - **How it works**: Performs Cholesky decomposition on the covariance matrix of the embeddings and flattens the resulting matrix. - - **Rationale**: This method factors the covariance matrix into a lower triangular matrix, capturing the structure of the variance in the embeddings. - -9. **ICA (Independent Component Analysis)**: +3. **ICA (Independent Component Analysis)**: - **How it works**: Applies ICA to the embeddings matrix to find statistically independent components, then flattens the result. - **Rationale**: ICA is useful for identifying independent sources in the data, providing a representation that highlights these independent features. -10. **NMF (Non-Negative Matrix Factorization)**: - - **How it works**: Applies NMF to the embeddings matrix and flattens the result. - - **Rationale**: NMF finds parts-based representations by factorizing the data into non-negative components, useful for interpretability and feature extraction. - -11. **Factor_Analysis**: +4. **Factor_Analysis**: - **How it works**: Applies factor analysis to the embeddings matrix to identify underlying factors, then flattens the result. - **Rationale**: Factor analysis models the data in terms of latent factors, providing a summary that captures these underlying influences. -12. **Gaussian_Random_Projection**: +5. **Gaussian_Random_Projection**: - **How it works**: Applies Gaussian random projection to reduce the dimensionality of the embeddings, then flattens the result. - **Rationale**: This method provides a fast and efficient way to reduce dimensionality while preserving the pairwise distances between points, useful for large datasets. diff --git a/embeddings_data_models.py b/embeddings_data_models.py index 7807e73..f81f29f 100644 --- a/embeddings_data_models.py +++ b/embeddings_data_models.py @@ -84,17 +84,17 @@ def update_document_hash_on_remove(target, value, initiator): # Request/Response models start here: class EmbeddingRequest(BaseModel): - text: str - llm_model_name: str - embedding_pooling_method: str - corpus_identifier_string: str + text: str = "" + llm_model_name: str = DEFAULT_MODEL_NAME + embedding_pooling_method: str = DEFAULT_EMBEDDING_POOLING_METHOD + corpus_identifier_string: str = "" class SimilarityRequest(BaseModel): - text1: str - text2: str - llm_model_name: str - embedding_pooling_method: str - similarity_measure: str + text1: str = "" + text2: str = "" + llm_model_name: str = DEFAULT_MODEL_NAME + embedding_pooling_method: str = DEFAULT_EMBEDDING_POOLING_METHOD + similarity_measure: str = "all" @field_validator('similarity_measure') def validate_similarity_measure(cls, value): valid_measures = ["all", "spearman_rho", "kendall_tau", "approximate_distance_correlation", "jensen_shannon_similarity", "hoeffding_d"] @@ -103,11 +103,11 @@ def validate_similarity_measure(cls, value): return value.lower() class SemanticSearchRequest(BaseModel): - query_text: str - number_of_most_similar_strings_to_return: int - llm_model_name: str - embedding_pooling_method: str - corpus_identifier_string: str + query_text: str = "" + number_of_most_similar_strings_to_return: int = 10 + llm_model_name: str = DEFAULT_MODEL_NAME + embedding_pooling_method: str = DEFAULT_EMBEDDING_POOLING_METHOD + corpus_identifier_string: str = "" class SemanticSearchResponse(BaseModel): query_text: str @@ -116,13 +116,13 @@ class SemanticSearchResponse(BaseModel): results: List[dict] # List of similar strings and their similarity scores using cosine similarity with Faiss (in descending order) class AdvancedSemanticSearchRequest(BaseModel): - query_text: str - llm_model_name: str - embedding_pooling_method: str - corpus_identifier_string: str - similarity_filter_percentage: float - number_of_most_similar_strings_to_return: int - result_sorting_metric: str + query_text: str = "" + llm_model_name: str = DEFAULT_MODEL_NAME + embedding_pooling_method: str = DEFAULT_EMBEDDING_POOLING_METHOD + corpus_identifier_string: str = "" + similarity_filter_percentage: float = 0.01 + number_of_most_similar_strings_to_return: int = 10 + result_sorting_metric: str = "hoeffding_d" @field_validator('result_sorting_metric') def validate_similarity_measure(cls, value): valid_measures = ["all", "spearman_rho", "kendall_tau", "approximate_distance_correlation", "jensen_shannon_similarity", "hoeffding_d"] @@ -168,12 +168,12 @@ class AllDocumentsResponse(BaseModel): documents: List[str] class TextCompletionRequest(BaseModel): - input_prompt: str - llm_model_name: str - temperature: float - grammar_file_string: str - number_of_tokens_to_generate: int - number_of_completions_to_generate: int + input_prompt: str = "" + llm_model_name: str = DEFAULT_MODEL_NAME + temperature: float = DEFAULT_COMPLETION_TEMPERATURE + grammar_file_string: str = "" + number_of_tokens_to_generate: int = DEFAULT_MAX_COMPLETION_TOKENS + number_of_completions_to_generate: int = DEFAULT_NUMBER_OF_COMPLETIONS_TO_GENERATE class TextCompletionResponse(BaseModel): input_prompt: str @@ -241,50 +241,3 @@ class AddGrammarRequest(BaseModel): class AddGrammarResponse(BaseModel): valid_grammar_files: List[str] - -def fill_default_values_in_request(request): - if isinstance(request, EmbeddingRequest): - if request.llm_model_name is None: - request.llm_model_name = DEFAULT_MODEL_NAME - if request.embedding_pooling_method is None: - request.embedding_pooling_method = DEFAULT_EMBEDDING_POOLING_METHOD - if request.corpus_identifier_string is None: - request.corpus_identifier_string = "" - elif isinstance(request, SimilarityRequest): - if request.llm_model_name is None: - request.llm_model_name = DEFAULT_MODEL_NAME - if request.embedding_pooling_method is None: - request.embedding_pooling_method = DEFAULT_EMBEDDING_POOLING_METHOD - if request.similarity_measure is None: - request.similarity_measure = "all" - elif isinstance(request, SemanticSearchRequest): - if request.llm_model_name is None: - request.llm_model_name = DEFAULT_MODEL_NAME - if request.embedding_pooling_method is None: - request.embedding_pooling_method = DEFAULT_EMBEDDING_POOLING_METHOD - if request.corpus_identifier_string is None: - request.corpus_identifier_string = "" - elif isinstance(request, AdvancedSemanticSearchRequest): - if request.llm_model_name is None: - request.llm_model_name = DEFAULT_MODEL_NAME - if request.embedding_pooling_method is None: - request.embedding_pooling_method = DEFAULT_EMBEDDING_POOLING_METHOD - if request.corpus_identifier_string is None: - request.corpus_identifier_string = "" - if request.similarity_filter_percentage is None: - request.similarity_filter_percentage = 0.01 - if request.number_of_most_similar_strings_to_return is None: - request.number_of_most_similar_strings_to_return = 10 - if request.result_sorting_metric is None: - request.result_sorting_metric = "hoeffding_d" - elif isinstance(request, TextCompletionRequest): - if request.llm_model_name is None: - request.llm_model_name = DEFAULT_MODEL_NAME - if request.temperature is None: - request.temperature = DEFAULT_COMPLETION_TEMPERATURE - if request.grammar_file_string is None: - request.grammar_file_string = "" - if request.number_of_tokens_to_generate is None: - request.number_of_tokens_to_generate = DEFAULT_MAX_COMPLETION_TOKENS - if request.number_of_completions_to_generate is None: - request.number_of_completions_to_generate = DEFAULT_NUMBER_OF_COMPLETIONS_TO_GENERATE diff --git a/end_to_end_tests.py b/end_to_end_tests.py index 159e439..8193c68 100644 --- a/end_to_end_tests.py +++ b/end_to_end_tests.py @@ -30,8 +30,7 @@ async def get_model_names() -> List[str]: return [name for name in model_names if "llava" not in name] async def get_embedding_pooling_methods() -> List[str]: - pooling_methods = ['means', 'means_mins_maxes', 'means_mins_maxes_stds_kurtoses', 'svd', 'svd_first_four', - 'qr_decomposition', 'cholesky_decomposition', 'ica', 'nmf', 'factor_analysis', 'gaussian_random_projection'] + pooling_methods = ['svd', 'svd_first_four', 'ica', 'factor_analysis', 'gaussian_random_projection'] print(f"Using embedding pooling methods: {pooling_methods}") return pooling_methods diff --git a/service_functions.py b/service_functions.py index 5335a6c..5951c30 100644 --- a/service_functions.py +++ b/service_functions.py @@ -22,7 +22,6 @@ from urllib.parse import quote import numpy as np import pandas as pd -import scipy import textract import zstandard as zstd from sqlalchemy import select @@ -38,10 +37,8 @@ from mutagen import File as MutagenFile from magika import Magika import httpx -from sklearn.decomposition import TruncatedSVD, FastICA, FactorAnalysis, NMF -from sklearn.preprocessing import StandardScaler, MinMaxScaler +from sklearn.decomposition import TruncatedSVD, FastICA, FactorAnalysis from sklearn.random_projection import GaussianRandomProjection -from numpy.linalg import qr, cholesky logger = setup_logger() magika = Magika() @@ -269,9 +266,8 @@ async def calculate_sentence_embeddings_list(llama, texts: list, embedding_pooli raise ValueError("Inconsistent number of embeddings found.") list_of_embedding_entry_dicts = [] cnt = 0 - for i, current_text in enumerate(texts): + for i, current_text in enumerate(texts): current_set_of_embeddings = sentence_embeddings_list[i]['embedding'] - # Check if `current_set_of_embeddings` is a list of lists or just a list; if it's just a list, then number_of_embeddings will be 1 and we need to convert it to a list of lists if isinstance(current_set_of_embeddings[0], list): number_of_embeddings = len(current_set_of_embeddings) else: @@ -280,83 +276,54 @@ async def calculate_sentence_embeddings_list(llama, texts: list, embedding_pooli logger.info(f"Sentence {i + 1} of {len(texts):,} has {number_of_embeddings:,} embeddings for text '{current_text[:50]}...'") embeddings = np.array(current_set_of_embeddings) dimension_of_token_embeddings = embeddings.shape[1] - if embedding_pooling_method == "means": - means = np.mean(embeddings, axis=0) - flattened_vector = means.flatten() - elif embedding_pooling_method == "means_mins_maxes": - if number_of_embeddings == 1: - flattened_vector = embeddings[0].flatten() - else: - means = np.mean(embeddings, axis=0) - mins = np.min(embeddings, axis=0) - maxes = np.max(embeddings, axis=0) - combined_feature_vector = np.concatenate([means, mins, maxes]).flatten() - flattened_vector = combined_feature_vector.flatten() - elif embedding_pooling_method == "means_mins_maxes_stds_kurtoses": - if number_of_embeddings == 1: - flattened_vector = embeddings[0].flatten() - else: - means = np.mean(embeddings, axis=0) - mins = np.min(embeddings, axis=0) - maxes = np.max(embeddings, axis=0) - stds = np.std(embeddings, axis=0) - kurtoses = scipy.stats.kurtosis(embeddings, axis=0) - combined_feature_vector = np.concatenate([means, mins, maxes, stds, kurtoses]) - flattened_vector = combined_feature_vector.flatten() - elif embedding_pooling_method == "svd": - if number_of_embeddings == 1: - flattened_vector = embeddings[0].flatten() - else: - svd = TruncatedSVD(n_components=2) # Set n_components to 2 + # Ensure embeddings have enough dimensions for the pooling method + required_components = { + "svd": 2, + "svd_first_four": 4, + "ica": 2, + "factor_analysis": 2, + "gaussian_random_projection": 2 + } + if number_of_embeddings > 1: + min_components = required_components.get(embedding_pooling_method, 1) + if number_of_embeddings < min_components: + padding = np.zeros((min_components - number_of_embeddings, dimension_of_token_embeddings)) + embeddings = np.vstack([embeddings, padding]) + if embedding_pooling_method == "svd": + svd = TruncatedSVD(n_components=2) svd_embeddings = svd.fit_transform(embeddings.T) flattened_vector = svd_embeddings.flatten() - elif embedding_pooling_method == "svd_first_four": - svd = TruncatedSVD(n_components=4) # Set n_components to 4 + elif embedding_pooling_method == "svd_first_four": + svd = TruncatedSVD(n_components=4) svd_embeddings = svd.fit_transform(embeddings.T) flattened_vector = svd_embeddings.flatten() - elif embedding_pooling_method == "covariance_matrix": - covariance_matrix = np.cov(embeddings.T, rowvar=False) - eigenvalues, eigenvectors = np.linalg.eigh(covariance_matrix) - flattened_vector = np.concatenate([eigenvalues, eigenvectors.flatten()]).flatten() - elif embedding_pooling_method == "qr_decomposition": - q, r = qr(embeddings.T) - flattened_vector = np.concatenate([q.flatten(), r.flatten()]).flatten() - elif embedding_pooling_method == "cholesky_decomposition": - try: - cholesky_matrix = cholesky(np.cov(embeddings.T, rowvar=False)) - flattened_vector = cholesky_matrix.flatten() - except np.linalg.LinAlgError: - flattened_vector = np.zeros((embeddings.shape[1] * embeddings.shape[1],)) - elif embedding_pooling_method == "ica": - ica = FastICA(n_components=2) - ica_embeddings = ica.fit_transform(embeddings.T) - flattened_vector = ica_embeddings.flatten() - elif embedding_pooling_method == "nmf": - scaler = MinMaxScaler() - scaled_embeddings = scaler.fit_transform(embeddings.T) - nmf = NMF(n_components=2) - nmf_embeddings = nmf.fit_transform(scaled_embeddings) - flattened_vector = nmf_embeddings.flatten() - elif embedding_pooling_method == "factor_analysis": - fa = FactorAnalysis(n_components=2) - fa_embeddings = fa.fit_transform(embeddings.T) - flattened_vector = fa_embeddings.flatten() - elif embedding_pooling_method == "gaussian_random_projection": - grp = GaussianRandomProjection(n_components=2) - grp_embeddings = grp.fit_transform(embeddings.T) - flattened_vector = grp_embeddings.flatten() + elif embedding_pooling_method == "ica": + ica = FastICA(n_components=2) + ica_embeddings = ica.fit_transform(embeddings.T) + flattened_vector = ica_embeddings.flatten() + elif embedding_pooling_method == "factor_analysis": + fa = FactorAnalysis(n_components=2) + fa_embeddings = fa.fit_transform(embeddings.T) + flattened_vector = fa_embeddings.flatten() + elif embedding_pooling_method == "gaussian_random_projection": + grp = GaussianRandomProjection(n_components=2) + grp_embeddings = grp.fit_transform(embeddings.T) + flattened_vector = grp_embeddings.flatten() + else: + raise ValueError(f"Unknown embedding_pooling_method: {embedding_pooling_method}") + combined_embedding = flattened_vector.tolist() else: - raise ValueError(f"Unknown embedding_pooling_method: {embedding_pooling_method}") - combined_embedding = flattened_vector.tolist() + flattened_vector = embeddings.flatten().tolist() + combined_embedding = embeddings.flatten().tolist() embedding_length = len(combined_embedding) cnt += 1 embedding_json = json.dumps(combined_embedding) embedding_hash = sha3_256(embedding_json.encode('utf-8')).hexdigest() - embedding_entry_dict = {'text_index': i, 'text': current_text, 'embedding_pooling_method': embedding_pooling_method,'number_of_token_embeddings_used': number_of_embeddings, 'embedding_length': embedding_length, 'embedding_hash': embedding_hash,'embedding': combined_embedding} + embedding_entry_dict = {'text_index': i, 'text': current_text, 'embedding_pooling_method': embedding_pooling_method, 'number_of_token_embeddings_used': number_of_embeddings, 'embedding_length': embedding_length, 'embedding_hash': embedding_hash, 'embedding': combined_embedding} list_of_embedding_entry_dicts.append(embedding_entry_dict) end_time = datetime.utcnow() total_time = (end_time - start_time).total_seconds() - logger.info(f"Calculated {len(flattened_vector):,}-dimensional embeddings (relative to the underlying token embedding dimensions of {dimension_of_token_embeddings:,}) for {total_number_of_sentences:,} sentences in a total of {total_time:,.1f} seconds.") + logger.info(f"Calculated {len(flattened_vector):,}-dimensional embeddings (relative to the underlying token embedding dimensions of {dimension_of_token_embeddings:,}) for {total_number_of_sentences:,} sentences in a total of {total_time:,.1f} seconds.") logger.info(f"That's an average of {1000*total_time/total_number_of_sentences:,.2f} ms per sentence and {total_number_of_sentences/total_time:,.3f} sentences per second (and {total_characters/(1000*total_time):,.4f} total characters per ms) using pooling method '{embedding_pooling_method}'") return list_of_embedding_entry_dicts diff --git a/swiss_army_llama.py b/swiss_army_llama.py index 2a77d55..dcb10b7 100644 --- a/swiss_army_llama.py +++ b/swiss_army_llama.py @@ -4,7 +4,7 @@ from database_functions import AsyncSessionLocal from ramdisk_functions import clear_ramdisk from misc_utility_functions import build_faiss_indexes, configure_redis_optimally -from embeddings_data_models import DocumentEmbedding, ShowLogsIncrementalModel, fill_default_values_in_request +from embeddings_data_models import DocumentEmbedding, ShowLogsIncrementalModel from embeddings_data_models import EmbeddingRequest, SemanticSearchRequest, AdvancedSemanticSearchRequest, SimilarityRequest, TextCompletionRequest, AddGrammarRequest from embeddings_data_models import EmbeddingResponse, SemanticSearchResponse, AdvancedSemanticSearchResponse, SimilarityResponse, AllStringsResponse, AllDocumentsResponse, TextCompletionResponse, AudioTranscriptResponse, ImageQuestionResponse, AddGrammarResponse from service_functions import get_or_compute_embedding, get_or_compute_transcript, add_model_url, download_file, start_resource_monitoring, end_resource_monitoring, decompress_data @@ -295,7 +295,7 @@ async def add_new_model(model_url: str, token: str = None) -> Dict[str, Any]: The request must contain the following attributes: - `text`: The input text for which the embedding vector is to be retrieved. - `llm_model_name`: The model used to calculate the embedding (optional, will use the default model if not provided). -- `embedding_pooling_method`: The method used to pool the embeddings (Choices: 'means', 'means_mins_maxes', 'means_mins_maxes_stds_kurtoses', 'svd', 'svd_first_four', 'qr_decomposition', 'cholesky_decomposition', 'ica', 'nmf', 'factor_analysis', 'gaussian_random_projection'; default is 'svd'). +- `embedding_pooling_method`: The method used to pool the embeddings (Choices: 'svd', 'svd_first_four', 'ica', 'factor_analysis', 'gaussian_random_projection'; default is 'svd'). ### Example (note that `llm_model_name` is optional): ```json @@ -321,7 +321,6 @@ async def get_embedding_vector_for_string(request: EmbeddingRequest, req: Reques logger.warning(f"Unauthorized request from client IP {client_ip}") raise HTTPException(status_code=403, detail="Unauthorized") try: - request = fill_default_values_in_request(request) request.text = prepare_string_for_embedding(request.text) unique_id = f"get_embedding_{request.text}_{request.llm_model_name}_{request.embedding_pooling_method}" lock = await shared_resources.lock_manager.lock(unique_id) @@ -372,7 +371,6 @@ async def compute_similarity_between_strings(request: SimilarityRequest, req: Re raise HTTPException(status_code=403, detail="Unauthorized") logger.info(f"Received request: {request}") request_time = datetime.utcnow() - request = fill_default_values_in_request(request) request.text1 = prepare_string_for_embedding(request.text1) request.text2 = prepare_string_for_embedding(request.text2) similarity_measure = request.similarity_measure.lower() @@ -437,7 +435,7 @@ async def compute_similarity_between_strings(request: SimilarityRequest, req: Re The request must contain the following attributes: - `query_text`: The input text for which to find the most similar string. - `llm_model_name`: The model used to calculate embeddings. -- `embedding_pooling_method`: The method used to pool the embeddings (Choices: 'means', 'means_mins_maxes', 'means_mins_maxes_stds_kurtoses', 'svd', 'svd_first_four', 'qr_decomposition', 'cholesky_decomposition', 'ica', 'nmf', 'factor_analysis', 'gaussian_random_projection'; default is 'svd'). +- `embedding_pooling_method`: The method used to pool the embeddings (Choices: 'svd', 'svd_first_four', 'ica', 'factor_analysis', 'gaussian_random_projection'; default is 'svd'). - `corpus_identifier_string`: An optional string identifier to restrict the search to a specific corpus. - `number_of_most_similar_strings_to_return`: (Optional) The number of most similar strings to return, defaults to 10. @@ -472,7 +470,6 @@ async def search_stored_embeddings_with_query_string_for_semantic_similarity(req raise HTTPException(status_code=403, detail="Unauthorized") global faiss_indexes, associated_texts_by_model_and_pooling_method request_time = datetime.utcnow() - request = fill_default_values_in_request(request) request.query_text = prepare_string_for_embedding(request.query_text) unique_id = f"semantic_search_{request.query_text}_{request.llm_model_name}_{request.embedding_pooling_method}_{request.corpus_identifier_string}_{request.number_of_most_similar_strings_to_return}" # Unique ID for this operation lock = await shared_resources.lock_manager.lock(unique_id) @@ -521,7 +518,10 @@ async def search_stored_embeddings_with_query_string_for_semantic_similarity(req total_time = (response_time - request_time).total_seconds() logger.info(f"Finished searching for the most similar string in the FAISS index in {total_time:,.2f} seconds. Found {len(results):,} results, returning the top {num_results:,}.") logger.info(f"Found most similar strings for query string {request.query_text}: {results}") - return {"query_text": request.query_text, "corpus_identifier_string": request.corpus_identifier_string,"results": results} # Return the response matching the SemanticSearchResponse model + if len(results) == 0: + logger.info(f"No results found for query string {request.query_text}.") + raise HTTPException(status_code=400, detail=f"No results found for query string {request.query_text} and model {llm_model_name} and pooling method {embedding_pooling_method} and corpus {request.corpus_identifier_string}.") + return {"query_text": request.query_text, "corpus_identifier_string": request.corpus_identifier_string, "embedding_pooling_method": embedding_pooling_method, "results": results} # Return the response matching the SemanticSearchResponse model except Exception as e: logger.error(f"An error occurred while processing the request: {e}") logger.error(traceback.format_exc()) # Print the traceback @@ -547,7 +547,7 @@ async def search_stored_embeddings_with_query_string_for_semantic_similarity(req The request must contain the following attributes: - `query_text`: The input text for which to find the most similar string. - `llm_model_name`: The model used to calculate embeddings. -- `embedding_pooling_method`: The method used to pool the embeddings (Choices: 'means', 'means_mins_maxes', 'means_mins_maxes_stds_kurtoses', 'svd', 'svd_first_four', 'qr_decomposition', 'cholesky_decomposition', 'ica', 'nmf', 'factor_analysis', 'gaussian_random_projection'; default is 'svd'). +- `embedding_pooling_method`: The method used to pool the embeddings (Choices: 'svd', 'svd_first_four', 'ica', 'factor_analysis', 'gaussian_random_projection'; default is 'svd'). - `corpus_identifier_string`: An optional string identifier to restrict the search to a specific corpus. - `similarity_filter_percentage`: (Optional) The percentage of embeddings to filter based on cosine similarity, defaults to 0.02 (i.e., top 2%). - `number_of_most_similar_strings_to_return`: (Optional) The number of most similar strings to return after applying the second similarity measure, defaults to 10. @@ -586,7 +586,6 @@ async def advanced_search_stored_embeddings_with_query_string_for_semantic_simil raise HTTPException(status_code=403, detail="Unauthorized") global faiss_indexes, associated_texts_by_model_and_pooling_method request_time = datetime.utcnow() - request = fill_default_values_in_request(request) request.query_text = prepare_string_for_embedding(request.query_text) unique_id = f"advanced_semantic_search_{request.query_text}_{request.llm_model_name}_{request.embedding_pooling_method}_{request.similarity_filter_percentage}_{request.number_of_most_similar_strings_to_return}" lock = await shared_resources.lock_manager.lock(unique_id) @@ -651,6 +650,7 @@ async def advanced_search_stored_embeddings_with_query_string_for_semantic_simil return {"query_text": request.query_text, "corpus_identifier_string": request.corpus_identifier_string, "results": results} except Exception as e: logger.error(f"An error occurred while processing the request: {e}") + traceback.print_exc() raise HTTPException(status_code=500, detail="Internal Server Error") finally: await shared_resources.lock_manager.unlock(lock) @@ -658,6 +658,7 @@ async def advanced_search_stored_embeddings_with_query_string_for_semantic_simil return {"status": "already processing"} + @app.post("/get_all_embedding_vectors_for_document/", summary="Get Embeddings for a Document", description="""Extract text embeddings for a document. This endpoint supports plain text, .doc/.docx (MS Word), PDF files, images (using Tesseract OCR), and many other file types supported by the textract library. @@ -668,10 +669,11 @@ async def advanced_search_stored_embeddings_with_query_string_for_semantic_simil - `hash`: SHA3-256 hash of the document file to verify integrity (optional; in lieu of `file`). - `size`: Size of the document file in bytes to verify completeness (optional; in lieu of `file`). - `llm_model_name`: The model used to calculate embeddings (optional). -- `embedding_pooling_method`: The method used to pool the embeddings (Choices: 'means', 'means_mins_maxes', 'means_mins_maxes_stds_kurtoses', 'svd', 'svd_first_four', 'qr_decomposition', 'cholesky_decomposition', 'ica', 'nmf', 'factor_analysis', 'gaussian_random_projection'; default is 'svd'). +- `embedding_pooling_method`: The method used to pool the embeddings (Choices: 'means', 'means_mins_maxes', 'means_mins_maxes_stds_kurtoses', 'svd', 'svd_first_four', 'ica', 'factor_analysis', 'gaussian_random_projection'; default is 'svd'). - `corpus_identifier_string`: An optional string identifier for grouping documents into a specific corpus. - `json_format`: The format of the JSON response (optional, see details below). - `send_back_json_or_zip_file`: Whether to return a JSON file or a ZIP file containing the embeddings file (optional, defaults to `zip`). +- `query_text`: An optional query text to perform a semantic search with the same parameters used for the document embedding request. - `token`: Security token (optional). ### JSON Format Options: @@ -698,8 +700,9 @@ async def get_all_embedding_vectors_for_document( embedding_pooling_method: str = DEFAULT_EMBEDDING_POOLING_METHOD, corpus_identifier_string: str = "", json_format: str = 'records', - token: str = None, send_back_json_or_zip_file: str = 'zip', + query_text: str = None, + token: str = None, req: Request = None ): logger.info(f"Received request with embedding_pooling_method: {embedding_pooling_method}") @@ -790,6 +793,23 @@ async def get_all_embedding_vectors_for_document( raise HTTPException(status_code=400, detail="Error while computing embeddings for document") finally: end_resource_monitoring(context) + + if query_text: + search_request = SemanticSearchRequest( + query_text=query_text, + llm_model_name=llm_model_name, + embedding_pooling_method=embedding_pooling_method, + corpus_identifier_string=corpus_identifier_string, + number_of_most_similar_strings_to_return=15 + ) + search_response = await search_stored_embeddings_with_query_string_for_semantic_similarity(search_request, req, token) + search_results = search_response["results"] + json_content_dict = {"document_embedding_results": json.loads(json_content), "semantic_search_results": search_results} + json_content = json.dumps(json_content_dict) + else: + json_content_dict = {"document_embedding_results": json.loads(json_content)} + json_content = json.dumps(json_content_dict) + overall_total_time = (datetime.utcnow() - request_time).total_seconds() json_content_length = len(json_content) if json_content_length > 0: @@ -801,10 +821,10 @@ async def get_all_embedding_vectors_for_document( else: original_filename_without_extension, _ = os.path.splitext(file.filename if file else os.path.basename(url)) json_file_path = f"/tmp/{original_filename_without_extension}.json" - with open(json_file_path, 'wb') as json_file: + with open(json_file_path, 'w') as json_file: json_file.write(json_content) zip_file_path = f"/tmp/{original_filename_without_extension}.zip" - with zipfile.ZipFile(zip_file_path, 'w') as zipf: + with zipfile.ZipFile(zip_file_path, 'w', compression=zipfile.ZIP_DEFLATED) as zipf: zipf.write(json_file_path, os.path.basename(json_file_path)) logger.info(f"Returning ZIP response for document containing {len(sentences):,} sentences with model {llm_model_name}; first 100 characters out of {json_content_length:,} total of JSON response: {json_content[:100]}") return FileResponse(zip_file_path, headers={"Content-Disposition": f"attachment; filename={original_filename_without_extension}.zip"}) @@ -891,7 +911,6 @@ async def get_text_completions_from_input_prompt(request: TextCompletionRequest, if USE_SECURITY_TOKEN and use_hardcoded_security_token and (token is None or token != SECURITY_TOKEN): logger.warning(f"Unauthorized request from client IP {client_ip}") raise HTTPException(status_code=403, detail="Unauthorized") - request = fill_default_values_in_request(request) context = start_resource_monitoring("get_text_completions_from_input_prompt", request.dict(), client_ip) try: unique_id = f"text_completion_{hash(request.input_prompt)}_{request.llm_model_name}" @@ -1105,7 +1124,7 @@ async def turn_pydantic_model_description_into_bnf_grammar_for_llm( - `size`: Size of the audio file in bytes to verify completeness. - `compute_embeddings_for_resulting_transcript_document`: Boolean to indicate if document embeddings should be computed (optional, defaults to True). - `llm_model_name`: The language model used for computing embeddings (optional, defaults to the default model name). -- `embedding_pooling_method`: The method used to pool the embeddings (Choices: 'means', 'means_mins_maxes', 'means_mins_maxes_stds_kurtoses', 'svd', 'svd_first_four', 'qr_decomposition', 'cholesky_decomposition', 'ica', 'nmf', 'factor_analysis', 'gaussian_random_projection'; default is 'svd'). +- `embedding_pooling_method`: The method used to pool the embeddings (Choices: 'svd', 'svd_first_four', 'ica', 'factor_analysis', 'gaussian_random_projection'; default is 'svd'). - `req`: HTTP Request object for additional request metadata (optional). - `token`: Security token for API access (optional). - `client_ip`: Client IP for logging and security (optional).