Skip to content

Commit

Permalink
Fixes and included semantic search in document embedding endpoint
Browse files Browse the repository at this point in the history
  • Loading branch information
Dicklesworthstone committed May 27, 2024
1 parent 46c0841 commit 900464f
Show file tree
Hide file tree
Showing 5 changed files with 105 additions and 202 deletions.
45 changes: 5 additions & 40 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,16 +11,9 @@ The Swiss Army Llama is designed to facilitate and optimize the process of worki
Some additional useful endpoints are provided, such as computing semantic similarity between submitted text strings. The service leverages a high-performance Rust-based library, `fast_vector_similarity`, to offer a range of similarity measures including `spearman_rho`, `kendall_tau`, `approximate_distance_correlation`, `jensen_shannon_similarity`, and [`hoeffding_d`](https://blogs.sas.com/content/iml/2021/05/03/examples-hoeffding-d.html). Additionally, semantic search across all your cached embeddings is supported using FAISS vector searching. You can either use the built in cosine similarity from FAISS, or supplement this with a second pass that computes the more sophisticated similarity measures for the most relevant subset of the stored vectors found using cosine similarity (see the advanced semantic search endpoint for this functionality).

Also, we now support multiple embedding pooling methods for combining token-level embedding vectors into a single fixed-length embedding vector for any length of input text, including the following:
- `means`: Element-wise average of the token embeddings.
- `means_mins_maxes`: Concatenation of element-wise mean, min, and max of the token embeddings.
- `means_mins_maxes_stds_kurtoses`: Concatenation of element-wise mean, min, max, standard deviation, and kurtosis of the token embeddings.
- `svd`: Concatenation of the first two singular vectors obtained from the Singular Value Decomposition (SVD) of the token embeddings matrix.
- `svd_first_four`: Concatenation of the first four singular vectors obtained from the Singular Value Decomposition (SVD) of the token embeddings matrix.
- `gram_matrix`: Flattened Gram matrix (dot product of the token embeddings matrix with its transpose).
- `qr_decomposition`: Concatenation of the flattened Q and R matrices from QR decomposition of the token embeddings.
- `cholesky_decomposition`: Flattened lower triangular matrix from Cholesky decomposition of the covariance matrix of the token embeddings.
- `ica`: Flattened independent components obtained from Independent Component Analysis (ICA) of the token embeddings.
- `nmf`: Flattened components obtained from Non-Negative Matrix Factorization (NMF) of the token embeddings.
- `factor_analysis`: Flattened factors obtained from Factor Analysis of the token embeddings.
- `gaussian_random_projection`: Flattened embeddings obtained from Gaussian Random Projection of the token embeddings.

Expand Down Expand Up @@ -694,50 +687,22 @@ The primary goal of these pooling methods is to retain as much useful informatio

#### Explanation of Pooling Methods

1. **Means**:
- **How it works**: Computes the element-wise average of the token embeddings.
- **Rationale**: The mean pooling method provides a simple yet effective way to summarize the central tendency of the token embeddings, capturing the overall semantic content of the text.

2. **Means_Mins_Maxes**:
- **How it works**: Concatenates the element-wise mean, min, and max of the token embeddings.
- **Rationale**: This method captures the central tendency (mean) as well as the range (min and max) of the embeddings, providing a richer representation by considering the distribution of values.

3. **Means_Mins_Maxes_Stds_Kurtoses**:
- **How it works**: Concatenates the element-wise mean, min, max, standard deviation, and kurtosis of the token embeddings.
- **Rationale**: This method captures various statistical properties of the embeddings, including their central tendency, variability, and distribution shape, offering a comprehensive summary of the token embeddings.

4. **SVD (Singular Value Decomposition)**:
1. **SVD (Singular Value Decomposition)**:
- **How it works**: Concatenates the first two singular vectors obtained from the SVD of the token embeddings matrix.
- **Rationale**: SVD is a dimensionality reduction technique that captures the most important features of the data. Using the first two singular vectors provides a compact representation that retains significant information.

5. **SVD_First_Four**:
2. **SVD_First_Four**:
- **How it works**: Uses the first four singular vectors obtained from the SVD of the token embeddings matrix.
- **Rationale**: By using more singular vectors, this method captures more of the variance in the data, providing a richer representation while still reducing dimensionality.

6. **Gram_Matrix**:
- **How it works**: Computes the Gram matrix (dot product of the embeddings matrix with its transpose) and flattens it.
- **Rationale**: The Gram matrix captures the pairwise similarities between token embeddings, providing a summary of their relationships.

7. **QR_Decomposition**:
- **How it works**: Performs QR decomposition on the embeddings matrix and concatenates the flattened Q and R matrices.
- **Rationale**: QR decomposition provides an orthogonal basis (Q) and upper triangular matrix (R), summarizing the embeddings in terms of these basis vectors and their coefficients.

8. **Cholesky_Decomposition**:
- **How it works**: Performs Cholesky decomposition on the covariance matrix of the embeddings and flattens the resulting matrix.
- **Rationale**: This method factors the covariance matrix into a lower triangular matrix, capturing the structure of the variance in the embeddings.

9. **ICA (Independent Component Analysis)**:
3. **ICA (Independent Component Analysis)**:
- **How it works**: Applies ICA to the embeddings matrix to find statistically independent components, then flattens the result.
- **Rationale**: ICA is useful for identifying independent sources in the data, providing a representation that highlights these independent features.

10. **NMF (Non-Negative Matrix Factorization)**:
- **How it works**: Applies NMF to the embeddings matrix and flattens the result.
- **Rationale**: NMF finds parts-based representations by factorizing the data into non-negative components, useful for interpretability and feature extraction.

11. **Factor_Analysis**:
4. **Factor_Analysis**:
- **How it works**: Applies factor analysis to the embeddings matrix to identify underlying factors, then flattens the result.
- **Rationale**: Factor analysis models the data in terms of latent factors, providing a summary that captures these underlying influences.

12. **Gaussian_Random_Projection**:
5. **Gaussian_Random_Projection**:
- **How it works**: Applies Gaussian random projection to reduce the dimensionality of the embeddings, then flattens the result.
- **Rationale**: This method provides a fast and efficient way to reduce dimensionality while preserving the pairwise distances between points, useful for large datasets.
101 changes: 27 additions & 74 deletions embeddings_data_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,17 +84,17 @@ def update_document_hash_on_remove(target, value, initiator):
# Request/Response models start here:

class EmbeddingRequest(BaseModel):
text: str
llm_model_name: str
embedding_pooling_method: str
corpus_identifier_string: str
text: str = ""
llm_model_name: str = DEFAULT_MODEL_NAME
embedding_pooling_method: str = DEFAULT_EMBEDDING_POOLING_METHOD
corpus_identifier_string: str = ""

class SimilarityRequest(BaseModel):
text1: str
text2: str
llm_model_name: str
embedding_pooling_method: str
similarity_measure: str
text1: str = ""
text2: str = ""
llm_model_name: str = DEFAULT_MODEL_NAME
embedding_pooling_method: str = DEFAULT_EMBEDDING_POOLING_METHOD
similarity_measure: str = "all"
@field_validator('similarity_measure')
def validate_similarity_measure(cls, value):
valid_measures = ["all", "spearman_rho", "kendall_tau", "approximate_distance_correlation", "jensen_shannon_similarity", "hoeffding_d"]
Expand All @@ -103,11 +103,11 @@ def validate_similarity_measure(cls, value):
return value.lower()

class SemanticSearchRequest(BaseModel):
query_text: str
number_of_most_similar_strings_to_return: int
llm_model_name: str
embedding_pooling_method: str
corpus_identifier_string: str
query_text: str = ""
number_of_most_similar_strings_to_return: int = 10
llm_model_name: str = DEFAULT_MODEL_NAME
embedding_pooling_method: str = DEFAULT_EMBEDDING_POOLING_METHOD
corpus_identifier_string: str = ""

class SemanticSearchResponse(BaseModel):
query_text: str
Expand All @@ -116,13 +116,13 @@ class SemanticSearchResponse(BaseModel):
results: List[dict] # List of similar strings and their similarity scores using cosine similarity with Faiss (in descending order)

class AdvancedSemanticSearchRequest(BaseModel):
query_text: str
llm_model_name: str
embedding_pooling_method: str
corpus_identifier_string: str
similarity_filter_percentage: float
number_of_most_similar_strings_to_return: int
result_sorting_metric: str
query_text: str = ""
llm_model_name: str = DEFAULT_MODEL_NAME
embedding_pooling_method: str = DEFAULT_EMBEDDING_POOLING_METHOD
corpus_identifier_string: str = ""
similarity_filter_percentage: float = 0.01
number_of_most_similar_strings_to_return: int = 10
result_sorting_metric: str = "hoeffding_d"
@field_validator('result_sorting_metric')
def validate_similarity_measure(cls, value):
valid_measures = ["all", "spearman_rho", "kendall_tau", "approximate_distance_correlation", "jensen_shannon_similarity", "hoeffding_d"]
Expand Down Expand Up @@ -168,12 +168,12 @@ class AllDocumentsResponse(BaseModel):
documents: List[str]

class TextCompletionRequest(BaseModel):
input_prompt: str
llm_model_name: str
temperature: float
grammar_file_string: str
number_of_tokens_to_generate: int
number_of_completions_to_generate: int
input_prompt: str = ""
llm_model_name: str = DEFAULT_MODEL_NAME
temperature: float = DEFAULT_COMPLETION_TEMPERATURE
grammar_file_string: str = ""
number_of_tokens_to_generate: int = DEFAULT_MAX_COMPLETION_TOKENS
number_of_completions_to_generate: int = DEFAULT_NUMBER_OF_COMPLETIONS_TO_GENERATE

class TextCompletionResponse(BaseModel):
input_prompt: str
Expand Down Expand Up @@ -241,50 +241,3 @@ class AddGrammarRequest(BaseModel):

class AddGrammarResponse(BaseModel):
valid_grammar_files: List[str]

def fill_default_values_in_request(request):
if isinstance(request, EmbeddingRequest):
if request.llm_model_name is None:
request.llm_model_name = DEFAULT_MODEL_NAME
if request.embedding_pooling_method is None:
request.embedding_pooling_method = DEFAULT_EMBEDDING_POOLING_METHOD
if request.corpus_identifier_string is None:
request.corpus_identifier_string = ""
elif isinstance(request, SimilarityRequest):
if request.llm_model_name is None:
request.llm_model_name = DEFAULT_MODEL_NAME
if request.embedding_pooling_method is None:
request.embedding_pooling_method = DEFAULT_EMBEDDING_POOLING_METHOD
if request.similarity_measure is None:
request.similarity_measure = "all"
elif isinstance(request, SemanticSearchRequest):
if request.llm_model_name is None:
request.llm_model_name = DEFAULT_MODEL_NAME
if request.embedding_pooling_method is None:
request.embedding_pooling_method = DEFAULT_EMBEDDING_POOLING_METHOD
if request.corpus_identifier_string is None:
request.corpus_identifier_string = ""
elif isinstance(request, AdvancedSemanticSearchRequest):
if request.llm_model_name is None:
request.llm_model_name = DEFAULT_MODEL_NAME
if request.embedding_pooling_method is None:
request.embedding_pooling_method = DEFAULT_EMBEDDING_POOLING_METHOD
if request.corpus_identifier_string is None:
request.corpus_identifier_string = ""
if request.similarity_filter_percentage is None:
request.similarity_filter_percentage = 0.01
if request.number_of_most_similar_strings_to_return is None:
request.number_of_most_similar_strings_to_return = 10
if request.result_sorting_metric is None:
request.result_sorting_metric = "hoeffding_d"
elif isinstance(request, TextCompletionRequest):
if request.llm_model_name is None:
request.llm_model_name = DEFAULT_MODEL_NAME
if request.temperature is None:
request.temperature = DEFAULT_COMPLETION_TEMPERATURE
if request.grammar_file_string is None:
request.grammar_file_string = ""
if request.number_of_tokens_to_generate is None:
request.number_of_tokens_to_generate = DEFAULT_MAX_COMPLETION_TOKENS
if request.number_of_completions_to_generate is None:
request.number_of_completions_to_generate = DEFAULT_NUMBER_OF_COMPLETIONS_TO_GENERATE
3 changes: 1 addition & 2 deletions end_to_end_tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,7 @@ async def get_model_names() -> List[str]:
return [name for name in model_names if "llava" not in name]

async def get_embedding_pooling_methods() -> List[str]:
pooling_methods = ['means', 'means_mins_maxes', 'means_mins_maxes_stds_kurtoses', 'svd', 'svd_first_four',
'qr_decomposition', 'cholesky_decomposition', 'ica', 'nmf', 'factor_analysis', 'gaussian_random_projection']
pooling_methods = ['svd', 'svd_first_four', 'ica', 'factor_analysis', 'gaussian_random_projection']
print(f"Using embedding pooling methods: {pooling_methods}")
return pooling_methods

Expand Down
Loading

0 comments on commit 900464f

Please sign in to comment.