Skip to content

Commit

Permalink
Break classes up
Browse files Browse the repository at this point in the history
Get all filepaths working for chat/completions

Check invalid input type combinations

Catch JSON

Fix tests, add copyrights

Fix vLLM backend, add JSON input file check

Fix TRT-LLM backend

Remove unused imports
  • Loading branch information
dyastremsky committed Jul 15, 2024
1 parent db888f1 commit 4cd5a1a
Show file tree
Hide file tree
Showing 13 changed files with 793 additions and 1,926 deletions.
16 changes: 8 additions & 8 deletions src/c++/perf_analyzer/genai-perf/docs/embeddings.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,18 +36,18 @@ GenAI-Perf allows you to profile embedding models running on an
To create a sample embeddings input file, use the following command:

```bash
echo '{"text": "What was the first car ever driven?"}
{"text": "Who served as the 5th President of the United States of America?"}
{"text": "Is the Sydney Opera House located in Australia?"}
{"text": "In what state did they film Shrek 2?"}' > embeddings.jsonl
echo '{"text_input": "What was the first car ever driven?"}
{"text_input": "Who served as the 5th President of the United States of America?"}
{"text_input": "Is the Sydney Opera House located in Australia?"}
{"text_input": "In what state did they film Shrek 2?"}' > embeddings.jsonl
```

This will generate a file named embeddings.jsonl with the following content:
```jsonl
{"text": "What was the first car ever driven?"}
{"text": "Who served as the 5th President of the United States of America?"}
{"text": "Is the Sydney Opera House located in Australia?"}
{"text": "In what state did they film Shrek 2?"}
{"text_input": "What was the first car ever driven?"}
{"text_input": "Who served as the 5th President of the United States of America?"}
{"text_input": "Is the Sydney Opera House located in Australia?"}
{"text_input": "In what state did they film Shrek 2?"}
```

## Start an OpenAI Embeddings-Compatible Server
Expand Down
16 changes: 8 additions & 8 deletions src/c++/perf_analyzer/genai-perf/docs/rankings.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,19 +44,19 @@ mkdir rankings_jsonl
Inside this directory, create a JSONL file named queries.jsonl with queries data:

```bash
echo '{"text": "What was the first car ever driven?"}
{"text": "Who served as the 5th President of the United States of America?"}
{"text": "Is the Sydney Opera House located in Australia?"}
{"text": "In what state did they film Shrek 2?"}' > rankings_jsonl/queries.jsonl
echo '{"text_input": "What was the first car ever driven?"}
{"text_input": "Who served as the 5th President of the United States of America?"}
{"text_input": "Is the Sydney Opera House located in Australia?"}
{"text_input": "In what state did they film Shrek 2?"}' > rankings_jsonl/queries.jsonl
```

Create another JSONL file named passages.jsonl with passages data:

```bash
echo '{"text": "Eric Anderson (born January 18, 1968) is an American sociologist and sexologist."}
{"text": "Kevin Loader is a British film and television producer."}
{"text": "Francisco Antonio Zea Juan Francisco Antonio Hilari was a Colombian journalist, botanist, diplomat, politician, and statesman who served as the 1st Vice President of Colombia."}
{"text": "Daddys Home 2 Principal photography on the film began in Massachusetts in March 2017 and it was released in the United States by Paramount Pictures on November 10, 2017. Although the film received unfavorable reviews, it has grossed over $180 million worldwide on a $69 million budget."}' > rankings_jsonl/passages.jsonl
echo '{"text_input": "Eric Anderson (born January 18, 1968) is an American sociologist and sexologist."}
{"text_input": "Kevin Loader is a British film and television producer."}
{"text_input": "Francisco Antonio Zea Juan Francisco Antonio Hilari was a Colombian journalist, botanist, diplomat, politician, and statesman who served as the 1st Vice President of Colombia."}
{"text_input": "Daddys Home 2 Principal photography on the film began in Massachusetts in March 2017 and it was released in the United States by Paramount Pictures on November 10, 2017. Although the film received unfavorable reviews, it has grossed over $180 million worldwide on a $69 million budget."}' > rankings_jsonl/passages.jsonl
```

## Start a Hugging Face Re-Ranker-Compatible Server
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from pathlib import Path
from typing import Any, Dict, List

import requests
from genai_perf.exceptions import GenAIPerfException
from genai_perf.llm_inputs.synthetic_prompt_generator import SyntheticPromptGenerator
from genai_perf.tokenizer import Tokenizer
from genai_perf.utils import load_json_str


class DatasetRetriever:
@staticmethod
def from_url(url: str, starting_index: int, length: int) -> List[Dict[str, Any]]:
url += f"&offset={starting_index}&length={length}"
response = requests.get(url)
response.raise_for_status()
dataset = response.json()
rows = dataset.get("rows", [])[starting_index : starting_index + length]
formatted_rows = [
{
"text_input": row["row"].get("question", ""),
"system_prompt": row["row"].get("system_prompt", ""),
"response": row["row"].get("response", ""),
}
for row in rows
]
return formatted_rows

@staticmethod
def from_file(file_path: Path) -> List[Dict[str, str]]:
with open(file_path, "r") as file:
data = [load_json_str(line) for line in file]

for item in data:
if not isinstance(item, dict):
raise GenAIPerfException(
"File content is not in the expected format."
)
if "text_input" not in item:
raise GenAIPerfException(
"Missing 'text_input' field in one or more items."
)
if len(item) != 1 or "text_input" not in item:
raise GenAIPerfException(
"Each item must only contain the 'text_input' field."
)

return [{"text_input": item["text_input"]} for item in data]

@staticmethod
def from_directory(directory_path: Path) -> Dict:
# TODO: Add support for an extra preprocessing step after loading the files to optionally create/modify the dataset.
# For files calling this method (e.g. rankings), it is a must to create the dataset before converting to the generic format.
dataset: Dict = {"rows": []}
data = {}

# Check all JSONL files in the directory
for file_path in directory_path.glob("*.jsonl"):
# Get the file name without suffix
key = file_path.stem
with open(file_path, "r") as file:
data[key] = [load_json_str(line) for line in file]

# Create rows with keys based on file names without suffix
num_entries = len(next(iter(data.values())))
for i in range(num_entries):
row = {key: data[key][i] for key in data}
dataset["rows"].append({"row": row})

return dataset

@staticmethod
def from_synthetic(
tokenizer: Tokenizer,
prompt_tokens_mean: int,
prompt_tokens_stddev: int,
num_of_output_prompts: int,
) -> List[Dict[str, str]]:
synthetic_prompts = []
for _ in range(num_of_output_prompts):
synthetic_prompt = SyntheticPromptGenerator.create_synthetic_prompt(
tokenizer, prompt_tokens_mean, prompt_tokens_stddev
)
synthetic_prompts.append({"text_input": synthetic_prompt})
return synthetic_prompts
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from typing import Any, Dict, List


class JSONConverter:
@staticmethod
def to_generic(dataset: List[Dict[str, Any]]) -> Dict:
if isinstance(dataset, list) and len(dataset) > 0:
if isinstance(dataset[0], dict):
# Assume dataset is a list of dictionaries
converted_data = []
for item in dataset:
row_data = {
"text_input": item.get("text_input", ""),
"system_prompt": item.get("system_prompt", ""),
"response": item.get("response", ""),
}
converted_data.append(row_data)
return {
"features": ["text_input", "system_prompt", "response"],
"rows": [{"row": item} for item in converted_data],
}
elif isinstance(dataset[0], str):
# Assume dataset is a list of strings
return {
"features": ["text_input"],
"rows": [{"row": {"text_input": item}} for item in dataset],
}
else:
raise ValueError("Dataset is not in a recognized format.")
else:
raise ValueError("Dataset is empty or not in a recognized format.")
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import json
from pathlib import Path
from typing import Dict

from genai_perf.constants import DEFAULT_INPUT_DATA_JSON


class JSONWriter:
@staticmethod
def write_to_file(json_data: Dict, output_dir: Path) -> None:
filename = output_dir / DEFAULT_INPUT_DATA_JSON
with open(filename, "w") as f:
f.write(json.dumps(json_data, indent=2))
Loading

0 comments on commit 4cd5a1a

Please sign in to comment.