-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support payload input in GAP #219
base: feature-request-level-timing
Are you sure you want to change the base?
Changes from all commits
2f333df
3f6e35c
bf2bc21
52020cc
482a3b7
6964ee0
6096b2c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,79 @@ | ||
# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
# | ||
# Redistribution and use in source and binary forms, with or without | ||
# modification, are permitted provided that the following conditions | ||
# are met: | ||
# * Redistributions of source code must retain the above copyright | ||
# notice, this list of conditions and the following disclaimer. | ||
# * Redistributions in binary form must reproduce the above copyright | ||
# notice, this list of conditions and the following disclaimer in the | ||
# documentation and/or other materials provided with the distribution. | ||
# * Neither the name of NVIDIA CORPORATION nor the names of its | ||
# contributors may be used to endorse or promote products derived | ||
# from this software without specific prior written permission. | ||
# | ||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY | ||
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE | ||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR | ||
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR | ||
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, | ||
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, | ||
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR | ||
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY | ||
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT | ||
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE | ||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | ||
|
||
|
||
from pathlib import Path | ||
from typing import Any, Dict, List, Tuple, Union | ||
|
||
from genai_perf.inputs.retrievers.base_input_retriever import BaseInputRetriever | ||
from genai_perf.inputs.retrievers.generic_dataset import FileData, GenericDataset | ||
|
||
|
||
class BaseFileInputRetriever(BaseInputRetriever): | ||
""" | ||
A base input retriever class that defines file input methods. | ||
""" | ||
|
||
def _verify_file(self, filename: Path) -> None: | ||
""" | ||
Verifies that the file exists. | ||
|
||
Args | ||
---------- | ||
filename : Path | ||
The file path to verify. | ||
|
||
Raises | ||
------ | ||
FileNotFoundError | ||
If the file does not exist. | ||
""" | ||
if not filename.exists(): | ||
raise FileNotFoundError(f"The file '{filename}' does not exist.") | ||
|
||
def _get_content_from_input_file(self, filename: Path) -> Union[ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you want to organize these into the implemented ones vs the non-implemented ones? And perhaps alphabetize them within each of those sections for easy navigation? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry. This was also my mess. I agree that it needs to be cleaned up more. |
||
Tuple[List[str], List[str]], | ||
Tuple[List[str], List[str], List[Dict[Any, Any]]], | ||
]: | ||
""" | ||
Reads the content from a JSONL file and returns lists of each content type. | ||
|
||
""" | ||
raise NotImplementedError("This method should be implemented by subclasses.") | ||
|
||
def _get_input_dataset_from_file(self, filename: Path) -> FileData: | ||
""" | ||
Retrieves the dataset from a specific JSONL file. | ||
|
||
""" | ||
|
||
raise NotImplementedError("This method should be implemented by subclasses.") | ||
|
||
def retrieve_data(self) -> GenericDataset: | ||
""" | ||
Retrieves the dataset from a file or directory. | ||
""" | ||
raise NotImplementedError("This method should be implemented by subclasses.") |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,25 +25,37 @@ | |
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | ||
|
||
from dataclasses import dataclass, field | ||
from typing import Dict, List, TypeAlias | ||
from typing import Any, Dict, List, TypeAlias, Union | ||
|
||
Filename: TypeAlias = str | ||
TypeOfData: TypeAlias = str | ||
ListOfData: TypeAlias = List[str] | ||
DataRowDict: TypeAlias = Dict[TypeOfData, ListOfData] | ||
DataRowDict: TypeAlias = Dict[str, Union[List[str], Dict[str, Any], str]] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a way to make this alias more straightforward? If it's this complex, I don't think aliasing is helpful anymore. We'd need to find a way to simplify this logic. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Our schema, which is what this functionally represents, has gotten more complex over time. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe loop in our alias whiz, @nv-braf? My guess though is that if it's this complex, we're doing some overloading with typing that's really unclean and masking other issues. I also would have no idea what's going on here from a quick review of the code. I can't speak to the complexity over time, which we have tried to reduce, but if you look at the before/after, one of these is relatively straightforward and the other is relatively messy. Having three possible datatypes values for a DataRowDict (or anything) doesn't sit right with me. That's too much overloading. We could fix the underlying issue in a future refactor, theoretically, but it looks like this PR is creating the above issue so we should fix it now. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My take is that DataRowDict is a "useless" type. I think static typing provides two benefits:
In this case I don't your getting either benefit. This could serve as a form of documentation if the values stored in the Dict were also typed. Something like: |
||
GenericDatasetDict: TypeAlias = Dict[Filename, List[DataRowDict]] | ||
|
||
|
||
@dataclass | ||
class DataRow: | ||
texts: List[str] = field(default_factory=list) | ||
images: List[str] = field(default_factory=list) | ||
timestamp: str = "" | ||
optional_data: Dict[str, Any] = field(default_factory=dict) | ||
|
||
def to_dict(self) -> DataRowDict: | ||
""" | ||
Converts the DataRow object to a dictionary. | ||
""" | ||
return {"texts": self.texts, "images": self.images} | ||
datarow_dict: DataRowDict = {} | ||
|
||
if self.texts: | ||
datarow_dict["texts"] = self.texts | ||
if self.images: | ||
datarow_dict["images"] = self.images | ||
if self.timestamp: | ||
datarow_dict["timestamp"] = self.timestamp | ||
if self.optional_data: | ||
datarow_dict["optional_data"] = self.optional_data | ||
return datarow_dict | ||
|
||
|
||
@dataclass | ||
|
@@ -55,8 +67,8 @@ def to_list(self) -> List[DataRowDict]: | |
Converts the FileData object to a list. | ||
Output format example for two payloads from a file: | ||
[ | ||
{'texts': ['text1', 'text2'], 'images': ['image1', 'image2']}, | ||
{'texts': ['text3', 'text4'], 'images': ['image3', 'image4']} | ||
{'texts': ['text1', 'text2'], 'images': ['image1', 'image2'], 'timestamp': 'timestamp1', 'optional_data': {}}, | ||
{'texts': ['text3', 'text4'], 'images': ['image3', 'image4'], 'timestamp': 'timestamp2', 'optional_data': {}}, | ||
] | ||
""" | ||
return [row.to_dict() for row in self.rows] | ||
|
@@ -71,8 +83,8 @@ def to_dict(self) -> GenericDatasetDict: | |
Converts the entire DataStructure object to a dictionary. | ||
Output format example for one payload from two files: | ||
{ | ||
'file_0': [{'texts': ['text1', 'text2'], 'images': ['image1', 'image2']}], | ||
'file_1': [{'texts': ['text1', 'text2'], 'images': ['image1', 'image2']}] | ||
'file_0': [{'texts': ['text1', 'text2'], 'images': ['image1', 'image2'], 'timestamp': 'timestamp1', 'optional_data': {}}], | ||
'file_1': [{'texts': ['text1', 'text2'], 'images': ['image1', 'image2'], 'timestamp': 'timestamp2', 'optional_data': {}}], | ||
} | ||
""" | ||
return { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love this approach. Is there a way to wrap this and make it cleaner? This is going to make the converters less straightforward to write. You'll also need to update the customizable frontends tutorial to reflect any changes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean for this converter? OR are you making a general statement about the way this was implemented? Just trying to understand the scope of the ask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the converters. I think this is the "hot path" for adding a new converter (i.e. customizable frontend). We want it to be as simple, concise, and intuitive as possible.
Any changes to this process also need to be documented in customizable_frontends.md.