You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The implementation of "table origins" is currently a half-baked mess:
Right now we have a TableOriginCSV class but nothing for tables coming from elsewhere than CSV files;
Even TableOriginCSV doesn't even get used properly: it's supposed to indicate not only what file the table is in, but also what row the table is on, but that row never gets set.
In fact, origin is in most places set to be just a str.
There are currently a flurry of function parameters named origin in the reader and parser functions; however, they are evidently not scoped to contain a TableOrigin but rather... something else, apparently a file name. Confusing nomenclature.
A few things need to get done:
Each call to parse_blocks() should be done with arguments (perhaps kwargs, perhaps a new OriginContext parameter object?) that are sufficient to determine the context (this context would usually be the "sheet", in the StarTable sense of the term).
For CSV, this would be : file name.
For Excel: file name and sheet name.
For JsonData: ...?
Probably also, a clear flag of whether it comes from CSV or Excel or JsonData or something else: we know this would be useful when resolving certain ambiguities during parsing, e.g. _parse_datetime_column() when val is None.
Each call to parse_blocks() should be interpreted as starting on row 0 of that context/sheet (or row 1, if we want 1-based for Excel?)
parse_blocks() makes a TableOrigin on the fly, based on the context it was passed plus the row number that it's currently on.
All variables/parameters that aren't meant to contain TableOrigin, rename them to something else than origin, and check if they even need to exist.
Have a TableOrigin subclass for each expected kind of origin: CSV, Excel, JsonData, ... raw cell data?, as well an "anonymous" one that tables should have by default unless it gets overwritten by one of the other ones. This means that tables will have an anonymous origin if created literally, and also if they get created in an unexpected way (e.g. read from some source that we don't yet know how to handle e.g. imported from a Word doc or whatever; though such cases should in time get their own TableOrigin subclass as they become supported in pdtable).
Also consider whether it would make sense for TableOrigin to accept another TableOrigin in the constructor, to keep track of a chain of origins. Is there a use case for this?
The text was updated successfully, but these errors were encountered:
Yes, that part was at best sketched. My thinking was something along the following.
The table origin data must somehow reference metadata for the file being read. This metadata must be passed into the reader by the caller, and could be a file path relative to a project root, records system metadata, a blob key or any other file-level metadata. Based on this, the reader must then be able to create a table-level origin object suitable for the caller.
One way of doing this could be to provide a TableOriginPolicy object as input origin (or maybe rename that input).
class TableOriginPolicy(Protocol):
@abstractmethod
def origin_for(self, **kwargs) -> TableOrigin:
"""
Provide table origin for the location specified by keyword arguments
Table sources are free to call this method with any arguments, but should
prefer the convenience methods `origin_for_line` and `origin_for_workbook_location`.
"""
pass
def origin_for_line(self, line: int) -> TableOrigin:
"""
Returns origin for line in text-based input source such as CSV.
"""
return self.origin_for(line=line)
def origin_for_workbook_location(self, row: int, column: str, sheet: Optional[str] = None) -> TableOrigin:
"""
Returns origin for line in text-based input source such as CSV.
"""
return self.origin_for(row=row, column=column, sheet=sheet)
In terms of actually building a full reader system, we will need a broader context setup -- as we discussed earlier, the pipeline processing as currently sketched will not be able to handle includes etc. Whether this should be part of the core I am not sure.
The implementation of "table origins" is currently a half-baked mess:
TableOriginCSV
class but nothing for tables coming from elsewhere than CSV files;TableOriginCSV
doesn't even get used properly: it's supposed to indicate not only what file the table is in, but also what row the table is on, but that row never gets set.origin
in the reader and parser functions; however, they are evidently not scoped to contain aTableOrigin
but rather... something else, apparently a file name. Confusing nomenclature.A few things need to get done:
parse_blocks()
should be done with arguments (perhapskwargs
, perhaps a newOriginContext
parameter object?) that are sufficient to determine the context (this context would usually be the "sheet", in the StarTable sense of the term)._parse_datetime_column()
when val is None.parse_blocks()
should be interpreted as starting on row 0 of that context/sheet (or row 1, if we want 1-based for Excel?)parse_blocks()
makes aTableOrigin
on the fly, based on the context it was passed plus the row number that it's currently on.TableOrigin
, rename them to something else thanorigin
, and check if they even need to exist.TableOrigin
subclass for each expected kind of origin: CSV, Excel, JsonData, ... raw cell data?, as well an "anonymous" one that tables should have by default unless it gets overwritten by one of the other ones. This means that tables will have an anonymous origin if created literally, and also if they get created in an unexpected way (e.g. read from some source that we don't yet know how to handle e.g. imported from a Word doc or whatever; though such cases should in time get their ownTableOrigin
subclass as they become supported inpdtable
).Also consider whether it would make sense for
TableOrigin
to accept anotherTableOrigin
in the constructor, to keep track of a chain of origins. Is there a use case for this?The text was updated successfully, but these errors were encountered: