Give each table a proper TableOrigin #35

jfcorbett · 2020-09-21T13:32:08Z

The implementation of "table origins" is currently a half-baked mess:

Right now we have a TableOriginCSV class but nothing for tables coming from elsewhere than CSV files;
Even TableOriginCSV doesn't even get used properly: it's supposed to indicate not only what file the table is in, but also what row the table is on, but that row never gets set.
In fact, origin is in most places set to be just a str.
There are currently a flurry of function parameters named origin in the reader and parser functions; however, they are evidently not scoped to contain a TableOrigin but rather... something else, apparently a file name. Confusing nomenclature.

A few things need to get done:

Each call to parse_blocks() should be done with arguments (perhaps kwargs, perhaps a new OriginContext parameter object?) that are sufficient to determine the context (this context would usually be the "sheet", in the StarTable sense of the term).
- For CSV, this would be : file name.
- For Excel: file name and sheet name.
- For JsonData: ...?
- Probably also, a clear flag of whether it comes from CSV or Excel or JsonData or something else: we know this would be useful when resolving certain ambiguities during parsing, e.g. _parse_datetime_column() when val is None.
Each call to parse_blocks() should be interpreted as starting on row 0 of that context/sheet (or row 1, if we want 1-based for Excel?)
parse_blocks() makes a TableOrigin on the fly, based on the context it was passed plus the row number that it's currently on.
All variables/parameters that aren't meant to contain TableOrigin, rename them to something else than origin, and check if they even need to exist.
Have a TableOrigin subclass for each expected kind of origin: CSV, Excel, JsonData, ... raw cell data?, as well an "anonymous" one that tables should have by default unless it gets overwritten by one of the other ones. This means that tables will have an anonymous origin if created literally, and also if they get created in an unexpected way (e.g. read from some source that we don't yet know how to handle e.g. imported from a Word doc or whatever; though such cases should in time get their own TableOrigin subclass as they become supported in pdtable).

Also consider whether it would make sense for TableOrigin to accept another TableOrigin in the constructor, to keep track of a chain of origins. Is there a use case for this?

The text was updated successfully, but these errors were encountered:

jfcorbett · 2020-12-14T07:55:49Z

@JanusWesenberg Comments on this issue please? This is my next clean-up project! It would be nice to be aligned up front.

JanusWesenberg · 2020-12-17T12:54:22Z

Yes, that part was at best sketched. My thinking was something along the following.

The table origin data must somehow reference metadata for the file being read. This metadata must be passed into the reader by the caller, and could be a file path relative to a project root, records system metadata, a blob key or any other file-level metadata. Based on this, the reader must then be able to create a table-level origin object suitable for the caller.

One way of doing this could be to provide a TableOriginPolicy object as input origin (or maybe rename that input).


class TableOriginPolicy(Protocol):
    @abstractmethod
    def origin_for(self, **kwargs) -> TableOrigin:
        """
        Provide table origin for the location specified by keyword arguments

        Table sources are free to call this method with any arguments, but should
        prefer the convenience methods `origin_for_line` and `origin_for_workbook_location`.
        """
        pass

    def origin_for_line(self, line: int) -> TableOrigin:
        """
        Returns origin for line in text-based input source such as CSV.
        """
        return self.origin_for(line=line)

    def origin_for_workbook_location(self, row: int, column: str, sheet: Optional[str] = None) -> TableOrigin:
        """
        Returns origin for line in text-based input source such as CSV.
        """
        return self.origin_for(row=row, column=column, sheet=sheet)

In terms of actually building a full reader system, we will need a broader context setup -- as we discussed earlier, the pipeline processing as currently sketched will not be able to handle includes etc. Whether this should be part of the core I am not sure.

jfcorbett added enhancement New feature or request ergonomics Improve usability, API, discoverability labels Sep 21, 2020

jfcorbett changed the title ~~Clarify TableOrigin interface and scope~~ Give each table a proper TableOrigin Dec 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Give each table a proper TableOrigin #35

Give each table a proper TableOrigin #35

jfcorbett commented Sep 21, 2020 •

edited

Loading

jfcorbett commented Dec 14, 2020

JanusWesenberg commented Dec 17, 2020

Give each table a proper TableOrigin #35

Give each table a proper TableOrigin #35

Comments

jfcorbett commented Sep 21, 2020 • edited Loading

jfcorbett commented Dec 14, 2020

JanusWesenberg commented Dec 17, 2020

jfcorbett commented Sep 21, 2020 •

edited

Loading