Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Give each table a proper TableOrigin #35

Open
jfcorbett opened this issue Sep 21, 2020 · 2 comments
Open

Give each table a proper TableOrigin #35

jfcorbett opened this issue Sep 21, 2020 · 2 comments
Labels
enhancement New feature or request ergonomics Improve usability, API, discoverability

Comments

@jfcorbett
Copy link
Member

jfcorbett commented Sep 21, 2020

The implementation of "table origins" is currently a half-baked mess:

  • Right now we have a TableOriginCSV class but nothing for tables coming from elsewhere than CSV files;
  • Even TableOriginCSV doesn't even get used properly: it's supposed to indicate not only what file the table is in, but also what row the table is on, but that row never gets set.
  • In fact, origin is in most places set to be just a str.
  • There are currently a flurry of function parameters named origin in the reader and parser functions; however, they are evidently not scoped to contain a TableOrigin but rather... something else, apparently a file name. Confusing nomenclature.

A few things need to get done:

  • Each call to parse_blocks() should be done with arguments (perhaps kwargs, perhaps a new OriginContext parameter object?) that are sufficient to determine the context (this context would usually be the "sheet", in the StarTable sense of the term).
    • For CSV, this would be : file name.
    • For Excel: file name and sheet name.
    • For JsonData: ...?
    • Probably also, a clear flag of whether it comes from CSV or Excel or JsonData or something else: we know this would be useful when resolving certain ambiguities during parsing, e.g. _parse_datetime_column() when val is None.
  • Each call to parse_blocks() should be interpreted as starting on row 0 of that context/sheet (or row 1, if we want 1-based for Excel?)
  • parse_blocks() makes a TableOrigin on the fly, based on the context it was passed plus the row number that it's currently on.
  • All variables/parameters that aren't meant to contain TableOrigin, rename them to something else than origin, and check if they even need to exist.
  • Have a TableOrigin subclass for each expected kind of origin: CSV, Excel, JsonData, ... raw cell data?, as well an "anonymous" one that tables should have by default unless it gets overwritten by one of the other ones. This means that tables will have an anonymous origin if created literally, and also if they get created in an unexpected way (e.g. read from some source that we don't yet know how to handle e.g. imported from a Word doc or whatever; though such cases should in time get their own TableOrigin subclass as they become supported in pdtable).

Also consider whether it would make sense for TableOrigin to accept another TableOrigin in the constructor, to keep track of a chain of origins. Is there a use case for this?

@jfcorbett jfcorbett added enhancement New feature or request ergonomics Improve usability, API, discoverability labels Sep 21, 2020
@jfcorbett jfcorbett changed the title Clarify TableOrigin interface and scope Give each table a proper TableOrigin Dec 14, 2020
@jfcorbett
Copy link
Member Author

@JanusWesenberg Comments on this issue please? This is my next clean-up project! It would be nice to be aligned up front.

@JanusWesenberg
Copy link
Contributor

Yes, that part was at best sketched. My thinking was something along the following.

The table origin data must somehow reference metadata for the file being read. This metadata must be passed into the reader by the caller, and could be a file path relative to a project root, records system metadata, a blob key or any other file-level metadata. Based on this, the reader must then be able to create a table-level origin object suitable for the caller.

One way of doing this could be to provide a TableOriginPolicy object as input origin (or maybe rename that input).


class TableOriginPolicy(Protocol):
    @abstractmethod
    def origin_for(self, **kwargs) -> TableOrigin:
        """
        Provide table origin for the location specified by keyword arguments

        Table sources are free to call this method with any arguments, but should
        prefer the convenience methods `origin_for_line` and `origin_for_workbook_location`.
        """
        pass

    def origin_for_line(self, line: int) -> TableOrigin:
        """
        Returns origin for line in text-based input source such as CSV.
        """
        return self.origin_for(line=line)

    def origin_for_workbook_location(self, row: int, column: str, sheet: Optional[str] = None) -> TableOrigin:
        """
        Returns origin for line in text-based input source such as CSV.
        """
        return self.origin_for(row=row, column=column, sheet=sheet)

In terms of actually building a full reader system, we will need a broader context setup -- as we discussed earlier, the pipeline processing as currently sketched will not be able to handle includes etc. Whether this should be part of the core I am not sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request ergonomics Improve usability, API, discoverability
Projects
None yet
Development

No branches or pull requests

2 participants