Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParseFixer is too aggressive, should be turned off by default #71

Open
jfcorbett opened this issue Nov 10, 2020 · 1 comment
Open

ParseFixer is too aggressive, should be turned off by default #71

jfcorbett opened this issue Nov 10, 2020 · 1 comment
Assignees
Labels
bug Something isn't working ergonomics Improve usability, API, discoverability
Milestone

Comments

@jfcorbett
Copy link
Member

jfcorbett commented Nov 10, 2020

Currently, the reader functions use a default "ParseFixer" by default. This means that errors are swallowed / autocorrected until some threshold number of errors is reached. This is a problem for usability; and it may hide unexpected bugs (at least one discovered so far #72 ).

This Issue requests the following, in descending order of priority:

  1. turn off ParseFixer by default
  2. probably redesign the whole ParseFixer architecture, currently quite convoluted
  3. easily allow user to specify their own fixer policy

Rationale: multiple reasons:

  • I suspect there could be more hidden bugs in the way the default fixer "fixes" "errors". Quotation marks here, because sometimes the fixes actually make things worse and creates errors by fixing things that weren't errors in the first place (cf. Reading table blocks fails when there is a comment to the right of the column name row #72 ).

  • The default ParseFixer makes debugging difficult; during debugging we usually want errors to be raised immediately. There is a way to kind of achieve this by monkeypatching a _called_from_test attribute onto the ParseFixer, but that's much too convoluted.

  • There should be an easy way to run read_csv() and read_excel() without any ParseFixer. Right now read_csv(..., , fixer=None) doesn't do that, because "None" gets interpreted as, use the default ParseFixer.

  • Principle of least surprise: the default parse fixer is quite aggressive. It will more or less silently replace omitted values by "defaults", fixer second-guesses what the user "really" intended to write; but the user may not have wanted these values and they could do damage downstream.

Possible use cases for a ParseFixer
I can kind of see the following use cases, with my comments:

  1. Ergonomics: Don't crash on the first tiny error, but instead collect all errors as they are encoutered, and crash at the end, reporting all errors at once. So if there are 10 errors, don't have to run the reader 10 times to reveal them all. <<< this is an option I think makes sense to have... but that's not what the current default fixer does.

  2. Fix errors as they are encountered, according to a user-specified policy. <<< This is not something that should be done by default like it is now. But it would be okay to give the user the option to do so, on their own terms, if they explicitly specify this.

  3. Report all encountered errors, for user checking in an external tool, e.g. the StarTable Editor that @BennyLassen proposed. <<< This is a good idea. But again, it should not be the default that we force-feed to all pdtable users. At best make it an option to be specified explicitly.

@jfcorbett jfcorbett added the ergonomics Improve usability, API, discoverability label Nov 10, 2020
@jfcorbett jfcorbett self-assigned this Nov 10, 2020
@jfcorbett jfcorbett changed the title ParseFixer makes debugging difficult ParseFixer is too aggressive, should be turned off by default Nov 11, 2020
@jfcorbett jfcorbett added the bug Something isn't working label Nov 11, 2020
@JanusWesenberg
Copy link
Contributor

Think we are in agreement here. My take would be that the only purpose would be to allow a channel for reporting parser errors. Since we already have the block stream, the simplest way of doing so would probably be to emit Error blocks. That way, users would even be free to implement "fixers" on top.

@guilhermebs guilhermebs added this to the Version 0.1 milestone Dec 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ergonomics Improve usability, API, discoverability
Projects
None yet
Development

No branches or pull requests

3 participants