ParseFixer is too aggressive, should be turned off by default #71

jfcorbett · 2020-11-10T14:31:59Z

Currently, the reader functions use a default "ParseFixer" by default. This means that errors are swallowed / autocorrected until some threshold number of errors is reached. This is a problem for usability; and it may hide unexpected bugs (at least one discovered so far #72 ).

This Issue requests the following, in descending order of priority:

turn off ParseFixer by default
probably redesign the whole ParseFixer architecture, currently quite convoluted
easily allow user to specify their own fixer policy

Rationale: multiple reasons:

I suspect there could be more hidden bugs in the way the default fixer "fixes" "errors". Quotation marks here, because sometimes the fixes actually make things worse and creates errors by fixing things that weren't errors in the first place (cf. Reading table blocks fails when there is a comment to the right of the column name row #72 ).
The default ParseFixer makes debugging difficult; during debugging we usually want errors to be raised immediately. There is a way to kind of achieve this by monkeypatching a _called_from_test attribute onto the ParseFixer, but that's much too convoluted.
There should be an easy way to run read_csv() and read_excel() without any ParseFixer. Right now read_csv(..., , fixer=None) doesn't do that, because "None" gets interpreted as, use the default ParseFixer.
Principle of least surprise: the default parse fixer is quite aggressive. It will more or less silently replace omitted values by "defaults", fixer second-guesses what the user "really" intended to write; but the user may not have wanted these values and they could do damage downstream.

Possible use cases for a ParseFixer
I can kind of see the following use cases, with my comments:

Ergonomics: Don't crash on the first tiny error, but instead collect all errors as they are encoutered, and crash at the end, reporting all errors at once. So if there are 10 errors, don't have to run the reader 10 times to reveal them all. <<< this is an option I think makes sense to have... but that's not what the current default fixer does.
Fix errors as they are encountered, according to a user-specified policy. <<< This is not something that should be done by default like it is now. But it would be okay to give the user the option to do so, on their own terms, if they explicitly specify this.
Report all encountered errors, for user checking in an external tool, e.g. the StarTable Editor that @BennyLassen proposed. <<< This is a good idea. But again, it should not be the default that we force-feed to all pdtable users. At best make it an option to be specified explicitly.

The text was updated successfully, but these errors were encountered:

JanusWesenberg · 2020-11-11T13:21:22Z

Think we are in agreement here. My take would be that the only purpose would be to allow a channel for reporting parser errors. Since we already have the block stream, the simplest way of doing so would probably be to emit Error blocks. That way, users would even be free to implement "fixers" on top.

jfcorbett added the ergonomics Improve usability, API, discoverability label Nov 10, 2020

jfcorbett self-assigned this Nov 10, 2020

jfcorbett changed the title ~~ParseFixer makes debugging difficult~~ ParseFixer is too aggressive, should be turned off by default Nov 11, 2020

jfcorbett mentioned this issue Nov 11, 2020

Fix/parser fails on comments #73

Merged

jfcorbett added the bug Something isn't working label Nov 11, 2020

guilhermebs added this to the Version 0.1 milestone Dec 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ParseFixer is too aggressive, should be turned off by default #71

ParseFixer is too aggressive, should be turned off by default #71

jfcorbett commented Nov 10, 2020 •

edited

Loading

JanusWesenberg commented Nov 11, 2020

ParseFixer is too aggressive, should be turned off by default #71

ParseFixer is too aggressive, should be turned off by default #71

Comments

jfcorbett commented Nov 10, 2020 • edited Loading

JanusWesenberg commented Nov 11, 2020

jfcorbett commented Nov 10, 2020 •

edited

Loading