Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for escaping the delimiter like the python counterpart (Write) #234

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

leaty
Copy link

@leaty leaty commented Jun 3, 2021

This is a proof of concept for escaping the delimiter in Writer. However, it overrides the default of QuoteStyle::Never, i.e. it will rather escape than produce potentially broken CSV.

One con of this, if you force Never because you know there is nothing to escape by quoting, it will still check if it has to escape something and if so escape using escape char. That's naturally going to be slower, although probably negligible.

Another con is that the escape char will always be escaped when it's on Never. This breaks reading it if you don't enable escape in Reader after #233. I would likely need to add a new option to avoid these cons because escape is not an Option in Writer.

Basically, any delimiter/escape char is prepended with the escape char. E.g.

\ = \\
, = \,

Examples below, note though that \\ in StringRecord is just escaped output for \:

Writing:

[StringRecord(["field1", "field2 with, comma and backslash \\", ",field3, with, commas,", "field4"])]
=> field1,field2 with\, comma and backslash \\,\,field3\, with\, commas\,,field4

Reading with #233:

field1,field2 with\, comma and backslash \\,\,field3\, with\, commas\,,field4
=> [StringRecord(["field1", "field2 with, comma and backslash \\", ",field3, with, commas,", "field4"])]

@leaty leaty force-pushed the feature/escape-delimiter-write branch from 46a8554 to 8d84634 Compare June 3, 2021 15:47
@gootorov
Copy link

gootorov commented Jun 7, 2021

One con of this, if you force Never because you know there is nothing to escape by quoting, it will still check if it has to escape something and if so escape using escape char. That's naturally going to be slower, although probably negligible.

Forcing checking for escaping is very slow. We're using QuoteStyle::Never in our code base and our benchmarks give roughly 40% performance increase with this option enabled (Disclaimer: this is locally on my laptop when writing to /tmp. Haven't tested on server grade hardware). We're relying on it since we know our data never needs escaping. So changing this would be quite undesirable.

@leaty
Copy link
Author

leaty commented Jun 7, 2021

I assumed performance would take a hit but not by that much, thanks for commenting. I totally agree with you that it's undesirable. As I mentioned it would have to be a new option then, since escape is not an Option<u8> in Writer like it is in Reader. One slightly illogical but simple change would be to add e.g. QuoteStyle::EscapeDelimiter, it's just illogical because it's not really based on quoting. Otherwise a new function would do, but then what would you name it when there's a function called escape already. Last but not least one could make escape take an Option instead, but that's API breaking.

I've been pondering about this a little, but I'd like to hear from @BurntSushi first on what the philosophies are, and if this feature would even be considered in Writer or not. All I know is that it exists in python std, and I personally prefer this CSV style to all others. You save a few bytes in storage, reading and writing, it's simple and predictable (just escape delimiter + escape char). It's also (for me) more readable raw.

@BurntSushi
Copy link
Owner

So basically, I don't have time to really dig into this right now, but I'll say these things based on skimming the comments above:

  • We should not remove the optimization that makes "never" quoting fast. CSV writing is really all about trying to shovel as much bytes through as quickly as possible, so additional checks can be quite costly here.
  • Breaking changes in csv-core are OK as long as they are well motivated. But if we can avoid it through a new option and it isn't too much of a kludge, then I'd probably prefer that.
  • If the Reader supports it than so to should the Writer.
  • Overall, I think this feature is probably a good fit. I think the only real issue to getting it merged is me: it's going to be a while before I can context switch into this project and give it a thorough review.

One question I personally have is whether or where this particular dialect of CSV is used. @leaty Have you seen it used in the wild? Or is this just something you happen to personally use?

@leaty
Copy link
Author

leaty commented Jun 7, 2021

I don't have time to really dig into this right now

That's absolutely fine, no need to rush. It was more a "should I elaborate for upstream or not" moment since what I've done already works for my immediate purposes. As long as it's beneficial to the library as a whole I'll fix it up and await the review, days or months from now doesn't really matter. Answer in your own time.

Have you seen it used in the wild?

I am of course biased, and I can't speak for other companies or organizations, but my company has used this style in all CSV data paths for years and I've personally become a strong advocate for it. Disregarding the benefit of it generally resulting in less bytes, it's also useful in some minor cases where you have no access to any good CSV libraries, or if you're reading it raw in shell for some obscure debugging reason- since you can literally just temporarily replace \, with something else and cleanly split the lines by , while with quoting you have to handle it with regex or reading byte by byte.

Breaking changes in csv-core are OK as long as they are well motivated. But if we can avoid it through a new option and it isn't too much of a kludge, then I'd probably prefer that.

My thoughts are one of the following:

  • QuoteStyle::EscapeDelimiter (slightly illogical)
  • WriterBuilder.escape_delimiter(bool) (doesn't look great and can be confused with escape)
  • Change escape to take Option<u8> instead of u8 (breaking change; will be identical with ReaderBuilder, not sure how to handle a None yet though)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants