Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lib does not work if cell contains \n characters #92

Open
reaver2oo3 opened this issue Jan 28, 2020 · 5 comments
Open

Lib does not work if cell contains \n characters #92

reaver2oo3 opened this issue Jan 28, 2020 · 5 comments

Comments

@reaver2oo3
Copy link

Hello,

I have integrated your library into an application that I am working on and what I have found is that if a cell has text in it that is spread into multiple lines, then the application will crash. :(
Is there any hope for a fix?
A possible solution for this, that I have thought of is that, if a cell starts with a quote but it isn't found by the end of the line, then more lines should be read until the pair is found. I haven't managed to write a fix attempt yet, as I haven't looked into your code enough.

Kind regards,
Daniel

@lubomirmatus
Copy link

lubomirmatus commented Feb 26, 2020

Hi

I have the same problem. Cells with double quoted strings that contain \n inside are parsed incorrectly.

id, text
1, "abc
efg"

I have fixed this localy and it seems to be working.
Problem is in function LineReader::next_line, line 465, link.
Replace original block, lines 465-468 with this:

int line_end = data_begin;
bool is_in_string = false;
bool has_quote = false;
while(line_end != data_end){
        if (buffer[line_end] == '\"') {
            if (is_in_string)
                has_quote = !has_quote;
            else
                is_in_string = true;
        }
        else if (buffer[line_end] == '\n') {
            if (!is_in_string)
                break;
        }
        else {
            if (is_in_string && has_quote) {
                is_in_string = false;
                has_quote = false;
            }
        }
        ++line_end;
}

This code does not consider other quote_policies. This needs to be done correctly.

I'm using csv reader with double_qoute_escape policy like this:

CSVReader<2, trim_chars<>, double_quote_escape<',', '\"'>> csv(file);

and everything works for me.

@ben-strasser
Copy link
Owner

ben-strasser commented Mar 4, 2020

The issue with \n in quoted strings has been raised a lot in the past. This is known. The summary of the problems are:

  • The newline handling is in LineReader, which is not parameterizable. This make it difficult to do changes in a clean and backward compatible way.
  • It is unclear how to handle "\n" vs "\r\n" newlines. Do we want automatically translate these? Some people want to, others not. There a lot of corner cases that can lead to unexpected behavior. Not handling them at all does at least not silently do the wrong thing.
  • Error message generation gets a lot more complex because of possibly run-away problems if a quote is missing.
  • My personally opinion is that CSV is a text file and text files should be readable in a text editor. If a column spans multiple lines then this gets hard to read as records can no longer be copied line-wise. My opinion is thus that you are abusing the format, if you have literal newlines. The solution is to escape newlines.

Up to now I have not yet seen a good solution.

@lubomirmatus
Copy link

We are processing many csv files downloaded from various webservices and/or generated by tools.
The \n inside the csv string is absolutely common, with no escaping. I think, the only escape used in csv files, is double double-quote "" to escape double-quote.

The problem of runaway due to missing closing double-quote is not a problem of parser, it is problem of generator, this should not be corrected by the parser.

Solving this by "correcting" csv before reading by escaping all multi-line records with some escape and then after reading unescape all escaped new lines is fairly inefficient and complicated.

@johndunlap
Copy link

Muti-line cells are absolutely required for my use case. You cannot tell users that they can't put multiple lines in text boxes.

@ben-strasser
Copy link
Owner

ben-strasser commented Jul 2, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants