Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skipping empty rows #159

Open
binh-vu opened this issue Jul 3, 2019 · 14 comments
Open

Skipping empty rows #159

binh-vu opened this issue Jul 3, 2019 · 14 comments

Comments

@binh-vu
Copy link

binh-vu commented Jul 3, 2019

What version of the csv crate are you using?

1.1.1

Briefly describe the question, bug or feature request.

The library skips the empty rows even when flexible is set to be true.

The default behavior in python csv library, Excel, Number is keeping empty rows, so I think we should follow that (the users may have hard time to find the different between the data they see in Excel and the result from the Rust code).

Include a complete program demonstrating a problem.

use csv; // 1.1.1

fn main() {
    let s = "year,make\n\n1948,Porsche\n1967,Ford";

    let reader = csv::ReaderBuilder::new()
        .has_headers(false)
        .flexible(true)
        .from_reader(s.as_bytes());
        
    for (i, record) in reader.into_records().enumerate() {
        println!("row: {} - record: {:?}", i, record.unwrap());
    }
}

What is the observed behavior of the code above?

row: 0 - record: StringRecord(["year", "make"])
row: 1 - record: StringRecord(["1948", "Porsche"])
row: 2 - record: StringRecord(["1967", "Ford"])

What is the expected or desired behavior of the code above?

row: 0 - record: StringRecord(["year", "make"])
row: 1 - record: StringRecord([])
row: 2 - record: StringRecord(["1948", "Porsche"])
row: 3 - record: StringRecord(["1967", "Ford"])

Here is the python code for the same csv:

import io, csv
s = "year,make\n\n1948,Porsche\n1967,Ford"
print(list(csv.reader(io.StringIO(s))))
# output >>> [['year', 'make'], [], ['1948', 'Porsche'], ['1967', 'Ford']]
@binh-vu binh-vu changed the title Skip empty row The library skips empty rows Jul 3, 2019
@binh-vu binh-vu changed the title The library skips empty rows Skipping empty rows Jul 3, 2019
@jaredforth
Copy link

Has this issue still not been addressed?

@BurntSushi
Copy link
Owner

@jaredforth Correct.

Re-reading this issue now, I am just not convinced that this change should be made. At the very least, this would be a breaking change, and I'm not convinced of the utility of including empty rows. The only reason given is for consistency with Python's csv reader, but I don't think that's a good enough reason on its own.

@JamesMc86
Copy link

To add another voice to this - this has also just caught me out. I'm running in flexible mode and an empty row has meaning to the structure of the file I'm trying to parse.

I have a workaround in this case but wanted to add another voice to this feature

@BurntSushi
Copy link
Owner

Folks could help their case here by providing a more detailed motivation for why they need this feature. Give some examples. Explain why there is no other way or easy workaround.

@gblock0
Copy link

gblock0 commented Oct 18, 2022

We are working on a POC with this, so it's not blocking anything at the moment, but I figured I'd add our use case.

We use json-schema to define TSV/CSV files and then would like to use this library to help us validate that the TSV/CSV file adheres to the schema. To give a more concrete example, we allow some fields to be required and some to be nullable. The nullable can be empty string, "N/A", or really anything we want. We'd like to parse the whole file and know if there is an empty line because if there are required columns that don't exist or the nullable column definition is not adhered to, then it would fail the validation.

Would it be possible to add this as an optional param like "skip_empty_lines" and defaults to true so it wouldn't be a breaking change?

@BurntSushi
Copy link
Owner

@gblock0 I see. So it sounds like the idea here is that you want to do some kind of verification step that possibly precludes the existence of empty lines in a CSV file, but since this parser completely skips them and makes them inaccessible to you, it's not possible to (easily) check whether they exist.

Would it be possible to add this as an optional param like "skip_empty_lines" and defaults to true so it wouldn't be a breaking change?

Yes, if I were to add this, then it would have to come in under an option because I won't release a csv 2.0 just for this.

FWIW, I don't see this happening any time soon. IIRC, this behavior is baked in all the way down in csv-core's finite state machine. I am not convinced that this is really something worth supporting, but if someone did come up with a simple patch for this (and is an opt-in feature), then I think I'd be willing to bring it in.

@gblock0
Copy link

gblock0 commented Oct 19, 2022

@BurntSushi yep exactly. Totally understand that it's not priority, and if I was more familiar with Rust I'd take a swing at it.

Thanks for hearing me out and all the work you do!

@jccalvojackson
Copy link

tried to do the first problem of AOC but as it skips the empty lines, csv is not useful here 🤷‍♂️

@BurntSushi
Copy link
Owner

I only skimmed the problem, but I don't see any CSV data there. It's just line delimited. There's only a single column.

@jccalvojackson
Copy link

yeah, it is not csv in the strict sense, just a file. I'm using csv as a generic file reader whenever the file has a column structure (this being the simplest with just one column). So granted, not a use case that really justifies this change.

@Yarn
Copy link

Yarn commented Mar 24, 2023

To add another specific use case, I'm using xsv fixlengths and xsv table to view a rather interesting csv, essentially multiple csvs and some other random info separated by blank lines. Losing the blank lines makes it substantially harder to distinguish between different sections quickly.

@amansawhney
Copy link

I have a use case for this feature. I am reading in a CSV with one column that is deserialized using serde. The column is an optional usize. If the rows are skipped all the NONE vals are skipped and hence the data is parsed incorrectly.

@amansawhney
Copy link

I have a use case for this feature. I am reading in a CSV with one column that is deserialized using serde. The column is an optional usize. If the rows are skipped all the NONE vals are skipped and hence the data is parsed incorrectly.

I was actually able to fix this by wrapping every value in my csv with quotes when I serialized the data earlier in the pipeline. I still think this feature should be added since not everyone has the flexibility of being able to change their data pipelines.

@dgendill
Copy link

dgendill commented Jun 20, 2024

This feature would be very useful to me as well. Same as Yarn's comment, I'm working with a CSV that basically has multiple tables embedded inside of it. Here is a very minimal example or a real world CSV that some financial companies export.

Account Number,Investment Name,Symbol,Shares,Share Price,Total Value,
1234567,NVIDIA Corp,NVDA,1,130.7,130.7,



Account Number,Trade Date,Settlement Date,Transaction Type,Transaction Description,Investment Name,Symbol,Shares,Share Price,Principal Amount,Commissions and Fees,Net Amount,Accrued Interest,Account Type,
1234567,2023-03-27,2023-03-27,Buy,Buy,Nvidia Corp,NVDA,0.00000,1.0,0.30,0.0,0.30,0.0,CASH,



Being able to see the trailing empty lines would signal that I'm approaching a different table of data.

I was able to test Yarn's PR and it seems to work the way I expect. Looks like a non-breaking change that only adds a new option to the builder. I tested by changing my Cargo.toml file to point to Yarn's PR branch...

[dependencies]
csv = { git = "https://github.com/Yarn/rust-csv.git" }

And then calling skip_blank_lines(false) on the builder.

use std::{error::Error, process};

fn example() -> Result<(), Box<dyn Error>> {
    let file = std::fs::File::open("documents/example2.csv")?;
    let reader = std::io::BufReader::new(file);

    let mut rdr = csv::ReaderBuilder::new()
        .has_headers(false)
        .flexible(true)
        .skip_blank_lines(false)
        .trim(csv::Trim::None)
        .from_reader(reader);    

    for result in rdr.records() {        
        let record = result?;
        println!("{:?}", record);
    }
    Ok(())
}

fn main() {
    if let Err(err) = example() {
        println!("error running example: {}", err);
        process::exit(1);
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants