[Docs] Add caveat about autodetect_column_names #79

AndyHunt66 · 2020-02-19T10:33:04Z

https://www.elastic.co/guide/en/logstash/current/plugins-filters-csv.html#plugins-filters-csv-autodetect_column_names

Version: 3.0.10

When using autodetect_column_names, if either

logstash is stopped and restarted in the middle of reading a csv file, or
logstash finishes reading a file with one column layout and starts reading a different file with a different column layout

then the behaviour is not what might be expected.

In the first case, the column names will be re-read from the next event where LS left off before being stopped - so the event data in that row becomes the column names for the rest of the file.

In the second case, column names are not re-read on starting a new file, so the data in the new file is treated as if it were in the format of the previous file.

Additionally, I think in the second case, the header line will be ingested as data, even if it is column names.

We should add a caveat in the docs to cover these scenarios.

Something along the lines of a note like:

When autodetect_column_names is set to true, the column names information is only parsed when Logstash starts. Refrain from using this setting if there's a chance Logstash will restart while in the middle of a file, or if you are ingesting multiple csv files which each have column names as the first line

The text was updated successfully, but these errors were encountered:

colinsurprenant · 2020-02-21T15:57:36Z

Note that the new cvs codec will help for the cases of changing files with new headers.

vmchiran · 2020-07-31T10:39:32Z

Note that the new cvs codec will help for the cases of changing files with new headers.

Same behaviour with the csv codec:

column names are not re-read on starting a new file, so the data in the new file is treated as if it were in the format of the previous file.
additionally, the header line is ingested as data, even if it is column names.

input{
file{
path => "/mnt/elk-fss/es-datasets/*.csv"
mode => "read"
codec => csv {
autodetect_column_names => true
include_headers => false
skip_empty_columns => true
}}}

Here is the exception for a new header line:
[WARN ] 2020-07-31 10:11:02.763 [[main]>worker0] elasticsearch - Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"mytestindex", :routing=>nil, :_type=>"_doc"}, #LogStash::Event:0x1887f568], :response=>{"index"=>{"_index"=>"mytestindex", "_type"=>"_doc", "_id"=>"x4hapHMBG3hrlXQBhPiG", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse field

Data from the new file is handled in the previous format, which of course leads to inconsistencies.

tomryanx · 2021-03-08T05:11:39Z

@colinsurprenant - could you elaborate on why you think the csv codec might address the case of changing files with new headers?

AndyHunt66 mentioned this issue Feb 19, 2020

[Docs] improvement request for skip_header, autodetect_column_names setting #77

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Docs] Add caveat about autodetect_column_names #79

[Docs] Add caveat about autodetect_column_names #79

AndyHunt66 commented Feb 19, 2020 •

edited

Loading

colinsurprenant commented Feb 21, 2020

vmchiran commented Jul 31, 2020

tomryanx commented Mar 8, 2021

[Docs] Add caveat about autodetect_column_names #79

[Docs] Add caveat about autodetect_column_names #79

Comments

AndyHunt66 commented Feb 19, 2020 • edited Loading

colinsurprenant commented Feb 21, 2020

vmchiran commented Jul 31, 2020

tomryanx commented Mar 8, 2021

AndyHunt66 commented Feb 19, 2020 •

edited

Loading