Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Add caveat about autodetect_column_names #79

Open
AndyHunt66 opened this issue Feb 19, 2020 · 3 comments
Open

[Docs] Add caveat about autodetect_column_names #79

AndyHunt66 opened this issue Feb 19, 2020 · 3 comments

Comments

@AndyHunt66
Copy link

AndyHunt66 commented Feb 19, 2020

https://www.elastic.co/guide/en/logstash/current/plugins-filters-csv.html#plugins-filters-csv-autodetect_column_names

  • Version: 3.0.10

When using autodetect_column_names, if either

  • logstash is stopped and restarted in the middle of reading a csv file, or
  • logstash finishes reading a file with one column layout and starts reading a different file with a different column layout

then the behaviour is not what might be expected.

In the first case, the column names will be re-read from the next event where LS left off before being stopped - so the event data in that row becomes the column names for the rest of the file.

In the second case, column names are not re-read on starting a new file, so the data in the new file is treated as if it were in the format of the previous file.

Additionally, I think in the second case, the header line will be ingested as data, even if it is column names.

We should add a caveat in the docs to cover these scenarios.

Something along the lines of a note like:

When autodetect_column_names is set to true, the column names information is only parsed when Logstash starts. Refrain from using this setting if there's a chance Logstash will restart while in the middle of a file, or if you are ingesting multiple csv files which each have column names as the first line

@colinsurprenant
Copy link
Contributor

Note that the new cvs codec will help for the cases of changing files with new headers.

@vmchiran
Copy link

Note that the new cvs codec will help for the cases of changing files with new headers.

Same behaviour with the csv codec:

  • column names are not re-read on starting a new file, so the data in the new file is treated as if it were in the format of the previous file.
  • additionally, the header line is ingested as data, even if it is column names.

input{
file{
path => "/mnt/elk-fss/es-datasets/*.csv"
mode => "read"
codec => csv {
autodetect_column_names => true
include_headers => false
skip_empty_columns => true
}}}

Here is the exception for a new header line:
[WARN ] 2020-07-31 10:11:02.763 [[main]>worker0] elasticsearch - Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"mytestindex", :routing=>nil, :_type=>"_doc"}, #LogStash::Event:0x1887f568], :response=>{"index"=>{"_index"=>"mytestindex", "_type"=>"_doc", "_id"=>"x4hapHMBG3hrlXQBhPiG", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse field

Data from the new file is handled in the previous format, which of course leads to inconsistencies.

@tomryanx
Copy link

tomryanx commented Mar 8, 2021

@colinsurprenant - could you elaborate on why you think the csv codec might address the case of changing files with new headers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants