Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create option for machine readable output for scan #984

Open
keith-turner opened this issue Dec 22, 2017 · 10 comments
Open

Create option for machine readable output for scan #984

keith-turner opened this issue Dec 22, 2017 · 10 comments

Comments

@keith-turner
Copy link
Contributor

It would be useful if the fluo scan command had an option to produce machine readable output. This could be something like fluo scan -a app1 --json or fluo scan -s app1 --csv. Not sure what the best format to use for the output is. If we start with something like a --json option then we can always add something like a --csv option later.

@blueshift-brasil
Copy link
Contributor

Suggestion: make a refactoring on org.apache.fluo.core.util.ScanUtil to receive a java.io.OutputStream or a java.io.Writer or both to avoid something like that:

System.out.println(sb.toString());

Could this be done in that same thread?

@keith-turner
Copy link
Contributor Author

keith-turner commented Feb 1, 2018

I think passing in something like an OutputStream is much cleaner. Would need to preserve the checkError() behavior where the scan breaks when the output stream is closed. Looking into this I realized the current code is inefficient because checkError() also flushes. It would be more efficient to use something like an OutputStream and when it throws an IOException just stop the scan. Or if using a PrintStream, then checkError() could be called less frequently like every 100 or 1000 lines.

@blueshift-brasil
Copy link
Contributor

I'm working on this. Do you think we could use the commons-csv lib?
I believe we could use it for both the current format (tsv like) and for csv format.

I'm creating some properties on fluo-app to CSV format:

## Fluo Scan properties
## -----------------
## Properties to export the scan result to CSV format.
fluo.scan.csv.delimiter = ;
fluo.scan.csv.header = true
fluo.scan.csv.quote = "
# Possible values: ALL, ALL_NON_NULL, MINIMAL, NON_NUMERIC and NONE
# @see org.apache.commons.csv.QuoteMode
fluo.scan.csv.quoteMode = ALL
fluo.scan.csv.comment = #
fluo.scan.csv.escape = \

@blueshift-brasil
Copy link
Contributor

In distribution module, fetch.sh file, are there any reason for this dependency to be in this version?

download com.google.code.gson:gson:jar:2.2.4

In Fluo pom.xml we are in 2.8.0. Can I upgrade to use for the --json scan?

@blueshift-brasil
Copy link
Contributor

Sample of CSV file:

[root@6cf4e94e7248 share]# fluo scan -a myapp --csv
"ROW";"COLUMN_FAMILY";"COLUMN_QUALIFIER";"COLUMN_VISIBILITY";"VALUE"
"HISTORICO:123:1:10:100:17bc30e1-4c55-4037-9bb8-032b2c422935";"cadastral";"DAT_NSC";"";"111111111"
"HISTORICO:123:1:10:100:17bc30e1-4c55-4037-9bb8-032b2c422935";"cadastral";"NOM_RAZ_SOC";"";"yyy"
"HISTORICO:123:1:10:100:18f73b40-d14c-4717-83cf-4a63f4012e9c";"cadastral";"DAT_NSC";"";"111111111"
"HISTORICO:123:1:10:100:18f73b40-d14c-4717-83cf-4a63f4012e9c";"cadastral";"NOM_RAZ_SOC";"";"yyy ;"
"HISTORICO:123:1:10:100:5a30271c-513a-47f3-86fd-dbf0edb98f93";"cadastral";"DAT_NSC";"";"111111111"
"HISTORICO:123:1:10:100:5a30271c-513a-47f3-86fd-dbf0edb98f93";"cadastral";"NOM_RAZ_SOC";"";"yyy"
"HISTORICO:123:1:10:100:ddba1524-05c6-4bbe-89db-a85a8bca6b22";"cadastral";"DAT_NSC";"";"111111111"
"HISTORICO:123:1:10:100:ddba1524-05c6-4bbe-89db-a85a8bca6b22";"cadastral";"NOM_RAZ_SOC";"";"yyy"
"HISTORICO:123:1:10:100:fd36a744-54de-4377-9396-3043ac01064d";"cadastral";"DAT_NSC";"";"111111111"
"HISTORICO:123:1:10:100:fd36a744-54de-4377-9396-3043ac01064d";"cadastral";"NOM_RAZ_SOC";"";"yyy"

JSON file:

[root@6cf4e94e7248 share]# fluo scan -a myapp --json
{"ROW":"HISTORICO:123:1:10:100:0a5f7383-58ec-48e1-8a4a-545bb99ace9f","COLUMN_FAMILY":"cadastral","COLUMN_QUALIFIER":"cadastralDAT_NSC","COLUMN_VISIBILITY":"","VALUE":"111111111"}
{"ROW":"HISTORICO:123:1:10:100:0a5f7383-58ec-48e1-8a4a-545bb99ace9f","COLUMN_FAMILY":"cadastral","COLUMN_QUALIFIER":"cadastralNOM_RAZ_SOC","COLUMN_VISIBILITY":"","VALUE":"yyy"}
{"ROW":"HISTORICO:123:1:10:100:60aabb11-cc1a-4bdd-9c12-29545ceae5ea","COLUMN_FAMILY":"cadastral","COLUMN_QUALIFIER":"cadastralDAT_NSC","COLUMN_VISIBILITY":"","VALUE":"111111111"}
{"ROW":"HISTORICO:123:1:10:100:60aabb11-cc1a-4bdd-9c12-29545ceae5ea","COLUMN_FAMILY":"cadastral","COLUMN_QUALIFIER":"cadastralNOM_RAZ_SOC","COLUMN_VISIBILITY":"","VALUE":"yyy"}
{"ROW":"HISTORICO:123:1:10:100:d4cc616a-629a-4361-b7ee-0efa007a400b","COLUMN_FAMILY":"cadastral","COLUMN_QUALIFIER":"cadastralDAT_NSC","COLUMN_VISIBILITY":"","VALUE":"111111111"}
{"ROW":"HISTORICO:123:1:10:100:d4cc616a-629a-4361-b7ee-0efa007a400b","COLUMN_FAMILY":"cadastral","COLUMN_QUALIFIER":"cadastralNOM_RAZ_SOC","COLUMN_VISIBILITY":"","VALUE":"yyy"}

@keith-turner
Copy link
Contributor Author

My slight preference would be to make the csv options command line options. The reason is that I think this would give more predictable results. However I can see the convenience of putting the options in the config file. So I am uncertain which is best.

If doing command line options, could be something like the following.

fluo scan -a app1 --csv --csv-delimiter '"'

I would replace COLUMN_FAMILY, COLUMN_QUALIFIER, and COLUMN_VISIBILITY with FAMILY, QUALIFIER, and VISIBILITY in json and csv output to make it shorter.

I think using commons csv is fine.

For the json dependency upgrade we need to make sure it does not cause issue with Accumulo and Hadoop libraries. If it does not, then its ok to upgrade.

@blueshift-brasil
Copy link
Contributor

About the config in the command line, I believe is possible keep both. It’s a good idea!
The command line could overwrite the property config. Even if the user don’t inform any parameter on command line or on property file we going to assume the DEFAUL behavior of the component (commons csv).

About the short names. Sounds good! Done!

About the dependence, is it common this two versions of the same component?
How can I test properly if the upgrade works well? Any idea?
Detail: with the version 2.2.4 it was not possible to write the json file.

@keith-turner
Copy link
Contributor Author

I think its ok to update gson to 2.8.0 in the fetch script. I was looking at mvn dependency:tree and 2.8.0 is the version that hadoop 2.6 depends on. I am not sure why fetch.sh has a different version than the pom.

@blueshift-brasil
Copy link
Contributor

blueshift-brasil commented Feb 21, 2018

Now we have 3 options to configure the scan command:

  • Based on fluo-app.properties
  • Based on --csv-* parameters
  • And based on -o overwrite parameter
[root@9bb10c2c941e share]# fluo scan -a myapp --csv
[root@9bb10c2c941e share]# fluo scan -a myapp --csv --csv-delimiter '|'
[root@9bb10c2c941e share]# fluo scan -a myapp --csv -o fluo.scan.csv.delimiter='|'

@keith-turner
Copy link
Contributor Author

I forgot about -o

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants