diff --git a/docs/UserGuide.md b/docs/UserGuide.md index 3a6692502..0dea7c2a9 100644 --- a/docs/UserGuide.md +++ b/docs/UserGuide.md @@ -370,7 +370,7 @@ After loading the set from the file, this constraint behaves identically to the ### `inMap` _(field, file, key)_ ```javascript -{ +{ "field": "country", "is": "inMap", "file": "countries.csv", @@ -380,7 +380,7 @@ After loading the set from the file, this constraint behaves identically to the Is satisfied if `field`'s value is in the map with the key `Country`. -For each field using the same map when one value is picked from the map all fields will use the same row. +For each field using the same map when one value is picked from the map all fields will use the same row. It populates the map from a new-line delimited file (with suffix `.csv`), where each line represents a value to load. A header is required in the file to identify which column is related to a key. @@ -652,3 +652,120 @@ Is satisfied if either: - The `if` constraint is not satisfied, and the `else` constraint is While it's not prohibited, wrapping conditional constraints in any other kind of constraint (eg, a `not`) may cause unintuitive results. + + +#Running a Profile + +Profiles can be run against a jar using the command line. + +##Command line arguments + +Currently the only mode fully supported by the data helix is generate mode. An example command would be something like + +`java -jar generator.jar generate --max-rows=100 --replace --profile-file=profile.json --output-path=output.csv` + +### Generate command line arguments +Option switches are case-sensitive, arguments are case-insensitive + +* `--profile-file=` (or `-p `) + * Path to the input profile file. +* `--output-path=` (or `-o `) + * Path to the output file. If not specified, output will be to standard output. +* `--replace` + * Overwrite/replace existing output files. Defaults to false. +* `-n ` or `--max-rows ` + * Emit at most `` rows to the output file, if not specified will limit to 10,000,000 rows. + * Mandatory in `RANDOM` mode. +* `--generation-type` + * Determines the type of (data generation)[Link] performed. Supported options are `FULL_SEQUENTIAL` and `RANDOM`(default). +* `--combination-strategy` + * Determines the type of combination strategy used in full sequential mode. Supported options are `MINIMAL`(default), `EXHAUSTIVE` and `PINNING`. +* `--output-format` + * Determines the output format. Supported options are `csv`(default) and `json`. +* `--ndjson` + * When combined with the `--output-format=json` flag sets the output format to [ndjson](http://ndjson.org/). Defaults to true if the `--output-flag` is not set and to false if it is set. + +By default the generator will report how much data has been generated over time, the other options are below: +* `--verbose` + * Will report in-depth detail of data generation +* `--quiet` + * Will disable velocity reporting + +`--quiet` will be ignored if `--verbose` is supplied. + +##Generation strategies +The generation mode can be specified by the `--generation-type` flag. + +The generator supports the following data generation types + +* Random (_default_) +* Full Sequential + +### Random +Generate some random data that abides by the given set of constraints. + +Examples: + +| Constraint | Emitted valid data | +| ---- | ---- | ---- | +| `Field 1 > 10 AND Field 1 < 20` | _(any values > 10 & < 20)_ | +| `Field 1 in set [A, B, C]` | _(A, B or C in any order, repeated as needed)_ | + +Notes: +- Random generation of data is infinite and is limited to 1000 by default, use `--max-rows` to enable generation of more data. + +#### Full Sequential +Generate all data that can be generated in order from lowest to highest. + +Examples: + +| Constraint | Emitted valid data | +| ---- | ---- | ---- | +| `Field 1 > 0 AND Field 1 < 5` | _(null, 1, 2, 3, 4)_ | +| `Field 1 in set [A, B, C]` | _(null, A, B, C)_ | + +* Note that null will only be produced depending on the [properties](https://github.com/finos/datahelix/blob/master/docs/UserGuide.md#Fields) of Field 1. + + +####Combination strategies +There are a few different combination strategies which can be used in **full sequential mode** with minimal being the default. In modes other than full sequential, combination strategy will have no effect. + +It is simplest to see how the different combination strategies work by look at the effect on a simple example profile. The following [profile](https://github.com/finos/datahelix/tree/master/examples/multiple-fields/profile.json) contains two fields: + * field1 - has values in set [ "A", "B" ] + * field2 - has values in set [ 1, 2, 3 ] + +##### Minimal + +The minimal strategy outputs the minimum data required to exemplify each value at least once. Per the example, the output would be: + +* "A",1 +* "B",2 +* "B",3 + +Note that minimal is the default combination strategy. + +##### Exhaustive + +The exhaustive strategy outputs all possible combinations. Given the fields as defined above, possible outputs would be: + +* "A",1 +* "B",1 +* "A",2 +* "B",2 +* "A",3 +* "B",3 + + +##### Pinning + +The pinning strategy establishes a baseline for each field (generally by picking the first available value for that field) and then creates outputs such that either: + +* All values equal the baseline for the respective field +* All values except one equal the baseline for the respective field + +To generate these outputs, we first output the first case (all values from baseline) and then iterate through each field, F, fixing all other fields at their baseline and generating the full range of values for F. For the example, the output would be: + +* "A",1 +* "A",2 +* "A",3 +* "B",1 diff --git a/examples/multiple-fileds/README.md b/examples/multiple-fileds/README.md new file mode 100644 index 000000000..e690428f3 --- /dev/null +++ b/examples/multiple-fileds/README.md @@ -0,0 +1,3 @@ +Two different fields using in set constraints. + +This profile is used as an aide for understanding [combination strategies](https://github.com/finos/datahelix/blob/master/docs/UserGuide.md#combination-strategies). diff --git a/examples/multiple-fileds/profile.json b/examples/multiple-fileds/profile.json new file mode 100644 index 000000000..91e82d354 --- /dev/null +++ b/examples/multiple-fileds/profile.json @@ -0,0 +1,31 @@ +{ + "schemaVersion": "0.11", + "fields": [ + { + "name": "field1", + "type": "string", + "nullable": false + }, + { + "name": "field2", + "type": "integer", + "nullable": false + } + ], + "rules": [ + { + "constraints": [ + { + "field": "field1", + "is": "inSet", + "values": ["A", "B"] + }, + { + "field": "field2", + "is": "inSet", + "values": ["1", "2", "3"] + } + ] + } + ] +}