diff --git a/docs/archive/developer/CucumberSyntax.md b/docs/archive/developer/CucumberSyntax.md deleted file mode 100644 index 5fe80c449..000000000 --- a/docs/archive/developer/CucumberSyntax.md +++ /dev/null @@ -1,162 +0,0 @@ -# Cucumber syntax - -We use cucumber for behaviour driven development and testing, with [gherkin](https://docs.cucumber.io/gherkin/)-based tests like the below: - -```gherkin -Feature: the name of my feature - - Background: - Given the generation strategy is interesting - And there is a non nullable field foo - - Scenario: Running the generator should emit the correct data - Given foo is equal to 8 - Then the following data should be generated: - | foo | - | 8 | - | null | -``` - -More examples can be seen in the [generator cucumber features](https://github.com/finos/datahelix/tree/master/orchestrator/src/test/java/com/scottlogic/datahelix/generator/orchestrator/cucumber) - -The framework supports setting configuration settings for the generator, defining the profile and describing the expected outcome. All of these are described below, all variable elements (e.g. `{generationStrategy}` are case insensitive), all fields and values **are case sensitive**. - -### Configuration options -* _the generation strategy is `{generationStrategy}`_ see [generation strategies](../user/generationTypes/GenerationTypes.md) - default: `random` -* _the combination strategy is `{combinationStrategy}`_ see [combination strategies](../user/CombinationStrategies.md) - default: `exhaustive` -* _the data requested is `{generationMode}`_, either `violating` or `validating` - default: `validating` -* _the generator can generate at most `{int}` rows_, ensures that the generator will only emit `int` rows, default: `1000` -* _we do not violate constraint `{operator}`_, prevent this operator from being violated (see **Operators** section below), you can specify this step many times if required - -### Defining the profile -It is important to remember that constraints are built up of 3 components: a field, an operator and most commonly an operand. In the following example the operator is 'greaterThan' and the operand is 5. - -``` -foo is greaterThan 5 -``` - -Operators are converted to English language equivalents for use in cucumber, so 'greaterThan' is expressed as 'greater than'. - -* _there is a non nullable field `{field}`_, adds a field called `field` to the profile -* _the following non nullable fields exist:_, adds a set of fields to the profile (is followed by a single column set of strings, each represents a field name) -* _`{field}` is null_, adds a null constraint to the profile for the field `field` -* _`{field}` is anything but null_, adds a not(is null) constraint to the profile for field `field` -* _`{field}` is `{operator}` `{operand}`_, adds an `operator` constraint to the field `field` with the data `operand`, see **operators** and **operands** sections below -* _`{field}` is anything but `{operator}` `{operand}`_, adds a negated `operator` constraint to the field `field` with the data `operand`, see **operators** and **operands** sections below -* _there is a constraint:_, adds the given JSON block as a constraint as if it was read from the profile file itself. It should only be used where the constraint cannot otherwise be expressed, e.g. for `anyOf`, `allOf` and `if`. -* _the maximum string length is {length}_, sets the maximum length for strings to the _max_ for the given scenario. The default is _200_ (for performance reasons), however in production the limit is _1000_. -* _untyped fields are allowed_, sets the --allow-untyped-fields flag to false - default: flag is true - -#### Operators -See [Predicate constraints](../user/UserGuide.md#Predicate-constraints), [Grammatical Constraints](../user/UserGuide.md#Grammatical-constraints) and [Presentational Constraints](../user/UserGuide.md#Presentational-constraints) for details of the constraints. - -#### Operands -When specifying the operator/s for a field, ensure to format the value as in the table below: - -| data type | example | -| ---- | ---- | -| string | "my value" | -| number | `1.234` | -| datetime | `2001-02-03T04:05:06.000` | -| null | `null` | - -datetimes must be expressed as above (i.e. `yyyy-MM-ddTHH:mm:ss.fff`) - -#### Examples -* `ofType` → `Given foo has type "string"` -* `equalTo` → `Given foo is equal to 5` -* `inSet` → -``` -Given foo is in set: - | foo | - | 1 | - | 2 | - | 3 | -``` -* `not(after 01/02/2003)` → `Given foo is anything but after 2003-02-01T00:00:00.00` - -In addition the following shows how the _there is a constraint_ step can be used: -``` -And there is a constraint: - """ - { - "if": { "field": "foo", "equalTo": "dddd" }, - "then": { "field": "bar", "equalTo": "4444" }, - "else": { "field": "bar", "is": "shorterThan", "value": 1 } - } - """ -``` - -### Describing the outcome -* _the profile is invalid because "`{reason}`"_, executes the generator and asserts that an `ValidationException` or `JsonParseException` was thrown with the message `{reason}`, reason is a regular expression*. -* _no data is created_, executes the generator and asserts that no data was emitted -* _the following data should be generated:_, executes the generator and asserts that no exceptions were thrown and the given data appears in the generated data, no additional data is permitted. -* _the following data should be generated in order:_, executes the generator and asserts that no exceptions were thrown and the given data appears **in the same order** in the generated data, no additional data is permitted. -* _the following data should be included in what is generated:_, executes the generator and asserts that no exceptions were thrown and the given data is present in the generated data (regardless of order) -* _the following data should not be included in what is generated:_, executes the generator and asserts that no exceptions were thrown and the given data is **not** present in the generated data (regardless of order) -* _some data should be generated_, executes the generator and asserts that at least one row of data was emitted -* _{number} of rows of data are generated_, executes the generator and asserts that exactly the given number of rows are generated - -\* Because `{reason}` is a regular expression, certain characters will need to be escaped, by including a `\` before them, e.g. `\(`, `\)`, `\[`, `\]`, etc. - -### Validating the data in the output - -#### DateTime -* _{field} contains datetime data_, executes the generator and asserts that _field_ contains either `null` or datetimes (other types are allowed) -* _{field} contains only datetime data_, executes the generator and asserts that _field_ contains only `null` or datetimes -* _{field} contains anything but datetime data_, executes the generator and asserts that _field_ contains either `null` or data that is not a datetime. -* _{field} contains datetimes between {min} and {max} inclusively_, executes the generator and asserts that _field_ contains either `null` or datetimes between _{min}_ and _{max}_. Does so in an inclusive manner for both min and max. -* _{field} contains datetimes outside {min} and {max}_, executes the generator and asserts that _field_ contains either `null` or datetimes outside _{min}_ and _{max}_. -* _{field} contains datetimes before or at {before}_, executes the generator and asserts that _field_ contains either `null` or datetimes at or before _{before}_ -* _{field} contains datetimes after or at {after}_, executes the generator and asserts that _field_ contains either `null` or datetimes at or after _{after}_ - -#### Numeric -Note these steps work for asserting both integer and decimal data. There are no current steps for asserting general granularity. -* _{field} contains numeric data_, executes the generator and asserts that _field_ contains either `null` or numeric values (other types are allowed) -* _{field} contains only numeric data_, executes the generator and asserts that _field_ contains only `null` or numeric values -* _{field} contains anything but numeric data_, executes the generator and asserts that _field_ contains either `null` or data that is not numeric. -* _{field} contains numeric values between {min} and {max} inclusively_, executes the generator and asserts that _field_ contains either `null` or numeric values between _{min}_ and _{max}_. Does so in an inclusive manner for both min and max. -* _{field} contains numeric values outside {min} and {max}_, executes the generator and asserts that _field_ contains either `null` or numeric values outside _{min}_ and _{max}_. -* _{field} contains numeric values less than or equal to {value}_, executes the generator and asserts that _field_ contains either `null` or numeric values less than or equal to _{value}_ -* _{field} contains numeric values greater than or equal to {value}_, executes the generator and asserts that _field_ contains either `null` or numeric values greater than or equal to _{value}_ - -#### String -* _{field} contains string data_, executes the generator and asserts that _field_ contains either `null` or string values (other types are allowed) -* _{field} contains only string data_, executes the generator and asserts that _field_ contains only `null` or string values -* _{field} contains anything but string data_, executes the generator and asserts that _field_ contains either `null` or data that is not a string. -* _{field} contains strings of length between {min} and {max} inclusively_, executes the generator and asserts that _field_ contains either `null` or strings with lengths between _{min}_ and _{max}_. Does so in an inclusive manner for both min and max. -* _{field} contains strings of length outside {min} and {max}_, executes the generator and asserts that _field_ contains either `null` or strings with lengths outside _{min}_ and _{max}_. -* _{field} contains strings matching /{regex}/_, executes the generator and asserts that _field_ contains either `null` or strings that match the given regular expression. -* _{field} contains anything but strings matching /{regex}/_, executes the generator and asserts that _field_ contains either `null` or strings that do not match the given regular expression. -* _{field} contains strings shorter than or equal to {length}_, executes the generator and asserts that _field_ contains either `null` or string values shorter than or equal to _{length}_ -* _{field} contains strings longer than or equal to {length}_, executes the generator and asserts that _field_ contains either `null` or string values longer than or equal to _{length}_ - - -#### Null (absence/presence) -* _{field} contains anything but null_, executes the generator and asserts that _field_ has a value in every row (i.e. no `null`s) - -### Cucumber test style guide -* Each test should be specific to one requirement. -* Tests should specify definite expected results rather than using "should include". -* All tables should be padded to the width of the largest item. -* All block-level indentation should be 2 spaces, as below: - -```gherkin -Feature: ... - ... - - Background: - Given ... - - Scenario: ... - Given ...: - | ... | - | ... | - | ... | - And ...: - """ - """ - When ... - Then ... - And ... -``` diff --git a/docs/archive/developer/DependencyInjection.md b/docs/archive/developer/DependencyInjection.md deleted file mode 100644 index 1cb5f0626..000000000 --- a/docs/archive/developer/DependencyInjection.md +++ /dev/null @@ -1,40 +0,0 @@ -# Dependency Injection - -### What - -We are using [Guice](https://github.com/google/guice) to achieve Dependency Injection (DI), and hence -Inversion of Control (IoC). Guice is a lightweight framework that helps remove the need of newing up -dependencies from within an Object. This separates the responsibility of creating an object, from -using it. - -### How - -To use Guice you first must initialise a container, which holds all the bindings for your dependencies. -Within the container you can then bind your classes/interfaces to either: - -- A direct implementation - >bind(myInterface.class).to(myConcreteImplementation.class) - -- A provider, which creates a binding depending on user(s) inputs - >bind(myInterface.class).toProvider(myImplementationProvider.class) - -You are then able to inject the bound dependency into the desired constructor using the **@Inject** -annotation. - -``` -private BoundInterface dependency; - -@Inject -public Myclass(BoundInterface dependency) { - this.dependency = dependency -} -``` - -### Key Decisions - -One main change in the code is that now the command line arguments and execution with said arguments -has been separated. They are now connected by a common CommandLineBase class, which is also where the -IoC container is initialised. - -Guice was selected as the DI framework due to its lightweight nature and ease of implementing it to the -project. Other frameworks were investigated and discarded. \ No newline at end of file diff --git a/docs/archive/developer/DeveloperGuide.md b/docs/archive/developer/DeveloperGuide.md deleted file mode 100644 index 3748a8d35..000000000 --- a/docs/archive/developer/DeveloperGuide.md +++ /dev/null @@ -1,28 +0,0 @@ -## Key Concepts - -1. [Design Decisions](KeyDecisions.md) -1. [Decision Trees](decisionTrees/DecisionTrees.md) -1. [Profile Syntax](../user/Schema.md) - - -## Development - -1. [Contributing](../../../.github/CONTRIBUTING.md) -1. [Build and Run the Generator](../user/gettingStarted/BuildAndRun.md) -1. [Adding Schema Versions](HowToAddSupportForNewSchemaVersion.md) -1. [Dependency Injection](DependencyInjection.md) -1. [Cucumber Testing](CucumberSyntax.md) -1. [Git Merging](GitMerging.md) - -## Behavioural Explanations - -1. [Behaviour in Detail](behaviour/BehaviourInDetail.md) -1. [Null Operator](behaviour/NullOperator.md) - -## Key Algorithms and Data Structures - -1. [Decision Trees](decisionTrees/DecisionTrees.md) -1. [Generation Algorithm](algorithmsAndDataStructures/GenerationAlgorithm.md) -1. [Field Fixing Strategy](algorithmsAndDataStructures/FieldFixingStrategy.md) -1. [String Generation](algorithmsAndDataStructures/StringGeneration.md) -1. [Tree Walker Types](decisionTreeWalkers/TreeWalkerTypes.md) \ No newline at end of file diff --git a/docs/archive/developer/DockerSetup.md b/docs/archive/developer/DockerSetup.md deleted file mode 100644 index 277faa2f5..000000000 --- a/docs/archive/developer/DockerSetup.md +++ /dev/null @@ -1,45 +0,0 @@ -# Build and run the generator using Docker - -The instructions below explain how to download the source code, and then build and run it using Docker. This generates a self-contained executable Docker image which can then run the generator without needing to install a JRE. If you would like to download and build the source code in order to contribute to development, we recommend you [build and run the generator using an IDE](../user/gettingStarted/BuildAndRun.md) instead. - -## Get Code - -Clone the repository to your local development folder. - -``` -git clone https://github.com/finos/datahelix.git -``` - -## Installation requirements - -* Docker EE or CE - -## Building the generator using Docker - -A Data Helix generator docker image can be built by running the following command in the root source code directory: - -``` -docker build . --tag datahelix -``` - -If you are on Linux, or any other system with a `sh`-compatible shell available, the following command is equivalent: - -``` -./docker-build.sh -``` - -## Running the generator using Docker - -Once built, you can run the image with: - -``` -docker run -ti -v mydir:/data datahelix [parameters] -``` - -Note that the `-v` option specifies how to map your local filesystem into the Docker image, so that the DataHelix generator can access the profile file that you pass to it, and can write its output to a location you can access. For example, if you run the image inside the profile directory, on a system with Unix-style environment variables, you can run the following command: - -``` -docker run -ti -v $PWD:/data datahelix --profile-file=/data/examples/actor-names/profile.json -``` - -This will map your current working directory (using the `$PWD` environment variable) to the `/data` directory in the Docker image's virtual filesystem, and uses this mapping to tell the generator to use the file `./examples/actor-names/profile.json` as its profile input. With this example, the generator output will be output to the console, but you can write the output data to a mapped directory in the same way. \ No newline at end of file diff --git a/docs/archive/developer/GitMerging.md b/docs/archive/developer/GitMerging.md deleted file mode 100644 index eed1c1050..000000000 --- a/docs/archive/developer/GitMerging.md +++ /dev/null @@ -1,20 +0,0 @@ -# Git Merging Instructions - -Assuming you've been developing a feature on the `feature` branch, but `master` has changed since you started work. - -This is depicted below: -``` -- - - master - | - | - - feature -``` - -To make a Pull Request, you will first need to merge `master` into `feature`. - -First, ensure that the local master is up to date. Then, checkout the `feature` branch. - -If in doubt, `git merge master` then `git push` will work. - -If you don't want to have merge commits, you can rebase using `git rebase master` and push with `git push --force-with-lease`. - -Make sure you don't `git pull` between the rebase and the push because it can cause changes to be merged incorrectly. diff --git a/docs/archive/developer/HowToAddSupportForNewSchemaVersion.md b/docs/archive/developer/HowToAddSupportForNewSchemaVersion.md deleted file mode 100644 index 04843a1a2..000000000 --- a/docs/archive/developer/HowToAddSupportForNewSchemaVersion.md +++ /dev/null @@ -1,43 +0,0 @@ -## How to add support for a new schema version - -1. Copy a package in _profile/src/main/resources/profileschema/_ and rename to the new version number. -1. Change the _schemaVersion_ const from the old version number to the new one. - -### Example -If the file structure currently looks like the below... -``` -- profileschema - |- 0.1 - |- datahelix.schema.json -``` -...and the new version is 0.2 then change it to the following: -``` -- profileschema - |- 0.1 - |- datahelix.schema.json - |- 0.2 - |- datahelix.schema.json -``` - -Then change the below (in the new file)... -``` -... -"schemaVersion": { - "title": "The version of the DataHelix profile schema", - "const": "0.1" -}, -... -``` -...to this: -``` -... -"schemaVersion": { - "title": "The version of the DataHelix profile schema", - "const": "0.2" -}, -... -``` - -You will need to update the test in _ProfileSchemaImmutabilityTests_ to contain the new schema version generated. Old versions should **not** be modified. This is reflected by the test failing if any existing schemas are modified. - -If you experience any issues with this test not updating the schema in IntelliJ, it is recommended to invalidate the cache and restart, or to delete the _profile/out_ directory and rebuild. diff --git a/docs/archive/developer/KeyDecisions.md b/docs/archive/developer/KeyDecisions.md deleted file mode 100644 index dbaf9b698..000000000 --- a/docs/archive/developer/KeyDecisions.md +++ /dev/null @@ -1,84 +0,0 @@ -# Key decisions - -## Do type-specific constraints imply corresponding type constraints? - -For instance, what is generated for the price field in this example? - -```javascript -{ "field": "price", "greaterThan": 4 } -``` - -This constraint means : - * Everything except numbers less than or equal to 4 (eg, strings are valid). Users are expected to supplement with type constraints - - -## Does negating a constraint complement its denotation? - -In other words, given a constraint `C`, is `¬C` satisfied by everything that doesn't satisfy `C`? - -In some cases this is intuitive: - -- If `C` says that a field is null, `¬C` should permit that field to be anything _other_ than null. -- If `C` says that a field is in a set, `¬C` should permit anything _not_ in that set. -- If `C` says that a field is a decimal, `¬C` should permit strings, datetimes, etc. - -But: - -- If `C` says that a field is a number greater than 3, it might be intuitive to say that `¬C` permits numbers less than or equal to 3. - -Note that negation of type integer is not fully defined yet as we do not have a negation of granularTo implemented. - -## Does an inSet constraint imply anything about nullability? - -```javascript -{ "field": "product_type", "inSet": [ "a", "b" ] } -``` - -Given the above, should we expect nulls? If null is considered a _value_ then no would be a reasonable answer, but it can equally be considered the absence of a value. - -## What do datetime constraints mean when datetimes are partially specified? - -```javascript -{ "field": "creationDate", "after": "2015-01-01" } -``` - -Should I be able to express the above, and if so what does it mean? Intuitive, we might say that it cannot be satisfied with, eg, `2015-01-01 12:00:00`, but it depends on how we interpret underspecified datetimes: - -* As an **instant**? If so, `2015-01-01` is interpreted as `2015-01-01 00:00:00.000`, and even a datetime 0.01 milliseconds later would satisfy the above constraint. -* As a **range**? If so, `2015-01-01` is interpreted as `2015-01-01 00:00:00.000 ≤ X < 2015-01-02 00:00:00.000`, and the `before` and `after` constraints are interpreted relative to the start and end of this range, respectively. - -Both of these approaches seem more or less intuitive in different cases (for example, how should `equalTo` constraints be applied?). To resolve this problem, we currently require datetime expressions to be fully specified down to thousandths of milliseconds. - -## How should we generate characters outside the basic latin character set? - -We currently only support generation of characters represented in range 002c-007e. - -Either we can: -1) Update the tool to reject any regular expressions that contain characters outside of this range. -2) Update the tool to accept & generate these characters - -## Is the order of rows emitted between each run consistent? - -We currently do not guarantee that the order of valid rows is consistent between each generation run. For example on one execution we may produce the following output for three fields: - -| Field A | Field B | Field C | -|---------|---------|---------| -| 1 | Z | 100.5 | -| 1 | Z | 95.2 | -| 2 | Z | 100.5 | -| 2 | Z | 95.2 | -| 1 | Y | 100.5 | -| 1 | Y | 95.2 | - -However on another using the same profile we may produce the following output: - -| Field A | Field B | Field C | -|---------|---------|---------| -| 1 | Z | 100.5 | -| 1 | Z | 95.2 | -| 1 | Y | 100.5 | -| 1 | Y | 95.2 | -| 2 | Z | 100.5 | -| 2 | Z | 95.2 | - -Both produce valid output and both produce the same values as a whole but in slightly different order. \ No newline at end of file diff --git a/docs/archive/developer/algorithmsAndDataStructures/GenerationAlgorithm.md b/docs/archive/developer/algorithmsAndDataStructures/GenerationAlgorithm.md deleted file mode 100644 index 8cb2f5503..000000000 --- a/docs/archive/developer/algorithmsAndDataStructures/GenerationAlgorithm.md +++ /dev/null @@ -1,64 +0,0 @@ -# Decision tree generation - -Given a set of rules, generate a [decision tree](../decisionTrees/DecisionTrees.md) (or multiple if [partitioning](../decisionTrees/Optimisation.md#Partitioning) was successful). - -## Decision tree interpretation - -An interpretation of the decision tree is defined by chosing an option for every decision visited in the tree. - -![](../../user/images/interpreted-graph.png) - -In the above diagram the red lines represent one interpretation of the graph, for every decision an option has been chosen and we end up with the set of constraints that the red lines touch at any point. These constraints are reduced into a fieldspec (see [Constraint Reduction](#constraint-reduction) below). - -Every decision introduces new interpretations, and we provide interpretations of every option of every decision chosen with every option of every other option. If there are many decisons then this can result in too many interpretations. - -# Constraint reduction - -An interpretation of a decision tree could contain several atomic constraints related to a single field. To make it easier to reason about these collectively, we **reduce** them into more detailed, holistic objects. These objects are referred to as **fieldspecs**, and can express any restrictions expressed by a constraint. For instance, the constraints: - -* `X greaterThanOrEqualTo 3` -* `X lessThanEqualTo 6` -* `X not null` - -could collapse to - -``` -{ - min: 3, - max: 6, - nullability: not_null -} -``` - -*(note: this is a conceptual example and not a reflection of actual object structure)* - -See [Set restriction and generation](../../user/SetRestrictionAndGeneration.md) for a more in depth explanation of how the constraints are merged and data generated. - -This object has all the information needed to produce the values `[3, 4, 5, 6]`. - -The reduction algorithm works by converting each constraint into a corresponding, sparsely-populated fieldspec, and then merging them together. During merging, three outcomes are possible: - -* The two fieldspecs specify distinct, compatible restrictions (eg, `X is not null` and `X > 3`), and the merge is uneventful -* The two fieldspecs specify overlapping but compatible restrictions (eg, `X is in [2, 3, 8]` and `X is in [0, 2, 3]`), and the more restrictive interpretation is chosen (eg, `X is in [2, 3]`). -* The two fieldspecs specify overlapping but incompatible restrictions (eg, `X > 2` and `X < 2`), the merge fails and the interpretation of the decision tree is rejected - -# Databags - -A **databag** is an immutable mapping from fields to outputs, where outputs are a pairing of a *value* and *formatting information* (eg, a date formatting string or a number of decimal places). - -Databags can be merged, but merging two databags fails if they have any keys in common. - -Fieldspecs are able to produce streams of databags containing valid values for the field they describe. Additional operations can then be applied over these streams, such as: - -* A memoization decorator that records values being output so they can be replayed inexpensively -* A filtering decorator that prevents repeated values being output -* A merger that takes multiple streams and applies one of the available [combination strategies](../../user/CombinationStrategies.md) -* A concatenator that takes multiple streams and outputs all the members of each - -# Output - -Once fieldspecs have generated streams of single-field databags, and databag stream combiners have merged them together, we should have a stream of databags that each contains all the information needed for a single datum. At this point, a serialiser can take each databag in turn and create an output. For instance, - -## CSV - -Given a databag, iterate through the fields in the profile, in order, and lookup values from the databag. Create a row of output from those values. diff --git a/docs/archive/developer/algorithmsAndDataStructures/OptimisationProcess.md b/docs/archive/developer/algorithmsAndDataStructures/OptimisationProcess.md deleted file mode 100644 index 661728628..000000000 --- a/docs/archive/developer/algorithmsAndDataStructures/OptimisationProcess.md +++ /dev/null @@ -1,49 +0,0 @@ -# Decision Tree Optimiser - -We optimise the decision tree to improve the performance of how the generator works. - -The optimiser only does ["Unification" optimisation](../decisionTrees/Optimisation.md) which works by finding decisions which has two options, -where one option has a constraint, and the other option has the negation of that constraint. -it then combines all of those decisions together. - -Because this is the only type of optimisation it does. it relies on the tree being created in a way which will have these types of decision. -So the tree factory specifically creates if statements in the form if X then Y == ¬X || X & Y. so that X is the shared constraint. and can be optimised against other "if X then [...]" constraints. - -## Strategy - -The process follows the steps below (exit early if there is nothing to process): - -* From a given `ConstraintNode`, get all the atomic constraints in all _options_ of __decisions immediately below itself__ -* Group the atomic constraints to identify if there are any which are prolific, i.e. where the atomic constraint is repeated. -* Order the groups of atomic constraints so that _NOT_ constraints are disfavoured -* Pick the first group in the ordered set, this contains an identification of the _most prolific constraint_ -* Create a new `DecisionNode` and attach it to the current _root_ `ConstraintNode` -* Create a new `ConstraintNode` (_A_) under the new `DecisionNode` with the most prolific constraint as the single atomic constraint -* Create a new `ConstraintNode` (_B_) under the new `DecisionNode` with the negated form of the most prolific constraint as the single atomic constraint -* Iterate through all decisions under the root `ConstraintNode`, for each decision inspect each option (`ConstraintNode`s) -* If the option has an atomic constraint that matches the most prolific constraint, clone the option (`ConstraintNode`) [comment 2] _excluding the most prolific constraint_, add it to the new constraint (_A_) -* If the option has an atomic constraint that matches the negated form of the most prolific constraint, clone the option (`ConstraintNode`) [comment 2] _excluding the negated form of the most prolific constraint_, add it to the new constraint (_B_) -* If the option does NOT have any atomic constraint that matches the most prolific constraint (or its negated form), add the option as another option under the new `DecisionNode` -* __Simplification:__ Inspect the decisions under the most prolific constraint node (_A_), if any has a single option, then hoist the atomic constraint up to the parent constraint node. Repeat the process for the negated constraint node (_B_) -* __Recursion:__ Start the process again for both of the newly created constraint nodes (to recurse through the tree) - -[comment 1] -The process will address negated (_NOT_) atomic constraints if they are the most prolific. The process will simply prefer to optimise a NON-negated constraint over that of a negated constraint where they have the same usage count. - -[comment 2] -`ConstraintNodes` can be __cloned, excluding a given atomic constraint__, when this happens all other properties are copied across to the new instance, therefore all of the decisions on the constraint are preserved. - -[comment 3] -The optimiser is used when generating data. It makes no attempt to maintain the order of the tree when optimising it as it isn't important for the process of data generation. - -## Processing - -The optimisation process will repeat the optimisation process at the current level for a maximum of __50__ (_default_) times. The process will prevent a repeat if there were no optimisations made in the previous iteration. - -The process will recurse through the tree each time it makes an optimisation, it will process any constraint node that meets the following criteria: -* The constraint has at least __2__ decisions -* The current depth is less than __10m__ (_default_) - -## Testing - -The optimisation process has integration tests that are run as part of every build, they take a given profile as an input and assert that the factorised tree (regardless of ordering) matches what is expected. \ No newline at end of file diff --git a/docs/archive/developer/algorithmsAndDataStructures/StringGeneration.md b/docs/archive/developer/algorithmsAndDataStructures/StringGeneration.md deleted file mode 100644 index 4919c0505..000000000 --- a/docs/archive/developer/algorithmsAndDataStructures/StringGeneration.md +++ /dev/null @@ -1,124 +0,0 @@ -# String Generation - -We use a Java library called [dk.brics.automaton](http://www.brics.dk/automaton/) to analyse regexes and generate valid (and invalid for [violation](../../user/alphaFeatures/DeliberateViolation.md)) strings based on them. It works by representing the regex as a finite state machine. It might be worth reading about state machines for those who aren't familiar: [https://en.wikipedia.org/wiki/Finite-state_machine](https://en.wikipedia.org/wiki/Finite-state_machine). Consider the following regex: `ABC[a-z]?(A|B)`. It would be represented by the following state machine: - -![](../../user/images/finite-state-machine.svg) - - - -The [states](http://www.brics.dk/automaton/doc/index.html) (circular nodes) represent a string that may (green) or may not (white) be valid. For example, `s0` is the empty string, `s5` is `ABCA`. - -The [transitions](http://www.brics.dk/automaton/doc/index.html) represent adding another character to the string. The characters allowed by a transition may be a range (as with `[a-z]`). A state does not store the string it represents itself, but is defined by the ordered list of transitions that lead to that state. - -Another project that also uses dk.brics.automaton in a similar way to us might be useful to look at for further study: [https://github.com/mifmif/Generex](https://github.com/mifmif/Generex). - -Other than the fact that we can use the state machine to generate strings, the main benefit that we get from using this library are: -* Finding the intersection of two regexes, used when there are multiple regex constraints on the same field. -* Finding the complement of a regex, which we use for generating invalid regexes for violation. - -Due to the way that the generator computes textual data internally the generation of strings is not deterministic and may output valid values in a different order with each generation run. - -## Anchors - -dk.brics.automaton doesn't support start and end anchors `^` & `$` and instead matches the entire word as if the anchors were always present. For some of our use cases though it may be that we want to match the regex in the middle of a string somewhere, so we have two versions of the regex constraint - [matchingRegex](../../user/UserGuide.md#predicate-matchingregex) and [containingRegex](../../user/UserGuide.md#predicate-containingregex). If `containingRegex` is used then we simply add a `.*` to the start and end of the regex before passing it into the automaton. Any `^` or `$` characters passed at the start or end of the string respectively are removed, as the automaton will treat them as literal characters. - -## Automaton data types -The automaton represents the state machine using the following types: -- `Transition` -- `State` - -### `Transition` -A transition holds the following properties and are represented as lines in the above graph -- `min: char` - The minimum permitted character that can be emitted at this position -- `max: char` - The maximum permitted character that can be emitted at this position -- `to: State[]` - The `State`s that can follow this transition - -In the above `A` looks like: - -| property | initial | \[a-z\] | -| ---- | ---- | ---- | -| min | A | a | -| max | A | z | -| to | 1 state, `s1` | 1 state, `s4` | - -### `State` -A state holds the following properties and are represented as circles in the above graph -- `accept: boolean` - is this a termination state, can string production stop here? -- `transitions: HashSet` - which transitions, if any, follow this state -- `number: int` - the number of this state -- `id: int` - the id of this state (not sure what this is used for) - -In the above `s0` looks like: - -| property | initial | s3 | -| ---- | ---- | ---- | -| accept | false | false | -| transitions | 1 transition, → `A` | 2 transitions:
→ `[a-z]`
→ `A|B` | -| number | 4 | 0 | -| id | 49 | 50 | - -### Textual representation -The automaton can represent the state machine in a textual representation such as: - -``` -initial state: 4 -state 0 [reject]: - a-z -> 3 - A-B -> 1 -state 1 [accept]: -state 2 [reject]: - B -> 5 -state 3 [reject]: - A-B -> 1 -state 4 [reject]: - A -> 2 -state 5 [reject]: - C -> 0 -``` - -This shows each state and each transition in the automaton, lines 2-4 show the `State` as shown in the previous section. -Lines 10-11 show the transition 'A' as shown in the prior section. - -The pathway through the automaton is: -- transition to state **4** (because the initial - empty string state - is rejected/incomplete) -- add an 'A' (state 4) -- transition to state **2** (because the current state "A" is rejected/incomplete) -- add a 'B' (state 2) -- transition to state **5** (because the current state "AB" is rejected/incomplete) -- add a 'C' (state 5) -- transition to state **0** (because the current state "ABC" is rejected/incomplete) - - _either_ - - add a letter between `a..z` (state 0, transition 1) - - transition to state **3** (because the current state "ABCa" is rejected/incomplete) - - add either 'A' or 'B' (state 3) - - transition to state **1** (because the current state "ABCaA" is rejected/incomplete) - - current state is accepted so exit with the current string "ABCaA" - - _or_ - - add either 'A' or 'B' (state 0, transition 2) - - transition to state **1** (because the current state "ABCA" is rejected/incomplete) - - current state is accepted so exit with the current string "ABCA" - -## Character support - -The generator does not support generating strings above the Basic Unicode plane (Plane 0). Using regexes that match characters above the basic plane may lead to unexpected behaviour. diff --git a/docs/archive/developer/behaviour/BehaviourInDetail.md b/docs/archive/developer/behaviour/BehaviourInDetail.md deleted file mode 100644 index 8b10a7927..000000000 --- a/docs/archive/developer/behaviour/BehaviourInDetail.md +++ /dev/null @@ -1,100 +0,0 @@ -# Behaviour in Detail -## Nullness -### Behaviour -Nulls can always be produced for a field, except when a field is explicitly not null. - -### Misleading Examples -|Field is |Null produced| -|:----------------------|:-----------:| -|Of type X | ✔ | -|Not of type X | ✔ | -|In set [X, Y, ...] | ✔ | -|Not in set [X, Y, ...] | ✔ | -|Equal to X | ❌ | -|Not equal to X | ✔ | -|Greater than X | ✔ | -|Null | ✔ | -|Not null | ❌ | - -For the profile snippet: -``` -{ "if": - { "field": "A", "equalTo": 1 }, - "then": - { "field": "B", "equalTo": 2 } -}, -{ "field": "A", "equalTo": 1 } -``` - -|Allowed value of A|Allowed value of B| -|------------------|------------------| -|1 |2 | - -## Type Implication -### Behaviour -No operators imply type (except ofType ones). By default, all values are allowed. - -### Misleading Examples -Field is greater than number X: - -|Values |Can be produced| -|----------------------|:-------------:| -|Numbers greater than X|✔ | -|Numbers less than X |❌ | -|Null |✔ | -|Strings |✔ | -|Date-times |✔ | - -## Violation of Rules -### Behaviour -Rules, constraints and top level `allOf`s are all equivalent in normal generation. - -In violation mode (an _alpha feature_), rules are treated as blocks of violation. -For each rule, a file is generated containing data that can be generated by combining the -violation of that rule with the non-violated other rules. - -This is equivalent to the behaviour for constraints and `allOf`s, but just splitting it -into different files. - -## General Strategy for Violation -_This is an alpha feature. Please do not rely on it. If you find issues with it, please [report them](https://github.com/finos/datahelix/issues)._ -### Behaviour -The violation output is not guaranteed to be able to produce any data, -even when the negation of the entire profile could produce data. - -### Why -The violation output could have been calculated by simply negating an entire rule or profile. This could then produce all data that breaks the original profile in any way. However, this includes data that breaks the data in multiple ways at once. This could be very noisy, because the user is expected to test one small breakage in a localised area at a time. - -To minimise the noise in an efficient way, a guarantee of completeness is broken. The system does not guarantee to be able to produce violating data in all cases where there could be data which meets this requirement. In some cases this means that no data is produced at all. - -In normal negation, negating `allOf [A, B, C]` gives any of the following: -1) `allOf[NOT(A), B, C]` -2) `allOf[A, NOT(B), C]` -3) `allOf[A, B, NOT(C)]` -4) `allOf[NOT(A), NOT(B), C]` -5) `allOf[A, NOT(B), NOT(C)]` -6) `allOf[NOT(A), B, NOT(C)]` -7) `allOf[NOT(A), NOT(B), NOT(C)]` - -These are listed from the least to most noisy. The current system only tries to generate data by negating one sub-constraint at a time (in this case, producing only 1, 2 and 3). - -### Misleading examples -When a field is a string, an integer and not null, no data can be produced normally, -but data can be produced in violation mode. - -|Values |Can be produced when in violation mode| -|----------------------|:-------------------------------------| -|Null |✔ (By violating the "not null" constraint) | -|Numbers |✔ (By violating the "string" constraint) | -|Strings |✔ (By violating the "integer" constraint) | -|Date-times |❌ (This would need both the "string" and "integer" constraints to be violated at the same time) | - -If a field is set to null twice, no data can be produced in violation mode because it tries to evaluate null and not null: - -|Values |Can be produced when in violation mode| -|----------------------|:-------------------------------------| -|Null |❌| -|Numbers |❌| -|Strings |❌| -|Date-times |❌| - diff --git a/docs/archive/developer/behaviour/NullOperator.md b/docs/archive/developer/behaviour/NullOperator.md deleted file mode 100644 index 1dc8a04f9..000000000 --- a/docs/archive/developer/behaviour/NullOperator.md +++ /dev/null @@ -1,161 +0,0 @@ -# `null`, presence, absence and the empty set. - -The `null` operator in a profile, expressed as `"is": "null"` or the negated equivalent has several meanings. It can mean (and emit the behaviour) as described below: - -### Possible scenarios: - -| Absence / Presence | Field value | -| ---- | ---- | -| (A) _null operator omitted_
**The default**. The field's value may be absent or present | (B) `is null`
The field will have _no value_ | -| (C) `not(is null)`
The field's value must be present | (D) `not(is null)`
The field must have a value | - -Therefore the null operator can: -- (C, D) `not(is null)` express fields that must have a value (otherwise known as a non-nullable field) -- (B) `is null` express fields as having no value (otherwise known as setting the value to `null`) -- (A) _By omitting the constraint_: express fields as permitting absence or presence of a value (otherwise known as a nullable field) - -### `null` and interoperability -`null` is a keyword/term that exists in other technologies and languages, so far as this tool is concerned it relates to the absence or the presence of a value. See [set restriction and generation](../../user/SetRestrictionAndGeneration.md) for more details. - -When a field is serialised or otherwise written to a medium, as the output of the generator, it may choose to represent the absence of a value by using the formats' `null` representation, or some other form such as omitting the property and so on. - -#### For illustration -CSV files do not have any standard for representing the absence of a value differently to an empty string (unless all strings are always wrapped in quotes ([#441](https://github.com/ScottLogic/data-engineering-generator/pull/441)). - -JSON files could be presented with `null` as the value for a property or excluding the property from the serialised result. This is the responsibility of the serialiser, and depends on the use cases. - -## The `null` operator and the `if` constraint -With `if` constraints, the absence of a value needs to be considered in order to understand how the generator will behave. Remember, every set contains the empty set, unless excluded by way of the `not(is null)` constraint, for more details see [set restriction and generation](../../user/SetRestrictionAndGeneration.md). - -Consider the following if constraint: - -``` -{ - "if": { - { - "field": "field1", - "equalTo": 5 - } - }, - "then": { - { - "field": "field2", - "equalTo": "a" - } - } -} -``` - -The generator will expand the `if` constraint as follows, to ensure the constraint is fully balanced: - -``` -{ - "if": { - { - "field": "field1", - "equalTo": 5 - } - }, - "then": { - { - "field": "field2", - "equalTo": "a" - } - }, - "else": { - { - "not": { - "field": "field1", - "equalTo": 5 - } - } - } -} -``` - -This expression does not prevent the consequence (the `then` constraints) from being considered when `field1` has no value. Equally it does not say anything about the alternative consequence (the `else` constraints). As such both outcomes are applicable at any time. - -The solution to this is to express the `if` constraint as follows. This is not 'auto completed' for profiles as it would remove functionality that may be intended, it must be explicitly included in the profile. - -``` -{ - "if": { - "allOf": [ - { - "field": "field1", - "equalTo": 5 - }, - { - "not": { - "field": "field1", - "is": "null" - } - } - ] - }, - "then": { - { - "field": "field2", - "equalTo": "a" - } - } -} -``` - -The generator will expand the `if` constraint as follows, to ensure the constraint is fully balanced: - -``` -{ - "if": { - "allOf": [ - { - "field": "field1", - "equalTo": 5 - }, - { - "not": { - "field": "field1", - "is": "null" - } - } - ] - }, - "then": { - { - "field": "field2", - "equalTo": "a" - } - }, - "else": { - "anyOf": [ - { - "not": { - "field": "field1", - "equalTo": 5 - } - }, - { - "field": "field1", - "is": "null" - } - ] - } -} -``` - -In this case the `then` constraints will only be applicable when `field1` has a value. Where `field1` has no value, either of the `else` constraints can be considered applicable. Nevertheless `field2` will only have the value `"a"` when `field1` has the value `5`, not when it is absent also. - -### Examples: -Considering this use case, you're trying to generate data to be imported into a SQL server database. Below are some examples of constraints that may help define fields and their mandatoriness or optionality. - -* A field that is non-nullable
-`field1 ofType string and field1 not(is null)` - -* A field that is nullable
-`field1 ofType string` - -* A field that has no value
-`field1 is null` - -## Violations -Violations are a special case for the `null` operator, see [Deliberate Violation](../../user/alphaFeatures/DeliberateViolation.md) for more details. \ No newline at end of file diff --git a/docs/archive/developer/decisionTreeWalkers/DecisionBasedWalker.md b/docs/archive/developer/decisionTreeWalkers/DecisionBasedWalker.md deleted file mode 100644 index 9efe84604..000000000 --- a/docs/archive/developer/decisionTreeWalkers/DecisionBasedWalker.md +++ /dev/null @@ -1,9 +0,0 @@ -The Decision Based Tree solver generates row specs by: - 1. choosing and removing a decision from the tree - 2. selecting an option from that decision - 3. adding the constraints from the chosen option to the root of the tree - - adding the sub decisions from the chosen option to the root of the tree - 4. "pruning" the tree by removing any options from the tree that contradict with the new root node - - any decisions that only have 1 remaining option will have that option also moved up the tree, and pruned again. - 5. restarting from 1, until there are no decision left - 6. creating a rowspec from the constraints in the remaining root node. \ No newline at end of file diff --git a/docs/archive/developer/decisionTrees/DecisionTrees.md b/docs/archive/developer/decisionTrees/DecisionTrees.md deleted file mode 100644 index d8c439896..000000000 --- a/docs/archive/developer/decisionTrees/DecisionTrees.md +++ /dev/null @@ -1,42 +0,0 @@ -# Decision Trees - -**Decision Trees** contain **Constraint Nodes** and **Decision Nodes**: - -* Constraint Nodes contain atomic constraints and a set of decision nodes, and are satisfied by an data entry if it satifies all atomic constraints and decision nodes. -* Decision Nodes contain Constraint Nodes, and are satisfied if at least one Constraint Node is satisfied. - -Every Decision Tree is rooted by a single Constraint Node. - -## Example - -In our visualisations, we notate constraint nodes with rectangles and decision nodes with triangles. - -![](hoisting.before.svg) - -## Derivation - -Given a set of input constraints, we can build an equivalent Decision Tree. - -One process involved in this is **constraint normalisation**, which transforms a set of constraints into a new set with equivalent meaning but simpler structure. This happens through repeated application of some known equivalences, each of which consumes one constraint and outputs a set of replacements: - -| Input | Outputs | -| ------------------ | ----------------------------- | -| `¬¬X` | `X` | -| `AND(X, Y)` | `X, Y` | -| `¬OR(X, Y, ...)` | `¬X, ¬Y, ...` | -| `¬AND(X, Y, ...)` | `OR(¬X, ¬Y, ...)` | -| `¬IF(X, Y)` | `X, ¬Y` | -| `¬IFELSE(X, Y, Z)` | `OR(AND(X, ¬Y), AND(¬X, ¬Z))` | - -We can convert a set of constraints to a Constraint Node as follows: - -1. Normalise the set of constraints -2. Take each constraint in sequence: - * If the constraint is atomic, add it to the Constraint Node - * If the constraint is an `OR`, add a Decision Node. Convert the operands of the `OR` into Constraint Nodes - * If the constraint is an `IF(X, Y)`, add a Decision Node with two Constraint Nodes. One is converted from `AND(X, Y)`, the other from `¬X` - * If the constraint is an `IFELSE(X, Y, Z)`, add a Decision Node with two Constraint Nodes. One is converted from `AND(X, Y)`, the other from `AND(¬X, Z)` - -## Optimisation - -As a post-processing step, we apply [optimisations](Optimisation.md) to yield equivalent but more tractable trees. diff --git a/docs/archive/developer/decisionTrees/Optimisation.md b/docs/archive/developer/decisionTrees/Optimisation.md deleted file mode 100644 index dafd20bec..000000000 --- a/docs/archive/developer/decisionTrees/Optimisation.md +++ /dev/null @@ -1,51 +0,0 @@ -# Decision tree optimisation - -## Partitioning - -The expression of a field may depend on the expression of other fields. For instance, given `X = 3 OR Y = 5`, `Y` must be `5` if `X` is not `3`; `X` and `Y` can be said to _co-vary_. This covariance property is transitive; if `X` and `Y` co-vary and `Y` and `Z` co-vary, then `X` and `Z` also co-vary. Given this definition, it's usually possible to divide a profile's fields into smaller groups of fields that co-vary. This process is called **partitioning**. - -For example, given the below tree: - -![](partitioning.before.svg) - -We can observe that variations in `x` and `y` have no implications on one another, and divide into two trees: - -![](partitioning.after1.svg) ![](partitioning.after2.svg) - -The motivation for partitioning is to determine which fields can vary independently of each other so that streams of values can be generated for them independently (and potentially in parallel execution threads) and then recombined by any preferred [combination strategy](../../user/CombinationStrategies.md). - -## Unification - -Consider the below tree: - -![](unification.before.svg) - -It's impossible to [partition](#Partitioning) this tree because the `type` field affects every decision node. However, we can see that the below tree is equivalent: - -![](unification.after.svg) - -Formally: If you can identify pairs of sibling, equivalent-valency decision nodes A and B such that for each constraint node in A, there is precisely one mutually satisfiable node in B, you can collapse the decisions. There may be multiple ways to do this; the ordering of combinations affects how extensively the tree can be reduced. - -## Deletion - -Consider the below tree: - -![](deletion.before.svg) - -Because the leftmost node contradicts the root node, we can delete it. Thereafter, we can pull the content of the other constraint node up to the root node. However, because `¬(x > 12)` is entailed by `x = 3`, we delete it as well. This yields: - -![](deletion.after.svg) - -## Hoisting - -Consider the below tree: - -![](hoisting.before.svg) - -We can simplify to: - -![](hoisting.after.svg) - -Formally: If a Decision Node `D` contains a Constraint Node `C` with no constraints and a single Decision Node `E`, `E`'s Constraint Nodes can be added to `D` and `C` removed. - -This optimisation addresses situations where, for example, an `anyOf` constraint is nested directly inside another `anyOf` constraint. diff --git a/docs/archive/developer/decisionTrees/deletion.after.json b/docs/archive/developer/decisionTrees/deletion.after.json deleted file mode 100644 index 15b79e47b..000000000 --- a/docs/archive/developer/decisionTrees/deletion.after.json +++ /dev/null @@ -1,9 +0,0 @@ -{ - "fields": [ - { "name": "x" }, - { "name": "y" } - ], - "constraints": [ - { "field": "x", "equalTo": 3 } - ] -} diff --git a/docs/archive/developer/decisionTrees/deletion.after.svg b/docs/archive/developer/decisionTrees/deletion.after.svg deleted file mode 100644 index 97e4af7b2..000000000 --- a/docs/archive/developer/decisionTrees/deletion.after.svg +++ /dev/null @@ -1,17 +0,0 @@ - - - - - - -tree - -c0 - -x = 3 - - - diff --git a/docs/archive/developer/decisionTrees/deletion.before.json b/docs/archive/developer/decisionTrees/deletion.before.json deleted file mode 100644 index 617ecf97b..000000000 --- a/docs/archive/developer/decisionTrees/deletion.before.json +++ /dev/null @@ -1,14 +0,0 @@ -{ - "fields": [ - { "name": "x" }, - { "name": "y" } - ], - "constraints": [ - { "field": "x", "equalTo": 3 }, - - { - "if": { "field": "x", "greaterThan": 12 }, - "then": { "field": "y", "equalTo": 5 } - } - ] -} diff --git a/docs/archive/developer/decisionTrees/deletion.before.svg b/docs/archive/developer/decisionTrees/deletion.before.svg deleted file mode 100644 index 40bcaf82b..000000000 --- a/docs/archive/developer/decisionTrees/deletion.before.svg +++ /dev/null @@ -1,44 +0,0 @@ - - - - - - -tree - -c0 - -x = 3 - - -d1 - - - -c0--d1 - - - -c2 - -x > 12 -y = 5 - - -d1--c2 - - - -c3 - -¬(x > 12) - - -d1--c3 - - - - diff --git a/docs/archive/developer/decisionTrees/hoisting.after.svg b/docs/archive/developer/decisionTrees/hoisting.after.svg deleted file mode 100644 index b916d09f9..000000000 --- a/docs/archive/developer/decisionTrees/hoisting.after.svg +++ /dev/null @@ -1,51 +0,0 @@ - - - - - - -tree - -c0 - - - -d1 - - - -c0--d1 - - - -c2 - -x = a - - -d1--c2 - - - -c3 - -x = b - - -d1--c3 - - - -c4 - -x = c - - -d1--c4 - - - - diff --git a/docs/archive/developer/decisionTrees/hoisting.before.json b/docs/archive/developer/decisionTrees/hoisting.before.json deleted file mode 100644 index f358c47a2..000000000 --- a/docs/archive/developer/decisionTrees/hoisting.before.json +++ /dev/null @@ -1,15 +0,0 @@ -{ - "fields": [ - { "name": "x" }, - { "name": "y" } - ], - "constraints": [ - { "anyOf": [ - { "field": "x", "equalTo": "a" }, - { "anyOf": [ - { "field": "x", "equalTo": "b" }, - { "field": "x", "equalTo": "c" } - ] } - ] } - ] -} diff --git a/docs/archive/developer/decisionTrees/hoisting.before.svg b/docs/archive/developer/decisionTrees/hoisting.before.svg deleted file mode 100644 index e03910d4d..000000000 --- a/docs/archive/developer/decisionTrees/hoisting.before.svg +++ /dev/null @@ -1,67 +0,0 @@ - - - - - - -tree - -c0 - - - -d1 - - - -c0--d1 - - - -c2 - -x = a - - -d1--c2 - - - -c3 - - - -d1--c3 - - - -d4 - - - -c3--d4 - - - -c5 - -x = b - - -d4--c5 - - - -c6 - -x = c - - -d4--c6 - - - - diff --git a/docs/archive/developer/decisionTrees/partitioning.after1.svg b/docs/archive/developer/decisionTrees/partitioning.after1.svg deleted file mode 100644 index aef73a40c..000000000 --- a/docs/archive/developer/decisionTrees/partitioning.after1.svg +++ /dev/null @@ -1,42 +0,0 @@ - - - - - - -tree - -c0 - - - -d1 - - - -c0--d1 - - - -c2 - -x = a - - -d1--c2 - - - -c3 - -x = b - - -d1--c3 - - - - diff --git a/docs/archive/developer/decisionTrees/partitioning.after2.svg b/docs/archive/developer/decisionTrees/partitioning.after2.svg deleted file mode 100644 index 3d7a5c486..000000000 --- a/docs/archive/developer/decisionTrees/partitioning.after2.svg +++ /dev/null @@ -1,42 +0,0 @@ - - - - - - -tree - -c0 - - - -d1 - - - -c0--d1 - - - -c2 - -y = 1 - - -d1--c2 - - - -c3 - -y = 2 - - -d1--c3 - - - - diff --git a/docs/archive/developer/decisionTrees/partitioning.before.json b/docs/archive/developer/decisionTrees/partitioning.before.json deleted file mode 100644 index b33a7f829..000000000 --- a/docs/archive/developer/decisionTrees/partitioning.before.json +++ /dev/null @@ -1,16 +0,0 @@ -{ - "fields": [ - { "name": "x" }, - { "name": "y" } - ], - "constraints": [ - { "anyOf": [ - { "field": "x", "equalTo": "a" }, - { "field": "x", "equalTo": "b" } - ] }, - { "anyOf": [ - { "field": "y", "equalTo": 1 }, - { "field": "y", "equalTo": 2 } - ] } - ] -} diff --git a/docs/archive/developer/decisionTrees/partitioning.before.svg b/docs/archive/developer/decisionTrees/partitioning.before.svg deleted file mode 100644 index a753924af..000000000 --- a/docs/archive/developer/decisionTrees/partitioning.before.svg +++ /dev/null @@ -1,68 +0,0 @@ - - - - - - -tree - -c0 - - - -d1 - - - -c0--d1 - - - -d4 - - - -c0--d4 - - - -c2 - -x = a - - -d1--c2 - - - -c3 - -x = b - - -d1--c3 - - - -c5 - -y = 1 - - -d4--c5 - - - -c6 - -y = 2 - - -d4--c6 - - - - diff --git a/docs/archive/developer/decisionTrees/unification.after.json b/docs/archive/developer/decisionTrees/unification.after.json deleted file mode 100644 index 0bb0e8024..000000000 --- a/docs/archive/developer/decisionTrees/unification.after.json +++ /dev/null @@ -1,20 +0,0 @@ -{ - "fields": [ - { "name": "type" }, - { "name": "x" }, - { "name": "y" } - ], - "constraints": [ - { - "if": { "field": "type", "equalTo": "a" }, - "then": { "allOf": [ - { "field": "x", "equalTo": "0.9" }, - { "field": "y", "equalTo": "12" } - ]}, - "else": { "allOf": [ - { "field": "x", "is": "null" }, - { "field": "y", "is": "null" } - ]} - } - ] -} diff --git a/docs/archive/developer/decisionTrees/unification.after.svg b/docs/archive/developer/decisionTrees/unification.after.svg deleted file mode 100644 index 9621f8fa2..000000000 --- a/docs/archive/developer/decisionTrees/unification.after.svg +++ /dev/null @@ -1,46 +0,0 @@ - - - - - - -tree - -c0 - - - -d1 - - - -c0--d1 - - - -c2 - -type = a -x = 0.9 -y = 12 - - -d1--c2 - - - -c3 - -¬(type = a) -x is null -y is null - - -d1--c3 - - - - diff --git a/docs/archive/developer/decisionTrees/unification.before.json b/docs/archive/developer/decisionTrees/unification.before.json deleted file mode 100644 index 0fb388f1b..000000000 --- a/docs/archive/developer/decisionTrees/unification.before.json +++ /dev/null @@ -1,16 +0,0 @@ -{ - "fields": [ - { "name": "type" }, - { "name": "x" }, - { "name": "y" } - ], - "constraints": [ - { "if": { "field": "type", "equalTo": "a" }, - "then": { "field": "x", "equalTo": "0.9" }, - "else": { "field": "x", "is": "null" }}, - - { "if": { "field": "type", "equalTo": "a" }, - "then": { "field": "y", "equalTo": "12" }, - "else": { "field": "y", "is": "null" }} - ] -} diff --git a/docs/archive/developer/decisionTrees/unification.before.svg b/docs/archive/developer/decisionTrees/unification.before.svg deleted file mode 100644 index 3b4f97615..000000000 --- a/docs/archive/developer/decisionTrees/unification.before.svg +++ /dev/null @@ -1,72 +0,0 @@ - - - - - - -tree - -c0 - - - -d1 - - - -c0--d1 - - - -d4 - - - -c0--d4 - - - -c2 - -type = a -x = 0.9 - - -d1--c2 - - - -c3 - -¬(type = a) -x is null - - -d1--c3 - - - -c5 - -type = a -y = 12 - - -d4--c5 - - - -c6 - -¬(type = a) -y is null - - -d4--c6 - - - - diff --git a/docs/archive/user/BuildAndRun.md b/docs/archive/user/BuildAndRun.md deleted file mode 100644 index 0d66ff0f9..000000000 --- a/docs/archive/user/BuildAndRun.md +++ /dev/null @@ -1,112 +0,0 @@ -## Build and run the generator - -The instructions below explain how to download the generator source code, build it and run it, using a Java IDE. This is the recommended setup if you would like to contribute to the project yourself. If you would like to use Docker to build the source code and run the generator, [please follow these alternate instructions](https://github.com/finos/datahelix/blob/master/docs/developer/DockerSetup.md). - -### Get Code - -Clone the repository to your local development folder. - -``` -git clone https://github.com/finos/datahelix.git -``` - -### Installation Requirements - -* Java version 1.8 -* Gradle -* Cucumber -* Preferred: One of IntelliJ/Eclipse IDE - -### Java - -[Download JDK 8 SE](http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html). - -*(Please note, this has been tested with jdk1.8.0_172 but later versions of JDK 1.8 may still work)* - -In Control Panel: edit your environment variables; set `JAVA_HOME=C:\Program Files\Java\jdk1.8.0_172`. -Add Java binary utilities to your `PATH` (`C:\Program Files\Java\jdk1.8.0_172\bin`). - -### Gradle - -Download and install Gradle, following the [instructions on their project website](https://docs.gradle.org/current/userguide/installation.html). - -### IntelliJ IDE - -Get IntelliJ. [EAP](https://www.jetbrains.com/idea/nextversion/) gives you all features of Ultimate (improves framework support and polyglot). - -### Eclipse - -Alternatively, download and install [Eclipse](https://www.eclipse.org/downloads/). Please note we do not have detailed documentation for using the generator from Eclipse. - -### Cucumber - -Add **Gherkin** and **Cucumber for Java** plugins (file > settings > plugins if using IntelliJ IDE). - -Currently the tests cannot be run from the TestRunner class. - -To run a feature file you’ll have to modify the configuration by removing .steps from the end of the Glue field. - -An explanation of the particular syntax used can be found [here](https://github.com/finos/datahelix/blob/master/docs/developer/CucumberSyntax.md). - -### Command Line - -Build the tool with all its dependencies: - -`gradle build` - -Check the setup worked with this example command: - -`java -jar orchestrator\build\libs\generator.jar --replace --profile-file=docs/user/gettingStarted/ExampleProfile1.json --output-path=out.csv` - -To generate valid data run the following command from the command line: - -`java -jar [options] --profile-file="" --output-path=""` - -* `[path to JAR file]` - the location of `generator.jar`. -* `[options]` - optionally a combination of [options](https://github.com/finos/datahelix/blob/master/docs/user/commandLineOptions/GenerateOptions.md) to configure how the command operates. -* `` - the location of the JSON profile file. -* `` - the location of the generated data. - -To generate violating data run the following command from the command line: - -`java -jar violate [options] --profile-file="" --output-path=""` - -* `[path to JAR file]` - the location of `generator.jar`. -* `[options]` - a combination of any (or none) of [the options documented here](https://github.com/finos/datahelix/blob/master/docs/user/commandLineOptions/ViolateOptions.md) to configure how the command operates. -* `` - the location of the JSON profile file. -* `` - the location of a folder in which to create generated data files. - -### IntelliJ - -On IntelliJ's splash screen, choose "Open". - -Open the repository root directory, `datahelix`. - -Right-click the backend Module, `generator`, choose "Open Module Settings". - -In "Project": specify a Project SDK (Java 1.8), clicking "New..." if necessary. -Set Project language level to 8. - -Open the "Gradle" Tool Window (this is an extension that may need to be installed), and double-click Tasks > build > build. -Your IDE may do this automatically for you. - -Navigate to the [`App.java` file](https://github.com/finos/datahelix/blob/master/orchestrator/src/main/java/com/scottlogic/datahelix/generator/orchestrator/App.java). Right click and debug. - -Now edit the run configuration on the top toolbar created by the initial run. Name the run configuration 'Generate' and under 'Program Arguments' enter the following, replacing the paths with your desired files: - -``` ---profile-file="" --output-path="" -``` - -For example, run this command: -``` -java -jar orchestrator\build\libs\generator.jar --replace --profile-file=docs/user/gettingStarted/ExampleProfile1.json --output-path=out.csv -``` - -Additionally create another run configuration called GenerateViolating and add the program arguments - -``` -violate --profile-file="" --output-path="" -``` - -Run both of these configurations to test that installation is successful. diff --git a/docs/archive/user/Contradictions.md b/docs/archive/user/Contradictions.md deleted file mode 100644 index 8c31efa6a..000000000 --- a/docs/archive/user/Contradictions.md +++ /dev/null @@ -1,64 +0,0 @@ -# Contradictions - -Contradictions are where constraints are combined in some fashion, creating a representation of data where no, or incorrectly reduced, amounts of data can be produced. -The following categories of contradictions can occur in a profile: - - -| Type | Explanation | Example | -| ---- | ---- | ---- | -| 'Hard' | _cannot_ create any rows of data (for the field, and therefore the output file) | `foo ofType string` and `foo ofType decimal` and `foo not(is null)` - no rows would be emitted | -| 'Partial' | _could_ create some data, but some scenarios would produce none | `foo equalTo 1` and `anyOf foo equalTo 1 or foo equalTo 2` (1 is produced), also `foo ofType string` and `foo ofType decimal` (`null` is produced) | - -Note that the phases "there is a hard contradiction", "the profile is fully contradictory" and "the profile is wholly contradictory" are synonymous. - -This document describes [how data is generated](SetRestrictionAndGeneration.md) and underpins the concept of contradictions. - -## 'Hard' contradictions -This is where contradictions occur in a way where no rows could be satisfied for at least one field. If this is the case then no rows can be emitted for any field in the profile, therefore the output file/s would be empty. -Hard contradictions can otherwise be described as removing all possible values from the universal set and denying the absence of values for the field; `not(is null)`. - -Note that these contradictions can occur even if most of the profile is correct. If no data can be generated for a single field, then it will prevent all the other fields from producing data. - -See [how data is generated](SetRestrictionAndGeneration.md) for more detail on how constraints are combined and in-turn reduce the set of permissible values for a field. - -Examples are: -* `is null` and `not(is null)` -* `ofType string` and `ofType decimal` and `not(is null)` -* `ofType string` and `shorterThan 1` and `not(is null)` -* `equalTo 1` and `equalTo 2` - -The generator can detect some contradictions upfront and will report them, but for performance reasons cannot detect all types. -If no data is generated for a file, this means that the profile has a hard contradiction. - -Examples of profiles are: -* [Null Validation](../../examples/hard-contradiction-null-validation/profile.json) -* [Type Validation 1](../../examples/hard-contradiction-type-validation-1/profile.json) -* [Type Validation 2](../../examples/hard-contradiction-type-validation-2/profile.json) - -## 'Partial' contradictions -This is where part of a profile is fully contradictory, i.e. where an otherwise valid profile contains a fully contradictory section. Partially contradictory profiles include ones with multiple fully contradictory sections. These usually occur when one part of an `anyOf` or an `if-then-else` is contradictory. - -Examples are: -* (`is null` and `not(is null)`) or `equalTo 1` -* (`ofType string` and `ofType decimal` and `not(is null)`) or `ofType string` -* (`ofType string` and `shorterThan 1` and `not(is null)`) or `is null` -* (`equalTo 1` and `equalTo 2`) or (`is null` and `not(is null)`) or `ofType string` - -Examples of profiles are: -* [Partial Contradiction in anyOf](../../examples/partial-contradictions/profile.json) - -## Non-contradictory examples -The following are examples of where constraints can be combined and (whilst potentially dubious) are not contradictory: -* `foo inSet ["a", "b", 1, 2]` and `foo greaterThan 1` - * this can emit `"a", "b", 2` or nothing (`null`) as there is nothing to say it must have a value, or must be of a particular type -* `foo greaterThan 1` and `foo ofType string` - * this can emit all strings, or emit no value (`null`) (the `greaterThan` constraint is ignored as it only applies to `decimal` or `integer` values, of which none will be generated) - -## What happens -1. Detected 'hard' contradictions give an error and produce no data. -1. Detected 'partial' contradictions give an error but produce some data. -1. The generator will try to generate data whether contradictions are present or not, it may only emit no rows if the combinations are fully contradictory. - -## Current stance -1. Contradiction checking to a greater degree will be deferred to tooling for the generator, such as profile writers, user interfaces, etc. -1. More detailed contradiction checking [(#1090)](https://github.com/finos/datahelix/issues/1090) and better upfront warnings [(#896)](https://github.com/finos/datahelix/issues/896) are being considered as features. diff --git a/docs/archive/user/alphaFeatures/DeliberateViolation.md b/docs/archive/user/alphaFeatures/DeliberateViolation.md deleted file mode 100644 index 6dff00983..000000000 --- a/docs/archive/user/alphaFeatures/DeliberateViolation.md +++ /dev/null @@ -1,125 +0,0 @@ -# Deliberate violation - -_This is an alpha feature. Please do not rely on it. If you find issues with it, please [report them](https://github.com/finos/datahelix/issues)._ - -## Invocation - -Violation is invoked using the `violate` command. An output directory must be specified rather than a single file. - -## Algorithm - -1. For each rule `R`: - 1. Create a new version of the profile where `R` is wrapped in a `ViolationConstraint`. A violation constraint works similar to a not constraint except that `and` conditionals are treated differently (see below). - 1. Create a decision tree from that profile, and pass it to the generator as normal. - 1. Write the output to a file in the output directory with a numerical file name. -1. Output a valid version with no rules violated. -1. Output a manifest file, listing which output file corresponds to which rule. - -## Manifest - -An example of a manifest file when violating one rule. "001.csv" is where the first rule has been violated. -```javascript -[ - { - "filepath": "001.csv", - "violatedRules": [ "Price field should not accept nulls" ] - } -] -``` - -## ViolationConstraint - -The `violation` constraint is an internal constraint and cannot be used specifically, only for violation. It works exactly like a `not` constraint, except where dealing with `and` constraints. - -* A `not` constraint converts `¬AND(X, Y, Z)` into `OR(¬X, ¬Y, ¬Z)` -* A `violate` constraint converts `VIOLATE(AND(X, Y, Z))` into: -``` -OR( - AND(VIOLATE(X), Y, Z), - AND(X, VIOLATE(Y), Z), - AND(X, Y, VIOLATE(Z))) -``` - -This is so that we end up with each inner constraint violated separately. - -## Generating invalid data - -One of the most powerful features of the generator is its ability to generate data that violates constraints.To create invalid data use the `violate` command. This time you need to specify an output directory rather than a file: - -``` -$ java -jar generator.jar violate --max-rows=100 --replace --profile-file=profile.json --output-path=out -``` - -When the above command has finished, you'll find that the generator has created an `out` directory which has four files: - -``` -$ ls out -1.csv 2.csv 3.csv manifest.json -``` - -The manifest file details which rules have been violated in each of the output files: - -``` -{ - "cases": [ - { - "filePath": "1", - "violatedRules": ["first name"] - }, - { - "filePath": "2", - "violatedRules": ["age"] - }, - { - "filePath": "3", - "violatedRules": ["national insurance"] - } - ] -} -``` - -If you take a look at the generated file `1.csv` you'll see that data for the `firstName` field does not obey the given constraints, whilst those for `age` and `nationalInsurance` are correct: - -``` -firstName,age,nationalInsurance --619248029,71,"HT849919" -08-12-2001 02:53:16,15, -"Lorem Ipsum",11, -"Lorem Ipsum",71,"WX004081" -1263483797,19,"HG054666" -,75,"GE023082" -"Lorem Ipsum",59,"ZM850737C" -[...] -``` - -However, it might be a surprise to see nulls, numbers and dates as values for the `firstName` field alongside strings that do not match the regex given in the profile. This is because these are all defined as a single rule within the profile. You have a couple of options if you want to ensure that `firstName` is null or a string, the first is to inform the generator that it should not violate specific constraint types: - -``` -$ java -jar generator.jar violate --dont-violate=ofType \ - --max-rows=100 --replace --profile-file=profile.json --output-path=out -``` - -Or, alternatively, you can re-arrange your constraints so that the ones that define types / null, are grouped as a single rule. After By re-grouping constraints, the following output, with random strings that violate the regex constraint, is generated: - -``` -firstName,age,nationalInsurance -"�",43,"PT530853D" -"뷇",56,"GE797875M" -"邦爃",84,"JA172890M" -"J㠃懇圑㊡俫杈",32,"AE613401F" -"俵튡",38,"TS256211F" -"M",60,"GE987171M" -"M",7, -"Mꞎኅ剭Ꙥ哌톞곒",97,"EN082475C" -")",80,"BX025130C" -",⑁쪝",60,"RW177969" -"5ᢃ풞ﺯ䒿囻",57,"RY904705" -[...] -``` -**NOTE** we are considering adding a feature that allows you to [specify / restrict the character set](https://github.com/finos/datahelix/issues/294) in a future release. - -## Known issues -1. The process would be expected to return vast quantities of data, as the single constraint `foo inSet [a, b, c]` when violated returns all data except [a, b, c] from the universal set. Whilst logically correct, could result in a unusable tool/data-set due to its time to create, or eventual size. -1. The process of violating constraints also violates the type for fields, e.g. `foo ofType string` will be negated to `not(foo ofType string)`. This itself could be useful for the user to test, but could also render the data unusable (e.g. if the consumer requires the 'schema' to be adhered to) -1. The process of violating constraints also violates the nullability for fields, e.g. `foo not(is null)` will be negated to `foo is null`. This itself could be useful for the user to test, but could render the data unusable (e.g. if the consumer requires non-null values for field `foo`). -1. Implied/default rules are not negated, therefore as every field is implied/defaulted to allowing nulls, the method of violation currently doesn't prevent null from being emitted when violating. This means that nulls can appear in both normal data generation mode AND violating data generation mode. diff --git a/docs/archive/user/alphaFeatures/GeneratingViolatingData.md b/docs/archive/user/alphaFeatures/GeneratingViolatingData.md deleted file mode 100644 index a31909c3a..000000000 --- a/docs/archive/user/alphaFeatures/GeneratingViolatingData.md +++ /dev/null @@ -1,71 +0,0 @@ -### Example - Generating Violating Data - -The generator can be used to generate data which intentionally violates the profile constraints for testing purposes. - -Using the `violate` command produces one file per rule violated along with a manifest that lists which rules are violated in each file. - -Using the [Sample Profile](#Example-Profile) that was created in a previous section, run the following command: - -`java -jar violate --profile-file="" --output-path=""` - -* `` the location of the folder in which the generated files will be saved - -Additional options are [documented here](https://github.com/finos/datahelix/blob/master/docs/user/commandLineOptions/ViolateOptions.md). - -With no additional options this should yield the following data: - -* `1.csv`: - -|Column 1 | Column 2 | -|:---------------:|:--------------:| -|-2147483648 |-2147483648 | -|-2147483648 |0 | -|-2147483648 |2147483646 | -|-2147483648 | | -|0 |-2147483648 | -|2147483646 |-2147483648 | -|1900-01-01T00:00 |-2147483648 | -|2100-01-01T00:00 |-2147483648 | -| |-2147483648 | - -* `2.csv`: - -|Column 1 Name |Column 2 Name | -|:---------------:|:--------------:| -|"Lorem Ipsum" |"Lorem Ipsum" | -|"Lorem Ipsum" |1900-01-01T00:00| -|"Lorem Ipsum" |2100-01-01T00:00| -|"Lorem Ipsum" | | -| |"Lorem Ipsum" | - -* `manifest.json`: - -``` -{ - "cases" : [ { - "filePath" : "1", - "violatedRules" : [ "Column 1 is a string" ] - }, { - "filePath" : "2", - "violatedRules" : [ "Column 2 is a number" ] - } ] -} -``` - -The data generated violates each rule in turn and records the results in separate files. -For example, by violating the `"ofType": "String"` constraint in the first rule the violating data produced is of types *decimal* and *datetime*. -The manifest shows which rules are violated in which file. - -### Hints and Tips - -* The generator will output velocity and row data to the console as standard -(see [options](https://github.com/finos/datahelix/blob/master/docs/user/commandLineOptions/GenerateOptions.md) for other monitoring choices). - * If multiple monitoring options are selected the most detailed monitor will be implemented. -* Ensure any desired output files are not being used by any other programs or the generator will not be able to run. - * If a file already exists it will be overwritten. -* Violated data generation will produce one output file per rule being violated. - * This is why the output location is a directory and not a file. - * If there are already files in the output directory with the same names they will be overwritten. -* It is important to give your rules descriptions so that the manifest can list the violated rules clearly. -* Rules made up of multiple constraints will be violated as one rule and therefore will produce one output file per rule. -* Unless explicitly excluded `null` will always be generated for each field. diff --git a/docs/archive/user/alphaFeatures/Interesting.md b/docs/archive/user/alphaFeatures/Interesting.md deleted file mode 100644 index c081abd01..000000000 --- a/docs/archive/user/alphaFeatures/Interesting.md +++ /dev/null @@ -1,62 +0,0 @@ -# Interesting Generation Mode - -_This is an alpha feature. Please do not rely on it. If you find issues with it, please [report them](https://github.com/finos/datahelix/issues)._ - -The _interesting generation mode_ exists to provide a means of generating smaller sets of data. To illustrate, consider the following profile: - -``` -{ - "fields": [ - { "name": "field1" } - ], - "constraints": [ - { - "field": "field1", - "is": "ofType", - "value": "string" - }, - { - "field": "field1", - "shorterThan": 5 - }, - { - "not": { - "field": "field2", - "is": "null" - } - } - ] -} -``` - -The above describes a data set where there is one field which can emit any string so long as it is shorter than 5 characters. The generator can emit the following (see [Generation strategies](https://github.com/ScottLogic/datahelix/blob/master/docs/Options/GenerateOptions.md)): - -Unicode has 55,411 code-points (valid characters) in [the basic multilingual plane](https://en.wikipedia.org/wiki/Plane_(Unicode)) - that the generator will emit characters from. In the table below this number is represented as _#[U](https://en.wikipedia.org/wiki/Universal_set)_. - -| mode | what would be emitted | potential number of rows | -| ---- | ---- | ---- | -| full sequential | any string that is empty, 1, 2 or 3 combinations of any unicode characters | _#U_0 + _#U_1 + _#U_2 + _#U_3 = 170,135,836,825,864 | -| random | an infinite production of random values from full sequential | unlimited | -| interesting | interesting strings that abide by the constraints | 3-4 | - -Given this simple profile, using full-sequential generation you would expect to see _**170 trillion**_ rows, if the generator was unlimited in the number of rows it can emit. One of the goals of the DataHelix project is to generate data for testing systems, this amount of data for testing is complete, but in itself difficult for use due to its size. - -It would be more useful for the generator to emit data that matches some data that presents normal and abnormal attributes. If a user wanted to test that another product can accept these strings you might expect the following scenarios: - -* an empty string -* a string of 4 characters -* a string of 1 character -* a string of characters including non-ASCII characters - but never the less unicode characters - e.g. an :slightly_smiling_face: -* a string containing at least one [null character](https://en.wikipedia.org/wiki/Null_character) - -The above generation strategy is called _interesting_ generation in the generator. The above list is indicative and designed to give a flavour of what the strategy is trying to achieve. - -The user may also want to generate data that [deliberately violates](DeliberateViolation.md) the given rules for the field, to test that the other product exhibits the expected behaviour when it is provided invalid data. If this was the case you might expect to test the system with the following data: - -* no value (otherwise represented as `null`) -* a string of 5 characters -* a numeric value -* a temporal value -* a boolean value - -The values that the generator will emit for various scenarios are [documented here](../generationTypes/GenerationTypes.md#interesting). Some of the scenarios above are not met, see the linked document for their details. \ No newline at end of file diff --git a/docs/archive/user/alphaFeatures/SelectiveViolation.md b/docs/archive/user/alphaFeatures/SelectiveViolation.md deleted file mode 100644 index f5b425408..000000000 --- a/docs/archive/user/alphaFeatures/SelectiveViolation.md +++ /dev/null @@ -1,19 +0,0 @@ -# Selective violation -_This is an alpha feature. Please do not rely on it. If you find issues with it, please [report them](https://github.com/finos/datahelix/issues)._ - -The current selective violations allows a user to choose an operator/type of constraint to not violate. -All of the constraints of that type will not be not be violated in the entire profile. - -e.g. using the command line argument `--dont-violate=lessThan` -will mean that every single less than constraint will not be violated. - -Selective violation does nothing if the generator is not run in violation mode. - -## Limitations -- `equalsTo` and `inSet` are considered the same. So choosing not to violate either, will make the system not violate both. -- Can't choose to not violate grammatical constraints -- If all constraints in a profile are selected to be not violated, the system will generate valid data - -## Potential future work -- Let user choose single constraints or rules to violate. -- Let user choose single constraints or rules to not violate. diff --git a/docs/archive/user/alphaFeatures/VisualisingTheDecisionTree.md b/docs/archive/user/alphaFeatures/VisualisingTheDecisionTree.md deleted file mode 100644 index 41b960269..000000000 --- a/docs/archive/user/alphaFeatures/VisualisingTheDecisionTree.md +++ /dev/null @@ -1,53 +0,0 @@ -## Visualising the Decision Tree -_This is an alpha feature. Please do not rely on it. If you find issues with it, please [report them](https://github.com/finos/datahelix/issues)._ - -This page will detail how to use the `visualise` command to view the decision tree for a profile. - -Visualise generates a DOT compliant representation of the decision tree, -for manual inspection, in the form of a gv file. - -### Using the Command Line - - -To visualise the decision tree run the following command from the command line: - -`java -jar visualise [options] --profile-file="" --output-path=""` - -* `[path to JAR file]` the location of generator.jar -* `[options]` optionally a combination of [options](https://github.com/finos/datahelix/blob/master/docs/user/commandLineOptions/VisualiseOptions.md) to configure how the command operates -* `` the location of the JSON profile file -* `` the location of the folder for the resultant GV file of the tree - -### Example - -Using the [Sample Profile](#Example-Profile) that was created in an earlier section, run the visualise command -with your preferred above method. - -With no options this should yield the following gv file: - -``` -graph tree { - bgcolor="transparent" - label="ExampleProfile1" - labelloc="t" - fontsize="20" - c0[bgcolor="white"][fontsize="12"][label="Column 1 Header is STRING -Column 2 Header is STRING"][shape=box] -c1[fontcolor="red"][label="Counts: -Decisions: 0 -Atomic constraints: 2 -Constraints: 1 -Expected RowSpecs: 1"][fontsize="10"][shape=box][style="dotted"] -} -``` - -This is a very simple tree, more complex profiles will generate more complex trees - -### Hints and Tips - -* You may read a gv file with any text editor -* You can also use this representation with a visualiser such as [Graphviz](https://www.graphviz.org/). - - There may be other visualisers that are suitable to use. The requirements for a visualiser are known (currently) as: - - gv files are encoded with UTF-8, visualisers must support this encoding. - - gv files can include HTML encoded entities, visualisers should support this feature. diff --git a/docs/archive/user/commandLineOptions/GenerateOptions.md b/docs/archive/user/commandLineOptions/GenerateOptions.md deleted file mode 100644 index ceea83e9b..000000000 --- a/docs/archive/user/commandLineOptions/GenerateOptions.md +++ /dev/null @@ -1,27 +0,0 @@ -# Generate Options -Option switches are case-sensitive, arguments are case-insensitive - -* `--profile-file=` (or `-p `) - * Path to the input profile file. -* `--output-path=` (or `-o `) - * Path to the output file. If not specified, output will be to standard output. -* `--replace` - * Overwrite/replace existing output files. -* `-n ` or `--max-rows ` - * Emit at most `` rows to the output file, if not specified will limit to 10,000,000 rows. - * Mandatory in `RANDOM` mode. -* `--disable-schema-validation` - * Generate without first checking profile validity against the schema. This can be used if you believe the schema is incorrectly rejecting your profile. -* `-o ` - * Output the data in the given format, either CSV (default) or JSON. - * Note that JSON format requires that all data is held in-memory until all data is known, at which point data will be flushed to disk, this could have an impact on memory and/or IO requirements -* `--allow-untyped-fields` - * Turns off type checking on fields in the profile. - -By default the generator will report how much data has been generated over time, the other options are below: -* `--verbose` - * Will report in-depth detail of data generation -* `--quiet` - * Will disable velocity reporting - -`--quiet` will be ignored if `--verbose` is supplied. diff --git a/docs/archive/user/commandLineOptions/ViolateOptions.md b/docs/archive/user/commandLineOptions/ViolateOptions.md deleted file mode 100644 index 03811115c..000000000 --- a/docs/archive/user/commandLineOptions/ViolateOptions.md +++ /dev/null @@ -1,29 +0,0 @@ -# Violate Options -Option switches are case-sensitive, arguments are case-insensitive - -* `--profile-file=` (or `-p `) - * Path to input profile file. -* `--output-path=` (or `-o `) - * Path to output directory. -* `--replace` - * Overwrite/replace existing output files. -* `--dont-violate` - * Choose specific [predicate constraints](../UserGuide.md#Predicate-constraints) to [not violate](../alphaFeatures/SelectiveViolation.md), e.g. "--dont-violate=ofType lessThan" will not violate ANY data type constraints and will also not violate ANY less than constraints. -* `-n ` or `--max-rows ` - * Emit at most `` rows to the output file, if not specified will limit to 10,000,000 rows. - * Mandatory in `RANDOM` mode. -* `--disable-schema-validation` - * Generate without first checking profile validity against the schema. This can be used if you believe the schema is incorrectly rejecting your profile. -* `-o ` - * Output the data in the given format, either CSV (default) or JSON. - * Note that JSON format requires that all data is held in-memory until all data is known, at which point data will be flushed to disk, this could have an impact on memory and/or IO requirements -* `--allow-untyped-fields` - * Turns off type checking on fields in the profile. - -By default the generator will report how much data has been generated over time, the other options are below: -* `--verbose` - * Will report in-depth detail of data generation -* `--quiet` - * Will disable velocity reporting - -`--quiet` will be ignored if `--verbose` is supplied. \ No newline at end of file diff --git a/docs/archive/user/commandLineOptions/VisualiseOptions.md b/docs/archive/user/commandLineOptions/VisualiseOptions.md deleted file mode 100644 index 92271e287..000000000 --- a/docs/archive/user/commandLineOptions/VisualiseOptions.md +++ /dev/null @@ -1,15 +0,0 @@ -# Visualise Options -Option switches are case-sensitive, arguments are case-insensitive - -* `--profile-file=` (or `-p `) - * Path to input profile file. -* `--output-path=` (or `-o `) - * Path to visualisation output file. -* `-t ` or `--title <title>` - * Include the given `<title>` in the visualisation. If not supplied, the description of in profile will be used, or the filename of the profile. -* `--no-title` - * Exclude the title from the visualisation. This setting overrides `-t`/`--title`. -* `--replace` - * Overwrite/replace existing output files. -* `--allow-untyped-fields` - * Turns off type checking on fields in the profile. diff --git a/docs/archive/user/generationTypes/GenerationTypes.md b/docs/archive/user/generationTypes/GenerationTypes.md deleted file mode 100644 index cdac4636c..000000000 --- a/docs/archive/user/generationTypes/GenerationTypes.md +++ /dev/null @@ -1,68 +0,0 @@ -# Generation types - -The generator supports the following data generation types - -* Random (_default_) -* Full Sequential -* Interesting (_alpha_) - -## Random -Generate some random data that abides by the given set of constraints. This mode has the potential to repeat data points, it does not keep track of values that have already been emitted. - -Examples: - -| Constraint | Emitted valid data |Emitted violating data | -| ---- | ---- | ---- | -| `Field 1 > 10 AND Field 1 < 20` | _(any values > 10 & < 20)_ | _(any values <= 10 or >= 20 ())_ | -| `Field 1 in set [A, B, C]` | _(A, B or C in any order, repeated as needed)_ | _(any values except A, B or C)_ | - -Notes: -- Random generation of data is infinite and is limited to 1000 by default, use `--max-rows` to enable generation of more data. -- For more information about the behaviour of this example, see the [behaviour in detail.](../../developer/behaviour/BehaviourInDetail.md) - -## Full Sequential -Generate all data that can be generated in order from lowest to highest. - -Examples: - -| Constraint | Emitted valid data |Emitted violating data | -| ---- | ---- | ---- | -| `Field 1 > 0 AND Field 1 < 5` | _(null, 1, 2, 3, 4)_ | _(any values <= 10 or >= 20)_ | -| `Field 1 in set [A, B, C]` | _(null, A, B, C)_ | _(any values except A, B or C)_ | - -Notes: -- For more information about the behaviour of this example, see the [behaviour in detail.](../../developer/behaviour/BehaviourInDetail.md) -- There are a few [combination strategies](../CombinationStrategies.md) for full sequential mode with [minimal](../CombinationStrategies.md#Minimal) being the default. - -## Interesting -_This is an alpha feature. Please do not rely on it. If you find issues with it, please [report them](https://github.com/finos/datahelix/issues)._ - -See [this document](../alphaFeatures/Interesting.md) for more details on the _interesting generation mode_. - -The values that are generated by the generator are shown in the matrix below. `granularTo` does not have any bearing on whether the field represents integer or decimal values. If no `ofType` constraint exists then interesting values from _all_ data types can be emitted. - -| Data type | where there are no constraints for the field <br /> _(or only `ofType` and/or `not(is null)`)_ | where constraints exist for the field <br /> _(beyond `ofType` and `not(is null)`)_ | -| ---- | ---- | ---- | -| **string** | "Lorem Ipsum" | a string as short as possible given the intersection of other constraints <br /> a string as long as possible given the intersection of other constraints <br /> If a containsRegex or matchingRegex is present, use this to inform the characters that are included in the shortest/longest strings. Otherwise repeat a `<space>` character for the contents of both strings <br /> | -| **numeric (integer values)** <br /> where all present numeric constraints (if any) contain whole numbers and no decimal points | -100000000000000000000 <br /> 0 <br /> 100000000000000000000 | _A unique set of:_ <br /> the lowest possible value or -100000000000000000000 <br /> 0 (if permitted) <br /> the highest possible value or 100000000000000000000 | -| **numeric (decimal values)** <br /> where some numeric constraints contain a decimal point (even if the fraction is 0, e.g. 10.0) | -100000000000000000000 <br /> 0 <br /> 100000000000000000000 | _A unique set of:_ <br /> the 2 lowest possible values or -100000000000000000000, -99999999999999999999<br /> 0 (if permitted) <br /> the 2 highest possible values or 99999999999999999999, 100000000000000000000 | -| **datetime** | 0000-01-01T00:00:00.000 <br /> 9999-12-31T23:59:59.999 | _A unique set of:_ <br /> the earliest possible date or 0000-01-01T00:00:00.000 <br /> the latest possible date or 9999-12-31T:23:59:59.999 | -| **string** (`aValid ISIN`) | "GB0000000009" <br /> "GB00JJJJJJ45" <br /> "US0000000002" | NA - this constraint has no other constraints that can be used in conjunction with it - see [#488](https://github.com/finos/datahelix/issues/488) | - -`null` is considered an interesting value; it will be emitted where permitted (i.e. where there is no `not(is null)` constraint applied to the field) - -### Conditional constraints -Any value/s mentioned within an `if` or `anyOf` will be considered interesting whether the field is constrained or not. Consider the following profile: -``` -field1 greaterThanOrEqual 0 -field1 lessThan 10 -field1 ofType integer - -if field1 equalTo 5 then - field2 equalTo "a" -``` - -Given the table above the interesting values for `field1` would be `0` & `9`. To ensure the consequence of the `if` constraint is met (`field2 equalTo "a"`), at least once, `5` is included as an interesting value. As such the above profile would emit `0`, `5` & `9` for `field1`. - -### Sets -All values of the `inSet` constraint is used (the single value in an `equalTo` constraint) are considered interesting. There is no discrimination over the values emitted for these sets. diff --git a/docs/archive/user/gettingStarted/BasicUsage.md b/docs/archive/user/gettingStarted/BasicUsage.md deleted file mode 100644 index b1a962402..000000000 --- a/docs/archive/user/gettingStarted/BasicUsage.md +++ /dev/null @@ -1,68 +0,0 @@ -# Basic Usage - -Once [Java v1.8](https://www.java.com/en/download/manual.jsp) is installed you can run the generator with the following command: - -`java -jar <path to JAR file> [options] <arguments>` - -* `[options]` optionally a combination of options to configure how the command operates -* `<arguments>` required inputs for the command to operate - -**Note:** Do not include a trailing \ in directory paths - -## Examples -* `java -jar generator.jar profile.json profile.csv` -* `java -jar generator.jar violate profile.json violated-data-files/` - -Example profiles can be found in the [examples folder](../../../examples). - -## Commands -### Generate -#### `[options] <profile path> <output path>` - -Generates data to a specified endpoint. - -* `<profile path>`, a path to the profile JSON file -* `<output path>`, a file path to where the data should be emitted to. This will be a UTF-8 encoded CSV file or directory, option dependent. - -The full list of generate options can be viewed [here](../commandLineOptions/GenerateOptions.md). - -### Violate -#### `violate [options] <profile path> <output directory>` - -Generates violating data to a specified folder/directory. - -* `<profile path>`, a path to the profile JSON file. -* `<output directory>`, a path to a directory into which the data should be emitted. This will consist of a set of output files, and a `manifest.json` file describing which constraints are violated by which output file. - -The full list of violate options can be viewed [here](../commandLineOptions/ViolateOptions.md) - -### Visualise -#### `visualise [options] <profile path> <output path>` - -Generates a <a href=https://en.wikipedia.org/wiki/DOT_(graph_description_language)>DOT</a> compliant representation of the decision tree, -for manual inspection, in the form of a gv file. -* `<profile path>`, a path to the profile JSON file -* `<output path>`, a file path to where the tree DOT visualisation should be emitted to. This will be a UTF-8 encoded DOT file. - -The full list of visualise options can be viewed [here](../commandLineOptions/VisualiseOptions.md) - -There may be other visualisers that are suitable to use. The requirements for a visualiser are known (currently) as: -- gv files are encoded with UTF-8, visualisers must support this encoding. -- gv files can include HTML encoded entities, visualisers should support this feature. - - -#### Options -Options are optional and case-insensitive - -* `--partition` - * Enables tree partitioning during transformation. -* `--optimise` - * Enables tree optimisation during transformation. See [Decision tree optimiser](../../developer/algorithmsAndDataStructures/OptimisationProcess.md) for more details. - -## Future invocation methods - -* Calling into a Java library -* Contacting an HTTP web service - -# -[< Previous](Visualise.md) | [Contents](StepByStepInstructions.md) diff --git a/docs/archive/user/gettingStarted/BuildAndRun.md b/docs/archive/user/gettingStarted/BuildAndRun.md deleted file mode 100644 index e32ae17c4..000000000 --- a/docs/archive/user/gettingStarted/BuildAndRun.md +++ /dev/null @@ -1,114 +0,0 @@ -# Build and run the generator - -The instructions below explain how to download the generator source code, build it and run it, using a Java IDE. This is the recommended setup if you would like to contribute to the project yourself. If you would like to use Docker to build the source code and run the generator, [please follow these alternate instructions](../../developer/DockerSetup.md). - -## Get Code - -Clone the repository to your local development folder. - -``` -git clone https://github.com/finos/datahelix.git -``` - -## Installation Requirements - -* Java version 1.8 -* Gradle -* Cucumber -* Preferred: One of IntelliJ/Eclipse IDE - -### Java - -[Download JDK 8 SE](http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html). - -*(Please note, this has been tested with jdk1.8.0_172 but later versions of JDK 1.8 may still work)* - -In Control Panel: edit your environment variables; set `JAVA_HOME=C:\Program Files\Java\jdk1.8.0_172`. -Add Java binary utilities to your `PATH` (`C:\Program Files\Java\jdk1.8.0_172\bin`). - -### Gradle - -Download and install Gradle, following the [instructions on their project website](https://docs.gradle.org/current/userguide/installation.html). - -### IntelliJ IDE - -Get IntelliJ. [EAP](https://www.jetbrains.com/idea/nextversion/) gives you all features of Ultimate (improves framework support and polyglot). - -### Eclipse - -Alternatively, download and install [Eclipse](https://www.eclipse.org/downloads/). Please note we do not have detailed documentation for using the generator from Eclipse. - -### Cucumber - -Add **Gherkin** and **Cucumber for Java** plugins (file > settings > plugins if using IntelliJ IDE). - -Currently the tests cannot be run from the TestRunner class. - -To run a feature file you’ll have to modify the configuration by removing .steps from the end of the Glue field. - -An explanation of the particular syntax used can be found [here](https://github.com/finos/datahelix/blob/master/docs/CucumberSyntax.md). - -## First time setup -### Command Line - -Build the tool with all its dependencies: - -`gradle build` - -Check the setup worked with this example command: - -`java -jar orchestrator\build\libs\generator.jar --replace --profile-file=docs/user/gettingStarted/ExampleProfile1.json --output-path=out.csv` - -To generate valid data run the following command from the command line: - -`java -jar <path to JAR file> [options] --profile-file="<path to profile>" --output-path="<desired output path>"` - -* `[path to JAR file]` - the location of `generator.jar`. -* `[options]` - optionally a combination of [options](../commandLineOptions/GenerateOptions.md) to configure how the command operates. -* `<path to profile>` - the location of the JSON profile file. -* `<desired output path>` - the location of the generated data. - -To generate violating data run the following command from the command line: - -`java -jar <path to JAR file> violate [options] --profile-file="<path to profile>" --output-path="<desired output folder>"` - -* `[path to JAR file]` - the location of `generator.jar`. -* `[options]` - a combination of any (or none) of [the options documented here](../commandLineOptions/ViolateOptions.md) to configure how the command operates. -* `<path to profile>` - the location of the JSON profile file. -* `<desired output folder>` - the location of a folder in which to create generated data files. - - -### IntelliJ - -On IntelliJ's splash screen, choose "Open". - -Open the repository root directory, `datahelix`. - -Right-click the backend Module, `generator`, choose "Open Module Settings". - -In "Project": specify a Project SDK (Java 1.8), clicking "New..." if necessary. -Set Project language level to 8. - -Open the "Gradle" Tool Window (this is an extension that may need to be installed), and double-click Tasks > build > build. -Your IDE may do this automatically for you. - -Navigate to the [`App.java` file](../../../orchestrator/src/main/java/com/scottlogic/datahelix/generator/orchestrator/App.java). Right click and debug. - -Now edit the run configuration on the top toolbar created by the initial run. Name the run configuration 'Generate' and under 'Program Arguments' enter the following, replacing the paths with your desired files: - -``` ---profile-file="<path to an example JSON profile>" --output-path="<desired output file path>" -``` - -For example, run this command: -``` -java -jar orchestrator\build\libs\generator.jar --replace --profile-file=docs/user/gettingStarted/ExampleProfile1.json --output-path=out.csv -``` - -Additionally create another run configuration called GenerateViolating and add the program arguments - -``` -violate --profile-file="<path to an example JSON profile>" --output-path="<desired output directory path>" -``` - -Run both of these configurations to test that installation is successful. diff --git a/docs/archive/user/gettingStarted/CreatingAProfile.md b/docs/archive/user/gettingStarted/CreatingAProfile.md deleted file mode 100644 index 429985de7..000000000 --- a/docs/archive/user/gettingStarted/CreatingAProfile.md +++ /dev/null @@ -1,56 +0,0 @@ -# Creating a Profile - -This page will walk you through creating basic profiles with which you can generate data. - -[Profiles](../UserGuide.md#profiles) are JSON documents consisting of two sections: the list of fields and the rules. - -- **List of Fields** - An array of column headings is defined with unique "name" keys. -``` - "fields": [ - { - "name": "Column 1" - }, - { - "name": "Column 2" - } - ] -``` -- **Constraints** - Constraints reduce the data in each column from the [universal set](../SetRestrictionAndGeneration.md) -to the desired range of values. They are formatted as JSON objects. There are three types of constraints: - - - [Predicate Constraints](../UserGuide.md#Predicate-constraints) - predicates that define any given value as being - _valid_ or _invalid_ - - [Grammatical Constraints](../UserGuide.md#Grammatical-constraints) - used to combine or modify other constraints - - [Presentational Constraints](../UserGuide.md#Presentational-constraints) - used by output serialisers where - string output is required - -Here is a list of two constraints: - -``` - "constraints": [ - { - "field": "Column 1", - "is": "ofType", - "value": "string" - }, - { - "field": "Column 2", - "is": "ofType", - "value": "integer" - } - ] - -``` - - -These three sections are combined to form the [complete profile](ExampleProfile1.json). - -## Further Information -* More detail on key decisions to make while constructing a profile can be found [here](../../developer/KeyDecisions.md) -* FAQs about constraints can be found [here](../FrequentlyAskedQuestions.md) -* For a larger profile example see [here](../Schema.md) -* Sometimes constraints can contradict one another, click [here](../Contradictions.md) to find out what happens in these cases - -# - -[Contents](StepByStepInstructions.md) | [Next Section >](GeneratingData.md) diff --git a/docs/archive/user/gettingStarted/ExampleProfile1.json b/docs/archive/user/gettingStarted/ExampleProfile1.json deleted file mode 100644 index 59fdb629f..000000000 --- a/docs/archive/user/gettingStarted/ExampleProfile1.json +++ /dev/null @@ -1,22 +0,0 @@ -{ - "fields": [ - { - "name": "Column 1" - }, - { - "name": "Column 2" - } - ], - "constraints": [ - { - "field": "Column 1", - "is": "ofType", - "value": "string" - }, - { - "field": "Column 2", - "is": "ofType", - "value": "integer" - } - ] -} diff --git a/docs/archive/user/gettingStarted/GeneratingData.md b/docs/archive/user/gettingStarted/GeneratingData.md deleted file mode 100644 index 89c3dea8b..000000000 --- a/docs/archive/user/gettingStarted/GeneratingData.md +++ /dev/null @@ -1,111 +0,0 @@ -# Generating Data - -This page details how to generate data with a given profile. - - -## Using the Command Line - -For first time setup, see the [Generator setup instructions](BuildAndRun.md). - -To generate data run the following command from the command line - -`java -jar <path to JAR file> [options] --profile-file="<path to profile>" --output-path="<desired output path>"` - -* `[path to JAR file]` the location of generator.jar -* `[options]` optionally a combination of [options](../commandLineOptions/GenerateOptions.md) to configure how the command operates -* `<path to profile>` the location of the JSON profile file -* `<desired output path>` the location of the generated data. If this option is omitted, generated data will be streamed to the standard output. - -## Example - Generating Valid Data - -Using the [Sample Profile](ExampleProfile1.json) that was created in the [previous](CreatingAProfile.md) section, run the following command: - - `java -jar <path to JAR file> --profile-file="<path to ExampleProfile1.json>" --output-path="<path to desired output file>"` - -* `<path to desired output file>` the file path to the desired output file - -With no other options this should yield the following data: - -|Column 1 |Column 2 | -|:-------------:|:-----------:| -|"Lorem Ipsum" |-2147483648 | -|"Lorem Ipsum" |0 | -|"Lorem Ipsum" |2147483646 | -|"Lorem Ipsum" | | -| |-2147483648 | - - -## Example - Generating Violating Data - -The generator can be used to generate data which intentionally violates the profile constraints for testing purposes. - -Using the `violate` command produces one file per rule violated along with a manifest that lists which rules are violated in each file. - -Using the [Sample Profile](ExampleProfile1.json) that was created in the [first](CreatingAProfile.md) section, run the following command: - -`java -jar <path to JAR file> violate --profile-file="<path to ExampleProfile1.json>" --output-path="<path to desired output directory>"` - -* `<path to desired output directory>` the location of the folder in which the generated files will be saved - -Additional options are [documented here](../commandLineOptions/ViolateOptions.md). - -With no additional options this should yield the following data: - -* `1.csv`: - -|Column 1 | Column 2 | -|:---------------:|:--------------:| -|-2147483648 |-2147483648 | -|-2147483648 |0 | -|-2147483648 |2147483646 | -|-2147483648 | | -|0 |-2147483648 | -|2147483646 |-2147483648 | -|1900-01-01T00:00 |-2147483648 | -|2100-01-01T00:00 |-2147483648 | -| |-2147483648 | - -* `2.csv`: - -|Column 1 Name |Column 2 Name | -|:---------------:|:--------------:| -|"Lorem Ipsum" |"Lorem Ipsum" | -|"Lorem Ipsum" |1900-01-01T00:00| -|"Lorem Ipsum" |2100-01-01T00:00| -|"Lorem Ipsum" | | -| |"Lorem Ipsum" | - -* `manifest.json`: - -``` -{ - "cases" : [ { - "filePath" : "1", - "violatedRules" : [ "Column 1 is a string" ] - }, { - "filePath" : "2", - "violatedRules" : [ "Column 2 is a number" ] - } ] -} -``` - -The data generated violates each rule in turn and records the results in separate files. -For example, by violating the `"ofType": "String"` constraint in the first rule the violating data produced is of types *decimal* and *datetime*. -The manifest shows which rules are violated in which file. - -## Hints and Tips - -* The generator will output velocity and row data to the console as standard -(see [options](../commandLineOptions/GenerateOptions.md) for other monitoring choices). - * If multiple monitoring options are selected the most detailed monitor will be implemented. -* Ensure any desired output files are not being used by any other programs or the generator will not be able to run. - * If a file already exists it will be overwritten. -* Violated data generation will produce one output file per rule being violated. - * This is why the output location is a directory and not a file. - * If there are already files in the output directory with the same names they will be overwritten. -* It is important to give your rules descriptions so that the manifest can list the violated rules clearly. -* Rules made up of multiple constraints will be violated as one rule and therefore will produce one output file per rule. -* Unless explicitly excluded `null` will always be generated for each field. - -# -[< Previous](CreatingAProfile.md) | [Contents](StepByStepInstructions.md) | [Next Section >](Visualise.md) diff --git a/docs/archive/user/gettingStarted/StepByStepInstructions.md b/docs/archive/user/gettingStarted/StepByStepInstructions.md deleted file mode 100644 index a7269d3cc..000000000 --- a/docs/archive/user/gettingStarted/StepByStepInstructions.md +++ /dev/null @@ -1,15 +0,0 @@ -# Step By Step Instructions - -You must have Java v1.8 installed (it can be [downloaded here](https://www.java.com/en/download/manual.jsp)) to be able -to run the generator. - -Download the Jar file (generator.jar) from the [GitHub project releases page](https://github.com/ScottLogic/data-engineering-generator/releases/). -The generator will then be run from the command line using the commands detailed in this guide. - -## Contents - -1. [Build and Run](BuildAndRun.md) -1. [ Creating a Profile. ](CreatingAProfile.md) -1. [Using a profile to Generate data](GeneratingData.md) -1. [Visualising the Decision Tree](Visualise.md) -1. [Basic Usage](BasicUsage.md) \ No newline at end of file diff --git a/docs/archive/user/gettingStarted/Visualise.md b/docs/archive/user/gettingStarted/Visualise.md deleted file mode 100644 index 443b10f7f..000000000 --- a/docs/archive/user/gettingStarted/Visualise.md +++ /dev/null @@ -1,57 +0,0 @@ -# Visualising the Decision Tree -_This is an alpha feature. Please do not rely on it. If you find issues with it, please [report them](https://github.com/finos/datahelix/issues)._ - -This page will detail how to use the `visualise` command to view the decision tree for a profile. - -Visualise generates a <a href=https://en.wikipedia.org/wiki/DOT_(graph_description_language)>DOT</a> compliant representation of the decision tree, -for manual inspection, in the form of a gv file. - -## Using the Command Line - - -To visualise the decision tree run the following command from the command line: - -`java -jar <path to JAR file> visualise [options] --profile-file="<path to profile>" --output-path="<path to desired output GV file>"` - -* `[path to JAR file]` the location of generator.jar -* `[options]` optionally a combination of [options](../commandLineOptions/VisualiseOptions.md) to configure how the command operates -* `<path to profile>` the location of the JSON profile file -* `<path to desired output GV file>` the location of the folder for the resultant GV file of the tree - -## Example - -Using the [Sample Profile](ExampleProfile1.json) that was created in the [first](CreatingAProfile.md) section, run the visualise command -with your preferred above method. - -With no options this should yield the following gv file: - -``` -graph tree { - bgcolor="transparent" - label="ExampleProfile1" - labelloc="t" - fontsize="20" - c0[bgcolor="white"][fontsize="12"][label="Column 1 Header is STRING -Column 2 Header is STRING"][shape=box] -c1[fontcolor="red"][label="Counts: -Decisions: 0 -Atomic constraints: 2 -Constraints: 1 -Expected RowSpecs: 1"][fontsize="10"][shape=box][style="dotted"] -} -``` - -This is a very simple tree, more complex profiles will generate more complex trees - -## Hints and Tips - -* You may read a gv file with any text editor -* You can also use this representation with a visualiser such as [Graphviz](https://www.graphviz.org/). - - There may be other visualisers that are suitable to use. The requirements for a visualiser are known (currently) as: - - gv files are encoded with UTF-8, visualisers must support this encoding. - - gv files can include HTML encoded entities, visualisers should support this feature. - -# -[< Previous](GeneratingData.md) | [Contents](StepByStepInstructions.md) | [Next Section >](BasicUsage.md) - diff --git a/docs/archive/user/CombinationStrategies.md b/docs/user/CombinationStrategies.md similarity index 100% rename from docs/archive/user/CombinationStrategies.md rename to docs/user/CombinationStrategies.md diff --git a/examples/partitioning/README.md b/examples/partitioning/README.md deleted file mode 100644 index 2c9e19bb7..000000000 --- a/examples/partitioning/README.md +++ /dev/null @@ -1 +0,0 @@ -Many fields that can be partitioned separately. diff --git a/examples/partitioning/profile.json b/examples/partitioning/profile.json deleted file mode 100644 index 9b6962e67..000000000 --- a/examples/partitioning/profile.json +++ /dev/null @@ -1,180 +0,0 @@ -{ - "fields": [ - { - "name": "p1f1", - "type": "string" - }, - { - "name": "p1f2", - "type": "string" - }, - { - "name": "p1f3", - "type": "string" - }, - { - "name": "p2f1", - "type": "string" - }, - { - "name": "p2f2", - "type": "string" - }, - { - "name": "p2f3", - "type": "string" - }, - { - "name": "p3f1", - "type": "string" - }, - { - "name": "p3f2", - "type": "string" - }, - { - "name": "p3f3", - "type": "string" - } - ], - "constraints": [ - { - "field": "p1f1", - "inSet": [ - "p1-null", - "p1-string" - ] - }, - { - "if": { - "field": "p1f1", - "equalTo": "p1-null" - }, - "then": { - "field": "p1f2", - "isNull": true - }, - "else": { - "field": "p1f2", - "inSet": [ - "hello", - "goodbye" - ] - } - }, - { - "if": { - "field": "p1f1", - "equalTo": "p1-null" - }, - "then": { - "field": "p1f3", - "isNull": true - }, - "else": { - "anyOf": [ - { - "field": "p1f3", - "equalTo": "string-1" - }, - { - "field": "p1f3", - "equalTo": "string-2" - } - ] - } - }, - { - "field": "p2f1", - "inSet": [ - "p2-null", - "p2-string" - ] - }, - { - "if": { - "field": "p2f1", - "equalTo": "p2-null" - }, - "then": { - "field": "p2f2", - "isNull": true - }, - "else": { - "field": "p2f2", - "inSet": [ - "hello", - "goodbye" - ] - } - }, - { - "if": { - "field": "p2f1", - "equalTo": "p2-null" - }, - "then": { - "field": "p2f3", - "isNull": true - }, - "else": { - "anyOf": [ - { - "field": "p2f3", - "equalTo": "string-1" - }, - { - "field": "p2f3", - "equalTo": "string-2" - } - ] - } - }, - { - "field": "p3f1", - "inSet": [ - "p3-null", - "p3-string" - ] - }, - { - "if": { - "field": "p3f1", - "equalTo": "p3-null" - }, - "then": { - "field": "p3f2", - "isNull": true - }, - "else": { - "field": "p3f2", - "inSet": [ - "hello", - "goodbye" - ] - } - }, - { - "if": { - "field": "p3f1", - "equalTo": "p3-null" - }, - "then": { - "field": "p3f3", - "isNull": true - }, - "else": { - "anyOf": [ - { - "field": "p3f3", - "equalTo": "string-1" - }, - { - "field": "p3f3", - "equalTo": "string-2" - } - ] - } - } - ] -} \ No newline at end of file