Skip to content
This repository has been archived by the owner on Apr 14, 2023. It is now read-only.

Commit

Permalink
Merge pull request #1150 from finos/1143-split-docs-into-two
Browse files Browse the repository at this point in the history
Tidy Documentation
  • Loading branch information
ms14981 authored Jul 26, 2019
2 parents ae264e7 + c7ca0e5 commit 6925c33
Show file tree
Hide file tree
Showing 68 changed files with 311 additions and 313 deletions.
2 changes: 1 addition & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ jobs:
- image: circleci/openjdk:8-jdk-browsers
steps:
- checkout
# If changing build tools be sure to update GeneratorSetup.md in docs
# If changing build tools be sure to update BuildAndRun.md in docs
- run: gradle fatJar :output:test :profile:test :generator:test :common:test :orchestrator:test
- run:
name: Save test results
Expand Down
51 changes: 13 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
# DataHelix Generator [![CircleCI](https://circleci.com/gh/finos/datahelix.svg?style=svg)](https://circleci.com/gh/finos/datahelix) [![FINOS - Incubating](https://cdn.jsdelivr.net/gh/finos/contrib-toolbox@master/images/badge-incubating.svg)](https://finosfoundation.atlassian.net/wiki/display/FINOS/Incubating)

![DataHelix logo](logo.png)
![DataHelix logo](docs/logo.png)

The generation of representative test and simulation data is a challenging and time-consuming task. The DataHelix generator allows you to quickly create data, based on a JSON profile that defines fields and the relationships between them, for the purpose of testing and validation. The generator supports a number of generation modes, allowing the creation of data that both conforms to, or violates, the profile.

DataHelix is a proud member of the [Fintech Open Source Foundation](https://www.finos.org/) and operates within the [Data Technologies Program](https://www.finos.org/dt).

- [Getting Started](#Getting-Started)
- [First Time Setup](docs/user/gettingStarted/BuildAndRun.md)
- [Creating your first profile](#Creating-your-first-profile)
- [Adding constraints](#Adding-constraints)
- [Generating large datasets](#Generating-large-datasets)
Expand All @@ -17,15 +18,16 @@ DataHelix is a proud member of the [Fintech Open Source Foundation](https://www.
- [Contributing](#Contributing)
- [License](#License)


# Getting Started

_The following guide gives a 10 minute introduction to the generator via various practical examples. For more detailed documentation please refer to the [Profile Developer Guide](docs/ProfileDeveloperGuide.md), and if you are interested in extending / modifying the generator itself, refer to the [DataHelix Generator Developer Guide](docs/GeneratorDeveloperGuide.md)._
_The following guide gives a 10 minute introduction to the generator via various practical examples. For more detailed documentation please refer to the [User Guide](docs/user/UserGuide.md). If you are interested in extending / modifying the generator itself please refer to the [Developer Guide](docs/developer/DeveloperGuide.md)._

The generator has been written in Java, allowing it to work on Microsoft Windows, Apple Mac and Linux. You will need Java v1.8 installed to run the generator (you can run `java version` to check whether you meet this requirement), it can be [downloaded here](https://www.java.com/en/download/manual.jsp).
The generator has been written in Java, allowing it to work on Microsoft Windows, Apple Mac and Linux. You will need Java v1.8 installed to run the generator (you can run `java -version` to check whether you meet this requirement), it can be [downloaded here](https://www.java.com/en/download/manual.jsp).

The generator is distributed as a JAR file, with the latest release always available from the [GitHub releases page](https://github.com/finos/datahelix/releases/). The project is currently in beta and under active development. You can expect breaking changes in future releases, and new features too!

You are also welcome to download the source code and build the generator yourself. To do so, follow the instructions for [downloading and building it using a Java IDE](generator/docs/GeneratorSetup.md), or for [downloading and building it using Docker](generator/docs/DockerSetup.md).
You are also welcome to download the source code and build the generator yourself. To do so, follow the instructions for [downloading and building it using a Java IDE](docs/user/gettingStarted/BuildAndRun.md), or for [downloading and building it using Docker](docs/developer/DockerSetup.md).

Your feedback on the beta would be greatly appreciated. If you have any issues, feature requests, or ideas, please share them via the [GitHub issues page](https://github.com/finos/datahelix/issues).

Expand Down Expand Up @@ -157,8 +159,6 @@ The generator supports four different data types:
- **string** - sequences of unicode characters up to a maximum length of 1000 characters
- **datetime** - specific moments in time, with values in the range 0001-01-01T00:00:00.000 to 9999-12-31T23:59:59.999, with an optional granularity / precision (from a maximum of one year to a minimum of one millisecond) that can be defined via a `granularTo` constraint.

<!-- TODO: rename as datetime -->

We'll expand the example profile to add a new `age` field, a not-null integer in the range 1-99:

```json
Expand Down Expand Up @@ -296,42 +296,17 @@ firstName,age,nationalInsurance
[...]
```

You can find out more about the various constraints the generator supports in the detailed [Profile Developer Guide](docs/ProfileDeveloperGuide.md).
You can find out more about the various constraints the generator supports in the detailed [User Guide](docs/user/UserGuide.md).

## Generation modes

The generator supports a number of different generation modes:

- **random** - generates random data that abides by the given set of constraints, with the number of generated rows limited via the `--max-rows` option.
- **interesting** - generates data that is typically [deemed 'interesting'](https://github.com/finos/datahelix/wiki/Interesting-data-generation) from a test perspective, for example exploring [boundary values](https://en.wikipedia.org/wiki/Boundary-value_analysis).

The mode is specified via the `--generation-type` option. The following example outputs 'interesting' values for the current profile:

```
$ java -jar generator.jar generate --generation-type interesting --replace --profile-file=profile.json --output-path=output.csv
```

In this case it generates just 14 rows where you can see that it is exploring the boundary values of the constraints:

```
firstName,age,nationalInsurance
"Jon",18,"AA000000"
"John",18,"AA000000"
"Jon",18,"AJ000000F"
"John",18,"AJ000000F"
"Jon",19,"AA000000"
"John",19,"AA000000"
"Jon",19,"AJ000000F"
"John",19,"AJ000000F"
"Jon",1,
"John",1,
"Jon",99,"AA000000"
"John",99,"AA000000"
"Jon",99,"AJ000000F"
"John",99,"AJ000000F"
```
- **random** - _(default)_ generates random data that abides by the given set of constraints, with the number of generated rows limited via the `--max-rows` option.
- **full** - generates all the data that abides by the given set of constraints, with the number of generated rows limited via the `--max-rows` option.
- **interesting** - _(alpha feature)_ generates data that is typically [deemed 'interesting'](docs/user/alphaFeatures/Interesting.md) from a test perspective, for example exploring [boundary values](https://en.wikipedia.org/wiki/Boundary-value_analysis).

<!-- I've got a few questions about this output! -->
The mode is specified via the `--generation-type` option.

## Generating invalid data

Expand Down Expand Up @@ -411,8 +386,8 @@ firstName,age,nationalInsurance

## Next steps

That's the end of our getting started guide. Hopefully it has given you a good understanding of what the DataHelix generator is capable of. If you'd like to find out more about the various constraints the tool supports, the [Profile Developer Guide](docs/ProfileDeveloperGuide.md) is a good next step. You might also be interested in the [examples folder](https://github.com/finos/datahelix/tree/master/examples), which illustrates various features of the generator.
For more detail about the behaviour of certain profiles, see the [behaviour in detail.](./docs/BehaviourInDetail.md)
That's the end of our getting started guide. Hopefully it has given you a good understanding of what the DataHelix generator is capable of. If you'd like to find out more about the various constraints the tool supports, the [User Guide](docs/user/UserGuide.md) is a good next step. You might also be interested in the [examples folder](https://github.com/finos/datahelix/tree/master/examples), which illustrates various features of the generator.
For more detail about the behaviour of certain profiles, see the [behaviour in detail.](docs/developer/behaviour/BehaviourInDetail.md)

## Contributing

Expand Down
13 changes: 0 additions & 13 deletions docs/GeneratorDeveloperGuide.md

This file was deleted.

8 changes: 4 additions & 4 deletions docs/CucumberSyntax.md → docs/developer/CucumberSyntax.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,9 @@ More examples can be seen in the [generator cucumber features](https://github.co
The framework supports setting configuration settings for the generator, defining the profile and describing the expected outcome. All of these are described below, all variable elements (e.g. `{generationStrategy}` are case insensitive), all fields and values **are case sensitive**.

### Configuration options
* _the generation strategy is `{generationStrategy}`_ see [generation strategies](https://github.com/finos/datahelix/blob/master/generator/docs/GenerationTypes.md) - default: `random`
* _the combination strategy is `{combinationStrategy}`_ see [combination strategies](https://github.com/finos/datahelix/blob/master/generator/docs/CombinationStrategies.md) - default: `exhaustive`
* _the walker type is `{walkerType}`_ see [walker types](https://github.com/finos/datahelix/blob/master/generator/docs/TreeWalkerTypes.md) - default: `reductive`
* _the generation strategy is `{generationStrategy}`_ see [generation strategies](https://github.com/finos/datahelix/blob/master/docs/user/generationTypes/GenerationTypes.md) - default: `random`
* _the combination strategy is `{combinationStrategy}`_ see [combination strategies](https://github.com/finos/datahelix/blob/master/docs/user/CombinationStrategies.md) - default: `exhaustive`
* _the walker type is `{walkerType}`_ see [walker types](https://github.com/finos/datahelix/blob/master/docs/developer/decisionTreeWalkers/TreeWalkerTypes.md) - default: `reductive`
* _the data requested is `{generationMode}`_, either `violating` or `validating` - default: `validating`
* _the generator can generate at most `{int}` rows_, ensures that the generator will only emit `int` rows, default: `1000`
* _we do not violate constraint `{operator}`_, prevent this operator from being violated (see **Operators** section below), you can specify this step many times if required
Expand All @@ -49,7 +49,7 @@ Operators are converted to English language equivalents for use in cucumber, so
* _untyped fields are allowed_, sets the --allow-untyped-fields flag to false - default: flag is true

#### Operators
See [Predicate constraints](ProfileDeveloperGuide.md#Predicate-constraints), [Grammatical Constraints](ProfileDeveloperGuide.md#Grammatical-constraints) and [Presentational Constraints](ProfileDeveloperGuide.md#Presentational-constraints) for details of the constraints.
See [Predicate constraints](../user/UserGuide.md#Predicate-constraints), [Grammatical Constraints](../user/UserGuide.md#Grammatical-constraints) and [Presentational Constraints](../user/UserGuide.md#Presentational-constraints) for details of the constraints.

#### Operands
When specifying the operator/s for a field, ensure to format the value as in the table below:
Expand Down
File renamed without changes.
26 changes: 26 additions & 0 deletions docs/developer/DeveloperGuide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
## Key Concepts

1. [Design Decisions](KeyDecisions.md)
1. [Decision Trees](decisionTrees/DecisionTrees.md)
1. [Profile Syntax](../user/Schema.md)


## Development

1. [Contributing](../../.github/CONTRIBUTING.md)
2. [Build and Run the Generator](../user/gettingStarted/BuildAndRun.md)
4. [Dependency Injection](DependencyInjection.md)
5. [Cucumber Testing](CucumberSyntax.md)

## Behavioural Explanations

1. [Behaviour in Detail](behaviour/BehaviourInDetail.md)
1. [Null Operator](behaviour/NullOperator.md)

## Key Algorithms and Data Structures

1. [Decision Trees](decisionTrees/DecisionTrees.md)
1. [Generation Algorithm](algorithmsAndDataStructures/GenerationAlgorithm.md)
1. [Field Fixing Strategy](algorithmsAndDataStructures/FieldFixingStrategy.md)
1. [String Generation](algorithmsAndDataStructures/StringGeneration.md)
1. [Tree Walker Types](decisionTreeWalkers/TreeWalkerTypes.md)
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Build and run the generator using Docker

The instructions below explain how to download the source code, and then build and run it using Docker. This generates a self-contained executable Docker image which can then run the generator without needing to install a JRE. If you would like to download and build the source code in order to contribute to development, we recommend you [build and run the generator using an IDE](GeneratorSetup.md) instead.
The instructions below explain how to download the source code, and then build and run it using Docker. This generates a self-contained executable Docker image which can then run the generator without needing to install a JRE. If you would like to download and build the source code in order to contribute to development, we recommend you [build and run the generator using an IDE](../user/gettingStarted/BuildAndRun.md) instead.

## Get Code

Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Decision tree generation

Given a set of rules, generate a [decision tree](../../docs/DecisionTrees/DecisionTrees.md) (or multiple if [partitioning](../../docs/DecisionTrees/Optimisation.md#Partitioning) was successful).
Given a set of rules, generate a [decision tree](../decisionTrees/DecisionTrees.md) (or multiple if [partitioning](../decisionTrees/Optimisation.md#Partitioning) was successful).

## Decision tree interpretation

An interpretation of the decision tree is defined by chosing an option for every decision visited in the tree.

![](interpreted-graph.png)
![](../../user/images/interpreted-graph.png)

In the above diagram the red lines represent one interpretation of the graph, for every decision an option has been chosen and we end up with the set of constraints that the red lines touch at any point. These constraints are reduced into a fieldspec (see [Constraint Reduction](#constraint-reduction) below).

Expand All @@ -32,7 +32,7 @@ could collapse to

*(note: this is a conceptual example and not a reflection of actual object structure)*

See [Set restriction and generation](SetRestrictionAndGeneration.md) for a more indepth explanation of how the constraints are merged and data generated.
See [Set restriction and generation](../../user/SetRestrictionAndGeneration.md) for a more in depth explanation of how the constraints are merged and data generated.

This object has all the information needed to produce the values `[3, 4, 5, 6]`.

Expand All @@ -50,9 +50,9 @@ Databags can be merged, but merging two databags fails if they have any keys in

Fieldspecs are able to produce streams of databags containing valid values for the field they describe. Additional operations can then be applied over these streams, such as:

* A memoizing decorator that records values being output so they can be replayed inexpensively
* A memoization decorator that records values being output so they can be replayed inexpensively
* A filtering decorator that prevents repeated values being output
* A merger that takes multiple streams and applies one of the available [combination strategies](CombinationStrategies.md)
* A merger that takes multiple streams and applies one of the available [combination strategies](../../user/CombinationStrategies.md)
* A concatenator that takes multiple streams and outputs all the members of each

# Output
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# String Generation

We use a Java library called [dk.brics.automaton](http://www.brics.dk/automaton/) to analyse regexes and generate valid (and invalid for [violation](DeliberateViolation.md)) strings based on them. It works by representing the regex as a finite state machine. It might be worth reading about state machines for those who aren't familiar: [https://en.wikipedia.org/wiki/Finite-state_machine](https://en.wikipedia.org/wiki/Finite-state_machine). Consider the following regex: `ABC[a-z]?(A|B)`. It would be represented by the following state machine:
We use a Java library called [dk.brics.automaton](http://www.brics.dk/automaton/) to analyse regexes and generate valid (and invalid for [violation](../../user/alphaFeatures/DeliberateViolation.md)) strings based on them. It works by representing the regex as a finite state machine. It might be worth reading about state machines for those who aren't familiar: [https://en.wikipedia.org/wiki/Finite-state_machine](https://en.wikipedia.org/wiki/Finite-state_machine). Consider the following regex: `ABC[a-z]?(A|B)`. It would be represented by the following state machine:

![](finite-state-machine.svg)
![](../../user/images/finite-state-machine.svg)

<!-- graphvis dot file for the above graph
graph {
Expand Down Expand Up @@ -41,7 +41,7 @@ Due to the way that the generator computes textual data internally the generatio

## Anchors

dk.brics.automaton doesn't support start and end anchors `^` & `$` and instead matches the entire word as if the anchors were always present. For some of our use cases though it may be that we want to match the regex in the middle of a string somewhere, so we have two versions of the regex constraint - [matchingRegex](https://github.com/finos/datahelix/blob/master/docs/ProfileDeveloperGuide.md#predicate-matchingregex) and [containingRegex](https://github.com/finos/datahelix/blob/master/docs/ProfileDeveloperGuide.md#predicate-containingregex). If `containingRegex` is used then we simply add a `.*` to the start and end of the regex before passing it into the automaton. Any `^` or `$` characters passed at the start or end of the string respectively are removed, as the automaton will treat them as literal characters.
dk.brics.automaton doesn't support start and end anchors `^` & `$` and instead matches the entire word as if the anchors were always present. For some of our use cases though it may be that we want to match the regex in the middle of a string somewhere, so we have two versions of the regex constraint - [matchingRegex](../../user/UserGuide.md#predicate-matchingregex) and [containingRegex](../../user/UserGuide.md#predicate-containingregex). If `containingRegex` is used then we simply add a `.*` to the start and end of the regex before passing it into the automaton. Any `^` or `$` characters passed at the start or end of the string respectively are removed, as the automaton will treat them as literal characters.

## Automaton data types
The automaton represents the state machine using the following types:
Expand Down
File renamed without changes.
Loading

0 comments on commit 6925c33

Please sign in to comment.