Skip to content
This repository has been archived by the owner on Apr 14, 2023. It is now read-only.

Generate value for a nullable column with a percentage #1704

Open
semisft opened this issue Sep 10, 2020 · 3 comments · May be fixed by #1718
Open

Generate value for a nullable column with a percentage #1704

semisft opened this issue Sep 10, 2020 · 3 comments · May be fixed by #1718
Labels
bug Something isn't working

Comments

@semisft
Copy link

semisft commented Sep 10, 2020

Some column values must be filled by a percentage, for example one field must be 10% filled, another 30% in the same profile.
For %10 I tried a field from weighted inSet file and used in an if statement. but results seem to give %50.
How can I configure this?

percent10.csv

1,10
0,90

profile.json

{
	"fields": [
		{
			"name": "percent10",
			"type": "integer"
		},
		{
			"name": "name",
			"type": "firstname",
			"nullable": true
		}
	],
	"constraints": [
		{
			"field": "percent10",
			"inSet": "percent10.csv"
		},
		{
			"if": {
				"field": "percent10",
				"equalTo": 1
			},
			"then": {
				"field": "name",
				"isNull": false
			},
			"else": {
				"field": "name",
				"isNull": true
			}
		}
	]
}

@Tom-hayden
Copy link
Contributor

Hi @semisft, this appears to be a bug with the datahelix. I have raised an issue for it here #1705

@ghost
Copy link

ghost commented Jan 14, 2021

I've tried this issue with the above profile given the latest edition of the code (to verify if the issue still exists). An example of the output (30 rows) is below:

percent10,name
1,Rory
1,Lily
1,Finn
0,
0,
0,
0,
0,
1,Amelia
1,Thea
1,Zara
1,Christina
1,Jake
0,
1,Maya
1,Liam
0,
1,Zac
1,Hamish
0,
0,
0,
0,
1,Lila
0,
0,
0,
1,Frank
0,
1,Phoebe

This shows a 50% spread of each of the values for percent10, where there should be 10% (3 rows) with 0 and 90% (27 rows) with 1. The issue is still confirmed to be valid - will investigate further.

@ghost
Copy link

ghost commented Jan 14, 2021

Investigation:
In RandomRowSpecDecisionTreeWalker a list of rowSpecs are generate that represent the rows that can be generated. These are generated as:

  1. name=not null & in (names) and percent10=not null & in (1)
  2. name=null and percent10=not null & in (0)

The generator will then randomly select between the two items above to generate rows. The items above do not have any weighting however (which could have been inherited from the value for percent10) so the generator generates (randomly) an even spread of rows from the two specs above.

Either of the below (or something more elegant) would be required:

  1. The items above need to indicate their weighting, i.e. item1 = 10% and item2 = 90% and use this in the getRandomRowSpec() method
  2. The items above are duplicated as many times as appropriate to create a representative spread, i.e. create 9 item2's for every 1 item1. Then there would be a sample of row specs that can be randomly selected from
  3. something else

@ghost ghost linked a pull request Jan 14, 2021 that will close this issue
ghost pushed a commit that referenced this issue Jan 18, 2021
@ghost ghost self-assigned this Jan 18, 2021
ghost pushed a commit that referenced this issue Jan 18, 2021
ghost pushed a commit that referenced this issue Jan 18, 2021
ghost pushed a commit that referenced this issue Jan 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants