New Feature: Simple Non-Correlated Tabular Generator #332

taylorfturner · 2023-08-09T20:03:36Z

Lots of improvements in this one PR (diff net of + 1406)

Lots of immaterial changes with pre-commit implementation with isort, flake8, black, etc...
- this PR also includes the changes related to the added pre-commit hooks
New Feature: Simple (i.e. non-correlated) tabular generator API implementation. This is implemented such that the same paradigm is utilized: data --> profile --> passed to TabularGenerator class --> outputs pd.DataFrame of synthetic data based on profile's values

* git ignore and rm DSstore * Update .gitignore Co-authored-by: Taylor Turner <[email protected]> --------- Co-authored-by: Taylor Turner <[email protected]>

* fFloat generator * extra line * another line * Update tests/test_float_generator.py Co-authored-by: Michael Davis <[email protected]> * another line * readability per michael's request * clean up * assertGreaterEqual * better test_sig_figs * sig_fig protection * clearer assert --------- Co-authored-by: Michael Davis <[email protected]>

…puts, and params_build

ssharpe42 · 2023-08-23T15:35:15Z

synthetic_data/distinct_generators/datetime_generator.py

+
+def random_datetimes(
+    rng: Generator,
+    format: Optional[str] = None,


This should be List[str]

(maybe also say formats) --- is it supposed to generate different formats for each date?

updating in #334

ssharpe42 · 2023-08-23T15:36:34Z

synthetic_data/distinct_generators/datetime_generator.py

+    :rtype: numpy array
+    """
+    date_list = [""] * num_rows
+    if not format:


maybe make sure it is a list? got a not so helpful error when I put in a string for the format

ValueError: a must be a sequence or an integer, not <class 'str'>

update in #334

ssharpe42 · 2023-08-23T15:41:32Z

synthetic_data/distinct_generators/text_generator.py

+
+    :param rng: the np rng object used to generate random values
+    :type rng: numpy Generator
+    :param vocab: a list of values that are allowed in a string or None


Does this assume these are all one character? If these are set to multiple characters, min max are not limiting the length of the string.

Maybe just specify or add assert

follow-up PR / creating an issue for this and below. Need to think about this

ssharpe42 · 2023-08-23T15:44:42Z

synthetic_data/generators.py

+            col_["rng"] = self.rng
+            col_["num_rows"] = num_samples
+
+            if (generator_name == "string") or (generator_name == "text"):


generator_name in ["string", "text"]?

good fix. upcoming PR

fixed in #334

ssharpe42 · 2023-08-23T15:59:31Z

synthetic_data/generators.py

+            noise_level = self.noise_level
+
+        if self.is_correlated:
+            return make_data_from_report(


Consider handling nans in correlation matrix. Tried to make synthetic data from a dataframe with numeric and categorical columns.

got Exception: The function only supports numerical variables

good call out

follow-up PR / creating an issue for this and below. Need to think about this

ssharpe42 · 2023-08-23T16:03:02Z

synthetic_data/generators.py

+
+        if self.is_correlated:
+            return make_data_from_report(
+                report=self.profile.report(),


Might be another PR, but how does make_data_from_report determine the distribution of "possible" discrete columns. One of the columns was the iris target (0/1) --- synthetic data produced 0.0, 1.0, 2.0

follow-up PR / creating an issue for this and below. Need to think about this

taylorfturner added the enhancement New feature or request label Aug 9, 2023

taylorfturner self-assigned this Aug 9, 2023

taylorfturner added the work_in_progress label Aug 9, 2023

taylorfturner changed the title ~~New Feature: Simple Non-Correlated Tabular Generator~~ [WIP] New Feature: Simple Non-Correlated Tabular Generator Aug 9, 2023

lizlouise1335 and others added 26 commits August 10, 2023 09:00

Datetime generator and tests

3efc788

mock added

5750fe2

clean up comments

13e141c

fix: add feature to test workflow

a7b0181

git ignore and rm DSstore (#291)

a4ee6d4

* git ignore and rm DSstore * Update .gitignore Co-authored-by: Taylor Turner <[email protected]> --------- Co-authored-by: Taylor Turner <[email protected]>

pre-made list

0ef059e

start and end type specification

cf4d1e9

removing unneeded space

d01d863

better name for a function

a0bf09f

better format catch from michael

a8fbf07

better format catch from michael

5621fed

changed to equate better across languages

e7c54da

docstring fix and update

20874f6

default values at declaration

548a4c1

del space

95f12cc

testing for format usage

02fb9c5

testing for format usage

d32a96b

categorical test + gen

5b8356b

space and ensuring num_rows does not exceed nbr of categories

36240c3

space fix

203f99e

second row test

a283001

fixed merge conflicts

9d04f99

random int generator and test

29b4550

refactored implementation of int generator and tests

4e7c534

added space between class and imports. added more test cases

72e7dc7

drahc1R added 12 commits August 10, 2023 09:08

broken test updates:

f9f2f08

categorical fix

90501d3

int string error

48932a1

tests for get_ordered_column_integration, uncorrelated_synthesize out…

d1641e3

…puts, and params_build

remove print statements

5ae54a0

fixed precision edge case of int

88bdaa1

reintegrated outdated tests

801c2b0

added test case for None columns

6a6ff57

removed print

3d96f79

changed test to f string

6d40678

fixed docstrings for datetime generator

3171ca7

empty commit

37413db

taylorfturner force-pushed the feature/simple-tabular-generator branch from c55f906 to 37413db Compare August 10, 2023 13:16

taylorfturner requested review from tazitoo, ssharpe42, ksneab7 and danielbarcklow August 10, 2023 13:19

taylorfturner removed the work_in_progress label Aug 10, 2023

taylorfturner changed the title ~~[WIP] New Feature: Simple Non-Correlated Tabular Generator~~ New Feature: Simple Non-Correlated Tabular Generator Aug 10, 2023

taylorfturner enabled auto-merge (squash) August 10, 2023 13:31

ksneab7 approved these changes Aug 10, 2023

View reviewed changes

ssharpe42 reviewed Aug 23, 2023

View reviewed changes

micdavis approved these changes Aug 23, 2023

View reviewed changes

taylorfturner merged commit b9e5c56 into main Aug 23, 2023
5 checks passed

taylorfturner mentioned this pull request Aug 25, 2023

Quick Fixes #334

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Feature: Simple Non-Correlated Tabular Generator #332

New Feature: Simple Non-Correlated Tabular Generator #332

taylorfturner commented Aug 9, 2023 •

edited

Loading

ssharpe42 Aug 23, 2023

ssharpe42 Aug 23, 2023

taylorfturner Aug 25, 2023

ssharpe42 Aug 23, 2023

taylorfturner Aug 25, 2023

ssharpe42 Aug 23, 2023

ssharpe42 Aug 23, 2023

taylorfturner Aug 25, 2023

ssharpe42 Aug 23, 2023

taylorfturner Aug 23, 2023

taylorfturner Aug 25, 2023

ssharpe42 Aug 23, 2023

taylorfturner Aug 23, 2023

taylorfturner Aug 25, 2023

ssharpe42 Aug 23, 2023

taylorfturner Aug 25, 2023

New Feature: Simple Non-Correlated Tabular Generator #332

New Feature: Simple Non-Correlated Tabular Generator #332

Conversation

taylorfturner commented Aug 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

taylorfturner commented Aug 9, 2023 •

edited

Loading