Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Feature: Simple Non-Correlated Tabular Generator #332

Merged
merged 178 commits into from
Aug 23, 2023

Conversation

taylorfturner
Copy link
Contributor

@taylorfturner taylorfturner commented Aug 9, 2023

Lots of improvements in this one PR (diff net of + 1406)

  • Lots of immaterial changes with pre-commit implementation with isort, flake8, black, etc...

    • this PR also includes the changes related to the added pre-commit hooks
  • New Feature: Simple (i.e. non-correlated) tabular generator API implementation. This is implemented such that the same paradigm is utilized: data --> profile --> passed to TabularGenerator class --> outputs pd.DataFrame of synthetic data based on profile's values

@taylorfturner taylorfturner added the enhancement New feature or request label Aug 9, 2023
@taylorfturner taylorfturner self-assigned this Aug 9, 2023
@taylorfturner taylorfturner changed the title New Feature: Simple Non-Correlated Tabular Generator [WIP] New Feature: Simple Non-Correlated Tabular Generator Aug 9, 2023
lizlouise1335 and others added 26 commits August 10, 2023 09:00
* git ignore and rm DSstore

* Update .gitignore

Co-authored-by: Taylor Turner <[email protected]>

---------

Co-authored-by: Taylor Turner <[email protected]>
* fFloat generator

* extra line

* another line

* Update tests/test_float_generator.py

Co-authored-by: Michael Davis <[email protected]>

* another line

* readability per michael's request

* clean up

* assertGreaterEqual

* better test_sig_figs

* sig_fig protection

* clearer assert

---------

Co-authored-by: Michael Davis <[email protected]>
@taylorfturner taylorfturner force-pushed the feature/simple-tabular-generator branch from c55f906 to 37413db Compare August 10, 2023 13:16
@taylorfturner taylorfturner changed the title [WIP] New Feature: Simple Non-Correlated Tabular Generator New Feature: Simple Non-Correlated Tabular Generator Aug 10, 2023
@taylorfturner taylorfturner enabled auto-merge (squash) August 10, 2023 13:31

def random_datetimes(
rng: Generator,
format: Optional[str] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be List[str]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(maybe also say formats) --- is it supposed to generate different formats for each date?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updating in #334

:rtype: numpy array
"""
date_list = [""] * num_rows
if not format:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make sure it is a list? got a not so helpful error when I put in a string for the format

ValueError: a must be a sequence or an integer, not <class 'str'>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update in #334


:param rng: the np rng object used to generate random values
:type rng: numpy Generator
:param vocab: a list of values that are allowed in a string or None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this assume these are all one character? If these are set to multiple characters, min max are not limiting the length of the string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just specify or add assert

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow-up PR / creating an issue for this and below. Need to think about this

col_["rng"] = self.rng
col_["num_rows"] = num_samples

if (generator_name == "string") or (generator_name == "text"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generator_name in ["string", "text"]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good fix. upcoming PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in #334

noise_level = self.noise_level

if self.is_correlated:
return make_data_from_report(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider handling nans in correlation matrix. Tried to make synthetic data from a dataframe with numeric and categorical columns.

got Exception: The function only supports numerical variables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow-up PR / creating an issue for this and below. Need to think about this


if self.is_correlated:
return make_data_from_report(
report=self.profile.report(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be another PR, but how does make_data_from_report determine the distribution of "possible" discrete columns. One of the columns was the iris target (0/1) --- synthetic data produced 0.0, 1.0, 2.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow-up PR / creating an issue for this and below. Need to think about this

@taylorfturner taylorfturner merged commit b9e5c56 into main Aug 23, 2023
5 checks passed
@taylorfturner taylorfturner mentioned this pull request Aug 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants