-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Feature: Simple Non-Correlated Tabular Generator #332
Conversation
* git ignore and rm DSstore * Update .gitignore Co-authored-by: Taylor Turner <[email protected]> --------- Co-authored-by: Taylor Turner <[email protected]>
* fFloat generator * extra line * another line * Update tests/test_float_generator.py Co-authored-by: Michael Davis <[email protected]> * another line * readability per michael's request * clean up * assertGreaterEqual * better test_sig_figs * sig_fig protection * clearer assert --------- Co-authored-by: Michael Davis <[email protected]>
…puts, and params_build
c55f906
to
37413db
Compare
|
||
def random_datetimes( | ||
rng: Generator, | ||
format: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be List[str]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(maybe also say formats
) --- is it supposed to generate different formats for each date?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updating in #334
:rtype: numpy array | ||
""" | ||
date_list = [""] * num_rows | ||
if not format: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe make sure it is a list? got a not so helpful error when I put in a string for the format
ValueError: a must be a sequence or an integer, not <class 'str'>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update in #334
|
||
:param rng: the np rng object used to generate random values | ||
:type rng: numpy Generator | ||
:param vocab: a list of values that are allowed in a string or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this assume these are all one character? If these are set to multiple characters, min max are not limiting the length of the string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just specify or add assert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow-up PR / creating an issue for this and below. Need to think about this
col_["rng"] = self.rng | ||
col_["num_rows"] = num_samples | ||
|
||
if (generator_name == "string") or (generator_name == "text"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generator_name in ["string", "text"]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good fix. upcoming PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in #334
noise_level = self.noise_level | ||
|
||
if self.is_correlated: | ||
return make_data_from_report( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider handling nans in correlation matrix. Tried to make synthetic data from a dataframe with numeric and categorical columns.
got Exception: The function only supports numerical variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow-up PR / creating an issue for this and below. Need to think about this
|
||
if self.is_correlated: | ||
return make_data_from_report( | ||
report=self.profile.report(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be another PR, but how does make_data_from_report determine the distribution of "possible" discrete columns. One of the columns was the iris target (0/1) --- synthetic data produced 0.0, 1.0, 2.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow-up PR / creating an issue for this and below. Need to think about this
Lots of improvements in this one PR (diff net of + 1406)
Lots of immaterial changes with pre-commit implementation with
isort
,flake8
,black
, etc...New Feature: Simple (i.e. non-correlated) tabular generator API implementation. This is implemented such that the same paradigm is utilized:
data
-->profile
--> passed toTabularGenerator
class --> outputspd.DataFrame
of synthetic data based on profile's values