Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClusterBasedNormalizer vs GaussianNormalizer vs PowerTransformer #613

Open
candalfigomoro opened this issue Feb 13, 2023 · 1 comment
Open
Labels
question General question about the software

Comments

@candalfigomoro
Copy link

candalfigomoro commented Feb 13, 2023

When using CTGAN, data is normalized using ClusterBasedNormalizer.

In RDT, GaussianNormalizer is also implemented.

What are the advantages of ClusterBasedNormalizer and GaussianNormalizer compared to using sklearn's PowerTransformer (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html) with the Yeo-Johnson method? Couldn't a power transform be used instead (which would perhaps be faster than ClusterBasedNormalizer)?

Thank you

@candalfigomoro candalfigomoro added new Label applied to new issues question General question about the software labels Feb 13, 2023
@npatki
Copy link
Contributor

npatki commented Mar 29, 2023

Hi @candalfigomoro, thanks for the feedback. We'll keep this issue open to share any information as we investigate the specifics of this transformers.

Some considerations:

  • Quality: Does this significantly improve the quality when used to create synthetic data? To evaluate quality, we use the SDMetrics quality report
  • Performance: How quickly is this transformer able to fit, transform and reverse transform compared to the others?
  • Memory: What would be the overall file size if you were to save a synthesizer that used this transformer vs. others?

If you have done any exploration yourself along these lines, we'd be very eager to see it!

@npatki npatki added under discussion Issue is currently being discussed and removed new Label applied to new issues labels Mar 29, 2023
@npatki npatki removed the under discussion Issue is currently being discussed label Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software
Projects
None yet
Development

No branches or pull requests

2 participants