-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NaN values for numerical variables DISAPPEAR when using CTGANSynthesizer #2288
Comments
Hi there @wilcovanvorstenbosch are you able to share just the sdtypes in your metadata for your numerical columns? The You can either:
With the sdtypes, I can try to reproduce the issue on my end! |
Dear Srini [@srinify ], First of all: I'm delighted that you are willing to help out. It is beyond my expectations, and I greatly appreciate it.
Hope this suffices. If you need any more info, let me know. Kind regards, |
Hi @wilcovanvorstenbosch I tried to reproduce this issue with a fake dataset (with a few thousand rows) but wasn't able to unfortunately. My fake dataset with lots of missing values still had the same ratios in the synthetic data when using your code. Were you able to train CTGANSynthesizer on your full dataset (with 2 million rows) or did you use a subset? I'm asking because training with such a large dataset was taking an incredibly long time, so I figured I would clarify the size of your training data! Do you mind swapping CTGANSynthesizer out with GaussianCopulaSynthesizer instead to see if that helps unblock you? |
Dear Srini, I am indeed using a subset, for now. I'm randomly sampling 10.000 rows from the original dataset to test this package. How does GaussianCopulaSynthesizer compare to the CTGAN? Does it retain correlations between attributes in a similar fashion? Either way, I would like to compare the results against the CTGAN synthesizer, so will be looking to fix this. Do you think the parameters of the model could somehow prevent this issue? Kind regards, |
Just to clarify @srinify : the values are not missing at random. Often, the variable was not relevant for a specific row because of a certain value for another variable. I was hoping that the synthesizer would be able to pick up on this correlations. It should, right? |
Definitely - all of our synthesizers try our best to learn statistical patterns between columns. In fact, we actually include the correlation between pairs of columns (we call them "Column Pair Trends") in our Quality Report. This report will compare the similarity of the column-level distributions (we call them "Column Shapes") and the Column Pair Trends between your real and synthetic data. You can run and compare these reports for each batch of synthetic data you create, which can also help you compare data created using different synthesizers or synthesizer parameters.
That would be awesome if you're able to!
Oh that's interesting. So are you saying that the synthetic data contains NaN values but their occurrences aren't in line with some inter-column logic that you expect? Or are you seeing no NaN values at all in your synthetic data for the numeric columns? We actually created a feature called Constraints to help you define specific rules that the synthetic data must follow. We have a few pre-defined constraint classes or you can create custom constraint classes with more open-ended logic. |
In my synthetic data, there are no NaN values for the numeric columns. By the way, I found out that the discriminator and generator loss values are all over the place. They do not seem to "converge". This might be the issue. I will try tweaking the parameters, although I found a comment of yours on another issue saying that this a) is difficult, b) outside your expertise, and that sometimes the data does not suit this synthesizer? What do you mean with the last statement? I imagined that GAN-based synthesizer would be better than other methods at dealing with ANY type of dataset. Have you found that this is not true? |
Hi @wilcovanvorstenbosch and @srinify, quickly jumping in here with a few clarifications. Details about NaN valuesWe expect CTGAN to produce synthetic NaN values at roughly the same proportion as the real data. Internally, NaN-values are handled by SDV at a level that is outside the scope of the internal GAN algorithm. Therefore:
The only thing that might affect NaN values is if you are (a) manipulating the data in any way after reading from CSV, or (b) making customizations such as updating transformers/adding constraints. This doesn't seem to be the case. Diagnosing this current issue: Synthetic data doesn't contain NaNsUnfortunately neither @srinify nor I have been table to replicate this. We have tried all sorts of combinations of sdtypes (numerical, categorical) with the same proportion of np.nan values. CTGANSynthesizer always gives us back np.nan values as expected. I think @srinify's suggestion to try GaussianCopula is to help discover if this bug is isolated to CTGAN. One more thing that may help: For the column you are visualizing, I know that the missing values are stored as Is there any other information you can provide that might be useful to replicate? Perhaps if you are able to replicate this on unrelated (or made up) data, you can share that? In the meantime, we will continue trying to replicate but it's proving to be a bit tricky! Other Notes
|
Sorry for the late reply. I was busy with other work, but will be working on this topic for most of this week so I'll try to further clarify the issue and maybe create a dataset that I can share. Regarding your question: The column that I was visualising is of type I'm currently testing the GausianCopula method to see if the same issue persists. |
One note on this: the NaN values are not synthesized exactly to the mean value, but they indeed are close.
You mentioned that NaN values are handled outside of the GANs. Can you point me to the pieces of code that handle this? I can't seem to find it. |
Hi @wilcovanvorstenbosch, no problem at all. I'm glad to hear that the GaussianCopula synthesizer does not have this problem with NaNs. If you want, you're welcome to file a new issue about improving the distribution quality so we can discuss that separately. (Hint: With GaussianCopula, there is a lot more you can do to control & customize the quality. For eg, you can put in the exact shape you want using the I'm going to update the title of this current NaN issue to mention that it is for CTGAN only. One thing that stands out to me is that you mention your column is dtype To get an
Sure, NaN handling is done during the data pre-processing stage, with the help of RDT transformers. Underlying algorithms (such as CTGAN) are not designed to work with NaNs, so this pre-processing stage will typically fill the NaN values with some random other values (and keep track of what it did, so it is reversible later). After fitting, you can see which transformers were used, and whether it learned the % of missing values in your column: all_transformers = synthesizer.get_transformers()
column_transformer = all_transformers[COLUMN_NAME] # add the name of the numerical column to debug
try:
print('Learned proportion of missing values:', column_transformer.null_transformer._null_percentage)
except:
print('Transformer used:', column_transformer) For more information see the RDT documentation site and RDT GitHub. |
Dear @npatki , In fact I am loading the dataset directly from an SQL database. I did not think this would matter, much. Still, I put it to the test. I will come back to you when I have some 'shareable' information about my dataset.
Is this true for the GausianCopula synthesizer as well? Kind regards, |
I have filed a separate issue, #2310 dedicated to discussing this particular topic (data missing completely at random vs. not). Re the original issue for being unable to sample NaN values:
I understand that the issue is happening for numerical columns in general. I just wanted to sanity check the dtypes because SDV has only been tested with Perhaps there is some other property of your dataset (unrelated to dtype) that is causing the bug, but I'm not sure what it could be. Please do let us know if you have any shareable information! This problem has been trickier for us to replicate than most :) |
Dear @npatki , I am pretty sure it has nothing to do with the dtypes. From previous tests, it looked like it generated values around the mean. Edit: |
As I mentioned earlier, by default all SDV synthesizers will consider the data "missing completely at random". I.e. they just learn the % of NaN values and randomly add them back into your synthetic data. Please check #2310 for more details.
Very strange! We will continue to investigate with this new info that a smaller subsample does not have the same issue as the original (full) dataset. |
Environment Details
Error Description
I tried to synthesize a DataFrame containing ~2 mil entries of loan data. The dataset has 9 numerical variables, and 11 categorical variables. The problem occurs only on the numerical variables. During synthesizing, all NaN values disappear even if the original variable had 90% missing values. I was told the SDV should be able to handle this, so I am left confused. Any help would be appreciated!
To be clear: the missing values in the numerical columns are of type np.nan
Steps to reproduce
I've attached a screenshot to show you the issue. Unfortunately I am very hesitant to share any of the original (meta)data.
The text was updated successfully, but these errors were encountered: