Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARSynthesizer samples uniformly distributed time series data #2241

Closed
ardulat opened this issue Sep 26, 2024 · 17 comments
Closed

PARSynthesizer samples uniformly distributed time series data #2241

ardulat opened this issue Sep 26, 2024 · 17 comments
Assignees
Labels
data:sequential Related to timeseries datasets question General question about the software resolution:duplicate This issue or pull request already exists

Comments

@ardulat
Copy link

ardulat commented Sep 26, 2024

Environment details

If you are already running SDV, please indicate the following details about the environment in
which you are running it:

  • SDV version: 1.15.9
  • Python version: 3.12
  • Operating System: linux/amd64 (Docker image)

Problem description

I've been using SDV for quite a while now. However, recently, after analyzing the sampled data, I observed a weird behavior in sampling time series data. The issue is that the sequential model PARSynthesizer keeps generating uniform distributions for time series data in almost all my columns. I am attaching two plots, which clearly show the difference.

Actual data distribution plot:
16

Synthetic data distribution plot:
Screenshot 2024-09-23 at 2 32 09 PM

What I already tried

I tried synthesizing on different datasets and with different numbers of epochs. The code snippet related to the model fitting:

# Initialize synthesizer
self.synthesizer = self.model_class(
    self.metadata,
    epochs=self.epochs,
    cuda=self.cuda,
    context_columns=[
        col for col in self.context_columns
    ],
    verbose=True,
    # Control whether the synthetic data should adhere to the same min/max
    # boundaries set by the real data
    enforce_min_max_values=True,
    # Control whether the synthetic data should have the same number of decimal
    # digits as the real data
    enforce_rounding=False,
)

# Fit synthesizer
self.synthesizer.fit(data)

I can't share the data or anything related to that (including metadata) since it is sensitive medical data.

@ardulat ardulat added new Automatic label applied to new issues question General question about the software labels Sep 26, 2024
@srinify
Copy link
Contributor

srinify commented Sep 26, 2024

Hi @ardulat without metadata, this might be challenging to debug but let's try!

  • Are the synthetic distributions uniform for context columns and non-context columns?
  • How many columns fall into each bucket (context vs non-context)?

In general, PARSynthesizer is one our less mature synthesizers compared to our other single and multi table synthesizers. That alone could be causing this behavior, but it would be great to rule out a few other things first.

@srinify srinify self-assigned this Sep 26, 2024
@srinify srinify added under discussion Issue is currently being discussed data:sequential Related to timeseries datasets and removed new Automatic label applied to new issues labels Sep 26, 2024
@ardulat
Copy link
Author

ardulat commented Sep 27, 2024

Hi @srinify, thank you for your quick response.

Here is what metadata looks like (I removed the exact column names to preserve privacy):

{
  "columns": {
    "sequence_id": {
      "sdtype": "id"
    },
    "context_column1": {
      "sdtype": "categorical"
    },
    "context_column2": {
      "sdtype": "categorical"
    },
    "context_column3": {
      "sdtype": "categorical"
    },
    "context_column4": {
      "sdtype": "categorical"
    },
    "context_column5": {
      "sdtype": "categorical"
    },
    "context_column6": {
      "sdtype": "numerical"
    },
    "context_column7": {
      "sdtype": "categorical"
    },
    "time_series_column1": {
      "sdtype": "numerical"
    },
    "time_series_column2": {
      "sdtype": "numerical"
    },
    "time_series_column3": {
      "sdtype": "categorical"
    },
    "time_series_column4": {
      "sdtype": "categorical"
    },
    "time_series_column5": {
      "sdtype": "categorical"
    },
    "time_series_column6": {
      "sdtype": "numerical"
    },
    "time_series_column7": {
      "sdtype": "numerical"
    },
    "time_series_column8": {
      "sdtype": "numerical"
    },
    "time_series_column9": {
      "sdtype": "numerical"
    },
    "time_series_column10": {
      "sdtype": "numerical"
    },
    "time_series_column11": {
      "sdtype": "numerical"
    },
    "time_series_column12": {
      "sdtype": "numerical"
    },
    "time_series_column13__steps": {
      "sdtype": "numerical"
    },
    "time_series_column14": {
      "sdtype": "numerical"
    },
    "time_series_column15": {
      "sdtype": "numerical"
    },
    "time_series_column16": {
      "sdtype": "numerical"
    },
    "time_series_column17": {
      "sdtype": "numerical"
    },
    "time_series_column18": {
      "sdtype": "numerical"
    },
    "time_series_column19": {
      "sdtype": "numerical"
    },
    "time_series_column20": {
      "sdtype": "numerical"
    },
    "time_series_column21": {
      "sdtype": "numerical"
    },
    "date": {
      "sdtype": "datetime",
      "datetime_format": "%Y-%m-%d"
    },
    "primary_key": {
      "sdtype": "id"
    }
  },
  "METADATA_SPEC_VERSION": "SINGLE_TABLE_V1",
  "primary_key": "primary_key",
  "sequence_index": "date",
  "sequence_key": "sequence_id",
  "synthesizer_info": {
    "class_name": "PARSynthesizer",
    "creation_date": "2024-09-18",
    "is_fit": true,
    "last_fit_date": "2024-09-18",
    "fitted_sdv_version": "1.15.0"
  }
}

Few issues with this metadata:

  1. context_column6 corresponds to dates converted to timestamps according to this: InvalidDataError when fitting datetime columns as context columns in PARSynthesizer #2115. But this leads to irrelevant dates, e.g., 1617 and 2253 years.
  2. Categorical time series columns (suffix 3-5) produce float numbers. The same applies to numerical columns of integer type.
  3. Sampled dates include 36% of null values on average.

Answering your questions:

  • No, among the context columns, only one is numerical and not uniform. For the non-context columns, all or almost all columns form uniform distributions.
  • There are 7 context columns and 21 non-context columns (not including ids and date columns).

@ardulat
Copy link
Author

ardulat commented Oct 3, 2024

Hi @srinify, are there any updates on this?

@ardulat
Copy link
Author

ardulat commented Oct 8, 2024

Hi @npatki, can you please elaborate on this?

@srinify
Copy link
Contributor

srinify commented Oct 8, 2024

Hi @ardulat apologize for the delay! It seems like there are a few issues here to discuss:

Uniform distribution for some of the time series columns

I will attempt to reproduce this today but I may run into some issues here if this issue is very data specific. But let's see.

Context column 6 isn't respecting timestamp boundaries.

The issue you linked to has been since fixed -- do you mind avoiding using timestamps and just use datetimes directly (with a datetime_format specified in the metadata) and see if that resolves your issue?

Categorical time series columns produce float numbers

Are your original values float numbers (e.g. 0.0, 1.0, etc) as well?

Sampled data includes 36% null values

Are these all in a specific column? Or entire rows with null values? And how does this match the null patterns in your real data?

@ardulat
Copy link
Author

ardulat commented Oct 9, 2024

Hi, @srinify! Thank you for your reply. Further discussion on the issues:

I will attempt to reproduce this today but I may run into some issues here if this issue is very data specific. But let's see.

I can't help here since the data I am working with is sensitive and private. I haven't tested PARSynthesizer on other datasets.

The issue you linked to has been since fixed -- do you mind avoiding using timestamps and just use datetimes directly (with a datetime_format specified in the metadata) and see if that resolves your issue?

Will do, thanks.

Are your original values float numbers (e.g. 0.0, 1.0, etc) as well?

The issue is that some of my columns contain integers, but the model samples are floats. I can do rounding, but I'm not sure if that's the right thing to do.

Are these all in a specific column? Or entire rows with null values? And how does this match the null patterns in your real data?

Apologies for my unclear explanation. The issue is that the date (not data) column, which is a sequence_index in metadata, is producing 36% of nulls.
I've been using the MissingValueSimilarity metric from the SDMetrics, and it shows good results for the similarity of null values in other columns. However, for the column representing dates/sequence_index, the model sometimes produces a lot of null values, resulting in MissingValueSimilarity smaller than 1.0. In other words, the actual data does not have many nulls, but the sampled data has relatively more nulls.

To sum up, there are a couple of issues, and I will follow your suggestions where applicable. However, the main issue stopping us from using SDV is the produced uniform distributions, hence the issue title. Thank you for your help!

@srinify
Copy link
Contributor

srinify commented Oct 9, 2024

Hi @ardulat unfortunately I wasn't able to recreate any of these issues with my own fake dataset that has matching metadata as yours. It's likely my fake dataset is too simple of course.

Let me chat more internally with the team to understand if they've encountered these issues before and what other debugging tips we can try!

Also, if you haven't updated to the latest version of SDV I always recommend trying that to see if any of these issues get resolved :)

@ardulat
Copy link
Author

ardulat commented Oct 16, 2024

Hi, @srinify. To help debug the issue, I have prepared a toy dataset that clearly shows the issues I previously described here. The CSV file with training data is attached below.
train.csv

The updated metadata is as follows:

{
  "columns": {
    "id": {
      "sdtype": "id"
    },
    "date_of_birth": {
      "sdtype": "datetime",
      "datetime_format": "%Y-%m-%d"
    },
    "gender": {
      "sdtype": "categorical"
    },
    "steps": {
      "sdtype": "numerical"
    },
    "date": {
      "sdtype": "datetime",
      "datetime_format": "%Y-%m-%d"
    },
    "primary_key": {
      "sdtype": "id"
    }
  },
  "primary_key": "primary_key",
  "sequence_key": "id",
  "METADATA_SPEC_VERSION": "SINGLE_TABLE_V1",
  "sequence_index": "date",
  "synthesizer_info": {
    "class_name": "PARSynthesizer",
    "creation_date": "2024-10-16",
    "is_fit": true,
    "last_fit_date": "2024-10-16",
    "fitted_sdv_version": "1.17.0"
  }
}

Here, date_of_birth and gender are context columns, steps and date are time series columns, and primary_key is just a range column. Also, I have updated the SDV version, and the datetime column date_of_birth in the context is now sampled correctly.

Here is the distribution plot for the training steps column:
Screenshot 2024-10-16 at 3 37 32 PM

And the distribution plot for the sampled/synthetic steps column after training for 128 epochs:
Screenshot 2024-10-16 at 3 37 48 PM

I hope this helps. Let me know what you think about ways how to fix uniform distributions and the rest of the issues.

@srinify
Copy link
Contributor

srinify commented Oct 17, 2024

Awesome @ardulat I'll take a look today and circle back! Full disclosure though, some (or all) of these might just be issues we need to open, track, and eventually address.

@ardulat
Copy link
Author

ardulat commented Oct 23, 2024

Hi @srinify! I was able to "fix" the issue. The issue was that SDV samples float values (although the actual data was integers). As a result, the sampled values were unique with frequency equal to 1 (e.g., 7230.05 and 7230.33 map to different bars in the plot). It would be great if SDV sampled integers for the integer column in the training data.

Anyway, the issue with uniform distribution is fixed now. However, I experienced the same challenge as in #2230, where the actual and synthetic distributions differed. I will add a comment on the mentioned issue. Here is what the distributions look like:
Screenshot 2024-10-23 at 4 05 09 PM

@srinify
Copy link
Contributor

srinify commented Oct 27, 2024

Hi @ardulat when I tried the PARSynthesizer workflow with the train.csv dataset you shared with me, I noticed that the steps column seemed to be float values (e.g. 6979.0, 5104.0, etc). When I analyzed the column deeper, it seems like all of them could be shortened to integer values though (because .0 isn't meaningful in the float context).

So if I'm understanding your last issue correctly - the steps column has .0 trailing floats and some null values, but you'd like SDV to them to be treated as integer values and only generate integer values right?

Regarding #2230 thanks for adding to that thread! PAR is one of our less mature synthesizers so we're collecting examples that showcase the shortcomings so we can improve it down the road!

@ardulat
Copy link
Author

ardulat commented Oct 27, 2024

Hi, @srinify! Yes, you are right; the data is all floats. My bad, I didn't notice that. But as far as I remember, even for integer columns, SDV generates float numbers. I guess this is happening due to the absence of integers and floats in the numerical type column.

Yes, that is an issue, but it's a minor issue that I can fix by simply rounding generated float numbers. Our major issue now is the differing distributions, which currently limits our usage of SDV's PARSynthesizer. I would love to hear your thoughts on how to sample more similar distributions. Thank you!

@srinify
Copy link
Contributor

srinify commented Oct 28, 2024

@ardulat I see what you mean now! Even if a column contains only integer values, if it has missing values then usually pandas will assign the float dtype to it. Then, it seems like even if you specify 'Int64' or another integer computer representation in the metadata, SDV is still generating float values not integer values. Let me sync up internally and see if we need to open an issue here to address this!

Regarding the differing distributions -- unfortunately I can't offer much help here. This is a current limitation of PAR Synthesizer but thank you for adding your example to that thread. This will go a long way in helping us prioritize changes to our sequential synthesizer and improve it!

@srinify
Copy link
Contributor

srinify commented Oct 31, 2024

@ardulat after chatting internally, it looks like the experience around float values in the synthetic data is our intended experience. In pandas, integer columns that contain missing values are converted to float by default.

Pandas does support a new int64 dtype that better supports missing values, but support for this dtype in SDV is not yet released. When that support is released, SDV will be able to synthesize both integer and missing values in the same column.

Thanks for commenting in #2230 we can track any developments in that issue itself.

It seems like we have some action items for the issues you've mentioned so I propose we move forward with closing out this specific issue? Let me know if I missed anything that still needs action for us to reproduce and / or file!

@npatki
Copy link
Contributor

npatki commented Oct 31, 2024

Hi @srinify and @ardulat -- wanted to jump in here to clarify something about integer columns with null values.

If you have an integer column with null values, pandas will read it in as a float64 column. All non-null values will then be represented as 1.0, 2.0, etc. In such cases, I would expect any SDV synthesizer (including PAR) to produce synthetic data that:

  • Has the same exact representation as the real data (in this case a float column) and
  • Has the same rounding scheme as the real data (eg. 1.0, 2.0, etc.)

The fact that the produced synthetic data does not have the same rounding scheme as the real data is a bug that ought to be fixed. I will file a new issue (bug) to track this.

As for producing better quality data -- Some users have reported that data pre-processing greatly helps with this. We do offer some basic data processors in the RDT library(this is installed by default when using SDV.) It may be worth a try.

@ardulat
Copy link
Author

ardulat commented Nov 1, 2024

Hi @srinify and @npatki! Thanks for your valuable input on the issue.

As I said earlier, float values remain a minor issue in our case since we can simply round the sampled values and get integers. However, I would expect something more meaningful since it seems to be a hack rather than a way to go.

Overall, the quality of the data is our major issue. @npatki, could you give some specific examples of data pre-processing that may help resolve the issue, please?

It's worth noting that another issue I listed here was that PARSynthesizer produces 36% of null values in the sequence_index column, although the original data does not contain any missing values. I consistently keep getting this error, and I would expect the synthesizer to produce a similar distribution of missing values here, especially for the sequence_index column.

Thanks again for your help! I don't have anything more to add here, so we can close the issue.

@npatki
Copy link
Contributor

npatki commented Nov 1, 2024

Hi @ardulat absolutely -- happy to help in any way I can. I really appreciate all the data examples, metadata, code snippets, and plots you are providing.

I want to make sure we address all of the topics that you are bringing up, so I suggest we use a different GitHub issue for each topic -- otherwise, we may accidentally miss a few. Here is what I propose:

  1. This issue can be dedicated to the problem of float/ints, which as you mention is a minor one. Still, we never want our users to use hacky solutions. We have bug PARSynthesizer is not learning rounding scheme for numerical columns #2274 for this, so I can close this one off as a duplicate.
  2. I have filed bug Unexpected null values in sequence_index column #2276 for the problem of seeing unexpected null values in the sequence_index column.
  3. We can use the existing issue Synthetic data from PARSynthesizer does not follow original data distribution  #2230 for problem of column shapes. Thank you for adding your example there -- we will investigate and respond more in that issue.

If there are any other topics that are still relevant (that were missed in this thread), please feel free to file a separate issue for each one. It will make it easier to track and ensure we get you an answer for each. Thanks for helping us keep the GitHub organized :)

@npatki npatki closed this as completed Nov 1, 2024
@npatki npatki added resolution:duplicate This issue or pull request already exists and removed under discussion Issue is currently being discussed labels Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data:sequential Related to timeseries datasets question General question about the software resolution:duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

3 participants