Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARSynthesizer is not learning rounding scheme for numerical columns #2274

Closed
npatki opened this issue Oct 31, 2024 · 0 comments · Fixed by #2289
Closed

PARSynthesizer is not learning rounding scheme for numerical columns #2274

npatki opened this issue Oct 31, 2024 · 0 comments · Fixed by #2289
Assignees
Labels
bug Something isn't working data:sequential Related to timeseries datasets
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Oct 31, 2024

Environment Details

  • SDV version: 1.17.1

Error Description

First observed in #2241: If I have a numerical, sequential column with a particular rounding scheme, I would expect that all SDV synthesizers will learn the rounding scheme and ensure the synthetic data that is produced has the same. But this is not the case for PARSynthesizer.

Steps to reproduce

In the example below, the numerical column col_A is always rounded to 2 digits. Observe how the synthetic data does not follow that scheme.

import pandas as pd
import numpy as np

from sdv.metadata import Metadata
from sdv.sequential import PARSynthesizer

data = pd.DataFrame(data={
    'id': ['a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'c'],
    'col_A': [5000.23, 4500.23, 4300.45, 2300.11, 3212.31, np.nan, 3456.34, 7890.12, 8201.00, 9810.12]
})

metadata = Metadata.load_from_dict({
    'tables': {
        'table': {
            'sequence_key': 'id',
            'columns': {
                'id': { 'sdtype': 'id' },
                'col_A': { 'sdtype': 'numerical'}
            }
        },
    }
})

synthesizer = PARSynthesizer(metadata, epochs=1)
synthesizer.fit(data)
synthesizer.sample(num_sequences=2)
image

Additional Context

Observe also that other synthesizers such as the GaussianCopula are able to correctly learn the rounding scheme and produce synthetic data that is correctly formatted.

from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data)
synthesizer.sample(num_rows=5)
image
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data:sequential Related to timeseries datasets
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants