Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugs using keras.src.utils.split_dataset on tf.data.Dataset loaded using tf.data.experimental.make_csv_dataset on versions v3.4.0+ #20538

Open
sfenu-3 opened this issue Nov 22, 2024 · 0 comments
Assignees
Labels

Comments

@sfenu-3
Copy link

sfenu-3 commented Nov 22, 2024

Bug description:

We've noticed two bugs that appear when using split_dataset on tf datasets loaded using tf.data.experimental.make_csv_dataset for keras versions 3.4.0 onward. One of two things happens on attempting to call split_dataset on a dataset loaded using make_csv_dataset, either the split_dataset call hangs indefinitely or the output train and test data have their column names shuffled.

Tested keras versions: 3.5.0, 3.6.0. Tested tensorflow versions: 2.18.

Steps to reproduce:

`from keras.src.utils import split_dataset
import tensorflow as tf
import pandas as pd

data_dict = {
'a': [1.] * 10,
'b': [20.] * 10,
'c': [300.] * 10,
'd': [4000.] * 10
}

df = pd.DataFrame(data_dict)

valid_dataset = tf.data.Dataset.from_tensor_slices(dict(df))
print("Dataframe dataset sample: ", [e for e in valid_dataset.take(1)])
train, test = split_dataset(valid_dataset, left_size=0.5, seed=1)
print("Train dataset sample: ", [e for e in train.take(1)])

df.to_csv('bug_report_test_data.csv', index=False)

invalid_dataset = tf.data.experimental.make_csv_dataset('bug_report_test_data.csv', batch_size=1)
print("CSV dataset sample: ", [e for e in invalid_dataset.take(1)])
train, test = split_dataset(invalid_dataset, left_size=0.5, seed=1)
print("Train dataset sample: ", [e for e in train.take(1)])

`
In the first case, split_dataset works as expected. In the latter case, the split_dataset call will either hang indefinitely or the column names will get reassigned like ['d': [1], 'b':[300], 'c':[4000], 'a':[20]]

Reverting the function _restore_dataset_from_list in keras.src.utils.dataset_utils back to version 3.3.3 resolves the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants