You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've noticed two bugs that appear when using split_dataset on tf datasets loaded using tf.data.experimental.make_csv_dataset for keras versions 3.4.0 onward. One of two things happens on attempting to call split_dataset on a dataset loaded using make_csv_dataset, either the split_dataset call hangs indefinitely or the output train and test data have their column names shuffled.
Tested keras versions: 3.5.0, 3.6.0. Tested tensorflow versions: 2.18.
Steps to reproduce:
`from keras.src.utils import split_dataset
import tensorflow as tf
import pandas as pd
valid_dataset = tf.data.Dataset.from_tensor_slices(dict(df))
print("Dataframe dataset sample: ", [e for e in valid_dataset.take(1)])
train, test = split_dataset(valid_dataset, left_size=0.5, seed=1)
print("Train dataset sample: ", [e for e in train.take(1)])
invalid_dataset = tf.data.experimental.make_csv_dataset('bug_report_test_data.csv', batch_size=1)
print("CSV dataset sample: ", [e for e in invalid_dataset.take(1)])
train, test = split_dataset(invalid_dataset, left_size=0.5, seed=1)
print("Train dataset sample: ", [e for e in train.take(1)])
`
In the first case, split_dataset works as expected. In the latter case, the split_dataset call will either hang indefinitely or the column names will get reassigned like ['d': [1], 'b':[300], 'c':[4000], 'a':[20]]
Reverting the function _restore_dataset_from_list in keras.src.utils.dataset_utils back to version 3.3.3 resolves the issue
The text was updated successfully, but these errors were encountered:
Bug description:
We've noticed two bugs that appear when using split_dataset on tf datasets loaded using tf.data.experimental.make_csv_dataset for keras versions 3.4.0 onward. One of two things happens on attempting to call split_dataset on a dataset loaded using make_csv_dataset, either the split_dataset call hangs indefinitely or the output train and test data have their column names shuffled.
Tested keras versions: 3.5.0, 3.6.0. Tested tensorflow versions: 2.18.
Steps to reproduce:
`from keras.src.utils import split_dataset
import tensorflow as tf
import pandas as pd
data_dict = {
'a': [1.] * 10,
'b': [20.] * 10,
'c': [300.] * 10,
'd': [4000.] * 10
}
df = pd.DataFrame(data_dict)
valid_dataset = tf.data.Dataset.from_tensor_slices(dict(df))
print("Dataframe dataset sample: ", [e for e in valid_dataset.take(1)])
train, test = split_dataset(valid_dataset, left_size=0.5, seed=1)
print("Train dataset sample: ", [e for e in train.take(1)])
df.to_csv('bug_report_test_data.csv', index=False)
invalid_dataset = tf.data.experimental.make_csv_dataset('bug_report_test_data.csv', batch_size=1)
print("CSV dataset sample: ", [e for e in invalid_dataset.take(1)])
train, test = split_dataset(invalid_dataset, left_size=0.5, seed=1)
print("Train dataset sample: ", [e for e in train.take(1)])
`
In the first case, split_dataset works as expected. In the latter case, the split_dataset call will either hang indefinitely or the column names will get reassigned like ['d': [1], 'b':[300], 'c':[4000], 'a':[20]]
Reverting the function _restore_dataset_from_list in keras.src.utils.dataset_utils back to version 3.3.3 resolves the issue
The text was updated successfully, but these errors were encountered: