You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey Everyone,
So I was dealing with a Twitter dataset to build the NLP model and I've split the data as follows: from sklearn.model_selection import train_test_split
len(list(set(train_sen) & set(val_sen))) # no. of common samples 12
Some common samples belong to both train and validation sets. Doesn't that mean the data is leaking? like is this supposed to happen? I don't know what exactly I am missing here… I've checked it cell by cell but still getting the same output. (In this case, I got an accuracy of 82.16…(train) and 82.62…(val))
Is it normal to have some common samples for better generalization or does it lead to overfitting?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hey Everyone,
So I was dealing with a Twitter dataset to build the NLP model and I've split the data as follows:
from sklearn.model_selection import train_test_split
train_sen, val_sen, train_labels, val_labels = train_test_split(train_df['text'].to_numpy(), train_df['target'].to_numpy(), test_size=0.1, random_state=42)
len(train_sen), len(train_labels), len(val_sen), len(val_labels)
len(list(set(train_sen) & set(val_sen))) # no. of common samples
12
Some common samples belong to both train and validation sets. Doesn't that mean the data is leaking? like is this supposed to happen? I don't know what exactly I am missing here… I've checked it cell by cell but still getting the same output. (In this case, I got an accuracy of 82.16…(train) and 82.62…(val))
Is it normal to have some common samples for better generalization or does it lead to overfitting?
Beta Was this translation helpful? Give feedback.
All reactions