Data Leakage problem #636

Khey17 · 2024-04-02T10:29:58Z

Khey17
Apr 2, 2024

Hey Everyone,
So I was dealing with a Twitter dataset to build the NLP model and I've split the data as follows:
from sklearn.model_selection import train_test_split

train_sen, val_sen, train_labels, val_labels = train_test_split(train_df['text'].to_numpy(), train_df['target'].to_numpy(), test_size=0.1, random_state=42)
len(train_sen), len(train_labels), len(val_sen), len(val_labels)

len(list(set(train_sen) & set(val_sen))) # no. of common samples
12

Some common samples belong to both train and validation sets. Doesn't that mean the data is leaking? like is this supposed to happen? I don't know what exactly I am missing here… I've checked it cell by cell but still getting the same output. (In this case, I got an accuracy of 82.16…(train) and 82.62…(val))

Is it normal to have some common samples for better generalization or does it lead to overfitting?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Leakage problem #636

{{title}}

Replies: 0 comments

Select a reply

Data Leakage problem #636

Khey17 Apr 2, 2024

Replies: 0 comments

Khey17
Apr 2, 2024